pmontgom@oak.math.ucla.edu (Peter Montgomery) (02/22/90)
In article <43278@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > >As an example, consider the following complaint: "Current machines don't >do well on arbitrary precision arithmetic". Question: Are there simple, >cheap, architectural extensions which could do a much better job of >implementing variable precision arithmetic, and, which would also be useful >in the context of current problem domains? To be more specific, are there >forms of integer multiply step, for example, which would make it much easier >to implement fast 128 bit, or 192 bit, arithmetic? > >What small set of reasonable instructions would provide significant performance >improvements? > I have implemented a multiple precision integer arithmetic library. For my applications (primarily integer factorization), the most time critical routine is one for modular multiplication (A*B mod N). The routine uses the algorithm in my paper "Modular Multiplication without Trial Division" (Mathematics of Computation, April, 1985), and really computes C = A*B/RADIX**lng mod N, where 0 <= A, B < N < RADIX**lng (RADIX is the radix used for multiple precision arithmetic). Its pseudocode is ! Arguments (single) integer lng, A(0:lng-1), B(0:lng-1), C(0:lng-1) (single) integer Ninv, N(0:lng-1) ! Ninv satisfies N(0)*Ninv == -1 mod RADIX ! Local variables (single) integer accum(0:lng-1), carry, i, j, mulb, muln (double) integer quad accum(0:lng-1) = 0 do j = 0, lng-1 mulb = A(j) quad = mulb*B(0) + accum(0) muln = quad*Ninv % RADIX carry = (quad + muln*N(0))/RADIX ! remainder is 0 do i = 1, lng-1 quad = mulb*B(i) + muln*N(i) + accum(i) + carry accum(i-1) = quad % RADIX carry = quad/RADIX end do accum(lng-1) = carry end do Set C(0:lng-1) to accum or accum-N (propagating carries). In the inner loop, mulb, muln, B(i), N(i), accum(i) are all in the interval [0, RADIX-1] (except accum(lng-1) has a wider interval) and 0 <= carry <= 2*RADIX-2. Hence quad will be in [0, 2*RADIX**2 - RADIX-1]. Assume a 32-bit machine and RADIX = 2**30. The inner loop requires me to compute two 32 x 32 = 64-bit products and add the results, while also adding two single precision integer values. The final sum (quad) will be less than 2**61-1. I need its lower 30 bits and the next 31 bits. I have implemented assembly versions on the CDC 7600, VAX 780, SUN 3, and Alliant FX/80. The CDC 7600 version used radix 2**47; the others used radix 2**30. The CDC 7600 implementation has been the fastest single-processor so far, even though that machine was built around 1970 and the rest are more modern (approximate ratio CDC = 1, Alliant = 5, SUN 3/280 = 12, VAX 780 = 50). A speed factor of 2.5 (= (47/30)**2) comes from the larger radix, since fewer loop iterations are needed if the numbers fit in fewer words. The 7600 also had an easy way to get the upper and lower halves of the product. The DXi Xj*Xk instruction (DMULT) gave the lower 48 bits of a 48 x 47 or 47 x 48 -bit product; the FXi Xj*Xk (FMULT) instruction gave the upper 48 bits of this product. Although four multiply instructions (not two) are required, they could occur in parallel; it took only 11 cycles (302.5 nanoseconds) for all four to complete. So I could compute t1 = FMULT(mulb, B(i)) + FMULT(muln, N(i)) t2 = DMULT(mulb, B(i)) + DMULT(muln, N(i)) + accum(i) + carry with both t1 and t2 fitting in 60-bit registers. Now accum(i-1) = AND(t2, RADIX-1) carry = SHIFT(t1, 48-47) + SHIFT(t2, -47) completes the computations. Excluding loop control, the inner loop needs only 4 memory references, 4 multiplies, 5 adds, 2 shifts, 1 AND, all of which fit in 4 60-bit words. By careful coding, I was able to improve this further and to fit the combined inner and outer loops in a 10-word instruction stack (cache). On the VAX, I used EMUL (32 x 32 + 32 = 64-bit multiply and add) twice, once to compute mulb*B(i)+accum(i) and once to compute muln*N(i)+carry. Two more instructions (add, add with carry) gave the 64-bit value of quad. An AND and a quadword shift, plus an end-of-loop test, completed the inner loop (only seven instructions!) The SUN 3 has a 32 x 32 = 64-bit multiply, but several adds and adds with carry are needed to get quad. Since it lacks a quadword shift, multiple bit operations are needed to extract accum(i-1) and carry. The Alliant code is a vectorized variation of the loop, with two temporary arrays (lower and carry), roughly lower(1:lng) = 0 carry(0:lng-1) = 0 do j = 0, lng-1 mulb = A(j) quad(0:lng-1) = mulb*B(0:lng-1) + lower(1:lng) + carry(0:lng-1) muln = quad(0)*Ninv % RADIX quad(0:lng-1) = quad(0:lng-1) + muln*N(0:lng-1) lower(0:lng-1) = quad(0:lng-1) % RADIX carry(0:lng-1) = quad(0:lng-1) / RADIX end do (answer is lower(1:lng) + carry(0:lng-1), or this minus N(0:lng-1) The Alliant has a vectorized 32 x 32 = 32 bit integer multiply, but only a scalar 32 x 32 = 64 bit integer multiply. This integer vector multiply can be used to get the new lower(0:lng-1), by ignoring all integer overflow. To get the new carry(0:lng-1), I repeat the computation of quad in double precision floating point, subtract the new DBLE(lower(0:lng-1)), multiply by 1.0/DBLE(RADIX), and round to integer; this uses requires 3 double precision multiply-add vector instructions plus several conversions between long and double. Even when lng = 32 (an optimum vector length), and with B(0:lng-1), N(0:lng-1), DBLE(B(0:lng-1)), DBLE(N(0:lng-1)), and carry(0:lng-1) all assigned to vector registers across the loop on "j", the Alliant single-processor code is 3-5 times as slow as the 7600. What existing vector machines can vectorize this loop well? You probably need 32 x 32 = 64-bit integer multiply (can be signed or unsigned). Add two 64-bit integer vectors. Convert 32-bit vector to a 64-bit vector. Logical and shift instructions to extract upper and lower pieces. A vector shift instruction to replace lower(0:lng-1) by lower(1:lng) in the vector registers without going through memory (since lower(0) = lower(lng) = 0, a circular shift will do). -------- Peter Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024