[comp.arch] Supporting arbitrary precision arithmetic

pmontgom@oak.math.ucla.edu (Peter Montgomery) (02/22/90)

In article <43278@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>
>As an example, consider the following complaint:  "Current machines don't
>do well on arbitrary precision arithmetic".  Question:  Are there simple,
>cheap, architectural extensions which could do a much better job of 
>implementing variable precision arithmetic, and, which would also be useful
>in the context of current problem domains?  To be more specific, are there
>forms of integer multiply step, for example, which would make it much easier
>to implement fast 128 bit, or 192 bit, arithmetic?  
>
>What small set of reasonable instructions would provide significant performance
>improvements?
>

	I have implemented a multiple precision integer arithmetic 
library.  For my applications (primarily integer factorization), 
the most time critical routine is one for modular multiplication 
(A*B mod N).  The routine uses the algorithm in my
paper "Modular Multiplication without Trial Division"
(Mathematics of Computation, April, 1985), and really computes 
C = A*B/RADIX**lng mod N, where 0 <= A, B < N < RADIX**lng
(RADIX is the radix used for multiple precision arithmetic).
Its pseudocode is

		! Arguments

	(single) integer lng, A(0:lng-1), B(0:lng-1), C(0:lng-1)
	(single) integer Ninv, N(0:lng-1)
		! Ninv satisfies N(0)*Ninv == -1 mod RADIX

		! Local variables

	(single) integer accum(0:lng-1), carry, i, j, mulb, muln
	(double) integer quad
	
	accum(0:lng-1) = 0
	do j = 0, lng-1
	    mulb = A(j)
	    quad = mulb*B(0) + accum(0)
	    muln = quad*Ninv % RADIX  
	    carry = (quad + muln*N(0))/RADIX  ! remainder is 0
	    do i = 1, lng-1
	        quad = mulb*B(i) + muln*N(i) + accum(i) + carry
	        accum(i-1) = quad % RADIX
	        carry = quad/RADIX
	    end do
	    accum(lng-1) = carry
	end do
	Set C(0:lng-1) to accum or accum-N (propagating carries).

	In the inner loop, mulb, muln, B(i), N(i), accum(i) are all in
the interval [0, RADIX-1] (except accum(lng-1) has a wider interval) and 
0 <= carry <= 2*RADIX-2.  Hence quad will be in [0, 2*RADIX**2 - RADIX-1].

	Assume a 32-bit machine and RADIX = 2**30.  The inner
loop requires me to compute two 32 x 32 = 64-bit products and add 
the results, while also adding two single precision integer values.
The final sum (quad) will be less than 2**61-1.  I need its
lower 30 bits and the next 31 bits.

	I have implemented assembly versions on the CDC 7600, 
VAX 780, SUN 3, and Alliant FX/80.  The CDC 7600 version used 
radix 2**47; the others used radix 2**30.  The CDC 7600 implementation 
has been the fastest single-processor so far, even though that machine was
built around 1970 and the rest are more modern
(approximate ratio CDC = 1, Alliant = 5, SUN 3/280 = 12, VAX 780 = 50).
A speed factor of 2.5 (= (47/30)**2) comes from the larger radix,
since fewer loop iterations are needed if the numbers fit in fewer words.
The 7600 also had an easy way to get the upper and lower halves of the
product.  The DXi Xj*Xk instruction (DMULT) gave the lower 48 bits 
of a 48 x 47 or 47 x 48 -bit product; the FXi Xj*Xk (FMULT) 
instruction gave the upper 48 bits of this product.
Although four multiply instructions (not two) are required,
they could occur in parallel; it took only 11 cycles (302.5 nanoseconds)
for all four to complete.  So I could compute

	t1 = FMULT(mulb, B(i)) + FMULT(muln, N(i))
	t2 = DMULT(mulb, B(i)) + DMULT(muln, N(i)) + accum(i) + carry

with both t1 and t2 fitting in 60-bit registers.  Now 

	accum(i-1) = AND(t2, RADIX-1)
	carry = SHIFT(t1, 48-47) + SHIFT(t2, -47)

completes the computations.  Excluding loop control,
the inner loop needs only 4 memory references, 4 multiplies, 
5 adds, 2 shifts, 1 AND, all of which fit in 4 60-bit words.
By careful coding, I was able to improve this further and to fit the 
combined inner and outer loops in a 10-word instruction stack (cache).

	On the VAX, I used EMUL (32 x 32 + 32 = 64-bit
multiply and add) twice, once to compute mulb*B(i)+accum(i)
and once to compute muln*N(i)+carry.  Two more instructions
(add, add with carry) gave the 64-bit value of quad.  An
AND and a quadword shift, plus an end-of-loop test,
completed the inner loop (only seven instructions!)

	The SUN 3 has a 32 x 32 = 64-bit multiply, but
several adds and adds with carry are needed to get quad.
Since it lacks a quadword shift, multiple bit operations
are needed to extract accum(i-1) and carry.

	The Alliant code is a vectorized variation of the loop,
with two temporary arrays (lower and carry), roughly

	lower(1:lng) = 0
	carry(0:lng-1) = 0 
	do j = 0, lng-1
	    mulb = A(j)
	    quad(0:lng-1) = mulb*B(0:lng-1) + lower(1:lng) + carry(0:lng-1)
	    muln = quad(0)*Ninv % RADIX
	    quad(0:lng-1) = quad(0:lng-1) + muln*N(0:lng-1)
	    lower(0:lng-1) = quad(0:lng-1) % RADIX
	    carry(0:lng-1) = quad(0:lng-1) / RADIX
	end do
	(answer is lower(1:lng) + carry(0:lng-1), or this minus N(0:lng-1)

The Alliant has a vectorized 32 x 32 = 32 bit integer multiply,
but only a scalar 32 x 32 = 64 bit integer multiply.
This integer vector multiply can be used to get the new lower(0:lng-1),
by ignoring all integer overflow.
To get the new carry(0:lng-1), I repeat the computation of quad in 
double precision floating point, subtract the new DBLE(lower(0:lng-1)),
multiply by 1.0/DBLE(RADIX), and round to integer; this uses
requires 3 double precision multiply-add vector instructions 
plus several conversions between long and double.  Even
when lng = 32 (an optimum vector length), and with B(0:lng-1), 
N(0:lng-1), DBLE(B(0:lng-1)), DBLE(N(0:lng-1)), and carry(0:lng-1) 
all assigned to vector registers across the loop on "j", the Alliant 
single-processor code is 3-5 times as slow as the 7600.

	What existing vector machines can vectorize this loop
well?  You probably need

	32 x 32 = 64-bit integer multiply (can be signed or unsigned).
	Add two 64-bit integer vectors.
	Convert 32-bit vector to a 64-bit vector.
	Logical and shift instructions to extract upper and lower pieces.
	A vector shift instruction to replace lower(0:lng-1) by lower(1:lng) 
		in the vector registers without going through memory
		(since lower(0) = lower(lng) = 0, a circular shift will do).

	
--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU 
	Department of Mathematics, UCLA, Los Angeles, CA 90024