[comp.arch] Vectorized double length multiplication and division

pmontgom@sonia.math.ucla.edu (Peter Montgomery) (07/04/89)

In article <57725@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>In article <41957@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>:In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>:>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
>:>>This, in my opinion is one of the major faults of RISC processors. They
>:>>do not provide basic arithmetic instructions. 
>:>
>:>When the list of "basic" arithmetic instructions is pages long, one starts
>:>to wonder how many of them are really "basic".  The instruction you ask
>:>for -- divide double length by single length yielding single-length result --
>:>is not exactly frequently needed.  Just how much silicon is it worth to make
>:>it run faster than an implementation as a subroutine?
>:
>:The moral of the story: If your application needs mul and div
>:performance, get a micro with hardware support for them.  If that
>:means using a CISC, so be it.
>
>No, to the last question. Such an instruction is required to compute
>AB/C or AB MOD C exactly, when AB can overflow and C > A,B. 
>
>The problem with CISC is that no one is likely to come out with a 40MIP
>CISC in the near future [e.g. i860]. I simply contend that operations
>like multiplication and division should be fundamental to ANY computer.
>I also include add, subtract and add/subtract with carry. How else
>would you compute AB/C or AB % C exactly?
>
	To some these may be "not exactly frequently needed", but I
use these double-length integer instructions very heavily, so much so 
that I request that manufacturers make them available in vector mode.
I am presently coding an integer FFT, and the usual inner loop operation

		(A, B) := (A + omega*B, A - omega*B)

(A, B, omega complex numbers;  omega a root of unity) is replaced by

		FOR each ip   (indexes over several primes)
		    (A[ip], B[ip]) := (A[ip] + omega[ip]*B[ip] mod P[ip], 
				       A[ip] - omega[ip]*B[ip] mod P[ip])
		END FOR

(0 <= A[ip], B[ip], omega[ip] < P[ip];  omega[ip] a root of unity mod P[ip]).

The primes P[ip] should be as large as possible (around 2^30 on a 32-bit
machine), to reduce the number of primes required.  I want this loop to run 
in vector mode, even if I have to code it in assembly.  Of the vector 
architectures I've studied (CRAY I, Alliant FX/8, IBM 3090), the CRAY has 
no vectorized integer multiplication or division (and there are not enough 
primes below 2^24 for this application, so even a 48-bit floating point 
product is inadequate).  The Alliant has a 32-bit vectorized integer 
multiplication and division (quotient only) but no vectorized 64-bit 
integer operations.  The 3090 does have a 64-bit vectorized integer 
product but lacks a 64-bit integer division.  

	The best vectorizable code I've found to date for the Alliant will 
require doing the multiplications and the division by P[ip] in double 
precision floating point, converting the quotients back to integer, 
and getting the remainders by taking the difference of two integer 
products mod 2^32 (with overflow ignored).  Even then, rounding errors 
may cause the remainder to be negative and need a correction.  
All for such a fundamental mathematical operation.
--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU