pmontgom@sonia.math.ucla.edu (Peter Montgomery) (07/04/89)
In article <57725@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes: >In article <41957@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >:In article <1989Jun24.230056.27774@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >:>In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes: >:>>This, in my opinion is one of the major faults of RISC processors. They >:>>do not provide basic arithmetic instructions. >:> >:>When the list of "basic" arithmetic instructions is pages long, one starts >:>to wonder how many of them are really "basic". The instruction you ask >:>for -- divide double length by single length yielding single-length result -- >:>is not exactly frequently needed. Just how much silicon is it worth to make >:>it run faster than an implementation as a subroutine? >: >:The moral of the story: If your application needs mul and div >:performance, get a micro with hardware support for them. If that >:means using a CISC, so be it. > >No, to the last question. Such an instruction is required to compute >AB/C or AB MOD C exactly, when AB can overflow and C > A,B. > >The problem with CISC is that no one is likely to come out with a 40MIP >CISC in the near future [e.g. i860]. I simply contend that operations >like multiplication and division should be fundamental to ANY computer. >I also include add, subtract and add/subtract with carry. How else >would you compute AB/C or AB % C exactly? > To some these may be "not exactly frequently needed", but I use these double-length integer instructions very heavily, so much so that I request that manufacturers make them available in vector mode. I am presently coding an integer FFT, and the usual inner loop operation (A, B) := (A + omega*B, A - omega*B) (A, B, omega complex numbers; omega a root of unity) is replaced by FOR each ip (indexes over several primes) (A[ip], B[ip]) := (A[ip] + omega[ip]*B[ip] mod P[ip], A[ip] - omega[ip]*B[ip] mod P[ip]) END FOR (0 <= A[ip], B[ip], omega[ip] < P[ip]; omega[ip] a root of unity mod P[ip]). The primes P[ip] should be as large as possible (around 2^30 on a 32-bit machine), to reduce the number of primes required. I want this loop to run in vector mode, even if I have to code it in assembly. Of the vector architectures I've studied (CRAY I, Alliant FX/8, IBM 3090), the CRAY has no vectorized integer multiplication or division (and there are not enough primes below 2^24 for this application, so even a 48-bit floating point product is inadequate). The Alliant has a 32-bit vectorized integer multiplication and division (quotient only) but no vectorized 64-bit integer operations. The 3090 does have a 64-bit vectorized integer product but lacks a 64-bit integer division. The best vectorizable code I've found to date for the Alliant will require doing the multiplications and the division by P[ip] in double precision floating point, converting the quotients back to integer, and getting the remainders by taking the difference of two integer products mod 2^32 (with overflow ignored). Even then, rounding errors may cause the remainder to be negative and need a correction. All for such a fundamental mathematical operation. -------- Peter Montgomery pmontgom@MATH.UCLA.EDU