vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/12/90)
In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes: \\\ >I wish people would refrain from shooting their mouths off about >multi-precision arithmetic until they've actually done a significant >amount of it. Typically, only having 32 x 32 --> 32 slows down >multi-precision arithmetic by about a factor of 10. Not having it >can render a machine all but useless for performing certain tasks. \\\ Well that ``factor of 10'', in my simple-minded way, sounded a bit too far off the mark (of course it only _asymptote_ to less than 3 for sure). So I ran my littul benchmark. The results for VAX 8750 and Sparc, using a variety of compilers and optimization levels are as follows: Machine Compiler Optimization benchmark 1 benchmark 2 VAX cc no 2.783 5.300 cc yes 2.566 5.133 gcc no 2.900 5.116 gcc yes 1.783 3.700 Sparc cc no 1.833 3.733 cc yes 1.033 2.100 Admittedly there is a lost to be said for my lack of understanding of experimental design, but I think we fail to see an order of magnitude difference between benchmarks 1 & 2. The difference? Benchmark 2 performed all calculations using int x int -> int while benchmark 1 used both int x int -> int and long x long -> long. I reiterate -- since int x int -> long is not _logically required_ then you must argue it's benefit from an _economic_ viewpoint rather than a convenience one. As the vast majority of computation (although an ill-defined quantity here) does not seem to suffer from its lack, and the admittedly special-purpose stuff here only runs a (binary) order of magnitude slower by using lower-precision, I can not see how allocation of 1 barn to int x int -> long can be justified. Of course, this situation may change -- technology has a habit of throwing a monkey-wrench into things (e.g. the motor industry and mating habbits). I await cogent arguments. -Kym Horsell
mash@mips.COM (John Mashey) (09/12/90)
I'm confused about the following exchange: In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes: ... >>amount of it. Typically, only having 32 x 32 --> 32 slows down >>multi-precision arithmetic by about a factor of 10. Not having it >Well that ``factor of 10'', in my simple-minded way, sounded a bit >too far off the mark (of course it only _asymptote_ to less than 3 for sure). >So I ran my littul benchmark. The results for VAX 8750 and Sparc, >using a variety of compilers and optimization levels are as >follows: .... >Admittedly there is a lost to be said for my lack of understanding >of experimental design, but I think we fail to see an >order of magnitude difference between benchmarks 1 & 2. >The difference? Benchmark 2 performed all calculations using >int x int -> int while benchmark 1 used both int x int -> int >and long x long -> long. >I reiterate -- since int x int -> long is not _logically >required_ then you must argue it's benefit from an _economic_ >viewpoint rather than a convenience one. I'm confused. Since some fairly strong results are being stated (that Silverman is way off; anyone can be off, but on the other hand, Silverman certainly knows this problem domain), it would help to understand exactly what is being measured. In particular, I'm confused becasue Silverman clearly wished for 32x32->64, and the posted results talk about int x int -> int, long x long -> long. Unless I completely misunderstand the posting, I can't find any connection whatsoever between the measurements of an apparently portable C program on these machines, and Silverman's clear statement of wishing for 32 x 32 -> 64 bit product for a 10X difference in his work, since: int & long are both 32-bit anyway C doesn't expect 64-bit from 32x32; I don't know how, in C, you select between 32x32->32, and 32x32 -> 64,if both are actually available? Are there C compilers that give this choice? (Note: MIPS has 32x32->64, but C just uses the low-order 32 bits) Existing SPARC implementations don't HAVE multiply instructions at all (other than multiply-step) How about posting the benchmark, so we can understand what's happening? It might especially be good to post a sample of the dissassembled code that's supposed to be doing 32x32->64, if that is truly supposed to be happening. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
pmontgom@euphemia.math.ucla.edu (Peter Montgomery) (09/12/90)
In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes: >I'm confused about the following exchange: > >In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes: >... >>>amount of it. Typically, only having 32 x 32 --> 32 slows down >>>multi-precision arithmetic by about a factor of 10. Not having it > >>Well that ``factor of 10'', in my simple-minded way, sounded a bit >>too far off the mark (of course it only _asymptote_ to less than 3 for sure). >>So I ran my littul benchmark. The results for VAX 8750 and Sparc, >>using a variety of compilers and optimization levels are as >>follows: >.... >>The difference? Benchmark 2 performed all calculations using >>int x int -> int while benchmark 1 used both int x int -> int >>and long x long -> long. > > >I'm confused. Since some fairly strong results are being stated >(that Silverman is way off; anyone can be off, but on the other hand, >Silverman certainly knows this problem domain), it would help >to understand exactly what is being measured. In particular, >I'm confused becasue Silverman clearly wished for 32x32->64, >and the posted results talk about int x int -> int, long x long -> long. > >Unless I completely misunderstand the posting, I can't find any >connection whatsoever between the measurements of an apparently portable >C program on these machines, and Silverman's clear statement of wishing for >32 x 32 -> 64 bit product for a 10X difference in his work. >How about posting the benchmark, so we can understand what's happening? >It might especially be good to post a sample of the disassembled code that's >supposed to be doing 32x32->64, if that is truly supposed to be happening. I do many of the same types of computations as Silverman (and, like he, some of my processes use a month of CPU time or more). But I dispute his 10X figure. Here is a benchmark which tries to compute the remainders i*j mod p where 0 <= i, j < p, using three algorithms expressible in C and also an assembly code algorithm (which, on the SUN 3, uses the 64-bit multiply and divide instructions). Those instructions perform five times as fast as the best of the C algorithms, but my SPARC assembly code is only slightly bettter than the best of the C algorithms. [If someone can significantly improve my SPARC assembly code, let me know.] #! /bin/csh -f # # This benchmark is intended to measure how much 64-bit # integer multiplication and division hardware affect # the computation of i*j mod p, where 0 <= i, j, < p < 2**31. # The task is done four times, via C routines MUL1, MUL2, MUL3, # and assembly routine MUL4. A probable primality test # (based on Fermat's little Theorem) is used to rate the algorithms. # # MUL1 uses additions, subtractions, and bit-wise operations, # (no multiplication or division), analyzing one bit at a time. # It provides a benchmark for rating the machine's raw speed. # # MUL2 uses floating point arithmetic to estimate the quotient # i*j/p, followed by unsigned long integer arithmetic (modulo 2^32 or # other power of 2) to compute the remainder and possibly adjust it. # # MUL3 breaks the operands up into 16-bit pieces. # # MUL4 is a machine-dependent, assembly-language program. # Presently SUN 3 and SUN 4 (SPARC) versions are included. # If the machine has 64-bit integer multiply and divide instructions, # the MUL1/MUL4 ratio estimates the benefit achieved by # using these instructions (when adjusted for compiler's # level of sophistication). # # Approximate times: # # SUN 3/280 SUN 4/260 (both with FPUs) # gcc 1.37.1 cc gcc 1.37.1 cc # MUL1 10.40 13.52 3.67 2.92 (simple integer arithmetic) # MUL2 9.00 10.38 2.03 2.05 (floating point arithmetic) # MUL3 10.14 11.30 6.11 6.32 (break into 16-bit pieces) # MUL4 1.88 1.98 1.82 1.83 (assembly code) # # My optimized assembly code for the SUN 4 does little better # than my straightforward assembly code for the SUN3, # even though the SUN 4's raw CPU speed is 3.5 - 5 times # that of the SUN 3 (as estimated by the MUL1 and MUL2 times). # This is an application on which the SPARC does not do # as well as we want. Residue (modular) arithmetic is # important to some of my applications -- see, for example, # "An FFT extension to the P-1 factoring algorithm" # by Robert D. Silverman and me, in the April, 1990 Math. Comp. # # Peter L. Montgomery # Department of Mathematics # University of California # Los Angeles, CA 90024-1555 # pmontgom@math.ucla.edu # September, 1990 cat << end_program >! modprd.c #include <stdio.h> #include <sys/time.h> #include <sys/resource.h> typedef unsigned long ulong; typedef long slong; long lrand48(); /* Random number generator (SUN-specific) */ ulong MUL4(); /* Assembly routine */ ulong MUL1(i, j, p) ulong i, j, p; { register ulong ii, jj, rem = 0; if (i < j) {ii = i; jj = j;} else {ii = j; jj = i;} /* ii = MIN(i, j), jj = MAX(i,j) */ while (ii != 0) { /* Answer is (ii*jj + rem) mod p */ if ((ii & 1) != 0) { rem += jj; if (rem >= p) rem -= p; } jj <<= 1; if (jj >= p) jj -= p; ii >>= 1; } return rem; } ulong MUL2(i, j, p) ulong i, j, p; { register ulong quot, rem; quot = (ulong)(slong) ( (double)(slong)i * (double)(slong)j /(double)(slong)p + 0.5); /* true quotient, or one too high */ rem = i*j - p*quot; /* remainder as signed integer */ if ((slong)rem <= -1) rem += p; /* The above test could be "< 0", but some compilers optimize "< 0" incorrectly, using the condition code from the subtract. */ return rem; } ulong MUL3(i, j, p) ulong i, j, p; { if (p < 65536) { return (i*j) % p; /* simple case */ } else { ulong itop = i >> 16, jtop = j >> 16; /* fit in 15 bits */ ulong ibot = i & 0xffff, jbot = j & 0xffff; register ulong prdtop = itop*jtop, prdbot = ibot*jbot; register ulong prdmid = prdtop + prdbot - (itop - ibot)*(jtop - jbot); /* i*j = prdtop*2^32 + prdmid*2^16 + prdbot */ /* 0 <= prdtop < p^2/2^32 < 2^30 */ /* 0 <= prdbot, prdmid <= 2^32 - 2^17 + 1 */ register ulong pnorm = p; /* Multiple of p */ register ulong pntop, pnbot; while (pnorm < (1L<<30)) pnorm <<= 1; /* 2^30 <= pnorm < 2^31 */ pntop = pnorm >> 16; pnbot = pnorm & 0xffff; prdmid += (prdbot >> 16); prdbot &= 0xffff; /* Accumulate carries */ prdtop += (prdmid >> 16); prdmid &= 0xffff; while (prdtop > pntop) { register ulong quot = prdtop/(pntop + 1); /* Fits in 16 bits */ register ulong tmpmid = quot*pnbot; prdtop -= quot*pntop; if (prdmid < tmpmid) prdtop -= 65536; prdmid -= tmpmid; prdtop += (prdmid >> 16); prdmid &= 0xffff; } prdmid += prdtop << 16; while (prdmid > pntop) { register ulong quot = prdmid/(pntop + 1); register ulong tmpbot = quot*pnbot; prdmid -= quot*pntop; if (prdbot < tmpbot) prdmid -= 65536; prdbot -= tmpbot; prdmid += (prdbot >> 16); prdbot &= 0xffff; } prdbot += (prdmid << 16); return prdbot%p; } } double CP_TIME() { /* Return CP time used so far, in seconds */ long sec, usec; struct rusage inf; getrusage(RUSAGE_SELF, &inf); sec = inf.ru_utime.tv_sec + inf.ru_stime.tv_sec; usec = inf.ru_utime.tv_usec + inf.ru_stime.tv_usec; return (double)sec + 1.0e-6*(double)usec; } ulong MODEXP(base, expon, p, func) ulong base, expon, p; ulong (*func)(); { /* Modular exponentiation, base^expon mod p */ if (expon == 0) { return 1; } else { register ulong b = base, ipow2 = 1; while ((ipow2 <<1) <= expon) ipow2 <<= 1; /* highest power of 2 <= expon */ while ((ipow2 >>= 1) != 0) { /* Answer = [b^(2*ipow2) * base^(e % (2*ipow2))] mod p */ b = func(b, b, p); if (expon & ipow2) b = func(b, base, p); } return b; } } void TIME(func, name) ulong (*func)(); /* MUL1, MUL2, MUL3, or MUL4 */ char *name; { double t1 = CP_TIME(), t2; ulong p, nprime = 0; /* Search for primes in the interval (10^9, 10^9 + 10000), using a pseudoprime test to base 1234567. Count the number of successes. */ for (p = 1000000001; p <= 1000010000; p += 2) { ulong test = MODEXP(1234567L, (p-1)/2, p, func); if (test == 1 || test == p-1) nprime++; } t2 = CP_TIME(); printf("%s found %ld probable primes in %10.2f seconds.\n", name, nprime, t2-t1); } int main(argc, argv) int argc; char **argv; { int itest; srand48(1990); /* Set seed */ for (itest = 1; itest <= 200; itest++) { ulong n1, n2, p; ulong ans1, ans2, ans3, ans4; while ((p = lrand48()) < 1000) {}; n1 = lrand48() % p; n2 = lrand48() % p; ans1 = MUL1(n1, n2, p); ans2 = MUL2(n1, n2, p); ans3 = MUL3(n1, n2, p); ans4 = MUL4(n1, n2, p); if (ans1 != ans2 || ans2 != ans3 || ans3 != ans4) { printf("Inconsistent results: n1 = %ld, n2 = %ld, p = %ld\n", n1, n2, p); printf(" ans1 = %ld, ans2 = %ld, ans3 = %ld, ans4 = %ld\n", ans1, ans2, ans3, ans4); } } /* for itest */ TIME(MUL1, "MUL1"); TIME(MUL2, "MUL2"); TIME(MUL3, "MUL3"); TIME(MUL4, "MUL4"); return 0; } end_program cat << end_sun3assem >! sun3assem.s | iprod = MUL4(i, j, p) Returns i*j mod p .globl _MUL4 _MUL4: | Uses only d0, d1 movl sp@(4),d1 | d1 = i mulul sp@(8),d0:d1 | Get 64-bit product i*j divul sp@(12),d0:d1 | Divide by p, get remainder in d0 rts | Return end_sun3assem cat << end_sun4assem >! sun4assem.s ! SPARC assembly version of MUL4(i, j, p) (i*j mod p) #define MULT1(prod,src) mulscc prod,src,prod #define MULT2(prod,src) MULT1(prod,src);MULT1(prod,src) #define MULT4(prod,src) MULT2(prod,src);MULT2(prod,src) #define MULT8(prod,src) MULT4(prod,src);MULT4(prod,src) #define MULT16(prod,src) MULT8(prod,src);MULT8(prod,src) /* When dividing divhi*2^n + divlo by divsor, where 0 <= divhi < divsor and 0 <= divlo < 2^n, the bottom half of the dividend (divlo) should be left-justified (i.e., pre-multiplied by 2^(32-n)). After n executions of the REMAINDER_STEP macro, register "divhi" will have the remainder. */ #define REMAINDER_STEP(divhi,divlo,divsor) \ addcc divlo,divlo,divlo /* Shift low dividend */ ;\ addx divhi,divhi,divhi /* Shift high dividend */ ;\ cmp divhi,divsor /* Compare dividend and divisor */ ;\ bgeu,a 1f /* if ge, branch */ ;\ sub divhi,divsor,divhi /* but first subtract divisor */; \ 1: #define MUL4_step REMAINDER_STEP(%o5,%o4,%o2) .global _MUL4 ! iprod = MUL4(i, j, p) (leaf procedure) _MUL4: ! %o0 = i, %o1 = j, %o2 = p subcc %o0,%o1,%o3 ! i-j bgeu 1f ! Branch if i >= j nop add %o1,%o3,%o1 ! %o1 = j + (i-j) = i = MIN(i,j) sub %o0,%o3,%o0 ! %o0 = i - (i-j) = j = MAX(i,j) 1: mov %o1,%y ! set multiplier (y reg) to MIN(i,j) srl %o1,16,%o4 tst %o4 bz mul4_16 ! Branch if multiplier fits in 16 bits andncc %o4,0xff,%g0 bz mul4_24 orcc %g0,%g0,%o5 ! Initialize upper product (delay) MULT16(%o5,%o0) MULT16(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results mov %y,%o4 ! Get lower product MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step mul4_x24: MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step mul4_x16: MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step mul4_x8: MUL4_step; MUL4_step; MUL4_step; MUL4_step MUL4_step; MUL4_step; MUL4_step; MUL4_step retl ! Return from leaf mov %o5,%o0 ! Move remainder (delay) mul4_24: MULT16(%o5,%o0) MULT8(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results b mul4_x24 mov %y,%o4 ! and get lower product (delay) mul4_16: andncc %o1,0xff,%g0 ! Test if one operand is 8 bits or less bz mul4_8 orcc %g0,%g0,%o5 ! Initialize upper product (delay) MULT16(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results b mul4_x16 mov %y,%o4 ! and get lower product mul4_8: MULT8(%o5,%o0) mulscc %o5,%g0,%o5 ! Align results b mul4_x8 mov %y,%o4 ! and get lower product end_sun4assem # ********************** Compile and execute the program ******************** if ("`arch`" == "sun3") then as -o assem.o sun3assem.s gcc -O modprd.c assem.o echo "SUN 3 gcc times" a.out echo "SUN 3 cc times" cc -O4 modprd.c assem.o a.out else if("`arch`" == "sun4") then as -o assem.o -P sun4assem.s echo "SUN 4 gcc times" gcc -O modprd.c assem.o a.out echo "SUN 4 cc times" cc -O4 modprd.c assem.o a.out else echo "Unknown architecture - `arch`" endif -- Peter L. Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024-1555 If I spent as much time on my dissertation as I do reading news, I'd graduate.
bs@linus.mitre.org (Robert D. Silverman) (09/13/90)
In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes: >In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>I'm confused about the following exchange: >> >>In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: >>>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes: stuff deleted. >>How about posting the benchmark, so we can understand what's happening? >>It might especially be good to post a sample of the disassembled code that's >>supposed to be doing 32x32->64, if that is truly supposed to be happening. > > I do many of the same types of computations as Silverman >(and, like he, some of my processes use a month of CPU time or more). >But I dispute his 10X figure. Here is a benchmark which tries >to compute the remainders i*j mod p where 0 <= i, j < p, using three >algorithms expressible in C and also an assembly code >algorithm (which, on the SUN 3, uses the 64-bit multiply and divide >instructions). Those instructions perform five times as fast as Even Peter reports a 5 fold speed increase. One difference between his code and mine that would tend to exaggerate the difference is that he codes the control loops directly in assembler [for his multiply routines] whereas I write my in C with low level calls to assembler routines for the 64 bit arithmetic. When I go from 32 to 64 bits, I cut my calls to these routines by a factor of 4. For me to reduce to 32 bit arithmetic is therefore more expensive. This applies especially to the VAX because the 'calls' instruction generated by the C compiler is a real pig. Secondly, he does his modular multiplication by a method that avoids the division instruction entirely (at the cost of some additional multiplies). I do mine by multiplication followed by long division [slower, but slightly more general]. Since division is slower than multiplication, the effect of going from 64/32 --> 32 to 32/16 --> 16 enhances the speed difference. My claim of 10x varies from machine to machine. On some machines [a microVAX] It has been as high as a factor of 13. On others as low as about 6. On one machine, [An Alliant] I observed a 20% speed difference just from cache effects. Storing in radix 2^30 versus 2^15 halves the storage requirements. I agree that my measurements DO NOT measure JUST the raw speed difference in arithmetic that one obtains from going from 32 to 64 bit multiplies and divides. I also agree that if I were to inline my calls to the low level assembler routines, the effect would be lessened. However, there can't be any doubt that there is a SIGNIFICANT difference between 32 and 64 bit arithmetic when doing multiple precision. The Sparc shows little difference because it has NO multiply or divide instruction for even 32 bits. It comes down to what we are measuring. I am measuring the overall performance benefit for large programs with lots of data including caching, subroutine calls etc. Peter's benchmarks tend to measure more closely just the speed difference in arithmetic. And in that respect, I agree with his assessments. If one had higher level language support for a 'long long' temporary data type so one could compute AB/C, AB mod C, (A << 30 + B)/C, (AB + C) mod 2^30, etc. directly in C, my claim for a 10x speed improvement would come down quite a bit. One thing is clear throughout ALL of this. Machines like the SPARC which provide NO integer multiply/divide are not worth much for this type of work. This is why I question ALUs like the i860 which also has no integer multiply/divide, yet does have a fancy graphics unit. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
preston@titan.rice.edu (Preston Briggs) (09/13/90)
In article <119977@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: >One thing is clear throughout ALL of this. Machines like the SPARC >which provide NO integer multiply/divide are not worth much for >this type of work. This is why I question ALUs like the i860 >which also has no integer multiply/divide, yet does have a fancy >graphics unit. I expect you're right. The i860 is supposed to crunch floating-point. The integer half is supposed to do the necessary integer work to support addressing. I guess they intend the graphics portion to display the results... I argued about this with a guy who wanted to put a chess program on an i860. It seems like a big mismatch of problem and resources. Personally, I'd rather see the 860+ drop the graphics instructions and give me a "4-address" multiply-accumulate instruction. Then we could do some crunching. For your stuff, a good integer processor with no FP or graphics or whatever sounds like the ticket. Unfortunately, most machines use integers to do characters, addresses, and sometimes bits. Are there exceptions? Digital signal processor chips? -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/13/90)
In article <119977@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: \\\ >Even Peter reports a 5 fold speed increase. One difference between his ^^^^^^ >code and mine that would tend to exaggerate the difference is that he \\\ # SUN 3/280 SUN 4/260 (both with FPUs) # gcc 1.37.1 cc gcc 1.37.1 cc # MUL1 10.40 13.52 3.67 2.92 (simple integer arithmetic) # MUL2 9.00 10.38 2.03 2.05 (floating point arithmetic) # MUL3 10.14 11.30 6.11 6.32 (break into 16-bit pieces) # MUL4 1.88 1.98 1.82 1.83 (assembly code) Ratio of MUL3/MUL1: .97 .83 1.7 2.2 There does not seem to be a 5-fold difference between using ``simple integer arithmetic'' and ``break into 16-bit pieces'' in Peter's figures. Of course the ratio between ``simple integer arithmetic'' and ``assemby code'' are much higher -- this would be comparing apples & oranges. Since Peter various assembly sources to hand -- I would be interested to see figures concerning: 32x32->64 multiply and 32x32->32 via 16-bit pieces multiply Another experiment, still in assembler, would be to get figures for the same program using a non-naive multiply and divide in which the lack of 32x32->64 would be even less marked. -Kym Horsell
dik@cwi.nl (Dik T. Winter) (09/14/90)
In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes: > [If someone can significantly improve my SPARC assembly code, > let me know.] Not significantly, but improvement is possible, see below. > (Leaving out Sun3 and gcc timings the latter are only really relevant for MUL1, MUL2 and MUL3). > # Approximate times: > # > # SUN 4/260 (both with FPUs) > # cc > # MUL1 2.92 (simple integer arithmetic) > # MUL2 2.05 (floating point arithmetic) > # MUL3 6.32 (break into 16-bit pieces) > # MUL4 1.83 (assembly code) (One big note: the verification of results requires MUL4. So large improvements in MUL4 will have effects on timings for other MUL's too!) Below the results I obtained on a SPARCstation 1 (4/60), SPARCstation SLC (4/?) and a SPARCstation 1+ (4/65). Even the ratios are very different! For MUL4 I give three timings, see further down for the reason. SPARC-1 SPARC-SLC SPARC-1+ MUL1 3.98 3.97 3.18 MUL2 2.98 3.22 2.40 MUL3 23.27 10.02 8.09 MUL4 2.47/2.24/2.14 2.45/2.23/2.14 1.97/1.78/1.71 Note that for all three machines the same compiler was used (the SLC compiler), but that the libraries for the individual machines were linked in. The SPARC-1 still has SunOS 4.0.3, the others have SunOS 4.1; that might make a difference. Now the reason for three timings for MUL4. 1. The first timing is for the assembly code of Peter Montgomery. A quick analysis reveals that it uses 1 cycle per bit for the multiply and 5 cycles per bit for the remaindering. (Plus overhead of course.) (And I use 7 cycles for division plus remainder, which cannot be improved in my opinion.) 2. In Peter Montgomery's code there are a lot of annulled branches. The instruction in the delay slot is always executed (and takes time), although the result is voided at times. By a rearrangement of the instructions this can be avoided in most cases, resulting in an expected 4.5 cycles per bit for remaindering. That gives the second timings. 3. It is possible to do the remaindering in 4 cycles per bit, hence the third timings. (The program jumps wildly through the code, but that is no problem.) Now I am going to do something that is not done: comparing apples to oranges. (And back to the architecture issue!) SPARC: Optimum if remainder only is required: 4 cycles per bit; if also the quotient is required: add 3 cycles per bit. Multiply requires 1 cycle per bit. MIPS: Multiply is 10 cycles (I found this somewhere in Kanes book, but can not find it now). Divide for small numbers (i.e. 32/32) is a single instruction, for which I cannot find the timing now. Larger numbers require more cycles. My code uses 9 cycles per bit, which can be reduced to 6 if only the remainder is required. Using hardware divide might give some benefits here. Note that hardware multiply and hardware divide are free running, i.e. other operations can be performed, but the benefit would be very limited in this case. RTPC: (Not really RISC in this aspect, just like mips.) Multiply 2 cycles per bit (4 cycles per multiply step) and divide 3 cycles per bit (a single divide step). My third timings were based on the (well-known) process that is also used in the RTPC divide step instruction. You need a single status bit (RTPC uses carry). The divide step is as follows: 1. If status bit set: subtract divisor, clear status bit if borrow is generated. subtract one from quotient. else add divisor, set status bit if carry is generated. add one to quotient. 2. Shift dividend and quotient one bit. To use it you put the dividend and divisor in the proper registers, set the status bit and execute the required number of divide steps. Finally, if the status bit is clear you add the divisor to the dividend and increment the quotient. Now some (my) opinions: Sun would have done well to include a divide step as outlined above in its architecture. If done in a single cycle it would have reduced Peter Montgomery's timings by about 60%. (And the operation is not very dissimilar from the multiply step.) Mips would have done well to include a 64/32 bit divide in addition to the 32/32 bit divide. I do not think that would have added much complexity to the processor. (The fact that mips has no status bits, and so special instructions have to be performed to obtain status makes the mips divide code slower than the sparc divide code.) IBM with their RTPC did very well in this aspect. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
mash@mips.COM (John Mashey) (09/14/90)
In article <2118@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: .... >Now some (my) opinions: ... >Mips would have done well to include a 64/32 bit divide in addition to the >32/32 bit divide. I do not think that would have added much complexity to >the processor. ... The opinion that 32x32->64, and 64/32->64 are essentially equivalent, and the latter would only cost a little complexity ..... is wrong, at least in the case of the R3000. (it may not be wrong for a microcoded machine, and it may or may not be wrong for RISCs in general; however, one my note that hardly any RISC gives you 32x32-> 64, but even fewer give you 64/32->64.... An explanation on why this is so was posted about 9 months ago by Craig Hansen and others. I also refer to Patterson&Hennessy, Appendix A on Computer Arithmetic. I won't repeat all of the details, but basically: 1) If divide follows the same, regular decoding structure as everything else, and if the dividend specifies the first of 2 registers (to get 64 bits), consider the consequences, assuming it was div dividend,divisor 1a) Either you need even/odd register pair (to get dividend and dividend|1 OR you need an extra adder (used by nothing else) to get dividend and dividend+1 1b) It becomes the ONLY instruction that needs to fetch 3 32-bit registers as inputs..... SO you need to add a 3rd read-port to the register file OR yuo must irregularize the pipeline in a way that no other instruction does to use 2 register fetch cycles... 2) Suppose you're willing to do that, or also pay the price of extra instructions to set up 64-bit dividend in the extra registers. 2b) For a bunch of reasons, you end up with extra latches & muxes (BAD). (Craig covered this) 3) But suppose you're willing to buy all of that. Now, it you look at Patterson and Hennessy, you will discover that if you do 32x32->64, and 64/32->32, as they say: "Note that the two block diagrams in Figure A.2 are very similar. ... By allowing these registers to shift bidirectionally, the same hardware can be shared between multiplication and division." "Figure A.2. The multiplier has an n-bit adder... the divider has an n+1 bit adder...." a) Integer multiply/divide takes up a nontrivial (not huge, but you can certainly see it on the layout) chunk of space. b) All of this stuff really just barely fit on the original R2000 in 2micron CMOS, and if you look at the layour again, you'll find that the right side of the chip is a regular stack of 32-bit wide datapaths, except the mul/div unit sticks out a few more bits. c) If you want 64/32->64, you discover that: the divisor wants to be expanded to 64 bits, i.e., n=64 now you want to have a 65-bit adder, which: may well be slower than a 33-bit adder may have exceeding awkward layout issues, either requiring something to be folded, or else be twice as wide as the whole rest of the stack. 4) SUMMARY: 32x32->64 with 32/32->32 -is regular in decode & register fetch -shares hardware well between mul and div -has few nasty, ugly layout implications 32x32->64 with 64/32->64 -is surprisingly irregular; in fact, it easily causes numerous special cases in the mainline paths of the machines where it would be the ONLY instruction to have some property - doesn't share as easily - is rife with ugly layout effects MORAL: -Apparently simple-looking additions can have surprisingly ugly and widespread effects -32x32->64 and 64/32->64 possess drastically different implementation effects in the context of a lean-pipelined, single-issue, 2-read-port, 32-bit RISC. They do NOT have the implementation similarity that one would expect. -The hardware folks could probably say this better, but these are at least some of the issues. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/14/90)
I was very surprised by the results of Horsell's benchmark. However, it is heavily biassed towards small numbers. I decided to focus on calculating factorials. My test program calculates n! for n from 0 to 1000. I modified Horsell's program to used a fixed buffer, so that malloc() costs could be eliminated. User System User ratio Horsell (16 x 16 -> 16, C code) 311.2 1.9 1.36 slower Horsell (32 x 32 -> 32, C code) 228.8 2.6 My code (32 x 32 -> 64, assembly) 52.6 0.5 4.35 faster These times were obtained on an Encode MultiMax with NS32532 cpus running Encore 4.3BSD, Encore's C compiler, time from /bin/time. I expect that multiplying bignums would also show about a 4x speedup on this machine from using the (32 x 32 -> 64) instruction. -- Heuer's Law: Any feature is a bug unless it can be turned off.
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)
In article <3759@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: > User System User ratio >Horsell (16 x 16 -> 16, C code) 311.2 1.9 1.36 slower >Horsell (32 x 32 -> 32, C code) 228.8 2.6 >My code (32 x 32 -> 64, assembly) 52.6 0.5 4.35 faster Could you please include a few more numbers in a followup post? I guess you were trying to save b/w, but it didn't work. -Kym Horsell
keith@mips.COM (Keith Garrett) (09/14/90)
In article <2118@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >[multipy comparisions deleted] >MIPS: Multiply is 10 cycles (I found this somewhere in Kanes book, but > can not find it now). Divide for small numbers (i.e. 32/32) is a > single instruction, for which I cannot find the timing now. Larger > numbers require more cycles. My code uses 9 cycles per bit, which > can be reduced to 6 if only the remainder is required. Using hardware > divide might give some benefits here. Note that hardware multiply and > hardware divide are free running, i.e. other operations can be > performed, but the benefit would be very limited in this case. integer multiply is actually 12 cycles (32*32->64)...integer divide is 35 cycles (32/32->quot,rem) there are addition cycles required to move the result from the special mult/div result registers to gp registers (MFHI/MFLO). >[...] -- Keith Garrett "This is *MY* opinion, OBVIOUSLY" Mips Computer Systems, 930 Arques Ave, Sunnyvale, Ca. 94086 (408) 524-8110 keith@mips.com or {ames,decwrl,prls}!mips!keith
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/15/90)
Further to my recent posting of a simple benchmark for testing extended precision arithmetic vv intxint->long multiply, several people have expressed concerns regarding the veracity of the program & its design methodology (if any). Peter Montgomery expressed concerns in private email regarding the apparently small numbers that were being multiplied and that my benchmark might be criticized as being ``untypical'' in that regard. He states he (typically?) performs calculations on numbers in the 100's of decimal digits; Robert Silverman performs similar (and, I presume, larger) calculations. Greg Lindahl (gl8f@atsun7.astro.Virginia.EDU) also raised the question of whether cacheing, specifically code cacheing, would affect the results obtained. He was also concerned about the (possible?) vectorizing of any code that might make results atypical of those performed by Silverman & others. I had intended that my benchmark give _worst possible case_ comparisons between its short & long versions so as to give any possibility of a 10x slowdown a good chance of success. To this end I used the naive algorithm, made the program fairly tight, and tried to eliminate (roughly) secondary effects due to cacheing, memory management, etc. [Although I used malloc() to get storage for some of the intermediate results I noted it used only 0.3% of the runtime so left it in -- thanks for bringing it up O'Keefe]. However, in case there is some doubt about this (now that I look at it again) somewhat questionable reasoning, I adjusted my benchmark -- by `offsetting`` the i & j to calculate (slightly) larger factorials so that the average size of numbers exceeds 1000 decimal digits (1028 digits on average). The results for this modified benchmark are as follows: --------------------------------------------------------------------------- 16x16->32 8x8->16 ratio original VAX 6640 22.9 79.9 3.49 3.0 (14.7) (50.3) 3.42 2.8 VAX 8530 60.8 197 3.24 3.0 (58.6) (198) 3.38 3.2 Sun Sparc 1 48.2 142.6 2.96 2.8 (bc -- 181.9!) (25.9) (74.8) 2.89 3.0 SUN 3/60 91.5 244.2 2.67 2.5 (45.1) (123.8) 2.74 2.5 --------------------------------------------------------------------------- To summarize, my estimate that I had produced a ``worst case'' comparison was a little off -- all the ratios, except for the Sparc, have crept up marginally by going to larger precision calculations. Presuming a quadratic behavior we might perhaps expect (on the basis of this _very_ small sample) to have a 10x slowdown at around 1M digits or possibly 100,000. I will investigate this final possibility (anyone got a Cray with a few Gb for rent?) of 10x slowdown and report my findings at a later date. [I will not point out my dispassionate attitude toward my own hypotheses (;-) ]. As for possible affects by cacheing/vectorizing, etc., I _should_ investigate these (we _do_ have an underutilized 3090 out here). A priori I expect no appreciable change. No promises at this stage. -Kym Horsell P.S. I have been reading this group now for about 6 months; I've been a bit too busy for the last few years to do so. I'm amazed (and perhaps in John Mashey's case, a bit disappointed) that better use isn't made of the facility. ---- I know if you go near a horses ass it might kick you into the water.
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (09/15/90)
In article <41497@mips.mips.COM>, mash@mips.COM (John Mashey) writes: > .... > -Apparently simple-looking additions can have surprisingly > ugly and widespread effects Not to disagree, I have many times bemoaned the lack of some kind of fast carry indication from the MIPS add instruction while looking for ways to make the TCP/IP checksum faster. If a single instruction could could compute 32-bit + 32-bit -> 32-bit with carry-bit, you could substantially better the number instructions in all of the algorithms I've found. The savings of having such a facility, counted in seconds, would be reduced by apparently unavoidable cache delays. That does not stop the moaning. One can construct tighter checksum loops on the 29K, 68K, and the 80386. Given operating system tricks on the byte copies, computing the checksum is the bigger of the two major places the code I know spends most of its time. My limited and decades past dabbling with general purpose long arithmetic makes me ask, would there be a similar benefit to extended precision addition? Vernon Schryver vjs@sgi.com
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/15/90)
In article <69434@sgi.sgi.com> vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes: \\\ >Not to disagree, I have many times bemoaned the lack of some kind of fast >carry indication from the MIPS add instruction while looking for ways to >make the TCP/IP checksum faster. If a single instruction could could compute > 32-bit + 32-bit -> 32-bit with carry-bit, >you could substantially better the number instructions in all of the >algorithms I've found. The savings of having such a facility, counted in \\\ I'm afraid the hard-hearted answer has got to be along the lines of: "What proportion of time is spent in the checksum code, & by what factor is it increased by not having the convient carry"? If the answer is a number greater than EPS (your tollerance for such things) will you then _personally_ pay me four figures to lay it out in Si? If the answer is ''no'' I'm afraid you have to stick in the extra instructions. -Kym Horsell
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (09/15/90)
In article <4025@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >> In article <69434@sgi.sgi.com> I wrote about a carry bit > > I'm afraid the hard-hearted answer has got to be along the lines of: > "What proportion of time is spent in the checksum code, & by what factor > is it increased by not having the convient carry"? Details of the family of implementations I know best would probably be considered proprietary by those who paid for it. However, one can estimate, based on the often repeated fact that to a first approximation, TCP input runs as fast as you can fondle the bytes: a) any fetches and stores to get from the media to a system buffer (0 if you have media-to-host DMA) b) 1 fetch to compute the checksum c) 1 fetch and 1 store to copy to the user process buffer (unless you cheat--all's fair as long as the data gets where it is needed) A simple implementation with direct DMA and no cheating needs 2 fetches and 1 store, The total cost of the TCP checksum in such a system is around 30%. Output is similar. Cache effects are important but ignored here. The trickier you are elsewhere, the large the checksum looms. Current, fast workstations move about 1MByte/sec TCP/IP user-process-to- user-process over ethernet. Ignoring important details, one saved instruction/byte is a million saved instructions/sec. Simplistically, if you could "fetch, add word to accumulator, add-carry to accumulator", you could save 0.5 instructions/byte on a MIPS CPU. Of course an ADDC instruction would cost in many other places, dragging in all of the disadvantages of status bits. I understand that the 6MByte/sec (?) figure from Cray benefits from vector hardware to do the checksums. The TCP checksum by itself does not justify ALU status bits in general purpose CPU's in high performance workstations, because of various bits of specialized hardware in various vendors' implementations for various media. What do the extend precision experts say about carry bits? Vernon Schryver vjs@sgi.com
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/15/90)
I wrote User System User ratio Horsell (16 x 16 -> 16, C code) 311.2 1.9 1.36 slower Horsell (32 x 32 -> 32, C code) 228.8 2.6 My code (32 x 32 -> 64, assembly) 52.6 0.5 4.35 faster (times in seconds reported by /bin/time on an Encore Multimax) In article <4015@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > Could you please include a few more numbers in a followup post? Willingly. But which? Surely the one number 4.35 establishes the disputed fact: 32x32->64 multiplication (which C does not give you access to at all) can give you a worthwhile speedup for bignum multiplication. Two very important numbers for this newsgroup are -- what impact did inclusion of 32x32->64 multiplication have on the rest of the architecture? (It also has 64/32->32+32 division.) -- how many NS32532 programs are actually likely to benefit from this instruction? My answers are "no idea" and "dunno, but probably very few, because the only HLL I've got that uses the instruction is T." -- Heuer's Law: Any feature is a bug unless it can be turned off.
cik@l.cc.purdue.edu (Herman Rubin) (09/15/90)
In article <41497@mips.mips.COM>, mash@mips.COM (John Mashey) writes: > In article <2118@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: ........................ > 1) If divide follows the same, regular decoding structure as everything > else, and if the dividend specifies the first of 2 registers (to get 64 > bits), consider the consequences, assuming it was div dividend,divisor > 1a) Either you need even/odd register pair (to get > dividend and dividend|1 > OR you need an extra adder (used by nothing else) to get > dividend and dividend+1 > 1b) It becomes the ONLY instruction that needs to fetch 3 32-bit > registers as inputs..... There is another instruction, badly needed in many applications, which has the same problems; that is double-register shift by an arbitrary amount. Also, extract field and insert field would benefit greatly from this. Alternatively, this could be done in the floating point units, which already have even more complexity. None of the people wanting to use these instructions for arithmetic care that the address-and-loop-control unit is used for fo good arithmetic operations. ........................ > a) Integer multiply/divide takes up a nontrivial (not huge, > but you can certainly see it on the layout) chunk of space. > b) All of this stuff really just barely fit on the original R2000 > in 2micron CMOS, and if you look at the layour again, you'll find that > the right side of the chip is a regular stack of 32-bit wide > datapaths, except the mul/div unit sticks out a few more bits. The floating point unit already has this and more. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
aglew@crhc.uiuc.edu (Andy Glew) (09/16/90)
In article <4025@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > I'm afraid the hard-hearted answer has got to be along the lines of: > "What proportion of time is spent in the checksum code, & by what factor > is it increased by not having the convient carry"? On a 68030 based system I found that 15% of the entire CPU was being spent in in_chksum() (my memory may be failing wrt. exact name), on real user workloads (namely, the backbone machines on our local net). This was for the naive byte-at-a-time one's complement sum (again, memory may be failing me. I believe it was a one's complement sum, but there have been quite a variety of checksums). Unrolling the loop and computing the checksum 32 bits at a time instead of 8 bits at a time gave me approximately a 6-8-fold speedup. A bit of instrumentation showed that the overwhelming majority of packets were of only two sizes, and these were special cased. With the new code, in_chksum() was reduced to around 4% of the CPU. (Not linearly divided by speedup because of overhead, and traffic increase). I used the carry-out and carry-in to do this. Coding without the carry would approximately double number of instructions for this checksum, but many of the added instructions would be branches. Actually, I'd probably only do it 16 bits at a time, no branches, which would be, again, a 3-fold slowdown (shifts and masks in the loop). Ie. based on my experience coding in_chksum(), but not having coded it on a MIPS, I would estimate that the slowdown through not having carry out and in is approximately 3-fold wrt. good code that uses carry-out and in. But this is only an upper bound, because overhead of call, etc., gets in the way. I do not wish to pass judgement on the usefulness of carries; I only wished to provide a data point for "by what factor is it [the checksum] increased by not having the convenient carry". -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
amos@taux01.nsc.com (Amos Shapir) (09/17/90)
[Quoted from the referenced article by ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe)] > >Two very important numbers for this newsgroup are > -- what impact did inclusion of 32x32->64 multiplication have on > the rest of the architecture? (It also has 64/32->32+32 division.) Not much - since this is a CISC, this type of division (there are 4 more) is just another entry of the division routine in the microcode. > -- how many NS32532 programs are actually likely to benefit from this > instruction? Quite a few - the NS32532 doen't have a FPU on-board, so many implementations emulate it in software; there are also many other uses which have been listed in this thread before. -- Amos Shapir amos@taux01.nsc.com, amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522255 TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N