vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/12/90)
After a number of private communications, I've managed to render one of the little benchmarks I have presentable enough to post, along with some performance figures from different machines (basically, its put up or shut up time). It has been claimed that a lack of 32x32->64 multiplication makes a factor of 10 difference in the running time of typical extended precision arithmetic routines. Although it obviously makes _a_ difference in run time I do not measure an order of magnitude difference. Following is a table of the total runtime for a particular benchmark that does a few simple-minded products. The figures in parens are those with optimization on. --------------------------------------------------------------------------- Machine 32x32->32 8x8->16 ratio bc VAX 6440 1.3 3.9 3.0 (0.88) (2.5) 2.8 VAX 8530 cc 3.2 9.5 3.0 3.1!!! (2.9) (9.3) 3.2 gcc 3.1 8.4 2.7 (1.9) (6.1) 3.2 SUN 3/60 4.8 12.2 2.54 7.8 (2.5) (6.2) Sun Sparc 1 2.4 6.6 2.75 2.5!!! (1.1) (3.3) 3.0 --------------------------------------------------------------------------- The benchmark itself follows: #include <stdio.h> #include <malloc.h> #define DEBUG /***/ #define DOUBLE #ifdef DOUBLE #define NUM 15 #define MAX 100 #define SHORT unsigned short #define LONG unsigned long #else #define NUM 15 #define MAX 200 #define SHORT unsigned char #define LONG unsigned short #endif /*DOUBLE*/ #ifdef __STDC__ #define _(x) x #else #define _(x) () #endif #ifdef DEBUG #define Assert(x) {if(!(x)){fprintf(stderr,"Assert failed: line %d\n",\ __LINE__); exit(-1);}} #else #define Assert(x) #endif typedef struct { SHORT *top,*bot,*max; } B; B *a[NUM][NUM],*b[NUM][NUM]; B *zap _((B*)); B *newb _((void)); void kill _((B*)); void part1 _((void)); void part2 _((void)); B *add2 _((B*,B*)); B *fact _((int)); B *mul3 _((B*,B*,B*)); B *muladd _((B*,B*,int)); main() { printf("sizeof SHORT=%d\n",8*sizeof(SHORT)); printf("sizeof LONG=%d\n",8*sizeof(LONG)); part1(); part2(); exit(0); } void part1() { int i,j; for(i=0; i<NUM; ++i) for(j=0; j<NUM; ++j) { b[i][j] = fact(i+j); a[j][i] = fact(i*j); } } void part2() { int i,j,k; B *t1 = newb(), *t2 = newb(); for(i = 0; i<NUM; ++i) for(j = 0; j<NUM; ++j) { zap(t1); for(k = 0; k<NUM; ++k) { mul3(t2,a[i][k],b[k][j]); add2(t1,t2); } } kill(t1); kill(t2); } B *fact(x) { B *res = newb(); SHORT *ptr; LONG cry; res->bot[0] = 1; while(x-->1) { cry = 0; for(ptr=res->bot; ptr<=res->top; ) { cry += (LONG)x**ptr; *ptr++ = cry; cry >>= 8*sizeof(SHORT); } if(cry) *(res->top = ptr) = cry; } Assert(res->top<res->max); Assert(res->bot<=res->top); return res; } /* x = y*z */ B *mul3(x,y,z) B *x,*y,*z; { SHORT *ptr; Assert(y->top<y->max); Assert(y->bot<=y->top); Assert(z->top<z->max); Assert(z->bot<=z->top); zap(x); for(ptr = z->bot; ptr<=z->top; ) muladd(x,y,*ptr++); return x; } /* x += y*k */ B *muladd(x,y,k) B *x,*y; { SHORT *ptr,*ptr2; LONG cry = 0; Assert(x->top<x->max); Assert(x->bot<=x->top); Assert(y->top<y->max); Assert(y->bot<=y->top); ptr = x->bot; for(ptr2 = y->bot; ptr2<=y->top; ) { Assert(ptr<x->max); cry += (LONG)k**ptr2++; if(ptr<=x->top) cry += *ptr; *ptr++ = cry; cry >>= 8*sizeof(SHORT); } while(cry) { cry += *ptr; *(x->top = ptr++) = cry; cry >>= 8*sizeof(SHORT); } return x; } /* x += y */ B *add2(x,y) B *x,*y; { SHORT cry; SHORT *ptr,*ptr2; Assert(y->top<y->max); Assert(y->bot<=y->top); for(ptr = x->bot,ptr2 = y->bot; ptr2<=y->top; ) { cry += (LONG)*ptr+*ptr2++; *ptr++ = cry; cry >>= 8*sizeof(SHORT); } if(cry) *(ptr = x->top) = cry; return x; } B *newb() { B *tmp; tmp = (B*)malloc(sizeof(B)); Assert(tmp); tmp->top = tmp->bot = (SHORT*)malloc(MAX*sizeof(SHORT)); Assert(tmp->top); tmp->max = tmp->bot + MAX; return tmp; } void kill(x) B*x; { free(x->bot); free(x); } B *zap(x) B*x; { x->top = x->bot; x->bot[0] = 0; return x; } --------------------------------------------------------------------------- A word of explanation -- on the machines tested `short' was 16, `char' was 8 and `long' (_sometimes_ the same size as `int') was 32 bits. The idea is to compare the running times of the two versions of the program created by alternately defining and undefining `DOUBLE'. With DOUBLE _defined_ the program uses 32x32->32 arithmetic to essentially perform 16x16->32 bit arithmetic. In std C this is performed by int x,y; long z; z = (long)x * y; extends both operands (the order of loading operands into registers is, of course, not defined to allow optimization some elbow-room). With DOUBLE _undefined_ (comment out that line) the program forms the same computation using 16x16->16 in order to perform essentially 8x8->16 bit arithmetic. Arrangement similar to above. (Of course the program uses unsigned arithmetic in these contexts). I think the results show that nxn->2n bit arithmetic can be replaced, at least for _this_ benchmark, by 2n x 2n -> 2n with a loss of about a factor of 3. Another point to note (although Robert Silverman is not impressed) is the performance of `bc'. Taking into consideration that it has no assembler coding (I am informed so by those-who-know at Sun) and, I understand, performs its arithmetic on _bytes_ that code simply `higits' [0,99] it compares remarkably well with hard-coded (though not very good) C. I would appreciate someone pointing out the error in my reasoning. -Kym Horsell. P.S. The benchmark was _deliberately_ kept rather simple; I wanted to measure the performance of _multiply_ in the context of basic extended precision arithmetic, not the memory or i/o subsystems.
bs@linus.mitre.org (Robert D. Silverman) (09/12/90)
In article <3989@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: > >After a number of private communications, I've managed to render >one of the little benchmarks I have presentable enough to post, >along with some performance figures from different machines >(basically, its put up or shut up time). > >It has been claimed that a lack of 32x32->64 multiplication >makes a factor of 10 difference in the running time of >typical extended precision arithmetic routines. Although it >obviously makes _a_ difference in run time I do not measure >an order of magnitude difference. some silly benchmark deleted.... You are totally confused. Nowhere in your benchmark did I see any assembler code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 [or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64 would allow you to work in radix 2^30 or 2^31 [depending on how careful you want to be about sign bit and 1's complement/2's complement portability problems]. Change of radix ALONE would account for a factor of 4 FEWER multiplies. Secondly, you are working with tiny examples. Working in a larger radix in a REAL WORLD PROGRAM means less storage required for each multi-precise variable, less loop overhead, fewer subroutine calls, fewer cache misses, etc. Try implementing (say) a prime proving algorithm, or an RSA encryption algorithm, or a multi-precise FFT over the rationals or anything but a tinkertoy program with just 1 or two multiprecise numbers where everything fits in cache. Furthermore, working in a power of two radix in performing 32 x 32--> 64 allows one to carefully implement some optimizations (in assembler) that are not readily available in C. By the way, 32 x 32 --> 64 can be needed for OTHER than just multiple precision support. Suppose A > B and A > C. How would one compute BC/A EXACTLY without 64 bit support? If all are 32 bit quantities, BC/A also fits into 32 bits, yet computing BC can overflow. Similarly, one cannot compute BC mod A without such support. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/13/90)
In article <119858@linus.mitre.org> bs@linus.UUCP (Robert D. Silverman) writes: \\\ >You are totally confused. Nowhere in your benchmark did I see any assembler >code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 >[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64 >would allow you to work in radix 2^30 or 2^31 [depending on how careful >you want to be about sign bit and 1's complement/2's complement portability >problems]. Change of radix ALONE would account for a factor of 4 FEWER >multiplies. Secondly, you are working with tiny examples. Working in a >larger radix in a REAL WORLD PROGRAM means less storage required for each >multi-precise variable, less loop overhead, fewer subroutine calls, fewer >cache misses, etc. \\\ Funny, I don't _fell_ confused. Some of the points you make _are_ valid in some sense, but we're diverging from the topic. My whole point is that 32x32->64 is not _required_ in order to support extended precision arithmetic. There are various definitions of _required_ here: ``logically required'', ``convenience'', etc. Obviously the operation isn't _logically_ required, and as I've pointed out, its introduction poses some logical _problems_. With this benchmark, I'm also trying to show that _convenience_ is also not an issue (esp. in the light of the ``bottom line'', which is what (almost?) everything boils down to, whether we like it or not). The fact that I'm doing multiply naively is in _favour_ of your argument that ``[without 32x32->64 high prec arithmetic will] run 10 times slower''. If I had written this code using _better algorithms_ the difference between using 16x16->16 and 32x32->32 would be _even less_, thereby weakening your claim further. Consider, we all know of fairly simply divide-and-conquer high-prec multiply with complexity of O(n^lg(3)). (Actually we _can_ do them in almost linear time, but the cost isn't worth it yet). Instead of using 32x32->64 arithmetic we _could_ use 32x32->32 arithmetic and only use the low-order bits for multiply. If we do this we must perform 4 times more ``small'' multiplications to perform the ``big'' one; but each of these ``small'' multiplies is performed on, effectively, 16-bit operands and should therefore be several times faster. (Checks for effectively multiplying by zero are _almost_ universal). In the limit, see above, this would work out at each multiply (2)^lg(3) faster / 4 times more multiplies = 3/4 of ``nxn->2n'' speed That is, *asymptotically* you only lose 25% in performance by using ``proper'' 32-bit arithmetic (to *simulate* 16x16->32) over having the ``convenience'' operation of 32x32->64 multiply. A 25% reduction in speed is not significant enough, in my accounting, to justify any _extra_ costs required to maintain the 32x32->64 crock in a HLL (_despite_ the fact that, historically, ``that's the way the harware does it anyway''). However, we can say 32 bits isn't close enough to the asymptote to justify any claims -- hence a little benchmark. Considering the amount of extended precision arithmetic that is performed anyway vis a vie fixed-point or the normal floating-point hackery, the the _theoretical_ (if you like) slowdown of 25% shown above, and in the light of _actual measurements_ of factors of 2-3 using naive algorithms, I remain unconvinced that 32x32->64 is required; the folks who have dropped it off their list of things to implement are justified I think. -Kym Horsell
bs@linus.mitre.org (Robert D. Silverman) (09/13/90)
In article <3994@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: :In article <119858@linus.mitre.org> bs@linus.UUCP (Robert D. Silverman) writes: :\\\ :>You are totally confused. Nowhere in your benchmark did I see any assembler :>code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 :>[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64 :>would allow you to work in radix 2^30 or 2^31 [depending on how careful :>you want to be about sign bit and 1's complement/2's complement portability :>problems]. Change of radix ALONE would account for a factor of 4 FEWER :>multiplies. Secondly, you are working with tiny examples. Working in a :>larger radix in a REAL WORLD PROGRAM means less storage required for each :>multi-precise variable, less loop overhead, fewer subroutine calls, fewer :>cache misses, etc. :\\\ : :Funny, I don't _fell_ confused. Some of the points you make _are_ :valid in some sense, but we're diverging from the topic. : :My whole point is that 32x32->64 is not _required_ in order to support It is if you want any speed out of your code. :extended precision arithmetic. There are various definitions of :_required_ here: ``logically required'', ``convenience'', etc. Executing in 1 day instead of 10 isn't a convenience. It's a necessity. :Obviously the operation isn't _logically_ required, and as I've One can implement a ALU that has only 8 or so instructions. Logically, nothing else is required. So why bother at all with the rest that get implemented in practice? Logically, they are not required at all. They get put in for reasons of SPEED. Those doing large scale computations want computers to be FAST. This is hardly a matter of convenience. :The fact that I'm doing multiply naively is in _favour_ of :your argument that : ``[without 32x32->64 high prec arithmetic will] run 10 times slower''. : :If I had written this code using _better algorithms_ the difference :between using 16x16->16 and 32x32->32 would be _even less_, But one can apply these same 'better algorithms' with 32 x 32 --> 64 just as readily. :thereby weakening your claim further. : :Consider, we all know of fairly simply divide-and-conquer :high-prec multiply with complexity of O(n^lg(3)). (Actually we :_can_ do them in almost linear time, but the cost isn't worth Have you ever implemented Karatsuba's algorithm? Tried it in practice? What are the constants in your big O estimates? In practice, one does not recusively bisect all the way down to the word level. One stops the bisection at about 8 machine words. It is NOT n^lg 3 until one reaches many hundreds (or thousands depending on implementation). Furthermore, the method is only good when multiply numbers of the same size. Karatsuba does NOTHING to change the basic argument. One can apply it in radix 2^30 as easily as in 2^15. :Instead of using 32x32->64 arithmetic we _could_ use :32x32->32 arithmetic and only use the low-order bits :for multiply. : :If we do this we must perform 4 times more ``small'' :multiplications to perform the ``big'' one; but each of these :``small'' multiplies is performed on, effectively, 16-bit :operands and should therefore be several times faster. Wrong. with proper HARDWARE there is no reason why 32 x 32 --> 64 can't be done in as few cycles as 16 x 16 --> 32. It just takes more gates. I'm through arguing with you. It's obvious that you've never actually done any large scale multi-precise calculations and in absence of that, your arguments have little weight behind them. Your mind seems to be made up and you refuse to accept the arguments of others who have VASTLY more experience in doing this kind of arithmetic. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
bmk@csc.ti.com (Brian M Kennedy) (09/13/90)
=>It has been claimed that a lack of 32x32->64 multiplication
=>makes a factor of 10 difference in the running time of
=>typical extended precision arithmetic routines. Although it
=>obviously makes _a_ difference in run time I do not measure
=>an order of magnitude difference.
I cannot comment directly about the performance gains possible
for general multi-precision arithmetic, but I can comment on
a special case, 64-bit arithmetic implemented via 32-bit
arithmetic.
I have a highly optimized simulator that must do 64-bit
arithmetic. For portability, the simulator is written
in C++/ANSI C. ANSI C provides no support for multi-precision
arithmetic. It only guarantees that long is *at least* 32-bit.
Therefore, I have written a C++ type, doubleLong, which
guarantees at least 64-bit. The operation doubleLong*doubleLong
is implemented, as you might guess, with multiple long*long
operations.
IF ANSI C supported a library function
void mult(long Op1, long Op2, long Product[2])
which multiplied Op1 times Op2 and placed the result in the
two longs making up the array Product,
AND my hardware provided support for 32*32->64, such that the
library function was implemented using that support,
THEN, I could implement doubleLong multiplication much more
efficiently.
The question in this thread has been "How much more efficient?"
Speed-up from 2.5 to 10 has been claimed. Here's my measurements
using doubleLong:
Ideally, I would measure a program full of multiplications of
doubleLong's, first with the current 32*32->32 implementation,
and then with 32*32->64 implementation. Unfortunately, I have
no hardware support for the latter. Instead I will measure
an upper-bound on the performance increase by comparing:
64*64->64 via 32*32->32 vs. 32*32->32
I assert that the speed of (64*64->64 via 32*32->64) calculations
is faster than (64*64->64 via 32*32->32) but slower than just
plain 32*32->32.
The Environment:
---------------
Mips R3000 running RISC/os 4.50, medium load
C++ code compiled with ATT C++ 2.0
C code compiled with Mips 4.5 C compiler with -O1
The Results:
-----------
The program was run a number of times. The min and max times are
shown.
32*32->32 64*64->64 via 32*32->32
--------- -----------------------
MIN: %multiSpeed %multiSpeed
Final product = 246447234 Final product = 99_999_970_000_002
CPU Time = 7650 ms CPU Time = 83730 ms
Real Time = 17187 ms Real Time = 86343 ms
MAX: %multiSpeed %multiSpeed
Final product = 246447234 Final product = 99_999_970_000_002
CPU Time = 7820 ms CPU Time = 85960 ms
Real Time = 19191 ms Real Time = 137034 ms
From these timings, I get an approximate ratio of -- 11 --.
This is about what I expected, knowing the code.
Therefore, the speedup I would see from having a 32*32->64
operation available to me through ANSI C would be less than 11.
Knowing the code, I would guess the speed-up would be about 4.
That speedup, of course, is on a program that does little more
than 64-bit multiplies. On my simulator that uses doubleLong,
the gain would be much less than five percent.
Finally, general multi-precision arithmetic may get more gains
than indicated here. Also the gains possible when coding in
assembly are probably higher than in C++.
The program (in C++):
--------------------
// Switch the definition of NUMBER between long and doubleLong
// to obtain the alternate versions of this program.
#include "doubleLong.h" // contains definition of doubleLong
#include "timer.h" // for software timings
#define NUMBER doubleLong // for 64*64->64 calculations,
// implemented with 32*32->32 calculations
//#define NUMBER long // for 32*32->32 calculations
#define COUNT 10000000
main ()
{NUMBER Product;
timer Timer;
Timer.start();
for(int I = 0; I < COUNT; I++)
{Product = NUMBER(I) * NUMBER(I-1);}
Timer.split();
cout << "Final product = " << Product << endl << endl;
cout << " CPU Time = " << Timer.cpu() << " ms" << endl;
cout << " Real Time = " << Timer.real() << " ms" << endl;
}
---------------------------------
Brian M. Kennedy <bmk@csc.ti.com>
Computer Systems Laboratory
Computer Science Center
Texas Instruments Incorporated
dik@cwi.nl (Dik T. Winter) (09/13/90)
Well, after the discussion between R. Kym Horsell and Robert D. Silverman, here are some real facts. First the background of it. Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a program (in Fortran) to do primality proving for long numbers (upto about 320 digits). A writeup about it can be found in some back issue of Mathematics of Computation. For this program I did the base arithmetic kernels. Initially the program was written and executed on a CDC Cyber 750-75. Later this program was ported to a Cray 1 and a Cyber 205, and again I did the basic arithmetic kernels, using vector operations. When we moved over to do our main work on minis etc., I brought along the program and ported it to a large platform of minis. The basic structure of the arithmetic kernels is very simple, it allows for kernels written in C or Fortran, using either the HOL arithmetic or some parts in assembler allowing for 32x32->64 bit multiply and associated division. Also provisions are made to allow for simple loops over these operations to be coded in assembler. In this way the subroutine call overhead for a large number of calls can be avoided. Now whenever I get access to some machine I try to port the program to it. For a full port at least a Fortran compiler is required (as the base program is still in Fortran). Access to a C compiler is useful. I was not successful on one port only (to a Mac II) because of severe bugs in the Fortran compiler. Also there are a few platforms where an assembler is not available (CDC Cyber 995, NEC SX/2). And in a number of cases I did a port of the arithmetic kernels, but could not do a full port because I had no access to an adequate Fortran compiler (DG/MV and Intel 8086 with Xenix). (And, no, the stuff is not yet publicly available; some things have still to be done.) Now, when Bob Silverman claimed that he experienced a 10 fold speedup when going from 32x32->32 multiply to 32x32->64 multiply, I can only say that I do not find such a speedup. See the table below. On the other hand, Kym Horsell is also off base with his benchmark. Also his claims about input and output are besides the truth. In my experience, with this kind of work, a program will run happily during several seconds/ minutes/hours, doing no I/O at all and then finally write only a single or a few numbers. So in effect every effort spend to speed up I/O is essentially waisted. (The program I use does indeed twice write the number it is testing; that is really a waste of time, but only a few milliseconds in a run.) Andrew Mullhaupt was of course right when he stated that 32x32->64 multiply is much more important than 64/32->32,32 division. Even in a multiprecise division, the number of elementary divisions is order n while the number of elementary multiplications is order n squared. Now back to the data. First column gives the machine, second column the amount of time needed when using 32x32->64 multiply as compared to 32x32->32 multiply (in percents), the third column gives the same when also some basic loops are coded. percentage of time needed when using 32x32->64 coded loops DEC VAX-780 62.84% 46.50% Harris HCX7 (Tahoe) 61.75% 59.38% Sun3 (3/60, 68020) 70.68% 56.23% Alliant FX/4 (68020(1)) 66.50% 48.43% Sequent Symmetry (80386) 56.48% (2) Sequent Balance (32032) 58.97% 46.08% Encore Multimax (32332) 56.73% 45.81% Encore Multimax (32532) 63.92% 48.02% Gould PowerNode 59.06% 42.09% SGI (mips R2000) 73.68% 68.30% SGI (mips R3000) 73.28% 65.99% DecSTATION 5100 (mips R3000) 70.20% 65.22% Sun4 SPARCStation 1 (4/60) 57.04% 55.28% IBM RTPC 36.79% 29.95% Notes: (1) Not really a 68020, but a 68020 look-alike. I did not use vector operations, because in that case I ought to go back to 32x32->32 multiply. (2) I did not do the coding of basic loops because of the lack of registers, and I really did not want to do all kinds of loads/stores to and from memory. Additional notes: The CISC machines give in general a speed up to 55-70% of the original program in the first case and to 40-60% in the second case. The Tahoe is extraordinary because of the small difference between first and second case; there is nearly no subroutine call overhead. The RISC machines display a lack of speedup when going to coded loops; that is what RISC is all about. Mips machines display a general lack of speedup. It has hardware multiply. SPARC and RTPC both use multiply step instructions rather than hardware multiply. The RTPC multiply step instruction delivers 2 bits each time it is executed. Both SPARC and RTPC deliver a 64 bit result. They also both use divide step instructions, and both are able to do 64/32 divide in the same amount of time as 32/32 divide, although the supplied software does not use this fact. I am still not satisfied with the Sun 4 results and thinking about methods to improve it. RTPC displays a very marked speedup. Is that a compiler problem? An additional question is whether hardware multiply is required, or that multiply step instructions are sufficient. Below for one CISC and three RISC machines the timing (in seconds) of the best version of the program: Sun 3 (3/60, 68020) 24.36 Sun 4 (4/60) 14.80 IBM RTPC 14.10 SGI (R3000 @ 20 MHz) 5.88 We see that the mips processor has a decided advantage. On the other hand, the RTPC shows that even with multiply step instructions a reasonable performance can be obtained (the base machine is slower than a Sun 4). Sun 4 is in fact unacceptable for this kind of programs. And lastly the claim that 'bc' is extremely fast. To the contrary, the above program performs about 15000 operations of the form a*b and also about 15000 operations of the form c/%d, with a, b and d 53 digit numbers and c a 106 digit number (plus of course many more operations). On the Sun 4 'bc' needed 13 seconds to do only 400 of those operations. So who is the winner? I hope this helps. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
cik@l.cc.purdue.edu (Herman Rubin) (09/13/90)
In article <1990Sep12.223253.9574@csc.ti.com>, bmk@csc.ti.com (Brian M Kennedy) writes: > =>It has been claimed that a lack of 32x32->64 multiplication > =>makes a factor of 10 difference in the running time of > =>typical extended precision arithmetic routines. Although it > =>obviously makes _a_ difference in run time I do not measure > =>an order of magnitude difference. ............................ > Instead I will measure > an upper-bound on the performance increase by comparing: > > 64*64->64 via 32*32->32 vs. 32*32->32 [Long description deleted.] The original problem was 32x32 -> 64 compared to 32x32 -> 32. To do a reasonable type of test, consider the general problem of NxN -> 2N vs. NxN -> N. Now to do this properly, one should remember that in the machine with NxN -> N, N is the length available. Thus, in adding two N-bit numbere, one must use a test-for-carry to detect a bit in position N (starting the count from 0). Also, the comparison should not depend on the peculiarities of a particular compiler, but should be done at the machine-language level. This is not a long code. To carry out the benchmark, one could use N = 16 (or even 8) to get a general idea. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!cik(UUCP)
bs@linus.mitre.org (Robert D. Silverman) (09/13/90)
In article <2114@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
:Well, after the discussion between R. Kym Horsell and Robert D. Silverman,
:here are some real facts. First the background of it.
----
Does this imply my claims are not real?
My claims are based on actual benchmarks with large scale multiprecise
FFTs, with the Elliptic Curve factoring algorithm and with the Elliptic
Curve prime proving algorithm.
:Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a
:program (in Fortran) to do primality proving for long numbers (upto
:about 320 digits). A writeup about it can be found in some back issue
:of Mathematics of Computation. For this program I did the base arithmetic
Math. Comp. vol. 48 (Jan 1987)
H. Cohen & A. Lenstra, "Implementation of a New Primality Test"
A large chunk of this algorithm is spent doing Jacobi sums that don't
consist of just multi-precise multiplications. This, I believe, is
relevent to some of the benchmarks given below.
:program and ported it to a large platform of minis. The basic structure
:of the arithmetic kernels is very simple, it allows for kernels written
:in C or Fortran, using either the HOL arithmetic or some parts in assembler
:allowing for 32x32->64 bit multiply and associated division. Also provisions
:are made to allow for simple loops over these operations to be coded in
:assembler. In this way the subroutine call overhead for a large number of
----------------------------------------------------------------
:calls can be avoided. Now whenever I get access to some machine I try
--------------------
If one chooses not to put the control loops in assembler, but instead
relies upon calling just low level routines for 64 bit arithmetic,
the speed difference becomes more dramatic. In going from 32 to 64
bit arithmetic, one makes 4 times fewer subroutine calls to low
level arithmetic routines.
:
:Now, when Bob Silverman claimed that he experienced a 10 fold speedup
:when going from 32x32->32 multiply to 32x32->64 multiply, I can only
:say that I do not find such a speedup. See the table below. On the
:other hand, Kym Horsell is also off base with his benchmark. Also his
:claims about input and output are besides the truth. In my experience,
:with this kind of work, a program will run happily during several seconds/
:minutes/hours, doing no I/O at all and then finally write only a single
:or a few numbers. So in effect every effort spend to speed up I/O is
:essentially waisted. (The program I use does indeed twice write the number
:Even in a multiprecise division, the number of elementary divisions is order
:n while the number of elementary multiplications is order n squared.
This is true when dividing a multi-precise number by a single precision
number. It is not true when dividing two multi-precise numbers.
Dividing 2n bits by n bits is O(n^2) not O(n).
:
:Now back to the data. First column gives the machine, second column the
:amount of time needed when using 32x32->64 multiply as compared to 32x32->32
:multiply (in percents), the third column gives the same when also some
:basic loops are coded.
:
: percentage of time needed when using
: 32x32->64 coded loops
:DEC VAX-780 62.84% 46.50%
Stuff deleted.
I don't dispute the data, but I'm a little unclear as to what is
being measured. 'percentage of time needed' is unclear to me.
Does this refer to the percentage of total execution time doing
multi-precise divides and multiples?
I believe what it means is that, for example on the VAX, the program
spent (respectively) 62% and 46% as much time to run when using 64
bit arithmetic, as it did when using 32 bit arithmetic.
However, he does not say what fraction of the program is spent in
doing this arithmetic. If a program (say) spends 1/2 its time doing
multi-precise multiplies, a 10-fold speed increase will reduce this
to 5%, but one still has the other 50%, resulting in a total run
time of 55% of the original.
:And lastly the claim that 'bc' is extremely fast. To the contrary, the
:above program performs about 15000 operations of the form a*b and also
:about 15000 operations of the form c/%d, with a, b and d 53 digit numbers
:and c a 106 digit number (plus of course many more operations). On the
:Sun 4 'bc' needed 13 seconds to do only 400 of those operations. So who
:is the winner?
:
--
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/13/90)
In article <119983@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: \\\ >This is true when dividing a multi-precise number by a single precision >number. It is not true when dividing two multi-precise numbers. >Dividing 2n bits by n bits is O(n^2) not O(n). \\\ The complexity of _division_ is still (I understand) an open question. According to some (old) work by Cook & Aanderaa it is _unlikely_ to be less than O(n log n/(log log n)^2) on ``conventional'' machines. The best division _algorithm_ that I know of performs O(n log n log log n). -Kym Horsell
bs@linus.mitre.org (Robert D. Silverman) (09/13/90)
In article <3999@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes: :In article <119983@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: :\\\ :>This is true when dividing a multi-precise number by a single precision :>number. It is not true when dividing two multi-precise numbers. :>Dividing 2n bits by n bits is O(n^2) not O(n). :\\\ : :The complexity of _division_ is still (I understand) an open :question. According to some (old) work by Cook & Aanderaa it :is _unlikely_ to be less than O(n log n/(log log n)^2) on :``conventional'' machines. : :The best division _algorithm_ that I know of performs :O(n log n log log n). : :-Kym Horsell I'm talking about PRACTICAL algorithms here, not theoretical ones. There doesn't seem to be a recursive bisection algorithm similar to Karatsuba's that works for division. Methods based on FFT's are ineffective until one reaches thousands of bits. For numbers in the (say) 200 decimal digit range, ordinary long division is best, and that algorithm is O(n^2). -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)
In article <120002@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: [distinction of theoretical complexity of division algorithm as opposed to complexity of particular algorithm]. >I'm talking about PRACTICAL algorithms here, not theoretical ones. >There doesn't seem to be a recursive bisection algorithm similar to >Karatsuba's that works for division. Methods based on FFT's are >ineffective until one reaches thousands of bits. For numbers in the >(say) 200 decimal digit range, ordinary long division is best, and >that algorithm is O(n^2). So long as you make the distinction between problem complexity, algorithm complexity & the complexity of algorithms that you use. Such niceties are sometimes lost on some of us who read this group. I will check out your claim re _practicality_ of FFT algorithms and will report results to the net; they may be of use to people to those concened about implementation issues. -Kym Horsell
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)
In article <120002@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: \\\ >There doesn't seem to be a recursive bisection algorithm similar to >Karatsuba's that works for division. Methods based on FFT's are \\\ I am not sure about the Newton part of the obvious ``invert and multiply'' using Kartsuba -- does this give you a practical technique for high-precision divide O(n^lg(3)). Another straightforward scheme, I haven't checked the literature but I don't remember it, is to use 1/(2^n a+b) approx= 2^-n 1/a (1 - 2^-2 b/a) It _has_ been a long day so could someone point out to me (I _know_ there are mathematicians out there) whether the rhs requires 2*n-bit calcs? -Kym Horsell
dik@cwi.nl (Dik T. Winter) (09/14/90)
In article <119983@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes: > In article <2114@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > :Well, after the discussion between R. Kym Horsell and Robert D. Silverman, > :here are some real facts. First the background of it. > ---- > > Does this imply my claims are not real? No, not that. But your claims told only about an order of magnitude. Perhaps your claims are based on programs where one was 10 times as fast as the other on some processor. I have a lot of data for a lot of different processors. > > My claims are based on actual benchmarks with large scale multiprecise > FFTs, with the Elliptic Curve factoring algorithm and with the Elliptic > Curve prime proving algorithm. You did not state what processor gives which advantage. It highly depends upon the processor used. > ... > A large chunk of this algorithm is spent doing Jacobi sums that don't > consist of just multi-precise multiplications. This, I believe, is > relevent to some of the benchmarks given below. Right, but althought it is large chunk of the algorithm, it is not a large chunk of the time required as far as I know. But I will time that. > ... > : In this way the subroutine call overhead for a large number of > ---------------------------------------------------------------- > :calls can be avoided. Now whenever I get access to some machine I try > -------------------- > > If one chooses not to put the control loops in assembler, but instead > relies upon calling just low level routines for 64 bit arithmetic, > the speed difference becomes more dramatic. In going from 32 to 64 > bit arithmetic, one makes 4 times fewer subroutine calls to low > level arithmetic routines. Especially if (as you make clear now) you also use subroutine calls to low level routines that could be inlined and coded in the HOL. Did you try inlining on (for example) the Sparc? > > : (The program I use does indeed twice write the number > :Even in a multiprecise division, the number of elementary divisions is order > :n while the number of elementary multiplications is order n squared. > > This is true when dividing a multi-precise number by a single precision > number. It is not true when dividing two multi-precise numbers. > Dividing 2n bits by n bits is O(n^2) not O(n). I wrote that the number of elementary divisions is order n, the number of elementary multiplications and additions is n^2. > > : > :Now back to the data. First column gives the machine, second column the > :amount of time needed when using 32x32->64 multiply as compared to 32x32->32 > :multiply (in percents), the third column gives the same when also some > :basic loops are coded. > : > : percentage of time needed when using > : 32x32->64 coded loops > :DEC VAX-780 62.84% 46.50% > > Stuff deleted. > > I don't dispute the data, but I'm a little unclear as to what is > being measured. 'percentage of time needed' is unclear to me. > Does this refer to the percentage of total execution time doing > multi-precise divides and multiples? Excuse me for my poor English. The columns set off the time required for the whole prime proving using an alternate form for the arithmetic compared to the time required when doing everything in a HOL. > > I believe what it means is that, for example on the VAX, the program > spent (respectively) 62% and 46% as much time to run when using 64 > bit arithmetic, as it did when using 32 bit arithmetic. Right. > > However, he does not say what fraction of the program is spent in > doing this arithmetic. If a program (say) spends 1/2 its time doing > multi-precise multiplies, a 10-fold speed increase will reduce this > to 5%, but one still has the other 50%, resulting in a total run > time of 55% of the original. Also true. But I am *not* interested in the time it takes to do some parts of a program; I am only interested in the time that is required for the program complete. The same discussion is often heard in supercomputer regions about vectorization. Look at Amdahl's law. What is better, a processor that peaks at 800 MFlop but gets a performance of 10 MFlop on realistic programs, or a processor that peaks at 400 MFlop and gets 100 MFlop on realistic programs? (To be honest, on more than one occasion I did this kind of micro optimization, see my follow-up to Peter Montgomery's article. But is it worthwile? For me yes, I learn a bit more about the processor, but that is all.) -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
pavlov@canisius.UUCP (Greg Pavlov) (09/14/90)
In article <119858@linus.mitre.org>, bs@linus.mitre.org (Robert D. Silverman) writes: > ......... Suppose A > B and A > C. How would one compute > BC/A EXACTLY without 64 bit support? ..... > ... and if A = PI, and B = sq.rt. of 2, how would one calculate BC/A EXACTLY ? > "You can lead a horse's ass to knowledge, but you can't make him think" ... obviously, we are all (all the rest of us, anyway) horses' asses.... greg pavlov, fstrf, amherst, ny
bs@linus.mitre.org (Robert D. Silverman) (09/14/90)
In article <2857@canisius.UUCP> pavlov@canisius.UUCP (Greg Pavlov) writes: :In article <119858@linus.mitre.org>, bs@linus.mitre.org (Robert D. Silverman) writes: :> ......... Suppose A > B and A > C. How would one compute :> BC/A EXACTLY without 64 bit support? ..... :> : ... and if A = PI, and B = sq.rt. of 2, how would one calculate BC/A : EXACTLY ? : :> "You can lead a horse's ass to knowledge, but you can't make him think" : : ... obviously, we are all (all the rest of us, anyway) horses' asses.... : : greg pavlov, fstrf, amherst, ny If you had been following this thread you would know that we have been discussing INTEGER arithmetic throughout. Therefore BC/A is an integer that fits within a single machine word. -- Bob Silverman #include <std.disclaimer> Mitre Corporation, Bedford, MA 01730 "You can lead a horse's ass to knowledge, but you can't make him think"
pete@abccam.abcl.co.uk (Peter Cockerell) (09/14/90)
In article <3989@bingvaxu.cc.binghamton.edu>, vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > > After a number of private communications, I've managed to render > one of the little benchmarks I have presentable enough to post, > along with some performance figures from different machines > (basically, its put up or shut up time). [load of stuff deleted] > P.S. The benchmark was _deliberately_ kept rather simple; I > wanted to measure the performance of _multiply_ in > the context of basic extended precision arithmetic, not > the memory or i/o subsystems. The results of the 'benchmark' when run on my ARM (Acorn Risc Machine) system running BSD 4.3 are: DOUBLE defined 3.6 DOUBLE not defined 8.9 Ratio 2.5 All this seems to be telling me is that the conversions required to used the ARM's 32*32->32 instruction to perform 8*8->16 arithmetic are more onerous (and so slower) than those required to do 16*16->32. (Short<->int on the ARM requires masking and/or shifting; there are no explicit conversion instructions, so a C int=short*short compiles to MOV R0, R0, LSL #16 ;Sign extend RHS MOV R0, R0, ASR #16 MOV R1, R1, LSL #16 ;Sign extend RHS MOV R1, R1, ASR #16 MUL R2, R0, R1 ;Do the multiply Similarly, a short=char*char is MOV R0, R0, LSL #24 ;Sign extend RHS MOV R0, R0, ASR #24 MOV R1, R1, LSL #24 ;Sign extend RHS MOV R1, R1, ASR #24 MUL R2, R0, R1 ;Do the mul MOV R2, R2, LSL #16 ;Convert to short MOV R2, R2, ASR #16 In comparison, int=int*int compiles to MUL R2, R0, R1 The benchmark time for the case when LONG and SHORT are both defined to be int (ie the natural length for the processor) is 0.4s! Or am I missing something...?
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/17/90)
In article <513@abccam.abcl.co.uk> pete@abccam.abcl.co.uk (Peter Cockerell) writes: \\\ >The benchmark time for the case when LONG and SHORT are both defined >to be int (ie the natural length for the processor) is 0.4s! > >Or am I missing something...? Maybe. What is happening to the high-order part of the 32-bit product? It's lost; your benchmark isn't performing the same function as mine. [And any difference is due to memory accessing effects -- what a difference tho'!]. For those that don't want (yet another?) clarification of what I'm trying to get at plz `n' here. [I can't help it & doctors can't help-- working in a college environment causes ``lecture latchup''. But it _does_ help to clarify my own `ideas']. To reiterate, I wish to measure the difference between performance of XP software with & without the convenience of having multiply produce a ``double size'' product. Some folks argue that having a multiply that gives a double product is crucial to efficient running of their XP software. The question then is, _how much_ is it worth (in terms of area, running time, etc). To this end I've released this lil' program for any interested party to measure on their available h/w (and _I'm_ interested in the results 4 sure). The program attempts to perform one of the things that tend to take time in XP calcs -- big multiplies -- and have adopted the naive ``pencil & paper'' method because (a) it is still used a lot (see a lot of LISP ``bignum'' packages for one thing), and (b) it has a _high dynamic density_ of machine-level multiply operations vv adds and shifts. Now, since double-sized products are not universal, I have to ``guestimate'' their loss on some architectures. They way I have chosen to do this is to perform some calculations using 32x32->32 and 16x16->16 arithmetic (where available). On machines where native 16x16->16 _isn't_ available we have a bit of a problem (not to mention machines that don't have h/w multiply in _any_ form); but its still useful to have some numbers for these machines anyway. O'Keefe has concentrated on computing factorials -- and this _may_ be a good idea; the density of multiplies may be higher than the program I released. However, the first set of figures I posted was based on the same idea (although I didn't _actually_ use any machine-level support for it) and the differences between 16 and 32-bit versions weren't as large as I thought they _might_ be in other contexts -- hence the _second_ program. O'Keefe's figure of 4-5 times speedup when _using_ vs _not using_ an _actual_ double-sized product is important to note. Maybe I'll go back and _actually_ insert this into my program. However, it's only _possible_ for machines with the actual h/w support. Summary -- O'Keefe has raised (my own included) doubts over the actual speedup 32x32->32 and 32x32->64, so I'm going back to the bench (but not for a rest). Unrolling the loops is an experiment that _has_ been suggested by several people. Tnx to all who are participating. -Kym Horsell
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/20/90)
In article <4037@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > The program attempts to perform one of the things that tend > to take time in XP calcs -- big multiplies -- and have adopted the naive Horsell's benchmark had two parts: part 1 calculated a bunch of factorials, then part 2 added and multiplied them. I noticed that the code to add two bignums was wrong ("cry" is one bit too short...), so concentrated on part 1. part 1 does a lot of bigit * bignum multiplies, except that it turns out to use very small bigits. That's why I looked at factorials, because that was part of Horsell's benchmark, and it was possible to use a technique for factorials which _is_ similar to bignum multiplication. > O'Keefe's figure of 4-5 times speedup when _using_ vs _not using_ > an _actual_ double-sized product is important to note. It turned out that we have GCC here. That makes a difference. New figures: Factorial test (NS32532, Encore Multimax 4.3BSD) version mult.size lang User t System Ratio Horsell "8x8 -> 16" cc 311.2 1.9 1.68 Horsell "16x16 -> 32" cc 228.8 2.6 1.23 My code "32x32 -> 64" as 52.6 0.5 1/3.52 Horsell "8x8->16" gcc 264.1 1.3 1.43 Horsell "16x16->32" gcc 185.1 0.8 1.00 My code "32x32->64" as 52.4 0.3 1/3.53 (.aligned) So compared with Horsell's "DOUBLE" version compiled by a rather better compiler, assembly code using 32x32->64 unsigned multiplication gets a 3.5ish speedup. (The .aligned note refers to the trick of aligning branch targets on word boundaries. On machines with variable size instructions that can often help. Here it didn't, not to speak of.) A figure of about 3.5 sounds right: going from 16-bit to 32-bit chunks we expect to do 1/4 the number of multiplications, and those multiplications may be a wee bit slower. It's worth noting that Horsell's representation for bignums is hairy. I just used pointer-to-(length, bigit 0, bigit 1, ...) which amongst other things let me use add-compare-and-branch. -- Heuer's Law: Any feature is a bug unless it can be turned off.