[comp.arch] benchmark for evaluating extended precision

vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/12/90)

After a number of private communications, I've managed to render
one of the little benchmarks I have presentable enough to post,
along with some performance figures from different machines
(basically, its put up or shut up time).

It has been claimed that a lack of 32x32->64 multiplication
makes a factor of 10 difference in the running time of
typical extended precision arithmetic routines. Although it
obviously makes _a_ difference in run time I do not measure
an order of magnitude difference.

Following is a table of the total runtime for a particular
benchmark that does a few simple-minded products. The
figures in parens are those with optimization on.

---------------------------------------------------------------------------
Machine		32x32->32	8x8->16		ratio	bc

VAX 6440	
		1.3		3.9		3.0
		(0.88)		(2.5)		2.8

VAX 8530	
cc		3.2		9.5		3.0	3.1!!!
		(2.9)		(9.3)		3.2
gcc		3.1		8.4		2.7
		(1.9)		(6.1)		3.2

SUN 3/60
		4.8		12.2		2.54	7.8
		(2.5)		(6.2)

Sun Sparc 1
		2.4		6.6		2.75	2.5!!!
		(1.1)		(3.3)		3.0
---------------------------------------------------------------------------

The benchmark itself follows:

#include	<stdio.h>
#include	<malloc.h>

#define	DEBUG	/***/
#define	DOUBLE

#ifdef	DOUBLE
#define	NUM	15
#define	MAX	100
#define	SHORT	unsigned short
#define	LONG	unsigned long
#else
#define	NUM	15
#define	MAX	200
#define	SHORT	unsigned char
#define	LONG	unsigned short
#endif	/*DOUBLE*/

#ifdef	__STDC__
#define	_(x)	x
#else
#define	_(x)	()
#endif

#ifdef	DEBUG
#define	Assert(x)	{if(!(x)){fprintf(stderr,"Assert failed: line %d\n",\
				__LINE__); exit(-1);}}
#else
#define	Assert(x)
#endif

typedef struct {
	SHORT *top,*bot,*max;
	} B;

B *a[NUM][NUM],*b[NUM][NUM];

B *zap _((B*));
B *newb _((void));
void kill _((B*));
void part1 _((void));
void part2 _((void));
B *add2 _((B*,B*));
B *fact _((int));
B *mul3 _((B*,B*,B*));
B *muladd _((B*,B*,int));

main() {
	printf("sizeof SHORT=%d\n",8*sizeof(SHORT));
	printf("sizeof LONG=%d\n",8*sizeof(LONG));
	part1();
	part2();
	exit(0);
	}

void part1() {
	int i,j;

	for(i=0; i<NUM; ++i)
	for(j=0; j<NUM; ++j) {
		b[i][j] = fact(i+j);
		a[j][i] = fact(i*j);
		}
	}

void part2() {
	int i,j,k;
	B *t1 = newb(), *t2 = newb();

	for(i = 0; i<NUM; ++i)
	for(j = 0; j<NUM; ++j) {
		zap(t1);
		for(k = 0; k<NUM; ++k) {
			mul3(t2,a[i][k],b[k][j]);
			add2(t1,t2);
			}
		}
	kill(t1);
	kill(t2);
	}

B *fact(x) {
	B *res = newb();
	SHORT *ptr;
	LONG cry;

	res->bot[0] = 1;
	while(x-->1) {
		cry = 0;
		for(ptr=res->bot; ptr<=res->top; ) {
			cry += (LONG)x**ptr;
			*ptr++ = cry;
			cry >>= 8*sizeof(SHORT);
			}
		if(cry) *(res->top = ptr) = cry;
		}
	Assert(res->top<res->max);
	Assert(res->bot<=res->top);
	return res;
	}

/* x = y*z */
B *mul3(x,y,z) B *x,*y,*z; {
	SHORT *ptr;

	Assert(y->top<y->max);
	Assert(y->bot<=y->top);
	Assert(z->top<z->max);
	Assert(z->bot<=z->top);
	zap(x);
	for(ptr = z->bot; ptr<=z->top; )
		muladd(x,y,*ptr++);
	return x;
	}

/* x += y*k */
B *muladd(x,y,k) B *x,*y; {
	SHORT *ptr,*ptr2;
	LONG cry = 0;

	Assert(x->top<x->max);
	Assert(x->bot<=x->top);
	Assert(y->top<y->max);
	Assert(y->bot<=y->top);
	ptr = x->bot;
	for(ptr2 = y->bot; ptr2<=y->top; ) {
		Assert(ptr<x->max);
		cry += (LONG)k**ptr2++;
		if(ptr<=x->top) cry += *ptr;
		*ptr++ = cry;
		cry >>= 8*sizeof(SHORT);
		}
	while(cry) {
		cry += *ptr;
		*(x->top = ptr++) = cry;
		cry >>= 8*sizeof(SHORT);
		}
	return x;
	}

/* x += y */
B *add2(x,y) B *x,*y; {
	SHORT cry;
	SHORT *ptr,*ptr2;

	Assert(y->top<y->max);
	Assert(y->bot<=y->top);
	for(ptr = x->bot,ptr2 = y->bot; ptr2<=y->top; ) {
		cry += (LONG)*ptr+*ptr2++;
		*ptr++ = cry;
		cry >>= 8*sizeof(SHORT);
		}
	if(cry) *(ptr = x->top) = cry;
	return x;
	}

B *newb() {
	B *tmp;

	tmp = (B*)malloc(sizeof(B));
	Assert(tmp);
	tmp->top = tmp->bot = (SHORT*)malloc(MAX*sizeof(SHORT));
	Assert(tmp->top);
	tmp->max = tmp->bot + MAX;
	return tmp;
	}

void kill(x) B*x; {

	free(x->bot);
	free(x);
	}

B *zap(x) B*x; {

	x->top = x->bot;
	x->bot[0] = 0;
	return x;
	}

---------------------------------------------------------------------------

A word of explanation -- on the machines tested `short' was 16,
`char' was 8 and `long' (_sometimes_ the same size as `int') was 32
bits. The idea is to compare the running times of the two versions
of the program created by alternately defining and undefining `DOUBLE'.

With DOUBLE _defined_ the program uses 32x32->32 arithmetic
to essentially perform 16x16->32 bit arithmetic. In std C this
is performed by

	int x,y;
	long z;
	z = (long)x * y;

extends both operands (the order of loading operands into registers is,
of course, not defined to allow optimization some elbow-room).

With DOUBLE _undefined_ (comment out that line) the program forms
the same computation using 16x16->16 in order to perform
essentially 8x8->16 bit arithmetic. Arrangement similar to above.
(Of course the program uses unsigned arithmetic in these contexts).

I think the results show that nxn->2n bit arithmetic can
be replaced, at least for _this_ benchmark, by 2n x 2n -> 2n
with a loss of about a factor of 3.

Another point to note (although Robert Silverman is not impressed)
is the performance of `bc'. Taking into consideration that
it has no assembler coding (I am informed so by those-who-know
at Sun) and, I understand, performs its arithmetic on _bytes_
that code simply `higits' [0,99] it compares remarkably well
with hard-coded (though not very good) C.

I would appreciate someone pointing out the error in my reasoning.

-Kym Horsell.

P.S. The benchmark was _deliberately_ kept rather simple; I
	wanted to measure the performance of _multiply_ in
	the context of basic extended precision arithmetic, not
	the memory or i/o subsystems.

bs@linus.mitre.org (Robert D. Silverman) (09/12/90)

In article <3989@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>
>After a number of private communications, I've managed to render
>one of the little benchmarks I have presentable enough to post,
>along with some performance figures from different machines
>(basically, its put up or shut up time).
>
>It has been claimed that a lack of 32x32->64 multiplication
>makes a factor of 10 difference in the running time of
>typical extended precision arithmetic routines. Although it
>obviously makes _a_ difference in run time I do not measure
>an order of magnitude difference.

some silly benchmark deleted....

You are totally confused. Nowhere in your benchmark did I see any assembler
code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 
[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64
would allow you to work in radix 2^30 or 2^31 [depending on how careful
you want to be about sign bit and 1's complement/2's complement portability
problems]. Change of radix ALONE would account for a factor of 4 FEWER
multiplies. Secondly, you are working with tiny examples. Working in a
larger radix in a REAL WORLD PROGRAM means less storage required for each
multi-precise variable, less loop overhead, fewer subroutine calls, fewer
cache misses, etc.

Try implementing (say) a prime proving algorithm, or an RSA encryption
algorithm, or a multi-precise FFT over the rationals or anything but
a tinkertoy program with just 1 or two multiprecise numbers where
everything fits in cache.

Furthermore, working in a power of two radix in performing 32 x 32--> 64
allows one to carefully implement some optimizations (in assembler) that
are not readily available in C.

By the way, 32 x 32 --> 64 can be needed for OTHER than just multiple
precision support. Suppose A > B and A > C. How would one compute
BC/A EXACTLY without 64 bit support? If all are 32 bit quantities,
BC/A also fits into 32 bits, yet computing BC can overflow. Similarly,
one cannot compute BC mod A without such support.

-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/13/90)

In article <119858@linus.mitre.org> bs@linus.UUCP (Robert D. Silverman) writes:
\\\
>You are totally confused. Nowhere in your benchmark did I see any assembler
>code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 
>[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64
>would allow you to work in radix 2^30 or 2^31 [depending on how careful
>you want to be about sign bit and 1's complement/2's complement portability
>problems]. Change of radix ALONE would account for a factor of 4 FEWER
>multiplies. Secondly, you are working with tiny examples. Working in a
>larger radix in a REAL WORLD PROGRAM means less storage required for each
>multi-precise variable, less loop overhead, fewer subroutine calls, fewer
>cache misses, etc.
\\\

Funny, I don't _fell_ confused. Some of the points you make _are_
valid in some sense, but we're diverging from the topic.

My whole point is that 32x32->64 is not _required_ in order to support 
extended precision arithmetic. There are various definitions of
_required_ here: ``logically required'', ``convenience'', etc.
Obviously the operation isn't _logically_ required, and as I've
pointed out, its introduction poses some logical _problems_.

With this benchmark, I'm also trying to show that _convenience_
is also not an issue (esp. in the light of the ``bottom line'',
which is what (almost?) everything boils down to, whether we
like it or not).

The fact that I'm doing multiply naively is in _favour_ of
your argument that 
	``[without 32x32->64 high prec arithmetic will] run 10 times slower''.

If I had written this code using _better algorithms_ the difference
between using 16x16->16 and 32x32->32 would be _even less_,
thereby weakening your claim further.

Consider, we all know of fairly simply divide-and-conquer
high-prec multiply with complexity of O(n^lg(3)). (Actually we
_can_ do them in almost linear time, but the cost isn't worth
it yet).

Instead of using 32x32->64 arithmetic we _could_ use
32x32->32 arithmetic and only use the low-order bits
for multiply.

If we do this we must perform 4 times more ``small''
multiplications to perform the ``big'' one; but each of these
``small'' multiplies is performed on, effectively, 16-bit
operands and should therefore be several times faster.
(Checks for effectively multiplying by zero are _almost_ universal).

In the limit, see above, this would work out at

	each multiply (2)^lg(3) faster / 4 times more multiplies
=	3/4 of ``nxn->2n'' speed

That is, *asymptotically* you only lose 25% in performance by
using ``proper'' 32-bit arithmetic (to *simulate* 16x16->32)
over having the ``convenience'' operation of 32x32->64 multiply.

A 25% reduction in speed is not significant enough, in my
accounting, to justify any _extra_ costs required to
maintain the 32x32->64 crock in a HLL (_despite_ the fact that, 
historically, ``that's the way the harware does it anyway'').

However, we can say 32 bits isn't close enough to the
asymptote to justify any claims -- hence a little benchmark.

Considering the amount of extended precision arithmetic
that is performed anyway vis a vie fixed-point or
the normal floating-point hackery, the the _theoretical_
(if you like) slowdown of 25% shown above, and in the light
of _actual measurements_ of factors of 2-3 using naive
algorithms, I remain unconvinced that 32x32->64 is required;
the folks who have dropped it off their list of things to
implement are justified I think.

-Kym Horsell

bs@linus.mitre.org (Robert D. Silverman) (09/13/90)

In article <3994@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
:In article <119858@linus.mitre.org> bs@linus.UUCP (Robert D. Silverman) writes:
:\\\
:>You are totally confused. Nowhere in your benchmark did I see any assembler
:>code to support 32 x 32 -- > 64. You are STILL working in radix 2^15 
:>[or maybe radix 2^16; I didn't check closely], whereas allowing 32 x 32 --> 64
:>would allow you to work in radix 2^30 or 2^31 [depending on how careful
:>you want to be about sign bit and 1's complement/2's complement portability
:>problems]. Change of radix ALONE would account for a factor of 4 FEWER
:>multiplies. Secondly, you are working with tiny examples. Working in a
:>larger radix in a REAL WORLD PROGRAM means less storage required for each
:>multi-precise variable, less loop overhead, fewer subroutine calls, fewer
:>cache misses, etc.
:\\\
:
:Funny, I don't _fell_ confused. Some of the points you make _are_
:valid in some sense, but we're diverging from the topic.
:
:My whole point is that 32x32->64 is not _required_ in order to support 
 
It is if you want any speed out of your code.

:extended precision arithmetic. There are various definitions of
:_required_ here: ``logically required'', ``convenience'', etc.
 
Executing in 1 day instead of 10 isn't a convenience. It's a necessity.
 
:Obviously the operation isn't _logically_ required, and as I've
 
One can implement a ALU that has only 8 or so instructions. Logically,
nothing else is required. So why bother at all with the rest that
get implemented in practice? Logically, they are not required at all.
They get put in for reasons of SPEED. Those doing large scale computations
want computers to be FAST. This is hardly a matter of convenience.


:The fact that I'm doing multiply naively is in _favour_ of
:your argument that 
:	``[without 32x32->64 high prec arithmetic will] run 10 times slower''.
:
:If I had written this code using _better algorithms_ the difference
:between using 16x16->16 and 32x32->32 would be _even less_,
 
But one can apply these same 'better algorithms' with 32 x 32 --> 64 just
as readily.

:thereby weakening your claim further.
:
:Consider, we all know of fairly simply divide-and-conquer
:high-prec multiply with complexity of O(n^lg(3)). (Actually we
:_can_ do them in almost linear time, but the cost isn't worth
 
Have you ever implemented Karatsuba's algorithm? Tried it in practice?
What are the constants in your big O estimates? In practice, one does
not recusively bisect all the way down to the word level. One stops
the bisection at about 8 machine words.  It is NOT n^lg 3 until one reaches
many hundreds (or thousands depending on implementation). Furthermore, the
method is only good when multiply numbers of the same size.

 
Karatsuba does NOTHING to change the basic argument. One can apply it in
radix 2^30 as easily as in 2^15.



:Instead of using 32x32->64 arithmetic we _could_ use
:32x32->32 arithmetic and only use the low-order bits
:for multiply.
:
:If we do this we must perform 4 times more ``small''
:multiplications to perform the ``big'' one; but each of these
:``small'' multiplies is performed on, effectively, 16-bit
:operands and should therefore be several times faster.
 
Wrong. with proper HARDWARE there is no reason why 32 x 32 --> 64 can't
be done in as few cycles as 16 x 16 --> 32. It just takes more gates.

I'm through arguing with you. It's obvious that you've never actually done
any large scale multi-precise calculations and in absence of that, your
arguments have little weight behind them. Your mind seems to be made up
and you refuse to accept the arguments of others who have VASTLY more
experience in doing this kind of arithmetic. 

-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

bmk@csc.ti.com (Brian M Kennedy) (09/13/90)

=>It has been claimed that a lack of 32x32->64 multiplication
=>makes a factor of 10 difference in the running time of
=>typical extended precision arithmetic routines. Although it
=>obviously makes _a_ difference in run time I do not measure
=>an order of magnitude difference.

I cannot comment directly about the performance gains possible
for general multi-precision arithmetic, but I can comment on
a special case, 64-bit arithmetic implemented via 32-bit
arithmetic.

I have a highly optimized simulator that must do 64-bit 
arithmetic.  For portability, the simulator is written 
in C++/ANSI C.  ANSI C provides no support for multi-precision
arithmetic.  It only guarantees that long is *at least* 32-bit.

Therefore, I have written a C++ type, doubleLong, which 
guarantees at least 64-bit.  The operation doubleLong*doubleLong
is implemented, as you might guess, with multiple long*long
operations.

IF ANSI C supported a library function
	void mult(long Op1, long Op2, long Product[2])
which multiplied Op1 times Op2 and placed the result in the
two longs making up the array Product,
AND my hardware provided support for 32*32->64, such that the
library function was implemented using that support,
THEN, I could implement doubleLong multiplication much more
efficiently.

The question in this thread has been "How much more efficient?"
Speed-up from 2.5 to 10 has been claimed.  Here's my measurements
using doubleLong:

Ideally, I would measure a program full of multiplications of
doubleLong's, first with the current 32*32->32 implementation,
and then with 32*32->64 implementation.  Unfortunately, I have
no hardware support for the latter.  Instead I will measure
an upper-bound on the performance increase by comparing:

  64*64->64 via 32*32->32      vs.      32*32->32

I assert that the speed of (64*64->64 via 32*32->64) calculations
is faster than (64*64->64 via 32*32->32) but slower than just
plain 32*32->32.


The Environment:
---------------
       Mips R3000 running RISC/os 4.50, medium load
       C++ code compiled with ATT C++ 2.0
       C code compiled with Mips 4.5 C compiler with -O1


The Results:
-----------
    The program was run a number of times.  The min and max times are
    shown.

      32*32->32                          64*64->64 via 32*32->32
      ---------                          -----------------------
MIN:  %multiSpeed                        %multiSpeed
      Final product = 246447234          Final product = 99_999_970_000_002 

           CPU Time = 7650 ms                 CPU Time = 83730 ms
          Real Time = 17187 ms               Real Time = 86343 ms

MAX:  %multiSpeed                        %multiSpeed
      Final product = 246447234          Final product = 99_999_970_000_002

           CPU Time = 7820 ms                 CPU Time = 85960 ms
          Real Time = 19191 ms               Real Time = 137034 ms


From these timings, I get an approximate ratio of -- 11 --.
This is about what I expected, knowing the code.
                                                 
Therefore, the speedup I would see from having a 32*32->64
operation available to me through ANSI C would be less than 11.
Knowing the code, I would guess the speed-up would be about 4.

That speedup, of course, is on a program that does little more
than 64-bit multiplies.  On my simulator that uses doubleLong,
the gain would be much less than five percent.

Finally, general multi-precision arithmetic may get more gains
than indicated here.  Also the gains possible when coding in 
assembly are probably higher than in C++.

The program (in C++):
--------------------
// Switch the definition of NUMBER between long and doubleLong
// to obtain the alternate versions of this program.

#include "doubleLong.h"      // contains definition of doubleLong

#include "timer.h"           // for software timings

#define NUMBER doubleLong    // for 64*64->64 calculations,
			     // implemented with 32*32->32 calculations

//#define NUMBER long	     // for 32*32->32 calculations

#define COUNT 10000000

main ()
   {NUMBER Product;
    timer Timer;

    Timer.start();

    for(int I = 0; I < COUNT; I++)
       {Product = NUMBER(I) * NUMBER(I-1);}

    Timer.split();

    cout << "Final product = " << Product << endl << endl;

    cout << "     CPU Time = " << Timer.cpu() << " ms" << endl;

    cout << "    Real Time = " << Timer.real() << " ms" << endl;
   }


---------------------------------
Brian M. Kennedy <bmk@csc.ti.com>
Computer Systems Laboratory
Computer Science Center
Texas Instruments Incorporated

dik@cwi.nl (Dik T. Winter) (09/13/90)

Well, after the discussion between R. Kym Horsell and Robert D. Silverman,
here are some real facts.  First the background of it.

Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a
program (in Fortran) to do primality proving for long numbers (upto
about 320 digits).  A writeup about it can be found in some back issue
of Mathematics of Computation.  For this program I did the base arithmetic
kernels.  Initially the program was written and executed on a CDC Cyber
750-75.  Later this program was ported to a Cray 1 and a Cyber 205, and
again I did the basic arithmetic kernels, using vector operations.

When we moved over to do our main work on minis etc., I brought along the
program and ported it to a large platform of minis.  The basic structure
of the arithmetic kernels is very simple, it allows for kernels written
in C or Fortran, using either the HOL arithmetic or some parts in assembler
allowing for 32x32->64 bit multiply and associated division.  Also provisions
are made to allow for simple loops over these operations to be coded in
assembler.  In this way the subroutine call overhead for a large number of
calls can be avoided.  Now whenever I get access to some machine I try
to port the program to it.  For a full port at least a Fortran compiler is
required (as the base program is still in Fortran).  Access to a C compiler
is useful.  I was not successful on one port only (to a Mac II) because of
severe bugs in the Fortran compiler.  Also there are a few platforms where
an assembler is not available (CDC Cyber 995, NEC SX/2).  And in a number of
cases I did a port of the arithmetic kernels, but could not do a full port
because I had no access to an adequate Fortran compiler (DG/MV and Intel
8086 with Xenix).  (And, no, the stuff is not yet publicly available; some
things have still to be done.)

Now, when Bob Silverman claimed that he experienced a 10 fold speedup
when going from 32x32->32 multiply to 32x32->64 multiply, I can only
say that I do not find such a speedup.  See the table below.  On the
other hand, Kym Horsell is also off base with his benchmark.  Also his
claims about input and output are besides the truth.  In my experience,
with this kind of work, a program will run happily during several seconds/
minutes/hours, doing no I/O at all and then finally write only a single
or a few numbers.  So in effect every effort spend to speed up I/O is
essentially waisted.  (The program I use does indeed twice write the number
it is testing; that is really a waste of time, but only a few milliseconds
in a run.)   Andrew Mullhaupt was of course right when he stated that
32x32->64 multiply is much more important than 64/32->32,32 division.
Even in a multiprecise division, the number of elementary divisions is order
n while the number of elementary multiplications is order n squared.

Now back to the data.  First column gives the machine, second column the
amount of time needed when using 32x32->64 multiply as compared to 32x32->32
multiply (in percents), the third column gives the same when also some
basic loops are coded.

				percentage of time needed when using
				32x32->64	coded loops
DEC VAX-780			62.84%		46.50%
Harris HCX7 (Tahoe)		61.75%		59.38%
Sun3 (3/60, 68020)		70.68%		56.23%
Alliant FX/4 (68020(1))		66.50%		48.43%
Sequent Symmetry (80386)	56.48%		 (2)
Sequent Balance (32032)		58.97%		46.08%
Encore Multimax (32332)		56.73%		45.81%
Encore Multimax (32532)		63.92%		48.02%
Gould PowerNode			59.06%		42.09%

SGI (mips R2000)		73.68%		68.30%
SGI (mips R3000)		73.28%		65.99%
DecSTATION 5100 (mips R3000)	70.20%		65.22%

Sun4 SPARCStation 1 (4/60)	57.04%		55.28%
IBM RTPC			36.79%		29.95%

Notes:
(1)	Not really a 68020, but a 68020 look-alike.  I did not use vector
	operations, because in that case I ought to go back to 32x32->32
	multiply.
(2)	I did not do the coding of basic loops because of the lack of registers,
	and I really did not want to do all kinds of loads/stores to and from
	memory.

Additional notes:

The CISC machines give in general a speed up to 55-70% of the original
program in the first case and to 40-60% in the second case.  The Tahoe
is extraordinary because of the small difference between first and second
case; there is nearly no subroutine call overhead.

The RISC machines display a lack of speedup when going to coded loops; that
is what RISC is all about.

Mips machines display a general lack of speedup.  It has hardware multiply.

SPARC and RTPC both use multiply step instructions rather than hardware
multiply.  The RTPC multiply step instruction delivers 2 bits each time
it is executed.  Both SPARC and RTPC deliver a 64 bit result.  They also
both use divide step instructions, and both are able to do 64/32 divide
in the same amount of time as 32/32 divide, although the supplied software
does not use this fact.  I am still not satisfied with the Sun 4 results and
thinking about methods to improve it.  RTPC displays a very marked speedup.
Is that a compiler problem?

An additional question is whether hardware multiply is required, or that
multiply step instructions are sufficient.  Below for one CISC and three
RISC machines the timing (in seconds) of the best version of the program:

Sun 3 (3/60, 68020)	24.36
Sun 4 (4/60)		14.80
IBM RTPC		14.10
SGI (R3000 @ 20 MHz)	 5.88

We see that the mips processor has a decided advantage.  On the other hand,
the RTPC shows that even with multiply step instructions a reasonable
performance can be obtained (the base machine is slower than a Sun 4).
Sun 4 is in fact unacceptable for this kind of programs.

And lastly the claim that 'bc' is extremely fast.  To the contrary, the
above program performs about 15000 operations of the form a*b and also
about 15000 operations of the form c/%d, with a, b and d 53 digit numbers
and c a 106 digit number (plus of course many more operations).  On the
Sun 4 'bc' needed 13 seconds to do only 400 of those operations.  So who
is the winner?

I hope this helps.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

cik@l.cc.purdue.edu (Herman Rubin) (09/13/90)

In article <1990Sep12.223253.9574@csc.ti.com>, bmk@csc.ti.com (Brian M Kennedy) writes:
> =>It has been claimed that a lack of 32x32->64 multiplication
> =>makes a factor of 10 difference in the running time of
> =>typical extended precision arithmetic routines. Although it
> =>obviously makes _a_ difference in run time I do not measure
> =>an order of magnitude difference.

			............................

>                                      Instead I will measure
> an upper-bound on the performance increase by comparing:
> 
>   64*64->64 via 32*32->32      vs.      32*32->32

		[Long description deleted.]

The original problem was 32x32 -> 64 compared to 32x32 -> 32.  To
do a reasonable type of test, consider the general problem of NxN -> 2N
vs. NxN -> N.  Now to do this properly, one should remember that in the
machine with NxN -> N, N is the length available.  Thus, in adding two
N-bit numbere, one must use a test-for-carry to detect a bit in position
N (starting the count from 0).  Also, the comparison should not depend on
the peculiarities of a particular compiler, but should be done at the 
machine-language level.  This is not a long code.

To carry out the benchmark, one could use N = 16 (or even 8) to get a
general idea.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

bs@linus.mitre.org (Robert D. Silverman) (09/13/90)

In article <2114@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
:Well, after the discussion between R. Kym Horsell and Robert D. Silverman,
:here are some real facts.  First the background of it.
               ----
 
Does this imply my claims are not real?

My claims are based on actual benchmarks with large scale multiprecise
FFTs, with the Elliptic Curve factoring algorithm and with the Elliptic
Curve prime proving algorithm.


:Eons ago Arjen Lenstra (now at the University of Chicago, I think) wrote a
:program (in Fortran) to do primality proving for long numbers (upto
:about 320 digits).  A writeup about it can be found in some back issue
:of Mathematics of Computation.  For this program I did the base arithmetic
 
Math. Comp. vol. 48 (Jan 1987) 
H. Cohen & A. Lenstra, "Implementation of a New Primality Test"

A large chunk of this algorithm is spent doing Jacobi sums that don't
consist of just multi-precise multiplications. This, I believe, is
relevent to some of the benchmarks given below. 

:program and ported it to a large platform of minis.  The basic structure
:of the arithmetic kernels is very simple, it allows for kernels written
:in C or Fortran, using either the HOL arithmetic or some parts in assembler
:allowing for 32x32->64 bit multiply and associated division.  Also provisions
:are made to allow for simple loops over these operations to be coded in
:assembler.  In this way the subroutine call overhead for a large number of
             ----------------------------------------------------------------
:calls can be avoided.  Now whenever I get access to some machine I try
 --------------------
 
If one chooses not to put the control loops in assembler, but instead
relies upon calling just low level routines for 64 bit arithmetic,
the speed difference becomes more dramatic. In going from 32 to 64
bit arithmetic, one makes 4 times fewer subroutine calls to low
level arithmetic routines. 

:
:Now, when Bob Silverman claimed that he experienced a 10 fold speedup
:when going from 32x32->32 multiply to 32x32->64 multiply, I can only
:say that I do not find such a speedup.  See the table below.  On the
:other hand, Kym Horsell is also off base with his benchmark.  Also his
:claims about input and output are besides the truth.  In my experience,
:with this kind of work, a program will run happily during several seconds/
:minutes/hours, doing no I/O at all and then finally write only a single
:or a few numbers.  So in effect every effort spend to speed up I/O is
:essentially waisted.  (The program I use does indeed twice write the number
:Even in a multiprecise division, the number of elementary divisions is order
:n while the number of elementary multiplications is order n squared.
 
This is true when dividing a multi-precise number by a single precision
number. It is not true when dividing two multi-precise numbers. 
Dividing 2n bits by n bits is O(n^2) not O(n).

:
:Now back to the data.  First column gives the machine, second column the
:amount of time needed when using 32x32->64 multiply as compared to 32x32->32
:multiply (in percents), the third column gives the same when also some
:basic loops are coded.
:
:				percentage of time needed when using
:				32x32->64	coded loops
:DEC VAX-780			62.84%		46.50%

Stuff deleted. 

I don't dispute the data, but I'm a little unclear as to what is
being measured. 'percentage of time needed' is unclear to me.
Does this refer to the percentage of total execution time doing
multi-precise divides and multiples?

 
I believe what it means is that, for example on the VAX, the program
spent (respectively) 62% and 46% as much time to run when using 64
bit arithmetic, as it did when using 32 bit arithmetic.

However, he does not say what fraction of the program is spent in
doing this arithmetic. If a program (say) spends 1/2 its time doing
multi-precise multiplies, a 10-fold speed increase will reduce this
to 5%, but one still has the other 50%, resulting in a total run
time of 55% of the original.

:And lastly the claim that 'bc' is extremely fast.  To the contrary, the
:above program performs about 15000 operations of the form a*b and also
:about 15000 operations of the form c/%d, with a, b and d 53 digit numbers
:and c a 106 digit number (plus of course many more operations).  On the
:Sun 4 'bc' needed 13 seconds to do only 400 of those operations.  So who
:is the winner?
:
-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/13/90)

In article <119983@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
\\\
>This is true when dividing a multi-precise number by a single precision
>number. It is not true when dividing two multi-precise numbers. 
>Dividing 2n bits by n bits is O(n^2) not O(n).
\\\

The complexity of _division_ is still (I understand) an open
question. According to some (old) work by Cook & Aanderaa it
is _unlikely_ to be less than O(n log n/(log log n)^2) on
``conventional'' machines.

The best division _algorithm_ that I know of performs 
O(n log n log log n).

-Kym Horsell

bs@linus.mitre.org (Robert D. Silverman) (09/13/90)

In article <3999@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
:In article <119983@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
:\\\
:>This is true when dividing a multi-precise number by a single precision
:>number. It is not true when dividing two multi-precise numbers. 
:>Dividing 2n bits by n bits is O(n^2) not O(n).
:\\\
:
:The complexity of _division_ is still (I understand) an open
:question. According to some (old) work by Cook & Aanderaa it
:is _unlikely_ to be less than O(n log n/(log log n)^2) on
:``conventional'' machines.
:
:The best division _algorithm_ that I know of performs 
:O(n log n log log n).
:
:-Kym Horsell


I'm talking about PRACTICAL algorithms here, not theoretical ones.
There doesn't seem to be a recursive bisection algorithm similar to
Karatsuba's that works for division. Methods based on FFT's are 
ineffective until one reaches thousands of bits. For numbers in the
(say) 200 decimal digit range, ordinary long division is best, and
that algorithm is O(n^2).

-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)

In article <120002@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
[distinction of theoretical complexity of division algorithm as opposed to
	complexity of particular algorithm].
>I'm talking about PRACTICAL algorithms here, not theoretical ones.
>There doesn't seem to be a recursive bisection algorithm similar to
>Karatsuba's that works for division. Methods based on FFT's are 
>ineffective until one reaches thousands of bits. For numbers in the
>(say) 200 decimal digit range, ordinary long division is best, and
>that algorithm is O(n^2).

So long as you make the distinction between problem complexity,
algorithm complexity & the complexity of algorithms that you use.
Such niceties are sometimes lost on some of us who read this group.

I will check out your claim re _practicality_ of FFT algorithms and
will report results to the net; they may be of use to people
to those concened about implementation issues.

-Kym Horsell

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)

In article <120002@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
\\\
>There doesn't seem to be a recursive bisection algorithm similar to
>Karatsuba's that works for division. Methods based on FFT's are 
\\\

I am not sure about the Newton part of the obvious ``invert and
multiply'' using Kartsuba -- does this give you a practical technique
for high-precision divide O(n^lg(3)).

Another straightforward scheme, I haven't checked the literature
but I don't remember it, is to use

	1/(2^n a+b) approx= 2^-n 1/a (1 - 2^-2 b/a)

It _has_ been a long day so could someone point out to me
(I _know_ there are mathematicians out there) whether the
rhs requires 2*n-bit calcs?

-Kym Horsell

dik@cwi.nl (Dik T. Winter) (09/14/90)

In article <119983@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
 > In article <2114@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
 > :Well, after the discussion between R. Kym Horsell and Robert D. Silverman,
 > :here are some real facts.  First the background of it.
 >                ----
 >  
 > Does this imply my claims are not real?
No, not that.  But your claims told only about an order of magnitude.
Perhaps your claims are based on programs where one was 10 times
as fast as the other on some processor.  I have a lot of data for a
lot of different processors.
 > 
 > My claims are based on actual benchmarks with large scale multiprecise
 > FFTs, with the Elliptic Curve factoring algorithm and with the Elliptic
 > Curve prime proving algorithm.
You did not state what processor gives which advantage.  It highly depends
upon the processor used.
 > 
...
 > A large chunk of this algorithm is spent doing Jacobi sums that don't
 > consist of just multi-precise multiplications. This, I believe, is
 > relevent to some of the benchmarks given below. 
Right, but althought it is large chunk of the algorithm, it is not a large chunk
of the time required as far as I know.  But I will time that.
 > 
...
 > :            In this way the subroutine call overhead for a large number of
 >              ----------------------------------------------------------------
 > :calls can be avoided.  Now whenever I get access to some machine I try
 >  --------------------
 >  
 > If one chooses not to put the control loops in assembler, but instead
 > relies upon calling just low level routines for 64 bit arithmetic,
 > the speed difference becomes more dramatic. In going from 32 to 64
 > bit arithmetic, one makes 4 times fewer subroutine calls to low
 > level arithmetic routines. 
Especially if (as you make clear now) you also use subroutine calls to
low level routines that could be inlined and coded in the HOL.  Did you
try inlining on (for example) the Sparc?
 > 
 > :                      (The program I use does indeed twice write the number
 > :Even in a multiprecise division, the number of elementary divisions is order
 > :n while the number of elementary multiplications is order n squared.
 >  
 > This is true when dividing a multi-precise number by a single precision
 > number. It is not true when dividing two multi-precise numbers. 
 > Dividing 2n bits by n bits is O(n^2) not O(n).
I wrote that the number of elementary divisions is order n, the number of
elementary multiplications and additions is n^2.
 > 
 > :
 > :Now back to the data.  First column gives the machine, second column the
 > :amount of time needed when using 32x32->64 multiply as compared to 32x32->32
 > :multiply (in percents), the third column gives the same when also some
 > :basic loops are coded.
 > :
 > :				percentage of time needed when using
 > :				32x32->64	coded loops
 > :DEC VAX-780			62.84%		46.50%
 > 
 > Stuff deleted. 
 > 
 > I don't dispute the data, but I'm a little unclear as to what is
 > being measured. 'percentage of time needed' is unclear to me.
 > Does this refer to the percentage of total execution time doing
 > multi-precise divides and multiples?
Excuse me for my poor English.  The columns set off the time required for
the whole prime proving using an alternate form for the arithmetic compared
to the time required when doing everything in a HOL.
 > 
 > I believe what it means is that, for example on the VAX, the program
 > spent (respectively) 62% and 46% as much time to run when using 64
 > bit arithmetic, as it did when using 32 bit arithmetic.
Right.
 > 
 > However, he does not say what fraction of the program is spent in
 > doing this arithmetic. If a program (say) spends 1/2 its time doing
 > multi-precise multiplies, a 10-fold speed increase will reduce this
 > to 5%, but one still has the other 50%, resulting in a total run
 > time of 55% of the original.
Also true.  But I am *not* interested in the time it takes to do some
parts of a program; I am only interested in the time that is required
for the program complete.  The same discussion is often heard in
supercomputer regions about vectorization.  Look at Amdahl's law.
What is better, a processor that peaks at 800 MFlop but gets a performance
of 10 MFlop on realistic programs, or a processor that peaks at 400 MFlop
and gets 100 MFlop on realistic programs?

(To be honest, on more than one occasion I did this kind of micro optimization,
see my follow-up to Peter Montgomery's article.  But is it worthwile?  For me
yes, I learn a bit more about the processor, but that is all.)
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

pavlov@canisius.UUCP (Greg Pavlov) (09/14/90)

In article <119858@linus.mitre.org>, bs@linus.mitre.org (Robert D. Silverman) writes:
> ......... Suppose A > B and A > C. How would one compute
> BC/A EXACTLY without 64 bit support? .....
> 
   ... and if A = PI, and B = sq.rt. of 2, how would one calculate BC/A
       EXACTLY ?

> "You can lead a horse's ass to knowledge, but you can't make him think"

   ... obviously, we are all (all the rest of us, anyway) horses' asses....

    greg pavlov, fstrf, amherst, ny

bs@linus.mitre.org (Robert D. Silverman) (09/14/90)

In article <2857@canisius.UUCP> pavlov@canisius.UUCP (Greg Pavlov) writes:
:In article <119858@linus.mitre.org>, bs@linus.mitre.org (Robert D. Silverman) writes:
:> ......... Suppose A > B and A > C. How would one compute
:> BC/A EXACTLY without 64 bit support? .....
:> 
:   ... and if A = PI, and B = sq.rt. of 2, how would one calculate BC/A
:       EXACTLY ?
:
:> "You can lead a horse's ass to knowledge, but you can't make him think"
:
:   ... obviously, we are all (all the rest of us, anyway) horses' asses....
:
:    greg pavlov, fstrf, amherst, ny


If you had been following this thread you would know that we have been
discussing INTEGER arithmetic throughout. Therefore BC/A is an integer
that fits within a single machine word.

-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

pete@abccam.abcl.co.uk (Peter Cockerell) (09/14/90)

In article <3989@bingvaxu.cc.binghamton.edu>, vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> 
> After a number of private communications, I've managed to render
> one of the little benchmarks I have presentable enough to post,
> along with some performance figures from different machines
> (basically, its put up or shut up time).

[load of stuff deleted]

> P.S. The benchmark was _deliberately_ kept rather simple; I
>       wanted to measure the performance of _multiply_ in
>       the context of basic extended precision arithmetic, not
>       the memory or i/o subsystems.


The results of the 'benchmark' when run on my ARM (Acorn Risc Machine)
system running BSD 4.3 are:


DOUBLE defined      3.6
DOUBLE not defined  8.9
Ratio               2.5

All this seems to be telling me is that the conversions required to used the
ARM's 32*32->32 instruction to perform 8*8->16 arithmetic are more onerous
(and so slower) than those required to do 16*16->32.

(Short<->int on the ARM requires masking and/or shifting; there are no
explicit conversion instructions, so a C int=short*short compiles to

    MOV R0, R0, LSL #16     ;Sign extend RHS
    MOV R0, R0, ASR #16
    MOV R1, R1, LSL #16     ;Sign extend RHS
    MOV R1, R1, ASR #16
    MUL R2, R0, R1          ;Do the multiply
		    
Similarly, a short=char*char is
		        
			
    MOV R0, R0, LSL #24     ;Sign extend RHS
    MOV R0, R0, ASR #24
    MOV R1, R1, LSL #24     ;Sign extend RHS
    MOV R1, R1, ASR #24
    MUL R2, R0, R1          ;Do the mul
    MOV R2, R2, LSL #16     ;Convert to short
    MOV R2, R2, ASR #16
						    
In comparison, int=int*int compiles to
						    
    MUL R2, R0, R1
							
The benchmark time for the case when LONG and SHORT are both defined
to be int (ie the natural length for the processor) is 0.4s!
			
Or am I missing something...?

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/17/90)

In article <513@abccam.abcl.co.uk> pete@abccam.abcl.co.uk (Peter Cockerell) writes:
\\\
>The benchmark time for the case when LONG and SHORT are both defined
>to be int (ie the natural length for the processor) is 0.4s!
>			
>Or am I missing something...?

Maybe. What is happening to the high-order part of the 32-bit product?
It's lost; your benchmark isn't performing the same function as mine.
[And any difference is due to memory accessing effects -- what a difference
tho'!].

For those that don't want (yet another?) clarification of what I'm trying to
get at plz `n' here.
[I can't help it & doctors can't help-- working in a college environment
causes ``lecture latchup''.  But it _does_ help to clarify my own `ideas'].

To reiterate, I wish to measure the difference between performance
of XP software with & without the convenience of having multiply
produce a ``double size'' product.

Some folks argue that having a multiply that gives a double product
is crucial to efficient running of their XP software. The question
then is, _how much_ is it worth (in terms of area, running time, etc).

To this end I've released this lil' program for any interested party
to measure on their available h/w (and _I'm_ interested in the
results 4 sure).

The program attempts to perform one of the things that tend
to take time in XP calcs -- big multiplies -- and have adopted the naive
``pencil & paper'' method because (a) it is still used a lot (see a lot of
LISP ``bignum'' packages for one thing), and (b) it has a _high dynamic
density_ of machine-level multiply operations vv adds and shifts.

Now, since double-sized products are not universal, I have to
``guestimate'' their loss on some architectures. They way I have
chosen to do this is to perform some calculations using 32x32->32 and
16x16->16 arithmetic (where available). On machines where native
16x16->16 _isn't_ available we have a bit of a problem (not to mention
machines that don't have h/w multiply in _any_ form); but its
still useful to have some numbers for these machines anyway.

O'Keefe has concentrated on computing factorials -- and this _may_
be a good idea; the density of multiplies may be higher than
the program I released. However, the first set of figures I posted
was based on the same idea (although I didn't _actually_ use
any machine-level support for it) and the differences between
16 and 32-bit versions weren't as large as I thought they _might_
be in other contexts -- hence the _second_ program.

O'Keefe's figure of 4-5 times speedup when _using_ vs _not using_
an _actual_ double-sized product is important to note. Maybe
I'll go back and _actually_ insert this into my program.
However, it's only _possible_ for machines with the actual
h/w support.

Summary -- O'Keefe has raised (my own included) doubts over
the actual speedup 32x32->32 and 32x32->64, so I'm going back
to the bench (but not for a rest). Unrolling the loops is
an experiment that _has_ been suggested by several people.

Tnx to all who are participating.

-Kym Horsell

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/20/90)

In article <4037@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> The program attempts to perform one of the things that tend
> to take time in XP calcs -- big multiplies -- and have adopted the naive

Horsell's benchmark had two parts:  part 1 calculated a bunch of factorials,
then part 2 added and multiplied them.  I noticed that the code to add two
bignums was wrong ("cry" is one bit too short...), so concentrated on part 1.
part 1 does a lot of bigit * bignum multiplies, except that it turns out to
use very small bigits.  That's why I looked at factorials, because that was
part of Horsell's benchmark, and it was possible to use a technique for
factorials which _is_ similar to bignum multiplication.

> O'Keefe's figure of 4-5 times speedup when _using_ vs _not using_
> an _actual_ double-sized product is important to note.

It turned out that we have GCC here.  That makes a difference.
New figures:

Factorial test (NS32532, Encore Multimax 4.3BSD)

version	mult.size	lang	User t	System	Ratio
Horsell "8x8 -> 16"	cc	311.2	1.9	1.68
Horsell	"16x16 -> 32"	cc	228.8	2.6	1.23
My code "32x32 -> 64"	as	 52.6	0.5   1/3.52

Horsell "8x8->16"	gcc	264.1	1.3	1.43
Horsell "16x16->32"	gcc	185.1	0.8	1.00
My code "32x32->64"	as	 52.4	0.3   1/3.53		(.aligned)

So compared with Horsell's "DOUBLE" version compiled by a rather better
compiler, assembly code using 32x32->64 unsigned multiplication gets a
3.5ish speedup.  (The .aligned note refers to the trick of aligning branch
targets on word boundaries.  On machines with variable size instructions
that can often help.  Here it didn't, not to speak of.)

A figure of about 3.5 sounds right:   going from 16-bit to 32-bit chunks
we expect to do 1/4 the number of multiplications, and those multiplications
may be a wee bit slower.

It's worth noting that Horsell's representation for bignums is hairy.
I just used pointer-to-(length, bigit 0, bigit 1, ...) which amongst
other things let me use add-compare-and-branch.

-- 
Heuer's Law:  Any feature is a bug unless it can be turned off.