[comp.arch] int x int -> long for *

vu0310@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/12/90)

In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes:
\\\
>I wish people would refrain from shooting their mouths off about
>multi-precision arithmetic until they've actually done a significant
>amount of it. Typically, only having 32 x 32 --> 32 slows down
>multi-precision arithmetic by about a factor of 10. Not having it
>can render a machine all but useless for performing certain tasks.
\\\

Well that ``factor of 10'', in my simple-minded way, sounded a bit
too far off the mark (of course it only _asymptote_ to less than 3 for sure).
So I ran my littul benchmark. The results for VAX 8750 and Sparc,
using a variety of compilers and optimization levels are as
follows:

Machine	Compiler	Optimization	benchmark 1	benchmark 2

VAX	
	cc		no		2.783		5.300
	cc		yes		2.566		5.133
	gcc		no		2.900		5.116
	gcc		yes		1.783		3.700

Sparc
	cc		no		1.833		3.733
	cc		yes		1.033		2.100

Admittedly there is a lost to be said for my lack of understanding
of experimental design, but I think we fail to see an
order of magnitude difference between benchmarks 1 & 2.

The difference? Benchmark 2 performed all calculations using
int x int -> int while benchmark 1 used both int x int -> int
and long x long -> long.

I reiterate -- since int x int -> long is not _logically
required_ then you must argue it's benefit from an _economic_ 
viewpoint rather than a convenience one. 

As the vast majority of computation (although an ill-defined 
quantity here) does not seem to suffer from its lack, and the 
admittedly special-purpose stuff here only runs a (binary) order of 
magnitude slower by using lower-precision, I can not see how
allocation of 1 barn to int x int -> long can be justified.

Of course, this situation may change -- technology has a habit
of throwing a monkey-wrench into things (e.g. the motor
industry and mating habbits).  

I await cogent arguments.

-Kym Horsell

mash@mips.COM (John Mashey) (09/12/90)

I'm confused about the following exchange:

In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes:
...
>>amount of it. Typically, only having 32 x 32 --> 32 slows down
>>multi-precision arithmetic by about a factor of 10. Not having it

>Well that ``factor of 10'', in my simple-minded way, sounded a bit
>too far off the mark (of course it only _asymptote_ to less than 3 for sure).
>So I ran my littul benchmark. The results for VAX 8750 and Sparc,
>using a variety of compilers and optimization levels are as
>follows:
....
>Admittedly there is a lost to be said for my lack of understanding
>of experimental design, but I think we fail to see an
>order of magnitude difference between benchmarks 1 & 2.

>The difference? Benchmark 2 performed all calculations using
>int x int -> int while benchmark 1 used both int x int -> int
>and long x long -> long.

>I reiterate -- since int x int -> long is not _logically
>required_ then you must argue it's benefit from an _economic_ 
>viewpoint rather than a convenience one. 

I'm confused.  Since some fairly strong results are being stated
(that Silverman is way off; anyone can be off, but on the other hand,
Silverman certainly knows this problem domain), it would help
to understand exactly what is being measured. In particular,
I'm confused becasue Silverman clearly wished for 32x32->64,
and the posted results talk about int x int -> int, long x long -> long.

Unless I completely misunderstand the posting, I can't find any
connection whatsoever between the measurements of an apparently portable
C program on these machines, and Silverman's clear statement of wishing for
32 x 32 -> 64 bit product for a 10X difference in his work, since:
	int & long are both 32-bit anyway
	C doesn't expect 64-bit from 32x32; I don't know how, in C, you
		select between 32x32->32, and 32x32 -> 64,if both are
		actually available?  Are there C compilers that give this
		choice? (Note: MIPS has 32x32->64,
			but C just uses the low-order 32 bits)
	Existing SPARC implementations don't HAVE multiply instructions at all
		(other than multiply-step)

How about posting the benchmark, so we can understand what's happening?
It might especially be good to post a sample of the dissassembled code that's
supposed to be doing 32x32->64, if that is truly supposed to be happening.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

pmontgom@euphemia.math.ucla.edu (Peter Montgomery) (09/12/90)

In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>I'm confused about the following exchange:
>
>In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes:
>...
>>>amount of it. Typically, only having 32 x 32 --> 32 slows down
>>>multi-precision arithmetic by about a factor of 10. Not having it
>
>>Well that ``factor of 10'', in my simple-minded way, sounded a bit
>>too far off the mark (of course it only _asymptote_ to less than 3 for sure).
>>So I ran my littul benchmark. The results for VAX 8750 and Sparc,
>>using a variety of compilers and optimization levels are as
>>follows:
>....
>>The difference? Benchmark 2 performed all calculations using
>>int x int -> int while benchmark 1 used both int x int -> int
>>and long x long -> long.
>
>
>I'm confused.  Since some fairly strong results are being stated
>(that Silverman is way off; anyone can be off, but on the other hand,
>Silverman certainly knows this problem domain), it would help
>to understand exactly what is being measured. In particular,
>I'm confused becasue Silverman clearly wished for 32x32->64,
>and the posted results talk about int x int -> int, long x long -> long.
>
>Unless I completely misunderstand the posting, I can't find any
>connection whatsoever between the measurements of an apparently portable
>C program on these machines, and Silverman's clear statement of wishing for
>32 x 32 -> 64 bit product for a 10X difference in his work.
>How about posting the benchmark, so we can understand what's happening?
>It might especially be good to post a sample of the disassembled code that's
>supposed to be doing 32x32->64, if that is truly supposed to be happening.

	I do many of the same types of computations as Silverman
(and, like he, some of my processes use a month of CPU time or more).
But I dispute his 10X figure.  Here is a benchmark which tries
to compute the remainders i*j mod p where 0 <= i, j < p, using three
algorithms expressible in C and also an assembly code
algorithm (which, on the SUN 3, uses the 64-bit multiply and divide
instructions).  Those instructions perform five times as fast as
the best of the C algorithms, but my SPARC assembly code
is only slightly bettter than the best of the C algorithms.
[If someone can significantly improve my SPARC assembly code, 
let me know.]

#! /bin/csh -f
#
#	This benchmark is intended to measure how much 64-bit 
#	integer multiplication and division hardware affect
#	the computation of i*j mod p, where 0 <= i, j, < p < 2**31.  
#	The task is done four times, via C routines MUL1, MUL2, MUL3,
#	and assembly routine MUL4.  A probable primality test
#	(based on Fermat's little Theorem) is used to rate the algorithms.
#
#	MUL1 uses additions, subtractions, and bit-wise operations,
#	(no multiplication or division), analyzing one bit at a time. 
#	It provides a benchmark for rating the machine's raw speed.
#
#	MUL2 uses floating point arithmetic to estimate the quotient
#	i*j/p, followed by unsigned long integer arithmetic (modulo 2^32 or
#	other power of 2) to compute the remainder and possibly adjust it.
#
#	MUL3 breaks the operands up into 16-bit pieces.
#
#	MUL4 is a machine-dependent, assembly-language program.
#	Presently SUN 3 and SUN 4 (SPARC) versions are included.
#	If the machine has 64-bit integer multiply and divide instructions,
#	the MUL1/MUL4 ratio estimates the benefit achieved by
#	using these instructions (when adjusted for compiler's
#	level of sophistication).
#
#	Approximate times:
#
#	       SUN 3/280	     SUN 4/260	   (both with FPUs)
#      gcc 1.37.1     cc     gcc 1.37.1	    cc
# MUL1	    10.40  13.52           3.67	  2.92     (simple integer arithmetic)
# MUL2	     9.00  10.38           2.03   2.05	   (floating point arithmetic)
# MUL3	    10.14  11.30           6.11	  6.32     (break into 16-bit pieces)
# MUL4	     1.88   1.98 	   1.82	  1.83     (assembly code)
#
#	My optimized assembly code for the SUN 4 does little better
#	than my straightforward assembly code for the SUN3,
#	even though the SUN 4's raw CPU speed is 3.5 - 5 times
#	that of the SUN 3 (as estimated by the MUL1 and MUL2 times).
#	This is an application on which the SPARC does not do
#	as well as we want.  Residue (modular) arithmetic is 
#	important to some of my applications -- see, for example,
#	"An FFT extension to the P-1 factoring algorithm"
#	by Robert D. Silverman and me, in the April, 1990 Math. Comp.
#
#		Peter L. Montgomery
#		Department of Mathematics
#		University of California
#		Los Angeles, CA 90024-1555
#		pmontgom@math.ucla.edu
#		September, 1990

cat << end_program >! modprd.c

#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>

typedef unsigned long ulong;
typedef          long slong; 

long lrand48();		/* Random number generator (SUN-specific) */
ulong MUL4();		/* Assembly routine */

ulong MUL1(i, j, p)
ulong i, j, p;
{
    register ulong ii, jj, rem = 0;

    if (i < j) {ii = i; jj = j;} else {ii = j; jj = i;}
		/* ii = MIN(i, j), 	jj = MAX(i,j)	*/

    while (ii != 0) {		/* Answer is (ii*jj + rem) mod p */
	if ((ii & 1) != 0) {
	    rem += jj;
	    if (rem >= p) rem -= p;
        }
        jj <<= 1;
        if (jj >= p) jj -= p;
	ii >>= 1;
    } 
    return rem;
}

ulong MUL2(i, j, p)
ulong i, j, p;
{
    register ulong quot, rem;

    quot = (ulong)(slong) (  (double)(slong)i * (double)(slong)j 
		            /(double)(slong)p + 0.5);
				/* true quotient, or one too high */
    rem = i*j - p*quot;		/* remainder as signed integer */
    if ((slong)rem <= -1) rem += p;
	/* The above test could be "< 0", but some compilers optimize 
	   "< 0" incorrectly, using the condition code from the subtract.  */
    return rem;
}

ulong MUL3(i, j, p) 
ulong i, j, p;
{
    if (p < 65536) {
	return (i*j) % p;	/* simple case */
    } else {
   	ulong itop = i >> 16, jtop = j >> 16;  /* fit in 15 bits */
   	ulong ibot = i & 0xffff, jbot = j & 0xffff;

   	register ulong prdtop = itop*jtop, prdbot = ibot*jbot;
   	register ulong prdmid = prdtop + prdbot - (itop - ibot)*(jtop - jbot);
		/* i*j = prdtop*2^32 + prdmid*2^16 + prdbot */
		/* 0 <= prdtop < p^2/2^32  < 2^30 */
		/* 0 <= prdbot, prdmid <= 2^32 - 2^17 + 1 */
	register ulong pnorm = p;	/* Multiple of p */
	register ulong pntop, pnbot;

 	while (pnorm < (1L<<30)) pnorm <<= 1;  /* 2^30 <= pnorm < 2^31 */
	pntop = pnorm >> 16;
	pnbot = pnorm & 0xffff;

	prdmid += (prdbot >> 16); prdbot &= 0xffff;	/* Accumulate carries */
	prdtop += (prdmid >> 16); prdmid &= 0xffff;
        while (prdtop > pntop) {
	    register ulong quot = prdtop/(pntop + 1);	/* Fits in 16 bits */
	    register ulong tmpmid = quot*pnbot;
	    prdtop -= quot*pntop;
	    if (prdmid < tmpmid) prdtop -= 65536;
	    prdmid -= tmpmid;
	    prdtop += (prdmid >> 16); prdmid &= 0xffff;
	}
	prdmid += prdtop << 16;
	while (prdmid > pntop) {
	    register ulong quot = prdmid/(pntop + 1);
	    register ulong tmpbot = quot*pnbot;
	    prdmid -= quot*pntop;
	    if (prdbot < tmpbot) prdmid -= 65536;
	    prdbot -= tmpbot;
	    prdmid += (prdbot >> 16); prdbot &= 0xffff;
	}	
	prdbot += (prdmid << 16);
	return prdbot%p;
    }
}

double CP_TIME()
{ 		/* Return CP time used so far, in seconds */
    long sec, usec;
    struct rusage inf;

    getrusage(RUSAGE_SELF, &inf);
    sec	 = inf.ru_utime.tv_sec	+ inf.ru_stime.tv_sec;
    usec = inf.ru_utime.tv_usec + inf.ru_stime.tv_usec;
    return (double)sec + 1.0e-6*(double)usec;
}

ulong MODEXP(base, expon, p, func)
ulong base, expon, p;
ulong (*func)();	
{		/* Modular exponentiation, base^expon mod p */
    if (expon == 0) {
	return 1;
    } else {
	register ulong b = base, ipow2 = 1;
	while ((ipow2 <<1) <= expon) ipow2 <<= 1;
			/* highest power of 2 <= expon */
	while ((ipow2 >>= 1) != 0) {
	    /* Answer = [b^(2*ipow2) * base^(e % (2*ipow2))] mod p */
	    b = func(b, b, p);
	    if (expon & ipow2) b = func(b, base, p);
	}	
   	return b;
    } 
}

void TIME(func, name)
ulong (*func)();	/* MUL1, MUL2, MUL3, or MUL4 */
char *name;
{
    double t1 = CP_TIME(), t2;
    ulong p, nprime = 0;
/*
	Search for primes in the interval (10^9, 10^9 + 10000),
	using a pseudoprime test to base 1234567.
	Count the number of successes.
*/
    for (p = 1000000001; p <= 1000010000; p += 2) {
	ulong test = MODEXP(1234567L, (p-1)/2, p, func);
	if (test == 1 || test == p-1) nprime++;
    }
	
    t2 = CP_TIME();
    printf("%s found %ld probable primes in %10.2f seconds.\n", 
		name, nprime, t2-t1);
}

int main(argc, argv)
int argc;
char **argv;
{
    int itest;
    srand48(1990);		/* Set seed */
    for (itest = 1; itest <= 200; itest++) {
        ulong n1, n2, p;
	ulong ans1, ans2, ans3, ans4;

	while ((p = lrand48()) < 1000) {};
	n1 = lrand48() % p;	
	n2 = lrand48() % p;	
        ans1 = MUL1(n1, n2, p);
        ans2 = MUL2(n1, n2, p);
        ans3 = MUL3(n1, n2, p);
        ans4 = MUL4(n1, n2, p);
	if (ans1 != ans2 || ans2 != ans3 || ans3 != ans4) {
	    printf("Inconsistent results: n1 = %ld, n2 = %ld, p = %ld\n", 
				n1, n2, p);
	    printf("  ans1 = %ld, ans2 = %ld, ans3 = %ld, ans4 = %ld\n",
			        ans1, ans2, ans3, ans4);
	}
    }	/* for itest */
    TIME(MUL1, "MUL1");
    TIME(MUL2, "MUL2");
    TIME(MUL3, "MUL3");
    TIME(MUL4, "MUL4");
    return 0;
}
end_program
cat << end_sun3assem >! sun3assem.s
| 	iprod = MUL4(i, j, p)		Returns i*j mod p
	.globl	_MUL4			
_MUL4:					| Uses only d0, d1
	movl	sp@(4),d1		| d1 = i
	mulul   sp@(8),d0:d1		| Get 64-bit product i*j
	divul   sp@(12),d0:d1		| Divide by p, get remainder in d0
	rts				| Return
end_sun3assem
cat << end_sun4assem >! sun4assem.s
!		SPARC assembly version of MUL4(i, j, p)   (i*j mod p)
#define MULT1(prod,src) mulscc prod,src,prod
#define MULT2(prod,src) MULT1(prod,src);MULT1(prod,src)
#define MULT4(prod,src) MULT2(prod,src);MULT2(prod,src)
#define MULT8(prod,src) MULT4(prod,src);MULT4(prod,src)
#define MULT16(prod,src) MULT8(prod,src);MULT8(prod,src)

/*
	When dividing  divhi*2^n + divlo  by  divsor,
	where 0 <= divhi < divsor  and	0 <= divlo < 2^n,
	the bottom half of the dividend (divlo) should
	be left-justified (i.e., pre-multiplied by 2^(32-n)).
	After n executions of the REMAINDER_STEP macro,
	register "divhi" will have the remainder.
*/

#define REMAINDER_STEP(divhi,divlo,divsor) \
	addcc	divlo,divlo,divlo  /* Shift low dividend */ ;\
	addx	divhi,divhi,divhi  /* Shift high dividend */ ;\
	cmp	divhi,divsor	   /* Compare dividend and divisor */ ;\
	bgeu,a	1f		   /* if ge, branch */ ;\
	sub	divhi,divsor,divhi /* but first subtract divisor */; \
1:

#define MUL4_step REMAINDER_STEP(%o5,%o4,%o2)

	.global _MUL4		! iprod = MUL4(i, j, p)		(leaf procedure)

_MUL4: 				! %o0 = i, %o1 = j, %o2 = p
	subcc	%o0,%o1,%o3	! i-j
	bgeu	1f		! Branch if i >= j
	nop
	add	%o1,%o3,%o1	! %o1 = j + (i-j) = i = MIN(i,j)
	sub	%o0,%o3,%o0	! %o0 = i - (i-j) = j = MAX(i,j)
1:
	mov	%o1,%y		! set multiplier (y reg) to MIN(i,j)
	srl	%o1,16,%o4
	tst	%o4
	bz	mul4_16		! Branch if multiplier fits in 16 bits
	andncc	%o4,0xff,%g0
	bz	mul4_24
	orcc	%g0,%g0,%o5	! Initialize upper product (delay)

	MULT16(%o5,%o0)
	MULT16(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	mov	%y,%o4		! Get lower product

	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
mul4_x24:
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
mul4_x16:
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
mul4_x8:
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	MUL4_step; MUL4_step; MUL4_step; MUL4_step
	retl			! Return from leaf
	mov	%o5,%o0		! Move remainder (delay)

mul4_24:
	MULT16(%o5,%o0)
	MULT8(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	b	mul4_x24
	mov	%y,%o4		! and get lower product (delay)

mul4_16:
	andncc	%o1,0xff,%g0	! Test if one operand is 8 bits or less
	bz	mul4_8
	orcc	%g0,%g0,%o5	! Initialize upper product (delay)

	MULT16(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	b	mul4_x16		
	mov	%y,%o4		! and get lower product

mul4_8:
	MULT8(%o5,%o0)
	mulscc	%o5,%g0,%o5	! Align results
	b	mul4_x8
	mov	%y,%o4		! and get lower product
end_sun4assem

# ********************** Compile and execute the program ********************

if ("`arch`" == "sun3") then
	as -o assem.o sun3assem.s
	gcc -O modprd.c assem.o
	echo "SUN 3 gcc times"
	a.out
	echo "SUN 3 cc times"
	cc -O4 modprd.c assem.o
	a.out
else if("`arch`" == "sun4") then
	as -o assem.o -P sun4assem.s
	echo "SUN 4 gcc times"
	gcc -O modprd.c assem.o
	a.out
	echo "SUN 4 cc times"
	cc -O4 modprd.c assem.o
	a.out
else 
	echo "Unknown architecture - `arch`"
endif
--
        Peter L. Montgomery 
        pmontgom@MATH.UCLA.EDU 
        Department of Mathematics, UCLA, Los Angeles, CA 90024-1555
If I spent as much time on my dissertation as I do reading news, I'd graduate.

bs@linus.mitre.org (Robert D. Silverman) (09/13/90)

In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes:
>In article <41425@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>I'm confused about the following exchange:
>>
>>In article <3984@bingvaxu.cc.binghamton.edu> vu0310@bingvaxu.cc.binghamton.edu.cc.binghamton.edu (R. Kym Horsell) writes:
>>>In article <119733@linus.mitre.org> bs@gauss.UUCP (Robert D. Silverman) writes:

stuff deleted.

>>How about posting the benchmark, so we can understand what's happening?
>>It might especially be good to post a sample of the disassembled code that's
>>supposed to be doing 32x32->64, if that is truly supposed to be happening.
>
>	I do many of the same types of computations as Silverman
>(and, like he, some of my processes use a month of CPU time or more).
>But I dispute his 10X figure.  Here is a benchmark which tries
>to compute the remainders i*j mod p where 0 <= i, j < p, using three
>algorithms expressible in C and also an assembly code
>algorithm (which, on the SUN 3, uses the 64-bit multiply and divide
>instructions).  Those instructions perform five times as fast as

Even Peter reports a 5 fold speed increase. One difference between his
code and mine that would tend to exaggerate the difference is that he
codes the control loops directly in assembler [for his multiply routines]
whereas I write my in C with low level calls to assembler routines for
the 64 bit arithmetic. When I go from 32 to 64 bits, I cut my calls to
these routines by a factor of 4. For me to reduce to 32 bit arithmetic
is therefore more expensive. This applies especially to the VAX because
the 'calls' instruction generated by the C compiler is a real pig.

Secondly, he does his modular multiplication by a method that avoids
the division instruction entirely (at the cost of some additional
multiplies). I do mine by multiplication followed by long division
[slower, but slightly more general]. Since division is slower than
multiplication, the effect of going from 64/32 --> 32  to 32/16 --> 16
enhances the speed difference.

My claim of 10x varies from machine to machine. On some machines 
[a microVAX] It has been as high as a factor of 13. On others as low
as about 6. On one machine, [An Alliant] I observed a 20% speed difference
just from cache effects. Storing in radix 2^30 versus 2^15 halves the
storage requirements.

I agree that my measurements DO NOT measure JUST the raw speed difference
in arithmetic that one obtains from going from 32 to 64 bit multiplies
and divides. I also agree that if I were to inline my calls to the
low level assembler routines, the effect would be lessened.

However, there can't be any doubt that there is a SIGNIFICANT difference
between 32 and 64 bit arithmetic when doing multiple precision. The
Sparc shows little difference because it has NO multiply or divide
instruction for even 32 bits.

It comes down to what we are measuring. I am measuring the overall
performance benefit for large programs with lots of data including
caching, subroutine calls etc. Peter's benchmarks tend to measure
more closely just the speed difference in arithmetic. And in that
respect, I agree with his assessments. 

If one had higher level language support for a 'long long' temporary
data type so one could compute AB/C,  AB mod C, (A << 30 + B)/C,
(AB + C) mod 2^30, etc. directly in C, my claim for a 10x speed
improvement would come down quite a bit. 

One thing is clear throughout ALL of this. Machines like the SPARC
which provide NO integer multiply/divide are not worth much for
this type of work. This is why I question ALUs like the i860
which also has no integer multiply/divide, yet does have a fancy
graphics unit.

-- 
Bob Silverman
#include <std.disclaimer>
Mitre Corporation, Bedford, MA 01730
"You can lead a horse's ass to knowledge, but you can't make him think"

preston@titan.rice.edu (Preston Briggs) (09/13/90)

In article <119977@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:

>One thing is clear throughout ALL of this. Machines like the SPARC
>which provide NO integer multiply/divide are not worth much for
>this type of work. This is why I question ALUs like the i860
>which also has no integer multiply/divide, yet does have a fancy
>graphics unit.

I expect you're right.  The i860 is supposed to crunch floating-point.
The integer half is supposed to do the necessary integer work to
support addressing.  I guess they intend the graphics portion to
display the results...

I argued about this with a guy who wanted to put a chess program on
an i860.  It seems like a big mismatch of problem and resources.

Personally, I'd rather see the 860+ drop the graphics instructions
and give me a "4-address" multiply-accumulate instruction.
Then we could do some crunching.

For your stuff, a good integer processor with no FP or graphics
or whatever sounds like the ticket.  Unfortunately, most machines
use integers to do characters, addresses, and sometimes bits.  
Are there exceptions?  Digital signal processor chips?

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/13/90)

In article <119977@linus.mitre.org> bs@linus.mitre.org (Robert D. Silverman) writes:
\\\
>Even Peter reports a 5 fold speed increase. One difference between his
		      ^^^^^^
>code and mine that would tend to exaggerate the difference is that he
\\\

#	       SUN 3/280	     SUN 4/260	   (both with FPUs)
#      gcc 1.37.1     cc     gcc 1.37.1	    cc
# MUL1	    10.40  13.52           3.67	  2.92     (simple integer arithmetic)
# MUL2	     9.00  10.38           2.03   2.05	   (floating point arithmetic)
# MUL3	    10.14  11.30           6.11	  6.32     (break into 16-bit pieces)
# MUL4	     1.88   1.98 	   1.82	  1.83     (assembly code)

Ratio of MUL3/MUL1:

		.97 .83 		1.7 2.2

There does not seem to be a 5-fold difference between using ``simple integer
arithmetic'' and ``break into 16-bit pieces'' in Peter's figures.

Of course the ratio between ``simple integer arithmetic'' and
``assemby code'' are much higher -- this would be comparing apples & oranges. 

Since Peter various assembly sources to hand -- I would be
interested to see figures concerning:
	32x32->64 multiply
and
	32x32->32 via 16-bit pieces multiply

Another experiment, still in assembler, would be to get figures for
the same program using a non-naive multiply and divide in which
the lack of 32x32->64 would be even less marked.

-Kym Horsell

dik@cwi.nl (Dik T. Winter) (09/14/90)

In article <353@kaos.MATH.UCLA.EDU> pmontgom@euphemia.math.ucla.edu (Peter Montgomery) writes:
 > [If someone can significantly improve my SPARC assembly code, 
 > let me know.]
Not significantly, but improvement is possible, see below.
 > 
(Leaving out Sun3 and gcc timings the latter are only really relevant
for MUL1, MUL2 and MUL3).
 > #	Approximate times:
 > #
 > #	       SUN 4/260	   (both with FPUs)
 > #      	   cc
 > # MUL1	  2.92     (simple integer arithmetic)
 > # MUL2	  2.05	   (floating point arithmetic)
 > # MUL3	  6.32     (break into 16-bit pieces)
 > # MUL4	  1.83     (assembly code)
(One big note: the verification of results requires MUL4.  So large improvements
in MUL4 will have effects on timings for other MUL's too!)

Below the results I obtained on a SPARCstation 1 (4/60), SPARCstation SLC (4/?)
and a SPARCstation 1+ (4/65).  Even the ratios are very different!
For MUL4 I give three timings, see further down for the reason.
	   SPARC-1	   SPARC-SLC	   SPARC-1+
MUL1	     3.98	     3.97	     3.18
MUL2	     2.98	     3.22	     2.40
MUL3	    23.27	    10.02	     8.09
MUL4	2.47/2.24/2.14	2.45/2.23/2.14	1.97/1.78/1.71

Note that for all three machines the same compiler was used (the SLC compiler),
but that the libraries for the individual machines were linked in.  The SPARC-1
still has SunOS 4.0.3, the others have SunOS 4.1; that might make a difference.

Now the reason for three timings for MUL4.
1.	The first timing is for the assembly code of Peter Montgomery.
	A quick analysis reveals that it uses 1 cycle per bit for the
	multiply and 5 cycles per bit for the remaindering.  (Plus overhead
	of course.)  (And I use 7 cycles for division plus remainder, which
	cannot be improved in my opinion.)
2.	In Peter Montgomery's code there are a lot of annulled branches.
	The instruction in the delay slot is always executed (and takes time),
	although the result is voided at times.  By a rearrangement of the
	instructions this can be avoided in most cases, resulting in an
	expected 4.5 cycles per bit for remaindering.  That gives the
	second timings.
3.	It is possible to do the remaindering in 4 cycles per bit, hence the
	third timings.  (The program jumps wildly through the code, but that
	is no problem.)

Now I am going to do something that is not done: comparing apples to oranges.
(And back to the architecture issue!)
SPARC:	Optimum if remainder only is required: 4 cycles per bit; if also
	the quotient is required: add 3 cycles per bit.  Multiply requires
	1 cycle per bit.
MIPS:	Multiply is 10 cycles (I found this somewhere in Kanes book, but
	can not find it now).  Divide for small numbers (i.e. 32/32) is a
	single instruction, for which I cannot find the timing now.  Larger
	numbers require more cycles.  My code uses 9 cycles per bit, which
	can be reduced to 6 if only the remainder is required.  Using hardware
	divide might give some benefits here.  Note that hardware multiply and
	hardware divide are free running, i.e. other operations can be
	performed, but the benefit would be very limited in this case.
RTPC:	(Not really RISC in this aspect, just like mips.)  Multiply 2 cycles
	per bit (4 cycles per multiply step) and divide 3 cycles per bit (a
	single divide step).

My third timings were based on the (well-known) process that is also used in
the RTPC divide step instruction.  You need a single status bit (RTPC uses
carry).  The divide step is as follows:
1.	If status bit set:
		subtract divisor, clear status bit if borrow is generated.
		subtract one from quotient.
	else
		add divisor, set status bit if carry is generated.
		add one to quotient.
2.	Shift dividend and quotient one bit.
To use it you put the dividend and divisor in the proper registers, set the
status bit and execute the required number of divide steps.  Finally, if the
status bit is clear you add the divisor to the dividend and increment
the quotient.

Now some (my) opinions:
Sun would have done well to include a divide step as outlined above in its
architecture.  If done in a single cycle it would have reduced Peter
Montgomery's timings by about 60%.  (And the operation is not very dissimilar
from the multiply step.)
Mips would have done well to include a 64/32 bit divide in addition to the
32/32 bit divide.  I do not think that would have added much complexity to
the processor.  (The fact that mips has no status bits, and so special
instructions have to be performed to obtain status makes the mips divide code
slower than the sparc divide code.)
IBM with their RTPC did very well in this aspect.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

mash@mips.COM (John Mashey) (09/14/90)

In article <2118@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
....
>Now some (my) opinions:
...
>Mips would have done well to include a 64/32 bit divide in addition to the
>32/32 bit divide.  I do not think that would have added much complexity to
>the processor.  ...

The opinion that 32x32->64, and 64/32->64 are essentially equivalent,
and the latter would only cost a little complexity ..... is wrong,
at least in the case of the R3000.  (it may not be wrong for a microcoded
machine, and it may or may not be wrong for RISCs in general; however,
one my note that hardly any RISC gives you 32x32-> 64, but even fewer
give you 64/32->64....

An explanation on why this is so was posted about 9 months ago by Craig
Hansen and others.  I also refer to Patterson&Hennessy, Appendix A on
Computer Arithmetic.
I won't repeat all of the details, but basically:

1) If divide follows the same, regular decoding structure as everything
else, and if the dividend specifies the first of 2 registers (to get 64
bits), consider the consequences, assuming it was div dividend,divisor
	1a) Either you need even/odd register pair (to get
		dividend and dividend|1
	OR you need an extra adder (used by nothing else) to get
		dividend and dividend+1
	1b) It becomes the ONLY instruction that needs to fetch 3 32-bit
		registers as inputs.....
		SO you need to add a 3rd read-port to the register file
		OR yuo must irregularize the pipeline in a way that no other
		instruction does to use 2 register fetch cycles...

2) Suppose you're willing to do that, or also pay the price of extra
instructions to set up 64-bit dividend in the extra registers.
	2b) For a bunch of reasons, you end up with extra latches & muxes (BAD).
	(Craig covered this)

3) But suppose you're willing to buy all of that.	
	Now, it you look at Patterson and Hennessy, you will discover that
	if you do 32x32->64, and 64/32->32, as they say:
	"Note that the two block diagrams in Figure A.2 are very similar.
	... By allowing these registers to shift bidirectionally, the same
	hardware can be shared between multiplication and division."
	"Figure A.2.  The multiplier has an n-bit adder... the divider
	has an n+1 bit adder...."

	a) Integer multiply/divide takes up a nontrivial (not huge,
	but you can certainly see it on the layout) chunk of space.
	b) All of this stuff really just barely fit on the original R2000
	in 2micron CMOS, and if you look at the layour again, you'll find that
	the right side of the chip is a regular stack of 32-bit wide
	datapaths, except the mul/div unit sticks out a few more bits.
	c) If you want 64/32->64, you discover that:
		the divisor wants to be expanded to 64 bits, i.e., n=64
		now you want to have a 65-bit adder, which:
			may well be slower than a 33-bit adder
			may have exceeding awkward layout issues, either
			requiring something to be folded, or else be twice
			as wide as the whole rest of the stack.
4) SUMMARY:
	32x32->64 with 32/32->32
		-is regular in decode & register fetch
		-shares hardware well between mul and div
		-has few nasty, ugly layout implications
	32x32->64 with 64/32->64
		-is surprisingly irregular; in fact, it easily causes
		numerous special cases in the mainline paths of the machines
		where it would be the ONLY instruction to have some property
		- doesn't share as easily
		- is rife with ugly layout effects
MORAL:
	-Apparently simple-looking additions can have surprisingly
	ugly and widespread effects
	-32x32->64 and 64/32->64 possess drastically different implementation
	effects in the context of a lean-pipelined, single-issue,
	2-read-port, 32-bit RISC.  They do NOT have the implementation
	similarity that one would expect.
	-The hardware folks could probably say this better, but these are
	at least some of the issues.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/14/90)

I was very surprised by the results of Horsell's benchmark.
However, it is heavily biassed towards small numbers.
I decided to focus on calculating factorials.  My test program
calculates n! for n from 0 to 1000.  I modified Horsell's program
to used a fixed buffer, so that malloc() costs could be eliminated.

					User	System	User ratio
Horsell	(16 x 16 -> 16, C code)		311.2	1.9	1.36 slower
Horsell (32 x 32 -> 32, C code)		228.8	2.6
My code (32 x 32 -> 64, assembly)	 52.6	0.5	4.35 faster

These times were obtained on an Encode MultiMax with NS32532 cpus
running Encore 4.3BSD, Encore's C compiler, time from /bin/time.

I expect that multiplying bignums would also show about a 4x speedup
on this machine from using the (32 x 32 -> 64) instruction.

-- 
Heuer's Law:  Any feature is a bug unless it can be turned off.

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/14/90)

In article <3759@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
>					User	System	User ratio
>Horsell	(16 x 16 -> 16, C code)		311.2	1.9	1.36 slower
>Horsell (32 x 32 -> 32, C code)		228.8	2.6
>My code (32 x 32 -> 64, assembly)	 52.6	0.5	4.35 faster

Could you please include a few more numbers in a followup post?
I guess you were trying to save b/w, but it didn't work.

-Kym Horsell

keith@mips.COM (Keith Garrett) (09/14/90)

In article <2118@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>[multipy comparisions deleted]
>MIPS:	Multiply is 10 cycles (I found this somewhere in Kanes book, but
>	can not find it now).  Divide for small numbers (i.e. 32/32) is a
>	single instruction, for which I cannot find the timing now.  Larger
>	numbers require more cycles.  My code uses 9 cycles per bit, which
>	can be reduced to 6 if only the remainder is required.  Using hardware
>	divide might give some benefits here.  Note that hardware multiply and
>	hardware divide are free running, i.e. other operations can be
>	performed, but the benefit would be very limited in this case.
integer multiply is actually 12 cycles (32*32->64)...integer divide is 35
cycles (32/32->quot,rem) there are addition cycles required to move the
result from the special mult/div result registers to gp registers (MFHI/MFLO).
>[...]
-- 
Keith Garrett        "This is *MY* opinion, OBVIOUSLY"
      Mips Computer Systems, 930 Arques Ave, Sunnyvale, Ca. 94086
      (408) 524-8110     keith@mips.com  or  {ames,decwrl,prls}!mips!keith

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/15/90)

Further to my recent posting of a simple benchmark for testing
extended precision arithmetic vv intxint->long multiply,
several people have expressed concerns regarding the veracity
of the program & its design methodology (if any).

Peter Montgomery expressed concerns in private email regarding
the apparently small numbers that were being multiplied and
that my benchmark might be criticized as being ``untypical''
in that regard.  He states he (typically?) performs calculations
on numbers in the 100's of decimal digits; Robert Silverman performs
similar (and, I presume, larger) calculations.
Greg Lindahl (gl8f@atsun7.astro.Virginia.EDU) also raised the
question of whether cacheing, specifically code cacheing,
would affect the results obtained. He was also concerned about
the (possible?) vectorizing of any code that might make
results atypical of those performed by Silverman & others.

I had intended that my benchmark give _worst possible case_ comparisons
between its short & long versions so as to give any possibility of a 10x
slowdown a good chance of success. To this end I used the naive algorithm,
made the program fairly tight, and tried to eliminate (roughly) secondary
effects due to cacheing, memory management, etc.
[Although I used malloc() to get storage for some of the intermediate
results I noted it used only 0.3% of the runtime so left it in -- thanks
for bringing it up O'Keefe].

However, in case there is some doubt about this (now that I look at it
again) somewhat questionable reasoning, I adjusted my benchmark -- by
`offsetting`` the i & j to calculate (slightly) larger factorials so that
the average size of numbers exceeds 1000 decimal digits (1028 digits on
average).

The results for this modified benchmark are as follows:
---------------------------------------------------------------------------
	16x16->32	8x8->16		ratio	original 

VAX 6640
	22.9		79.9		3.49	3.0
	(14.7)		(50.3)		3.42	2.8
VAX 8530
	60.8		197		3.24	3.0
	(58.6)		(198)		3.38	3.2
Sun Sparc 1
	48.2		142.6		2.96	2.8 	(bc -- 181.9!)
	(25.9)		(74.8)		2.89	3.0
SUN 3/60
	91.5		244.2		2.67	2.5
	(45.1)		(123.8)		2.74	2.5
---------------------------------------------------------------------------

To summarize, my estimate that I had produced a ``worst case'' comparison was 
a little off -- all the ratios, except for the Sparc, have crept up marginally 
by going to larger precision calculations. Presuming a quadratic behavior we 
might perhaps expect (on the basis of this _very_ small sample) to have a 10x 
slowdown at around 1M digits or possibly 100,000. I will investigate this 
final possibility (anyone got a Cray with a few Gb for rent?) of 10x slowdown 
and report my findings at a later date.  
[I will not point out my dispassionate attitude toward my own hypotheses (;-) ].

As for possible affects by cacheing/vectorizing, etc., I _should_ investigate
these (we _do_ have an underutilized 3090 out here). A priori I expect
no appreciable change. No promises at this stage.

-Kym Horsell

P.S.	I have been reading this group now for about 6 months; I've been a 
	bit too busy for the last few years to do so. I'm amazed (and perhaps 
	in John Mashey's case, a bit disappointed) that better use isn't made 
	of the facility.
----
I know if you go near a horses ass it might kick you into the water.

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (09/15/90)

In article <41497@mips.mips.COM>, mash@mips.COM (John Mashey) writes:
> ....
> 	-Apparently simple-looking additions can have surprisingly
> 	ugly and widespread effects

Not to disagree, I have many times bemoaned the lack of some kind of fast
carry indication from the MIPS add instruction while looking for ways to
make the TCP/IP checksum faster.  If a single instruction could could compute
	    32-bit + 32-bit -> 32-bit with carry-bit,
you could substantially better the number instructions in all of the
algorithms I've found.  The savings of having such a facility, counted in
seconds, would be reduced by apparently unavoidable cache delays.  That
does not stop the moaning.

One can construct tighter checksum loops on the 29K, 68K, and the 80386.

Given operating system tricks on the byte copies, computing the checksum is
the bigger of the two major places the code I know spends most of its time.

My limited and decades past dabbling with general purpose long arithmetic
makes me ask, would there be a similar benefit to extended precision addition?

Vernon Schryver     vjs@sgi.com

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (09/15/90)

In article <69434@sgi.sgi.com> vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes:
\\\
>Not to disagree, I have many times bemoaned the lack of some kind of fast
>carry indication from the MIPS add instruction while looking for ways to
>make the TCP/IP checksum faster.  If a single instruction could could compute
>	    32-bit + 32-bit -> 32-bit with carry-bit,
>you could substantially better the number instructions in all of the
>algorithms I've found.  The savings of having such a facility, counted in
\\\

I'm afraid the hard-hearted answer has got to be along the lines of:

"What proportion of time is spent in the checksum code, & by what factor
is it increased by not having the convient carry"?

If the answer is a number greater than EPS (your tollerance for such things) 
will you then _personally_ pay me four figures to lay it out in Si?

If the answer is ''no'' I'm afraid you have to stick in the extra 
instructions.

-Kym Horsell

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (09/15/90)

In article <4025@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>> In article <69434@sgi.sgi.com> I wrote about a carry bit
> 
> I'm afraid the hard-hearted answer has got to be along the lines of:
> "What proportion of time is spent in the checksum code, & by what factor
> is it increased by not having the convient carry"?

Details of the family of implementations I know best would probably be
considered proprietary by those who paid for it.  However, one can
estimate, based on the often repeated fact that to a first approximation,
TCP input runs as fast as you can fondle the bytes:
	a) any fetches and stores to get from the media to a system buffer
		(0 if you have media-to-host DMA)
	b) 1 fetch to compute the checksum
	c) 1 fetch and 1 store to copy to the user process buffer (unless
		you cheat--all's fair as long as the data gets where it
		is needed)
A simple implementation with direct DMA and no cheating needs 2 fetches and
1 store,  The total cost of the TCP checksum in such a system is around 30%.
Output is similar.  Cache effects are important but ignored here.
The trickier you are elsewhere, the large the checksum looms.

Current, fast workstations move about 1MByte/sec TCP/IP user-process-to-
user-process over ethernet.  Ignoring important details, one saved
instruction/byte is a million saved instructions/sec.  Simplistically, if
you could "fetch, add word to accumulator, add-carry to accumulator", you
could save 0.5 instructions/byte on a MIPS CPU.  Of course an ADDC
instruction would cost in many other places, dragging in all of the
disadvantages of status bits.

I understand that the 6MByte/sec (?) figure from Cray benefits from vector
hardware to do the checksums.

The TCP checksum by itself does not justify ALU status bits in general
purpose CPU's in high performance workstations, because of various bits of
specialized hardware in various vendors' implementations for various media.

What do the extend precision experts say about carry bits?

Vernon Schryver    vjs@sgi.com

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (09/15/90)

I wrote

					User	System	User ratio
Horsell	(16 x 16 -> 16, C code)		311.2	1.9	1.36 slower
Horsell (32 x 32 -> 32, C code)		228.8	2.6
My code (32 x 32 -> 64, assembly)	 52.6	0.5	4.35 faster

(times in seconds reported by /bin/time on an Encore Multimax)

In article <4015@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> Could you please include a few more numbers in a followup post?

Willingly.  But which?  Surely the one number 4.35 establishes the
disputed fact:  32x32->64 multiplication (which C does not give you
access to at all) can give you a worthwhile speedup for bignum multiplication.

Two very important numbers for this newsgroup are
    --	what impact did inclusion of 32x32->64 multiplication have on
	the rest of the architecture?  (It also has 64/32->32+32 division.)
    --	how many NS32532 programs are actually likely to benefit from this
	instruction?
My answers are "no idea" and "dunno, but probably very few, because the
only HLL I've got that uses the instruction is T."

-- 
Heuer's Law:  Any feature is a bug unless it can be turned off.

cik@l.cc.purdue.edu (Herman Rubin) (09/15/90)

In article <41497@mips.mips.COM>, mash@mips.COM (John Mashey) writes:
> In article <2118@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:

			........................
  
> 1) If divide follows the same, regular decoding structure as everything
> else, and if the dividend specifies the first of 2 registers (to get 64
> bits), consider the consequences, assuming it was div dividend,divisor
> 	1a) Either you need even/odd register pair (to get
> 		dividend and dividend|1
> 	OR you need an extra adder (used by nothing else) to get
> 		dividend and dividend+1
> 	1b) It becomes the ONLY instruction that needs to fetch 3 32-bit
> 		registers as inputs.....

There is another instruction, badly needed in many applications, which
has the same problems; that is double-register shift by an arbitrary
amount.  Also, extract field and insert field would benefit greatly
from this.

Alternatively, this could be done in the floating point units, which
already have even more complexity.  None of the people wanting to use
these instructions for arithmetic care that the address-and-loop-control
unit is used for  fo good arithmetic operations.

			........................

> 	a) Integer multiply/divide takes up a nontrivial (not huge,
> 	but you can certainly see it on the layout) chunk of space.
> 	b) All of this stuff really just barely fit on the original R2000
> 	in 2micron CMOS, and if you look at the layour again, you'll find that
> 	the right side of the chip is a regular stack of 32-bit wide
> 	datapaths, except the mul/div unit sticks out a few more bits.

The floating point unit already has this and more.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

aglew@crhc.uiuc.edu (Andy Glew) (09/16/90)

In article <4025@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> I'm afraid the hard-hearted answer has got to be along the lines of:
> "What proportion of time is spent in the checksum code, & by what factor
> is it increased by not having the convient carry"?

On a 68030 based system I found that 15% of the entire CPU was being
spent in in_chksum() (my memory may be failing wrt. exact name), on
real user workloads (namely, the backbone machines on our local net).
This was for the naive byte-at-a-time one's complement sum (again,
memory may be failing me.  I believe it was a one's complement sum,
but there have been quite a variety of checksums).

Unrolling the loop and computing the checksum 32 bits at a time
instead of 8 bits at a time gave me approximately a 6-8-fold speedup.
A bit of instrumentation showed that the overwhelming majority of
packets were of only two sizes, and these were special cased.

With the new code, in_chksum() was reduced to around 4% of the CPU.
(Not linearly divided by speedup because of overhead, and traffic
increase).  I used the carry-out and carry-in to do this. Coding
without the carry would approximately double number of instructions
for this checksum, but many of the added instructions would be
branches.  Actually, I'd probably only do it 16 bits at a time, no
branches, which would be, again, a 3-fold slowdown (shifts and masks
in the loop).
    Ie. based on my experience coding in_chksum(), but not having
coded it on a MIPS, I would estimate that the slowdown through not
having carry out and in is approximately 3-fold wrt. good code that
uses carry-out and in. But this is only an upper bound, because
overhead of call, etc., gets in the way.

I do not wish to pass judgement on the usefulness of carries; I only
wished to provide a data point for "by what factor is it [the
checksum] increased by not having the convenient carry".

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

amos@taux01.nsc.com (Amos Shapir) (09/17/90)

[Quoted from the referenced article by ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe)]
>
>Two very important numbers for this newsgroup are
>    --	what impact did inclusion of 32x32->64 multiplication have on
>	the rest of the architecture?  (It also has 64/32->32+32 division.)

Not much - since this is a CISC, this type of division (there are 4 more)
is just another entry of the division routine in the microcode.

>    --	how many NS32532 programs are actually likely to benefit from this
>	instruction?

Quite a few - the NS32532 doen't have a FPU on-board, so many implementations
emulate it in software; there are also many other uses which have been
listed in this thread before.

-- 
	Amos Shapir		amos@taux01.nsc.com, amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522255  TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N