[comp.arch] RISC integer multiply/divide

guy@auspex.auspex.com (Guy Harris) (03/31/91)

>Are the new PA machines binary compatible with the old ones?  Reading
>the press, it sounds like there are new instructions in the new 
>machines so that while the old programs will run, the full performance
>will not be achieved unless you recompile.  Is this true?

I saw something that indicated that integer multiply and divide
instructions have been added to HP-PA, and that the chips in the new
machine, I think, implement them.

SPARC has also added them in Version 8 of the SPARC architecture; the
"Implementation Characteristics" appendix indicates that the Matsushita
MN10501 chip (I think that's the chip used in some Solbourne machines)
implements the multiply (but not divide) instructions.

I think the IBM ROMP had no integer multiply or divide instructions;
the POWER architecture does.

Are there any remaining RISC holdouts that originally didn't have
integer multiply and divide, and that haven't added those instructions
to their architectures?

dik@cwi.nl (Dik T. Winter) (04/01/91)

In article <6920@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
 > I saw something that indicated that integer multiply and divide
 > instructions have been added to HP-PA, and that the chips in the new
 > machine, I think, implement them.
Right.
 > 
 > I think the IBM ROMP had no integer multiply or divide instructions;
 > the POWER architecture does.
Also correct.
 > 
You may also mention the move in AMD 29k to the 29050.

 > Are there any remaining RISC holdouts that originally didn't have
 > integer multiply and divide, and that haven't added those instructions
 > to their architectures?

Now this is a bit slippery (checking the 32*32->64 bit multiply and
64/32->32,32 bit divide; this may be old hat, and, what is RISC?):
i860:	no integer multiply/divide, uses FP (slight support for multiply)
m88k:	full support (no remainder & no 64 bit res/src, uses FP)
amd29k:	full support (from 29050)
sparc:	full support (no remainder (from version 8))
mips:	full support (no 64 bit src for div/rem)
power:	full support (it appears when trying)
hppa:	full support (uses FP, might limit res/src to 32 bits, I am not sure)
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

daryl@hpcupt3.cup.hp.com (Daryl Odnert) (04/02/91)

Guy Harris writes:
> I saw something that indicated that integer multiply and divide
> instructions have been added to HP-PA, and that the chips in the new
> machine, I think, implement them.

HP's PA-RISC 1.1 architecture does include an unsigned integer multiply
instruction that operates on floating-point register operands.  However,
it does not include any integer division instructions.

Daryl Odnert
Hewlett-Packard
California Language Lab
Cupertiono, California

huck@aspen.IAG.HP.COM (Jerry Huck) (04/02/91)

From: In comp.arch, guy@auspex.auspex.com (Guy Harris) writes:
>    I saw something that indicated that integer multiply and divide
>    instructions have been added to HP-PA, and that the chips in the new
>    machine, I think, implement them.

Integer multiply on operands in the floating-point register file have
been added (this was pretty simple to support).  No direct integer
divide.  The instruction set has several strategies to deal with
integer multiplication and division.  All the data to date does not
support the inclusion of integer divide instructions as being cost
effective for the markets and workloads HP sells to.  At this point,
the silicon/design effort/testing is better spent elsewhere.  Other
workloads may suggest other conclusions.

Jerry Huck
Hewlett Packard

meissner@osf.org (Michael Meissner) (04/04/91)

In article <3237@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:

| Now this is a bit slippery (checking the 32*32->64 bit multiply and
| 64/32->32,32 bit divide; this may be old hat, and, what is RISC?):

| m88k:	full support (no remainder & no 64 bit res/src, uses FP)

While it doesn't have a 32x32->64 bit multiply, the multiply unit is
pipelined, so that you can do a 32x32->64 bit multiply in about 13
clocks (including separating the two 32 bit numbers into 16 bit
pieces, doing 4 pipelined multiplies, and merging the results).
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/04/91)

In article <MEISSNER.91Apr3133852@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
>While it doesn't have a 32x32->64 bit multiply, the multiply unit is
>pipelined, so that you can do a 32x32->64 bit multiply in about 13
>clocks (including separating the two 32 bit numbers into 16 bit
>pieces, doing 4 pipelined multiplies, and merging the results).

But you only need 3 multiplies -- see Knuth.

-kym

dik@cwi.nl (Dik T. Winter) (04/04/91)

In article <MEISSNER.91Apr3133852@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
 > In article <3237@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
 > | Now this is a bit slippery (checking the 32*32->64 bit multiply and
 > | 64/32->32,32 bit divide; this may be old hat, and, what is RISC?):
 > 
 > | m88k:	full support (no remainder & no 64 bit res/src, uses FP)
 > 
 > While it doesn't have a 32x32->64 bit multiply, the multiply unit is
 > pipelined, so that you can do a 32x32->64 bit multiply in about 13
 > clocks (including separating the two 32 bit numbers into 16 bit
 > pieces, doing 4 pipelined multiplies, and merging the results).

Yes, I know.  I have in my archives a routine to do that in 18 clocks.
Possibly a some cycles could be shaved off.  It was posted over a year
ago by Alan Lovejoy at AT&T Paradyne.

Some updates (newer information on hppa, and I added also arm and clipper):

hppa:	full multiply support, division through divide step
arm:	no divide or divide step, multiply is 32*32->32
clipper:full mult, separate div and rem and 32/32 only
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/04/91)

In article <1991Apr3.232757.8020@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> In article <MEISSNER.91Apr3133852@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
> >While it doesn't have a 32x32->64 bit multiply, the multiply unit is
> >pipelined, so that you can do a 32x32->64 bit multiply in about 13
> >clocks (including separating the two 32 bit numbers into 16 bit
> >pieces, doing 4 pipelined multiplies, and merging the results).
> 
> But you only need 3 multiplies -- see Knuth.

This is true, but you would need either 17x17 -> 34 multiply, or 
(16+sign)*(16+sign)->(32+sign), as well as a considerable use of
keeping track of signs/carries, or substantially more additions.

BTW, if unsigned multiplication is not available, additional problems
arise in any case.

-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/05/91)

In article <9538@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
[discussion of double-prec multiply, esp 32x32->64 (yuk!)]
>This is true, but you would need either 17x17 -> 34 multiply, or 
>(16+sign)*(16+sign)->(32+sign), as well as a considerable use of
>keeping track of signs/carries, or substantially more additions.

Well I'm not sure I understand this -- it's probably a slightly
different implementation of the what I have in mind.

One of the reasons for keeping the 3-multiply trick in mind is that some
chips include a dedicated fast multiplier. It's not unknown to link up this 
guy with the rest of the alu so that simultaneous add and mults can be done 
(after all, we spent a lot of money to get that multiplier IN there so let's 
use it for everything :-). Instructures like multiply-and-add, 
multiply-shift-and-add are therefore not unheard of. (I have in mind esp the 
TI 320 series, I presume those 6000 thingies can do similar things, but I 
haven't looked at the gory details).

So for such machines we can both make use of the trick and try to use as
many fast instruction combinations as possible; the trouble _may_ be worth it 
in some applications (although I haven't actually measured anything).

The appended programs show how to obtain single (i.e. 32x32->32) and
double (32x32->64) results for multiply using a 16x16->32 multiplier. The 
routines are pretty rough (written on the fly), so forgive any glaring (or 
subtle) errors.

>BTW, if unsigned multiplication is not available, additional problems
>arise in any case.

I'm not sure I understand this either -- my routines use signed stuff. I can 
appreciate the problems regarding missing unsigned arithmetic, however.

-kym

C code:

===single precision (32x32->32) multiply using (16x16->32) multiplier===
/***

Taking:
	(aL+b)(cL+d) = LLac+Lad+Lbc+bd
	L(a-b)(c-d) = Lac-Lad-Lbc+Lbd
where L=2^k we find:
	L(a-b)(c-d) + (aL+b)(cL+d) = LLac+Lac+Lbd+bd

Thereore:
	mul(x,y) = hix hiy <<2k + hix hiy<<k + lox loy<<k + lox loy 
		- (hix-lox)(hiy-loy)<<k

If you only need the low-order 2k bits of the product (i.e. the same size as 
the operands):
	mul(x,y) = hix hiy<<k + lox loy<<k + (-hix+lox)(hiy-loy)<<k + lox loy

We could also reformulate this as:
	mul(x,y) = hix hiy<<k + lox loy<<k + (-hix+lox)(hiy-loy)<<k + loxyhi<<k
		+ loxylo
that requires only k-bit adds. 2k-bit adds are more convenient.

***/

#define SHORTBITS 16	/* size of ``short'' */

typedef int Num;	/* regular size */
typedef short SNum;	/* 1/2 size */

#define BIT(N) (1L<<(N))
#define MASK(N) (BIT(N)-1)

Num mul(x,y) Num x,y; {
	SNum hix,hiy,lox,loy;
	Num t1,t2;

/*** I'll rely on a smart compiler to sort out the following 
	mess into something reasonable. ***/

	hix=x>>SHORTBITS;
	lox=x&MASK(SHORTBITS);
	hiy=y>>SHORTBITS;
	loy=y&MASK(SHORTBITS);

	t2 = t1 = (Num)lox*loy;
	t2 += (Num)hix*hiy;		/* multiply-and-add */
	t2 += ((Num)-hix+lox)*((Num)hiy-loy); /* multiply-and-add */
	t1 += t2<<SHORTBITS;	/* shift-and-add */

	return t1;
	}

===double prec multiply (32x32->64) using (16x16->32) multiplier===

#define SHORTBITS 16

typedef long LNum;
typedef int Num;
typedef short SNum;

#define BIT(N) (1L<<(N))
#define MASK(N) (BIT(N)-1)

LNum mul(x,y) Num x,y; {
	SNum hix,hiy,lox,loy;
	LNum t1;

	hix=x>>SHORTBITS;
	lox=x&MASK(SHORTBITS);
	hiy=y>>SHORTBITS;
	loy=y&MASK(SHORTBITS);

	t1 = (Num)lox*loy;
	t1 += ((Num)hix*hiy)<<SHORTBITS; /* multiply-shift-and-add */
	t1 += t1<<SHORTBITS;	/* shift-and-add */
	t1 += (((Num)-hix+lox)*((Num)hiy-loy))<<SHORTBITS; /* multiply-shift-and-add */

	return t1;
	}
===end end end===

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/06/91)

Various people have responded via email regarding the ``3-multiply
trick''.

Mostly the feeling is that any extra trouble required by the method is
not worth the results. I must concur in the majority of cases. However,
we can ask the converse question ``on which architectures will the trick
work?'' As I indicated in a previous post, I had in mind DSP-type
chips that have fancy multiply-and-add and add-and-shift instructions.

The trick is obviously worthwhile on machines with a very expensive
multiply as well.

I've written another little ``benchmark'' program to compare a 32x32->64
(yuk!) unsigned multiply written using both the naive 4-multiply method:

	(a+b)*(c+d) = a*c + b*c + a*d + b*d

and the 3-multiply trick:

	(a+b)*(c+d) = (a-b)*(c-d) + 2*(b*c + a*d)

where a, b, c and d are the obvious high & low parts of 2 xp unsigned
numbers.

The results from pixie show that unsigned multiplies account for about 10% 
of all opcodes (all subroutine calls were removed by the optimizer and lost 
delay slots counted for relatively few cycles).

Specifically in the 4-multiply case we find:

	andi	13.56
	addu	10.17
	multu	10.17
	srl	6.78
	sw	8.47
	addiu	8.48
	sll	3.39

and in the 3-multiply case we find:

	andi	13.63
	addu	9.09
	multu	7.57
	srl	6.06
	sw	9.09
	addiu	7.58
	sll	3.03

as percentages of dynamic instructions.

We can then ask ``how slow would multiply have to be to make an overall 
significant difference here?''

We set up the equation:

	10.17 T + (100-10.17) = (7.57 T + (100-7.57)) * 1.1

I.e. ``when will the total runtime for the 4-mult routine be 10% larger
than for the 3-mult routine?''. (I am assuming here that all other
instructions apart from multiply take about the same time).

We get T=
	(100-10.17) - 1.1*(100-7.57)
	----------------------------
	7.57 * 1.1 - 10.17

	= 6.43 approx.

The Sparc would therefore be a good architecture to use the trick on.

Running the benchmark on a Sun SS1 we have the output:

	3700550176	3700550176
	minfact=.75
	maxfact=1
	sumfact=8.71429
	arithavfact=.871429
	prodfact=.237305
	geoavfact=.237305

The first couple of numbers merely check that both routines produce
the same result (it totals the high-order parts of a large number of 
random 64-bit products -- a sorta ``signature''). The best the
3-multiply routine can do is 25% faster than the naive method -- as
expected. Some cases break even, however. The arithmetic average has
the 3-mult routine running a little more than 10% faster.

The next question we might ask is ``what is the situation on a machine with 
multiply-and-add?''. I can only approximate this at present (not having
my hands on a DSP chip that also has a good optimizing C compiler :-).

By combining the addu+multu figures above we arrive at:

	2*10.17 T + (100-2*10.17) = ((9.09+7.57) T + (100-9.09-7.57))*1.1

where T is the time of a synthetic multiply-and-add instruction.
(This is obviously only approximate -- not EVERY add is associated with
a multiply although vv).

We find: T =  6 approx.

Thus if a combined m&a instruction is 6 times slower ON AVERAGE than other 
instructions in a given architecture it may be worthwhile doing the 3-mult 
trick. 

If m&a is implemented by simply chaining the o/p of a multiplier to the input 
of an adder then it may be likely that the average cost of multiply-and-add
exceeds this figure.

It might also be true even taking into account a more sophisticated 
pipelined h/w. The algorithm in question is subject to lost ``m&a slots'' 
due to the high density of m&a instructions; the result of one m&a
is typically required for a subsequent operation. While the alu
is busy we can't use it for the add/subtract phase of an m&a instruction
(and we rely on the compiler therefore NOT to schedule one in cases
where this would happen).

To summarize: using the 3-multiply trick will not get you anything
on machines where multiplies are not expensive (i.e. < 6.5 average 
instructions).  This lets out VAX's, 68k's, Mips's but doesn't seem to include 
the SS1 (I don't have any later SS's to hand, sorry). In the world of smaller 
machines (:-) 386/486 without co-processors are likely another.

On machines with instructions such as multiply-and-add, etc, the 3-multiply 
trick may be more likely to pay off, as m&a will be subject to scheduling 
collisions and other slot losses for the algorithms in question.  The trick 
apparently requires a multiply-and-add to cost about 6 average instructions 
to pay off on these architectures.

-kym

BTW, I would be interested in any multiply-and-add figures people have to hand. 
If there is sufficient interest, I will collate and post a summary at a future 
date.

C code:

===compare 3-m and 4-m 32x32->64 unsigned multiply using 16x16->32 arithmetic===
#include	<stdio.h>
#include	<math.h>
#include	<assert.h>

#define	BIT(N)	(1L<<(N))
#define	MASK(N)	(BIT(N)-1)

#define	SHORTBITS	16
#define	MAX	(1L<<SHORTBITS)
#define	NSAMP	10000
#define	NIT	10

typedef unsigned short UShort;
typedef short Short;
typedef long Long;
typedef unsigned long ULong;
typedef struct {
	ULong lo,hi;
	} Bignum;

Bignum *umul3(x,y) register ULong x,y; {
	register UShort a,b,c,d;
	register ULong ac,bd;
	register ULong mid;
	static Bignum res;

	a=x>>SHORTBITS;
	b=(UShort)x;
	c=y>>SHORTBITS;
	d=(UShort)y;

/***
	(a+b)(c+d)

	32	16	16

	ac	
		bd
		ac	
		-(a-b)(c-d)
			bd
***/
	ac=a*c;
	bd=b*d;

	mid=bd+ac+(bd>>SHORTBITS);
	if(a>=b) if(c>=d) mid-=(a-b)*(c-d);
		else mid+=(a-b)*(-c+d);
	else if(c>=d) mid+=(-a+b)*(c-d);
		else mid-=(-a+b)*(-c+d);

	res.lo=((UShort)bd)+(mid<<SHORTBITS);
	res.hi=((UShort)mid)+((ac+(mid>>SHORTBITS))<<SHORTBITS);

	return &res;
	}

Bignum *umul4(x,y) register ULong x,y; {
	register UShort a,b,c,d;
	register ULong lo,mid;
	static Bignum res;

	a=x>>SHORTBITS;
	b=(UShort)x;
	c=y>>SHORTBITS;
	d=(UShort)y;

/***
	(a+b)(c+d)	=	ac + ad + bc + bd

	32	16	16
	ac
		bc
		ad
			bd
***/
	
	lo=b*d;
	mid=b*c+a*d+(lo>>SHORTBITS);

	res.lo=((UShort)lo)+(mid<<SHORTBITS);
	res.hi=((UShort)mid)+((a*c+(mid>>SHORTBITS))<<SHORTBITS);

	return &res;
	}

ULong x[NSAMP],y[NSAMP];

void init() {
	unsigned i;

	for(i=0;i<NSAMP;i++) {
		x[i]=rand();
		y[i]=rand();
		}
	}

void main(argc,argv) char *argv[]; {
	unsigned it,n;
	double t0,t1,t2;
	ULong sum3=0,sum4=0;
	double sumfact=0,prodfact=1;
	double minfact=1e38,maxfact= -1e38;

	for(it=0; it<NIT; it++) {
		double fact;

		init();

		t0=clock();
		for(n=0; n<NSAMP; n++) {
			sum4+=umul4(x[it],y[it])->hi;
			}
		t1=clock();

		for(n=0; n<NSAMP; n++) {
			sum3+=umul3(x[it],y[it])->hi;
			}
		t2=clock();

		fact=(t2-t1)/(t1-t0);
		prodfact*=fact;
		sumfact+=fact;
		if(fact>maxfact) maxfact=fact;
		if(fact<minfact) minfact=fact;
		}

	printf("%lu	%lu\n",sum3,sum4);/* in case of clever compiler */
	printf("minfact=%g\n",minfact);
	printf("maxfact=%g\n",maxfact);
	printf("sumfact=%g\n",sumfact);
	printf("arithavfact=%g\n",sumfact/NIT);
	printf("prodfact=%g\n",prodfact);
	printf("geoavfact=%g\n",pow(prodfact,1.0/NIT));

	exit(0);
	}
===end end end===

dik@cwi.nl (Dik T. Winter) (04/06/91)

In article <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
 > In article <9538@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
 > [discussion of double-prec multiply, esp 32x32->64 (yuk!)]
 > >This is true, but you would need either 17x17 -> 34 multiply, or 
 > >(16+sign)*(16+sign)->(32+sign), as well as a considerable use of
 > >keeping track of signs/carries, or substantially more additions.
 > 
 > Well I'm not sure I understand this -- it's probably a slightly
 > different implementation of the what I have in mind.
To understand it, see below.
 > 
 > One of the reasons for keeping the 3-multiply trick in mind is that some
 > chips include a dedicated fast multiplier. It's not unknown to link up this 
 > guy with the rest of the alu so that simultaneous add and mults can be done 
 > (after all, we spent a lot of money to get that multiplier IN there so let's 
 > use it for everything :-). Instructures like multiply-and-add, 
 > multiply-shift-and-add are therefore not unheard of. (I have in mind esp the 
 > TI 320 series, I presume those 6000 thingies can do similar things, but I 
 > haven't looked at the gory details).
TI 320, yes, RS6000, no.  As I mailed you, this came from a remark from me
about the m88k.  As on that machine with proper code a multiply takes only
a single cycle, there is no way to improve it with 3 multiplies only.  My
findings are that using an order n squared long integer multiply still pays
for fairly large numbers (hundreds of decimal digits on current architectures).
 > 
 > The appended programs show how to obtain single (i.e. 32x32->32) and
 > double (32x32->64) results for multiply using a 16x16->32 multiplier. The 
 > routines are pretty rough (written on the fly), so forgive any glaring (or 
 > subtle) errors.
Well, it is just those subtle errors that make a routine useless, and are
inherently unforgivable.  Glaring errors we can live with, they are easily
corrected!  (It is just like my talk with a representative from some,
unnamed, computer manifacturer about a design error I found in an instruction.
It made the instruction inherently useless, because in some, admittedly
obscure, cases the instruction would deliver the wrong result.  Considering
that the desing is already more than 10 years on the market one might wonder
how many programs silently failed due to this!)
 > 
 > >BTW, if unsigned multiplication is not available, additional problems
 > >arise in any case.
I dunno, anyhow:
 > I'm not sure I understand this either -- my routines use signed stuff. I can 
 > appreciate the problems regarding missing unsigned arithmetic, however.
Yup, but here we find the first error: is >> sign extending or not?  You did
not deal with this.  (And yes, there are machines that do not have an
arithmetic right shift.)  But I do not think that is essential.
 > 
 > ===single precision (32x32->32) multiply using (16x16->32) multiplier===
I will not comment on this; it is pretty straightforward.  And contains
similar errors as the next one.

 > ===double prec multiply (32x32->64) using (16x16->32) multiplier===
I presume you assume short is 16 bits, int is 32 bits and long is 64 bits?

some declarations here.... anyhow, LNum is long, Num is int, Snum is short.
I assume that all multiplies are signed 16*16->32 (as promised).

Some more details omitted, and next (hix,lox,hiy,loy are SNums, t1 is a LNum):
 > 	t1 += (((Num)-hix+lox)*((Num)hiy-loy))<<SHORTBITS;
Several problems here:  I do not know the current C standard well enough
but I would write '(Num)-hix' as '-(Num)hix' unless you are sure that promotion
to Num occurs before negation (in that case, why the case?).  I think it will
fail if hix=0x8000 and lox=1.  Also the promotions to Num of lox and loy are
not clear.  These are the glaring errors.  I will assume that you intended
that the expressions '-hix+lox' and 'hiy-loy' are still signed 16 bit numbers
(otherwise the assumption about the multiply would fail!).  Let's see:
Suppose, x = y = -2147483647; we find hix = -32768, lox = 1, hiy = -32768
and loy = 1.  Next, -hix+lox = 32769, but we have to truncate to 1 because
we are using this as an operand of the 16*16->32 multiply!  This is the
subtle error; it will fail on some well-choosen examples.
So do you now understand Herman Rubins remark that you need either 17*17->34
multiply or (16+sign)*(16+sign)->(32+sign)?

Anyhow, your implementation is worthless, even if we expunge the glaring errors.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/07/91)

In article <3275@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
[I didn't understand the need for 17x17->34 arithmetic do do the 3 multiply 
trick]
>To understand it, see below.

Well I'm still not convinced. In the followup article I _think_ I have a
working 32x32->64 multiply (not that I've checked it exhaustively mind
you). Both the naive and 3-multiply versions seem to get the same
answers over a large number of cases (and I did some independent checking 
of the ``end conditions'' of all simple patterns of n-bits rotated in a 32-bit 
field  and those pretty pattern suggested by Knuth that are easy to eyeball). 
I don't see anywhere that uses 33x33->66 bit arithmetic although a bit of 
sign-testing goes on before we do the product-of-differences stuff.

Forgive me if I'm wrong, but to convert the unsigned routines to
signed ones just requires a couple of further 32-bit adds; still no
funny multiplies.

[I talk about multiply-and-add and shift-and-add perhaps having an
impact on 3-m vs 4-m; esp on DSP chips such as TI 320, _maybe_ on 6000]
>TI 320, yes, RS6000, no.  As I mailed you, this came from a remark from me
>about the m88k.  As on that machine with proper code a multiply takes only
>a single cycle, there is no way to improve it with 3 multiplies only.  

Well I don't know what you _think_ you said in your email. But all I
have on file is ``all those halfword adds are not worth the pain'';
I don't see anything there about m88k's (although that may have been
implicit, it doesn't read that way).

>My findings are that using an order n squared long integer multiply still pays
>for fairly large numbers (hundreds of decimal digits on current architectures).

Well on any machine where multiplies < about 6.5 average instructions
I would agree with you (see my previous post for derivation). In the
context of other architectures, which are abundant but not common :-)
maybe this is not the case. As all on comp.arch know by now, I'm not one for
taking people's word for anything when I can measure it for myself.

[long analysis of poor programs]
>Anyhow, your implementation is worthless, even if we expunge the glaring errors.

Sorry, but I thought it was apparent they were meant to be illustrative, 
not implementations. It's pretty hard to make C do a real 16x16->32 multiply 
in the first place :-)

As illustrations I guess they leave something to be desired due to the
various ambiguities you outlined, but I hope some of this was cleared up
by the follow-up post of what (I hope) are working versions.

-kym

dik@cwi.nl (Dik T. Winter) (04/07/91)

In article <1991Apr5.191136.21806@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
 > Various people have responded via email regarding the ``3-multiply
 > trick''.
Eh, yup.
 > 
 > Mostly the feeling is that any extra trouble required by the method is
 > not worth the results. I must concur in the majority of cases. However,
 > we can ask the converse question ``on which architectures will the trick
 > work?'' As I indicated in a previous post, I had in mind DSP-type
 > chips that have fancy multiply-and-add and add-and-shift instructions.
Might be, however, it is better to have correct code than fast code.
Moreover, I see you changed from signed multiply to unsigned multiply.
 > 
 > The Sparc would therefore be a good architecture to use the trick on.
Barely.  33 multiply step instructions give you the 64 bits result.
 > 
 > Running the benchmark on a Sun SS1 we have the output:
 > 
 > 	3700550176	3700550176
I found this quite interesting.  Both umul3 and umul4 gave the same incorrect
result for both cases I tried!

 > 	mid=bd+ac+(bd>>SHORTBITS);
This can fail.  0 <= a,b,c,d < 2^16, so
0 <= mid < 2^33+2^16.  You do not handle the overflow.

Try:
	umul3(0xffffffff,0xffffffff)
and see that it returns 0xfffe000000000001; not entirely correct.
This is probably due to some glaring (and some subtle) bugs.

 > 		for(n=0; n<NSAMP; n++) {
 > 			sum4+=umul4(x[it],y[it])->hi;
 > 			}
You want to change both occurrences of 'it' to 'n'.  Same a bit further down.

Bottom line, never compare apples with oranges, and certainly not rotten
apples with rotten oranges.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/08/91)

In article <3276@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>Might be, however, it is better to have correct code than fast code.

Sure. But I am not sure it is required for the purposes of comparison
(better to compare rotten apples if them's all you got).

As you pointed out my routines didn't handle integer overflow in
an intermediate result. But as the same is true of both routines fixing 
same would not affect the comparison too much (if measurable at all).

As I pointed out before, the main point of the exercise was to test
the claims (a) that you need 17x17->34 arithmetic to implement this
technique, and (b) that it isn't (ever) worth the trouble.

I think I've illustrated, despite problems, that (a) is not the case. 
While (b) appears true in some cases there are extant architectures in which it 
is not correct. 

I realize the SS1 already performs 32x32->64 multiply, this is really beside 
the point. 32x32->64 was what I benchmarked. The same should be approximately 
true of 3m vs 4m algorithms for larger operands.

Although I've tried to kid others into writing up the appropriate code
(to check 64x64->128 for instance) no-one has volunteered.  My prediction 
from the extant data is that 3m/4m on the SS1 will be about 10% on
average in favor of 3m for a 64x64->128 multiply with peaks at 25%
(signed or not, correct or not). Anyone care to check it?
-- 
-kym

main(){int i=0,b=0,c=0,n=0;while(i<1920){if((b>>=1)<2)
b=n*n++;printf("\n%c"+!!(++i%80)," #"[(c^=1)^(b&1)]);}}

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/08/91)

In article <3276@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
> > 	mid=bd+ac+(bd>>SHORTBITS);
>This can fail.  0 <= a,b,c,d < 2^16, so
>0 <= mid < 2^33+2^16.  You do not handle the overflow.

Hmmm. While we're talking about accuracy -- I get 0 <= mid <= 2^33 - 3*2^16 + 1.
You seem to indicate >1 bits of overflow may be required.

-kym
-- 
main(){int i=0,b=0,c=0,n=0;while(i<1920){if((b>>=1)<2)
b=n*n++;printf("\n%c"+!!(++i%80)," #"[(c^=1)^(b&1)]);}}

tif@doorstop.austin.ibm.com (Paul Chamberlain) (04/08/91)

dik@cwi.nl (Dik T. Winter) writes:
>kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> > use it for everything :-). Instructures like multiply-and-add, 
> > multiply-shift-and-add are therefore not unheard of. (I have in mind ...
> > TI 320 series, I presume those 6000 thingies can do similar things,
>TI 320, yes, RS6000, no.

The rest of this article is to deep for my current mood, however,
if I understand the above correctly, you are wrong.  IBM goes to
great pains to brag about the multiply-and-add instruction.

Paul Chamberlain | I do NOT speak for IBM.          IBM VNET: PAULCC AT AUSTIN
512/838-9748     | ...!cs.utexas.edu!ibmchs!auschs!doorstop.austin.ibm.com!tif

halkoD@batman.moravian.EDU (David Halko) (04/23/91)

In article <6490@awdprime.UUCP>, tif@doorstop.austin.ibm.com (Paul Chamberlain) writes:
> dik@cwi.nl (Dik T. Winter) writes:
> >kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
> > > use it for everything :-). Instructures like multiply-and-add, 
> > > multiply-shift-and-add are therefore not unheard of. (I have in mind ...
> > > TI 320 series, I presume those 6000 thingies can do similar things,
> >TI 320, yes, RS6000, no.
> 
> The rest of this article is to deep for my current mood, however,
> if I understand the above correctly, you are wrong.  IBM goes to
> great pains to brag about the multiply-and-add instruction.
> 
Multiply and add instructions are very useful in digital signal procssing
applications, where filters need to be made. multiply-add instructions
performed in a pipeline significantly improve performance of such
algorythms...

-- 
     _______________________________________________________________________
    /                                                                       \
   / "The use of COBOL cripples the mind; its teaching should, therefore, be \
  /  regarded as a criminal offence."           E.W.Dijkstra, 18th June 1975. \
 +-----------------------------------------------------------------------------+
  \  "Have you purchased a multi-   halkoD@moravian.edu  Have you booted your /
   \ media machine from IMS yet?"   David J. Halko       OS-9 computer today?/
    \_______________________________________________________________________/