guy@auspex.auspex.com (Guy Harris) (03/31/91)
>Are the new PA machines binary compatible with the old ones? Reading >the press, it sounds like there are new instructions in the new >machines so that while the old programs will run, the full performance >will not be achieved unless you recompile. Is this true? I saw something that indicated that integer multiply and divide instructions have been added to HP-PA, and that the chips in the new machine, I think, implement them. SPARC has also added them in Version 8 of the SPARC architecture; the "Implementation Characteristics" appendix indicates that the Matsushita MN10501 chip (I think that's the chip used in some Solbourne machines) implements the multiply (but not divide) instructions. I think the IBM ROMP had no integer multiply or divide instructions; the POWER architecture does. Are there any remaining RISC holdouts that originally didn't have integer multiply and divide, and that haven't added those instructions to their architectures?
dik@cwi.nl (Dik T. Winter) (04/01/91)
In article <6920@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: > I saw something that indicated that integer multiply and divide > instructions have been added to HP-PA, and that the chips in the new > machine, I think, implement them. Right. > > I think the IBM ROMP had no integer multiply or divide instructions; > the POWER architecture does. Also correct. > You may also mention the move in AMD 29k to the 29050. > Are there any remaining RISC holdouts that originally didn't have > integer multiply and divide, and that haven't added those instructions > to their architectures? Now this is a bit slippery (checking the 32*32->64 bit multiply and 64/32->32,32 bit divide; this may be old hat, and, what is RISC?): i860: no integer multiply/divide, uses FP (slight support for multiply) m88k: full support (no remainder & no 64 bit res/src, uses FP) amd29k: full support (from 29050) sparc: full support (no remainder (from version 8)) mips: full support (no 64 bit src for div/rem) power: full support (it appears when trying) hppa: full support (uses FP, might limit res/src to 32 bits, I am not sure) -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
daryl@hpcupt3.cup.hp.com (Daryl Odnert) (04/02/91)
Guy Harris writes: > I saw something that indicated that integer multiply and divide > instructions have been added to HP-PA, and that the chips in the new > machine, I think, implement them. HP's PA-RISC 1.1 architecture does include an unsigned integer multiply instruction that operates on floating-point register operands. However, it does not include any integer division instructions. Daryl Odnert Hewlett-Packard California Language Lab Cupertiono, California
huck@aspen.IAG.HP.COM (Jerry Huck) (04/02/91)
From: In comp.arch, guy@auspex.auspex.com (Guy Harris) writes: > I saw something that indicated that integer multiply and divide > instructions have been added to HP-PA, and that the chips in the new > machine, I think, implement them. Integer multiply on operands in the floating-point register file have been added (this was pretty simple to support). No direct integer divide. The instruction set has several strategies to deal with integer multiplication and division. All the data to date does not support the inclusion of integer divide instructions as being cost effective for the markets and workloads HP sells to. At this point, the silicon/design effort/testing is better spent elsewhere. Other workloads may suggest other conclusions. Jerry Huck Hewlett Packard
meissner@osf.org (Michael Meissner) (04/04/91)
In article <3237@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: | Now this is a bit slippery (checking the 32*32->64 bit multiply and | 64/32->32,32 bit divide; this may be old hat, and, what is RISC?): | m88k: full support (no remainder & no 64 bit res/src, uses FP) While it doesn't have a 32x32->64 bit multiply, the multiply unit is pipelined, so that you can do a 32x32->64 bit multiply in about 13 clocks (including separating the two 32 bit numbers into 16 bit pieces, doing 4 pipelined multiplies, and merging the results). -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/04/91)
In article <MEISSNER.91Apr3133852@curley.osf.org> meissner@osf.org (Michael Meissner) writes: >While it doesn't have a 32x32->64 bit multiply, the multiply unit is >pipelined, so that you can do a 32x32->64 bit multiply in about 13 >clocks (including separating the two 32 bit numbers into 16 bit >pieces, doing 4 pipelined multiplies, and merging the results). But you only need 3 multiplies -- see Knuth. -kym
dik@cwi.nl (Dik T. Winter) (04/04/91)
In article <MEISSNER.91Apr3133852@curley.osf.org> meissner@osf.org (Michael Meissner) writes: > In article <3237@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > | Now this is a bit slippery (checking the 32*32->64 bit multiply and > | 64/32->32,32 bit divide; this may be old hat, and, what is RISC?): > > | m88k: full support (no remainder & no 64 bit res/src, uses FP) > > While it doesn't have a 32x32->64 bit multiply, the multiply unit is > pipelined, so that you can do a 32x32->64 bit multiply in about 13 > clocks (including separating the two 32 bit numbers into 16 bit > pieces, doing 4 pipelined multiplies, and merging the results). Yes, I know. I have in my archives a routine to do that in 18 clocks. Possibly a some cycles could be shaved off. It was posted over a year ago by Alan Lovejoy at AT&T Paradyne. Some updates (newer information on hppa, and I added also arm and clipper): hppa: full multiply support, division through divide step arm: no divide or divide step, multiply is 32*32->32 clipper:full mult, separate div and rem and 32/32 only -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
hrubin@pop.stat.purdue.edu (Herman Rubin) (04/04/91)
In article <1991Apr3.232757.8020@bingvaxu.cc.binghamton.edu>, kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > In article <MEISSNER.91Apr3133852@curley.osf.org> meissner@osf.org (Michael Meissner) writes: > >While it doesn't have a 32x32->64 bit multiply, the multiply unit is > >pipelined, so that you can do a 32x32->64 bit multiply in about 13 > >clocks (including separating the two 32 bit numbers into 16 bit > >pieces, doing 4 pipelined multiplies, and merging the results). > > But you only need 3 multiplies -- see Knuth. This is true, but you would need either 17x17 -> 34 multiply, or (16+sign)*(16+sign)->(32+sign), as well as a considerable use of keeping track of signs/carries, or substantially more additions. BTW, if unsigned multiplication is not available, additional problems arise in any case. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/05/91)
In article <9538@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: [discussion of double-prec multiply, esp 32x32->64 (yuk!)] >This is true, but you would need either 17x17 -> 34 multiply, or >(16+sign)*(16+sign)->(32+sign), as well as a considerable use of >keeping track of signs/carries, or substantially more additions. Well I'm not sure I understand this -- it's probably a slightly different implementation of the what I have in mind. One of the reasons for keeping the 3-multiply trick in mind is that some chips include a dedicated fast multiplier. It's not unknown to link up this guy with the rest of the alu so that simultaneous add and mults can be done (after all, we spent a lot of money to get that multiplier IN there so let's use it for everything :-). Instructures like multiply-and-add, multiply-shift-and-add are therefore not unheard of. (I have in mind esp the TI 320 series, I presume those 6000 thingies can do similar things, but I haven't looked at the gory details). So for such machines we can both make use of the trick and try to use as many fast instruction combinations as possible; the trouble _may_ be worth it in some applications (although I haven't actually measured anything). The appended programs show how to obtain single (i.e. 32x32->32) and double (32x32->64) results for multiply using a 16x16->32 multiplier. The routines are pretty rough (written on the fly), so forgive any glaring (or subtle) errors. >BTW, if unsigned multiplication is not available, additional problems >arise in any case. I'm not sure I understand this either -- my routines use signed stuff. I can appreciate the problems regarding missing unsigned arithmetic, however. -kym C code: ===single precision (32x32->32) multiply using (16x16->32) multiplier=== /*** Taking: (aL+b)(cL+d) = LLac+Lad+Lbc+bd L(a-b)(c-d) = Lac-Lad-Lbc+Lbd where L=2^k we find: L(a-b)(c-d) + (aL+b)(cL+d) = LLac+Lac+Lbd+bd Thereore: mul(x,y) = hix hiy <<2k + hix hiy<<k + lox loy<<k + lox loy - (hix-lox)(hiy-loy)<<k If you only need the low-order 2k bits of the product (i.e. the same size as the operands): mul(x,y) = hix hiy<<k + lox loy<<k + (-hix+lox)(hiy-loy)<<k + lox loy We could also reformulate this as: mul(x,y) = hix hiy<<k + lox loy<<k + (-hix+lox)(hiy-loy)<<k + loxyhi<<k + loxylo that requires only k-bit adds. 2k-bit adds are more convenient. ***/ #define SHORTBITS 16 /* size of ``short'' */ typedef int Num; /* regular size */ typedef short SNum; /* 1/2 size */ #define BIT(N) (1L<<(N)) #define MASK(N) (BIT(N)-1) Num mul(x,y) Num x,y; { SNum hix,hiy,lox,loy; Num t1,t2; /*** I'll rely on a smart compiler to sort out the following mess into something reasonable. ***/ hix=x>>SHORTBITS; lox=x&MASK(SHORTBITS); hiy=y>>SHORTBITS; loy=y&MASK(SHORTBITS); t2 = t1 = (Num)lox*loy; t2 += (Num)hix*hiy; /* multiply-and-add */ t2 += ((Num)-hix+lox)*((Num)hiy-loy); /* multiply-and-add */ t1 += t2<<SHORTBITS; /* shift-and-add */ return t1; } ===double prec multiply (32x32->64) using (16x16->32) multiplier=== #define SHORTBITS 16 typedef long LNum; typedef int Num; typedef short SNum; #define BIT(N) (1L<<(N)) #define MASK(N) (BIT(N)-1) LNum mul(x,y) Num x,y; { SNum hix,hiy,lox,loy; LNum t1; hix=x>>SHORTBITS; lox=x&MASK(SHORTBITS); hiy=y>>SHORTBITS; loy=y&MASK(SHORTBITS); t1 = (Num)lox*loy; t1 += ((Num)hix*hiy)<<SHORTBITS; /* multiply-shift-and-add */ t1 += t1<<SHORTBITS; /* shift-and-add */ t1 += (((Num)-hix+lox)*((Num)hiy-loy))<<SHORTBITS; /* multiply-shift-and-add */ return t1; } ===end end end===
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/06/91)
Various people have responded via email regarding the ``3-multiply trick''. Mostly the feeling is that any extra trouble required by the method is not worth the results. I must concur in the majority of cases. However, we can ask the converse question ``on which architectures will the trick work?'' As I indicated in a previous post, I had in mind DSP-type chips that have fancy multiply-and-add and add-and-shift instructions. The trick is obviously worthwhile on machines with a very expensive multiply as well. I've written another little ``benchmark'' program to compare a 32x32->64 (yuk!) unsigned multiply written using both the naive 4-multiply method: (a+b)*(c+d) = a*c + b*c + a*d + b*d and the 3-multiply trick: (a+b)*(c+d) = (a-b)*(c-d) + 2*(b*c + a*d) where a, b, c and d are the obvious high & low parts of 2 xp unsigned numbers. The results from pixie show that unsigned multiplies account for about 10% of all opcodes (all subroutine calls were removed by the optimizer and lost delay slots counted for relatively few cycles). Specifically in the 4-multiply case we find: andi 13.56 addu 10.17 multu 10.17 srl 6.78 sw 8.47 addiu 8.48 sll 3.39 and in the 3-multiply case we find: andi 13.63 addu 9.09 multu 7.57 srl 6.06 sw 9.09 addiu 7.58 sll 3.03 as percentages of dynamic instructions. We can then ask ``how slow would multiply have to be to make an overall significant difference here?'' We set up the equation: 10.17 T + (100-10.17) = (7.57 T + (100-7.57)) * 1.1 I.e. ``when will the total runtime for the 4-mult routine be 10% larger than for the 3-mult routine?''. (I am assuming here that all other instructions apart from multiply take about the same time). We get T= (100-10.17) - 1.1*(100-7.57) ---------------------------- 7.57 * 1.1 - 10.17 = 6.43 approx. The Sparc would therefore be a good architecture to use the trick on. Running the benchmark on a Sun SS1 we have the output: 3700550176 3700550176 minfact=.75 maxfact=1 sumfact=8.71429 arithavfact=.871429 prodfact=.237305 geoavfact=.237305 The first couple of numbers merely check that both routines produce the same result (it totals the high-order parts of a large number of random 64-bit products -- a sorta ``signature''). The best the 3-multiply routine can do is 25% faster than the naive method -- as expected. Some cases break even, however. The arithmetic average has the 3-mult routine running a little more than 10% faster. The next question we might ask is ``what is the situation on a machine with multiply-and-add?''. I can only approximate this at present (not having my hands on a DSP chip that also has a good optimizing C compiler :-). By combining the addu+multu figures above we arrive at: 2*10.17 T + (100-2*10.17) = ((9.09+7.57) T + (100-9.09-7.57))*1.1 where T is the time of a synthetic multiply-and-add instruction. (This is obviously only approximate -- not EVERY add is associated with a multiply although vv). We find: T = 6 approx. Thus if a combined m&a instruction is 6 times slower ON AVERAGE than other instructions in a given architecture it may be worthwhile doing the 3-mult trick. If m&a is implemented by simply chaining the o/p of a multiplier to the input of an adder then it may be likely that the average cost of multiply-and-add exceeds this figure. It might also be true even taking into account a more sophisticated pipelined h/w. The algorithm in question is subject to lost ``m&a slots'' due to the high density of m&a instructions; the result of one m&a is typically required for a subsequent operation. While the alu is busy we can't use it for the add/subtract phase of an m&a instruction (and we rely on the compiler therefore NOT to schedule one in cases where this would happen). To summarize: using the 3-multiply trick will not get you anything on machines where multiplies are not expensive (i.e. < 6.5 average instructions). This lets out VAX's, 68k's, Mips's but doesn't seem to include the SS1 (I don't have any later SS's to hand, sorry). In the world of smaller machines (:-) 386/486 without co-processors are likely another. On machines with instructions such as multiply-and-add, etc, the 3-multiply trick may be more likely to pay off, as m&a will be subject to scheduling collisions and other slot losses for the algorithms in question. The trick apparently requires a multiply-and-add to cost about 6 average instructions to pay off on these architectures. -kym BTW, I would be interested in any multiply-and-add figures people have to hand. If there is sufficient interest, I will collate and post a summary at a future date. C code: ===compare 3-m and 4-m 32x32->64 unsigned multiply using 16x16->32 arithmetic=== #include <stdio.h> #include <math.h> #include <assert.h> #define BIT(N) (1L<<(N)) #define MASK(N) (BIT(N)-1) #define SHORTBITS 16 #define MAX (1L<<SHORTBITS) #define NSAMP 10000 #define NIT 10 typedef unsigned short UShort; typedef short Short; typedef long Long; typedef unsigned long ULong; typedef struct { ULong lo,hi; } Bignum; Bignum *umul3(x,y) register ULong x,y; { register UShort a,b,c,d; register ULong ac,bd; register ULong mid; static Bignum res; a=x>>SHORTBITS; b=(UShort)x; c=y>>SHORTBITS; d=(UShort)y; /*** (a+b)(c+d) 32 16 16 ac bd ac -(a-b)(c-d) bd ***/ ac=a*c; bd=b*d; mid=bd+ac+(bd>>SHORTBITS); if(a>=b) if(c>=d) mid-=(a-b)*(c-d); else mid+=(a-b)*(-c+d); else if(c>=d) mid+=(-a+b)*(c-d); else mid-=(-a+b)*(-c+d); res.lo=((UShort)bd)+(mid<<SHORTBITS); res.hi=((UShort)mid)+((ac+(mid>>SHORTBITS))<<SHORTBITS); return &res; } Bignum *umul4(x,y) register ULong x,y; { register UShort a,b,c,d; register ULong lo,mid; static Bignum res; a=x>>SHORTBITS; b=(UShort)x; c=y>>SHORTBITS; d=(UShort)y; /*** (a+b)(c+d) = ac + ad + bc + bd 32 16 16 ac bc ad bd ***/ lo=b*d; mid=b*c+a*d+(lo>>SHORTBITS); res.lo=((UShort)lo)+(mid<<SHORTBITS); res.hi=((UShort)mid)+((a*c+(mid>>SHORTBITS))<<SHORTBITS); return &res; } ULong x[NSAMP],y[NSAMP]; void init() { unsigned i; for(i=0;i<NSAMP;i++) { x[i]=rand(); y[i]=rand(); } } void main(argc,argv) char *argv[]; { unsigned it,n; double t0,t1,t2; ULong sum3=0,sum4=0; double sumfact=0,prodfact=1; double minfact=1e38,maxfact= -1e38; for(it=0; it<NIT; it++) { double fact; init(); t0=clock(); for(n=0; n<NSAMP; n++) { sum4+=umul4(x[it],y[it])->hi; } t1=clock(); for(n=0; n<NSAMP; n++) { sum3+=umul3(x[it],y[it])->hi; } t2=clock(); fact=(t2-t1)/(t1-t0); prodfact*=fact; sumfact+=fact; if(fact>maxfact) maxfact=fact; if(fact<minfact) minfact=fact; } printf("%lu %lu\n",sum3,sum4);/* in case of clever compiler */ printf("minfact=%g\n",minfact); printf("maxfact=%g\n",maxfact); printf("sumfact=%g\n",sumfact); printf("arithavfact=%g\n",sumfact/NIT); printf("prodfact=%g\n",prodfact); printf("geoavfact=%g\n",pow(prodfact,1.0/NIT)); exit(0); } ===end end end===
dik@cwi.nl (Dik T. Winter) (04/06/91)
In article <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > In article <9538@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > [discussion of double-prec multiply, esp 32x32->64 (yuk!)] > >This is true, but you would need either 17x17 -> 34 multiply, or > >(16+sign)*(16+sign)->(32+sign), as well as a considerable use of > >keeping track of signs/carries, or substantially more additions. > > Well I'm not sure I understand this -- it's probably a slightly > different implementation of the what I have in mind. To understand it, see below. > > One of the reasons for keeping the 3-multiply trick in mind is that some > chips include a dedicated fast multiplier. It's not unknown to link up this > guy with the rest of the alu so that simultaneous add and mults can be done > (after all, we spent a lot of money to get that multiplier IN there so let's > use it for everything :-). Instructures like multiply-and-add, > multiply-shift-and-add are therefore not unheard of. (I have in mind esp the > TI 320 series, I presume those 6000 thingies can do similar things, but I > haven't looked at the gory details). TI 320, yes, RS6000, no. As I mailed you, this came from a remark from me about the m88k. As on that machine with proper code a multiply takes only a single cycle, there is no way to improve it with 3 multiplies only. My findings are that using an order n squared long integer multiply still pays for fairly large numbers (hundreds of decimal digits on current architectures). > > The appended programs show how to obtain single (i.e. 32x32->32) and > double (32x32->64) results for multiply using a 16x16->32 multiplier. The > routines are pretty rough (written on the fly), so forgive any glaring (or > subtle) errors. Well, it is just those subtle errors that make a routine useless, and are inherently unforgivable. Glaring errors we can live with, they are easily corrected! (It is just like my talk with a representative from some, unnamed, computer manifacturer about a design error I found in an instruction. It made the instruction inherently useless, because in some, admittedly obscure, cases the instruction would deliver the wrong result. Considering that the desing is already more than 10 years on the market one might wonder how many programs silently failed due to this!) > > >BTW, if unsigned multiplication is not available, additional problems > >arise in any case. I dunno, anyhow: > I'm not sure I understand this either -- my routines use signed stuff. I can > appreciate the problems regarding missing unsigned arithmetic, however. Yup, but here we find the first error: is >> sign extending or not? You did not deal with this. (And yes, there are machines that do not have an arithmetic right shift.) But I do not think that is essential. > > ===single precision (32x32->32) multiply using (16x16->32) multiplier=== I will not comment on this; it is pretty straightforward. And contains similar errors as the next one. > ===double prec multiply (32x32->64) using (16x16->32) multiplier=== I presume you assume short is 16 bits, int is 32 bits and long is 64 bits? some declarations here.... anyhow, LNum is long, Num is int, Snum is short. I assume that all multiplies are signed 16*16->32 (as promised). Some more details omitted, and next (hix,lox,hiy,loy are SNums, t1 is a LNum): > t1 += (((Num)-hix+lox)*((Num)hiy-loy))<<SHORTBITS; Several problems here: I do not know the current C standard well enough but I would write '(Num)-hix' as '-(Num)hix' unless you are sure that promotion to Num occurs before negation (in that case, why the case?). I think it will fail if hix=0x8000 and lox=1. Also the promotions to Num of lox and loy are not clear. These are the glaring errors. I will assume that you intended that the expressions '-hix+lox' and 'hiy-loy' are still signed 16 bit numbers (otherwise the assumption about the multiply would fail!). Let's see: Suppose, x = y = -2147483647; we find hix = -32768, lox = 1, hiy = -32768 and loy = 1. Next, -hix+lox = 32769, but we have to truncate to 1 because we are using this as an operand of the 16*16->32 multiply! This is the subtle error; it will fail on some well-choosen examples. So do you now understand Herman Rubins remark that you need either 17*17->34 multiply or (16+sign)*(16+sign)->(32+sign)? Anyhow, your implementation is worthless, even if we expunge the glaring errors. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/07/91)
In article <3275@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <1991Apr4.213550.8106@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: [I didn't understand the need for 17x17->34 arithmetic do do the 3 multiply trick] >To understand it, see below. Well I'm still not convinced. In the followup article I _think_ I have a working 32x32->64 multiply (not that I've checked it exhaustively mind you). Both the naive and 3-multiply versions seem to get the same answers over a large number of cases (and I did some independent checking of the ``end conditions'' of all simple patterns of n-bits rotated in a 32-bit field and those pretty pattern suggested by Knuth that are easy to eyeball). I don't see anywhere that uses 33x33->66 bit arithmetic although a bit of sign-testing goes on before we do the product-of-differences stuff. Forgive me if I'm wrong, but to convert the unsigned routines to signed ones just requires a couple of further 32-bit adds; still no funny multiplies. [I talk about multiply-and-add and shift-and-add perhaps having an impact on 3-m vs 4-m; esp on DSP chips such as TI 320, _maybe_ on 6000] >TI 320, yes, RS6000, no. As I mailed you, this came from a remark from me >about the m88k. As on that machine with proper code a multiply takes only >a single cycle, there is no way to improve it with 3 multiplies only. Well I don't know what you _think_ you said in your email. But all I have on file is ``all those halfword adds are not worth the pain''; I don't see anything there about m88k's (although that may have been implicit, it doesn't read that way). >My findings are that using an order n squared long integer multiply still pays >for fairly large numbers (hundreds of decimal digits on current architectures). Well on any machine where multiplies < about 6.5 average instructions I would agree with you (see my previous post for derivation). In the context of other architectures, which are abundant but not common :-) maybe this is not the case. As all on comp.arch know by now, I'm not one for taking people's word for anything when I can measure it for myself. [long analysis of poor programs] >Anyhow, your implementation is worthless, even if we expunge the glaring errors. Sorry, but I thought it was apparent they were meant to be illustrative, not implementations. It's pretty hard to make C do a real 16x16->32 multiply in the first place :-) As illustrations I guess they leave something to be desired due to the various ambiguities you outlined, but I hope some of this was cleared up by the follow-up post of what (I hope) are working versions. -kym
dik@cwi.nl (Dik T. Winter) (04/07/91)
In article <1991Apr5.191136.21806@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > Various people have responded via email regarding the ``3-multiply > trick''. Eh, yup. > > Mostly the feeling is that any extra trouble required by the method is > not worth the results. I must concur in the majority of cases. However, > we can ask the converse question ``on which architectures will the trick > work?'' As I indicated in a previous post, I had in mind DSP-type > chips that have fancy multiply-and-add and add-and-shift instructions. Might be, however, it is better to have correct code than fast code. Moreover, I see you changed from signed multiply to unsigned multiply. > > The Sparc would therefore be a good architecture to use the trick on. Barely. 33 multiply step instructions give you the 64 bits result. > > Running the benchmark on a Sun SS1 we have the output: > > 3700550176 3700550176 I found this quite interesting. Both umul3 and umul4 gave the same incorrect result for both cases I tried! > mid=bd+ac+(bd>>SHORTBITS); This can fail. 0 <= a,b,c,d < 2^16, so 0 <= mid < 2^33+2^16. You do not handle the overflow. Try: umul3(0xffffffff,0xffffffff) and see that it returns 0xfffe000000000001; not entirely correct. This is probably due to some glaring (and some subtle) bugs. > for(n=0; n<NSAMP; n++) { > sum4+=umul4(x[it],y[it])->hi; > } You want to change both occurrences of 'it' to 'n'. Same a bit further down. Bottom line, never compare apples with oranges, and certainly not rotten apples with rotten oranges. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/08/91)
In article <3276@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >Might be, however, it is better to have correct code than fast code. Sure. But I am not sure it is required for the purposes of comparison (better to compare rotten apples if them's all you got). As you pointed out my routines didn't handle integer overflow in an intermediate result. But as the same is true of both routines fixing same would not affect the comparison too much (if measurable at all). As I pointed out before, the main point of the exercise was to test the claims (a) that you need 17x17->34 arithmetic to implement this technique, and (b) that it isn't (ever) worth the trouble. I think I've illustrated, despite problems, that (a) is not the case. While (b) appears true in some cases there are extant architectures in which it is not correct. I realize the SS1 already performs 32x32->64 multiply, this is really beside the point. 32x32->64 was what I benchmarked. The same should be approximately true of 3m vs 4m algorithms for larger operands. Although I've tried to kid others into writing up the appropriate code (to check 64x64->128 for instance) no-one has volunteered. My prediction from the extant data is that 3m/4m on the SS1 will be about 10% on average in favor of 3m for a 64x64->128 multiply with peaks at 25% (signed or not, correct or not). Anyone care to check it? -- -kym main(){int i=0,b=0,c=0,n=0;while(i<1920){if((b>>=1)<2) b=n*n++;printf("\n%c"+!!(++i%80)," #"[(c^=1)^(b&1)]);}}
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (04/08/91)
In article <3276@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > > mid=bd+ac+(bd>>SHORTBITS); >This can fail. 0 <= a,b,c,d < 2^16, so >0 <= mid < 2^33+2^16. You do not handle the overflow. Hmmm. While we're talking about accuracy -- I get 0 <= mid <= 2^33 - 3*2^16 + 1. You seem to indicate >1 bits of overflow may be required. -kym -- main(){int i=0,b=0,c=0,n=0;while(i<1920){if((b>>=1)<2) b=n*n++;printf("\n%c"+!!(++i%80)," #"[(c^=1)^(b&1)]);}}
tif@doorstop.austin.ibm.com (Paul Chamberlain) (04/08/91)
dik@cwi.nl (Dik T. Winter) writes: >kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > > use it for everything :-). Instructures like multiply-and-add, > > multiply-shift-and-add are therefore not unheard of. (I have in mind ... > > TI 320 series, I presume those 6000 thingies can do similar things, >TI 320, yes, RS6000, no. The rest of this article is to deep for my current mood, however, if I understand the above correctly, you are wrong. IBM goes to great pains to brag about the multiply-and-add instruction. Paul Chamberlain | I do NOT speak for IBM. IBM VNET: PAULCC AT AUSTIN 512/838-9748 | ...!cs.utexas.edu!ibmchs!auschs!doorstop.austin.ibm.com!tif
halkoD@batman.moravian.EDU (David Halko) (04/23/91)
In article <6490@awdprime.UUCP>, tif@doorstop.austin.ibm.com (Paul Chamberlain) writes: > dik@cwi.nl (Dik T. Winter) writes: > >kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: > > > use it for everything :-). Instructures like multiply-and-add, > > > multiply-shift-and-add are therefore not unheard of. (I have in mind ... > > > TI 320 series, I presume those 6000 thingies can do similar things, > >TI 320, yes, RS6000, no. > > The rest of this article is to deep for my current mood, however, > if I understand the above correctly, you are wrong. IBM goes to > great pains to brag about the multiply-and-add instruction. > Multiply and add instructions are very useful in digital signal procssing applications, where filters need to be made. multiply-add instructions performed in a pipeline significantly improve performance of such algorythms... -- _______________________________________________________________________ / \ / "The use of COBOL cripples the mind; its teaching should, therefore, be \ / regarded as a criminal offence." E.W.Dijkstra, 18th June 1975. \ +-----------------------------------------------------------------------------+ \ "Have you purchased a multi- halkoD@moravian.edu Have you booted your / \ media machine from IMS yet?" David J. Halko OS-9 computer today?/ \_______________________________________________________________________/