bs@linus.UUCP (Robert D. Silverman) (12/23/89)
Does any have, of know of software for the SPARC [SUN-4] that will perform the following: (1) Multiply two unsigned 32-bit integers, yielding a 64 bit product stored in two registers? (2) Take a 64 bit product and divide it by a 32 bit (unsigned) integer yielding a 32-bit quotient and remainder? or (3) Compute A*B Mod C directly? The SPARC is brain dead [as were its designers] when it comes to doing integer arithmetic. It can't multiply and it can't divide. Trying to convert to floating point, do the arithmetic, then convert back is far too slow. (I've already tried). -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
davidc@vlsisj.VLSI.COM (David Chapman) (12/27/89)
In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes: >Does any have, of know of software for the SPARC [SUN-4] that will >perform the following: > > [standard multiply and divide] > >The SPARC is brain dead [as were its designers] when it comes to doing >integer arithmetic. It can't multiply and it can't divide. There should be instructions on the order of "multiply step" and "divide step", each of which will do one of the 32 adds/subtracts and then shift. I'm not particularly fond of the SPARC architecture (don't like register windows), but this is a theoretical viewpoint and is not based on any direct exposure to assembly-language programming for it (translation: sorry, I can't give you any more help). Neither SPARC nor its designers were brain-dead when it was built. It's just that it is difficult to get multiplication and division (especially the latter) to run in 1 or 2 clock cycles. All instructions are supposed to execute in the ALU in 1 cycle; if the multiply and divide instructions take more time then the front of the processor pipeline has to be able to stall and this added complexity will slow down the entire processor. Thus they provide you with the tools to do your own multiply and divide. One of the benefits is that a compiler can optimize small multiplies and divides to make them execute quicker (i.e. multiply by 10 takes 4 steps instead of 32). It is important that you understand this if you are to write assembly language programs for a SPARC. If your instructions are not carefully optimized, the result could be slower than if you write in a high-level language and compile with its optimizer! (Unless the SPARC assembler performs instruction reordering.) P.S. Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be incredibly slow. Unroll the loop 4 or 8 times (MULSTEP, MULSTEP, MULSTEP, MULSTEP, SUB 4, BNZ). Branches are expensive. -- David Chapman {known world}!decwrl!vlsisj!fndry!davidc vlsisj!fndry!davidc@decwrl.dec.com
bs@linus.UUCP (Robert D. Silverman) (12/29/89)
In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes: :In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes: :>Does any have, of know of software for the SPARC [SUN-4] that will :>perform the following: :> :> [standard multiply and divide] :> :>The SPARC is brain dead [as were its designers] when it comes to doing :>integer arithmetic. It can't multiply and it can't divide. : :There should be instructions on the order of "multiply step" and "divide :step", each of which will do one of the 32 adds/subtracts and then shift. There is a multiply step instruction. There is no such support for division. It can take 200+ cycles to do a division on the SPARC [worst case]. A 32 x 32 bit unsigned multiply takes 45-47 cycles. Programs that have a significant number of multiplies and divides can run SLOWER on a SPARC than on a SUN-3. [I have such!] ONLY because of the slow multiply/divides. :I'm not particularly fond of the SPARC architecture (don't like register :windows), but this is a theoretical viewpoint and is not based on any :direct exposure to assembly-language programming for it (translation: :sorry, I can't give you any more help). : :Neither SPARC nor its designers were brain-dead when it was built. It's just I didn't say they were. I said they were with respect to arithmetic. I stand by that assertion. Most programs may not need multiply/divide in hardware. However, for those that do require it, not having it is a real KILLER of algorithms. :that it is difficult to get multiplication and division (especially the :latter) to run in 1 or 2 clock cycles. All instructions are supposed to I know of quite a few DSP chips that do multiplies in 1 cycles. Divides take a little longer [but not much; Ernie Brickell of SANDIA invented a hardware divide that works much faster than standard conditional-shift/ subtract]. :execute in the ALU in 1 cycle; if the multiply and divide instructions take :more time then the front of the processor pipeline has to be able to stall :and this added complexity will slow down the entire processor. : :Thus they provide you with the tools to do your own multiply and divide. See above. They are too slow. :One of the benefits is that a compiler can optimize small multiplies and :divides to make them execute quicker (i.e. multiply by 10 takes 4 steps That's fine for multiply-by-constant. Most programs that NEED multiply/divide are multiplying variables. :P.S. Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be : incredibly slow. Unroll the loop 4 or 8 times (MULSTEP, MULSTEP, : MULSTEP, MULSTEP, SUB 4, BNZ). Branches are expensive. Agreed. In fact my 32 x 32 bit multiply consists of 32 calls to multstep and no looping at all. It is still slow. -- Bob Silverman #include <std.disclaimer> Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs Mitre Corporation, Bedford, MA 01730
irf@kuling.UUCP (Bo Thide') (12/31/89)
In order to put the discussion on SPARC arithmetics slowness into some perspective, we ran Plum Hall benchmark on one of our CISC and two of our RISC machines. The machines we used were: HP9000/370 MC68030/33 MHz with and without floating point accelerator (fpa) HP9000/835 HP-PA RISC/15 MHz Sun SparcStation1 (SS1) SPARC RISC/20 MHz ================================================================================ register auto auto int function auto int short long multiply call+ret double HP9000/370 (fpa -O) 0.22 0.26 0.22 1.35 3.96 0.62 HP9000/370 (-O) 0.21 0.26 0.22 1.35 3.08 1.21 HP9000/370 (fpa no -O) 0.26 0.40 0.36 1.44 4.42 1.56 HP9000/370 (no -O) 0.26 0.40 0.37 1.45 3.38 2.72 HP9000/835 (-O) 0.27 0.29 0.27 5.49 0.31 0.27 HP9000/835 (no -O) 0.29 0.53 0.45 5.62 0.31 0.59 Sun SS1 (-O) 0.29 0.33 0.30 19.5 0.49 0.59 Sun SS1 (no -O) 0.38 0.40 0.35 19.7 0.51 0.72 ================================================================================ There is no question that Sun SparcStation 1 is *extremely* slow on integer multiply even for a RISC architecture -- scaling the HP-PA results to the same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster than SPARC!! Our results also show that integer arithmetics on CISC (MC68030) is much faster than on RISC (HP-PA, SPARC). For those of you not familiar with the Plum-Hall benchmark here is some info: ---------------------------------------------------------------------------- [The following article appeared in "C Users Journal" May 1988. It describes the purpose and use of the enclosed benchmarks. ] SIMPLE BENCHMARKS FOR C COMPILERS by Thomas Plum Dr.Plum is the author of several books on C, including Efficient C (co- authored with Jim Brodie). He is Vice-Chair of the ANSI X3J11 Committee, and Chairman of Plum Hall Inc, which offers introductory and advanced sem- inars on C. Copyright (c) 1988, Plum Hall Inc We are placing into the public domain some simple benchmarks with several appealing properties: They are short enough to type while browsing at trade shows. They are protected against overly-aggressive compiler optimizations. They reflect empirically-observed operator frequencies in C programs. They give a C programmer information directly relevant to programming. In Efficient C, Jim Brodie and I described how useful it can be for a pro- grammer to have a general idea of how many microseconds it takes to execute the "average operator" on register int's, on auto short's, on auto long's, and on double data, as well as the time for an integer multiply, and the time to call-and-return from a function. These six numbers allow a programmer to make very good first-order estimates of the CPU time that a particular algorithm will take.
tot@frend.fi (Teemu Torma) (01/02/90)
In article <1313@kuling.UUCP> irf@kuling.UUCP (Bo Thide') writes:
================================================================================
register auto auto int function auto
int short long multiply call+ret double
HP9000/370 (fpa -O) 0.22 0.26 0.22 1.35 3.96 0.62
HP9000/370 (-O) 0.21 0.26 0.22 1.35 3.08 1.21
HP9000/370 (fpa no -O) 0.26 0.40 0.36 1.44 4.42 1.56
HP9000/370 (no -O) 0.26 0.40 0.37 1.45 3.38 2.72
HP9000/835 (-O) 0.27 0.29 0.27 5.49 0.31 0.27
HP9000/835 (no -O) 0.29 0.53 0.45 5.62 0.31 0.59
Sun SS1 (-O) 0.29 0.33 0.30 19.5 0.49 0.59
Sun SS1 (no -O) 0.38 0.40 0.35 19.7 0.51 0.72
================================================================================
There is no question that Sun SparcStation 1 is *extremely* slow on integer
multiply even for a RISC architecture -- scaling the HP-PA results to the
same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster
than SPARC!! Our results also show that integer arithmetics on CISC (MC68030)
is much faster than on RISC (HP-PA, SPARC).
Strange. I got much better int multiply results from Sun SS1. Gcc
version was 1.36.
register auto auto int function auto
int short long multiply call+ret double
Sun 4/60 (cc, no -O) 0.38 0.40 0.36 3.52 0.28 0.72
Sun 4/60 (cc, -O) 0.32 0.35 0.32 3.62 0.28 0.62
Sun 4/60 (cc, -O2) 0.30 0.33 0.30 3.32 0.26 0.61
Sun 4/60 (cc, -O3) 0.30 0.33 0.30 3.45 0.28 0.63
Sun 4/60 (gcc, no -O) 0.21 0.38 0.38 3.50 0.27 0.77
Sun 4/60 (gcc, -O) 0.18 0.21 0.18 3.57 0.27 0.42
Sun 4/60 (gcc, <2>) 0.17 0.22 0.19 3.67 0.28 0.42
Sun 4/60 (gcc, <3>) 0.17 0.21 0.18 3.52 0.27 0.40
gcc <2>: -fstrength-reduce -fcombine-regs -fomit-frame-pointer
gcc <3>: as <2> including -mno-epologue
--
Teemu Torma
Front End Oy, Helsinki, Finland
Internet: tot@nyse.frend.fi
whit@milton.acs.washington.edu (John Whitmore) (01/24/90)
In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes: >In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes: >>Does any have, of know of software for the SPARC [SUN-4] that will >>perform the following: >> >> [standard multiply and divide] > . . . >There should be instructions on the order of "multiply step" and "divide >step", each of which will do one of the 32 adds/subtracts and then shift. > I strongly disagree. Smarter routines can optimize a lot of that kind of "microcode" away, and should do so given the opportunity. >Thus they provide you with the tools to do your own multiply and divide. >One of the benefits is that a compiler can optimize small multiplies and >divides to make them execute quicker (i.e. multiply by 10 takes 4 steps >instead of 32). Yes, EXACTLY. So extend the principle; take four-bit nibbles of the argument and do a 16-way JUMP (whatever the equivalent is on a SPARC) to sixteen cases like CASE x0000: RTN CASE x0001: add (to accumulator) CASE x0010: shift +1, add CASE x0011: subtract, shift+3, add CASE x0100: shift+3, add CASE x0101: add, shift+3, add CASE x0110: shift+2, add, shift+3, add CASE x0111: subtract, shift+4, add and if I can figure it out, you experts are getting bored by now. Four operations MAX for a four-bit multiplicand, opposed to 12 operations (estimated) for the one-bit "MULSTEP" approach. > >P.S. Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be > incredibly slow. Unroll the loop 4 or 8 times (MULSTEP, MULSTEP, > MULSTEP, MULSTEP, SUB 4, BNZ). Branches are expensive. Hm. The principle is good, but don't think small. Unroll it to really large chunks of code. The "BNZ" is a bottleneck that shouldn't be employed when really large fanout of the code path can be done in ONE step. I seem to recall that the trick (above) was employed by a hardware multiplier IBM made, some decade or more ago. It still works. I am known for my brilliance, John Whitmore by those who do not know me well.
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/25/90)
Actually memory is cheap enough to use 64k for a lookup table. Make a 16 bit address from the two bytes to multiply as 0 78 15 |_______|______| | byte1 | byte2| |_______|______| and just pull the answer out of memory. Actually, the Intel practice of treating a 32 bit register as lots of little registers, with the 8 and 16 bit portions addressible by name, would be handy for some of this stuff, eliminating a shift. It should still be quite fast. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me