[comp.lang.c] Integer Multiply/Divide on Sparc

bs@linus.UUCP (Robert D. Silverman) (12/23/89)

Does any have, of know of software for the SPARC [SUN-4] that will
perform the following:

(1) Multiply two unsigned 32-bit integers, yielding a 64 bit product
    stored in two registers?

(2) Take a 64 bit product and divide it by a 32 bit (unsigned) integer
    yielding a 32-bit quotient and remainder?
 
or (3) Compute A*B Mod C directly?


The SPARC is brain dead [as were its designers] when it comes to doing
integer arithmetic. It can't multiply and it can't divide.

Trying to convert to floating point, do the arithmetic, then convert
back is far too slow. (I've already tried).

-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

davidc@vlsisj.VLSI.COM (David Chapman) (12/27/89)

In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes:
>Does any have, of know of software for the SPARC [SUN-4] that will
>perform the following:
>
> [standard multiply and divide]
>
>The SPARC is brain dead [as were its designers] when it comes to doing
>integer arithmetic. It can't multiply and it can't divide.

There should be instructions on the order of "multiply step" and "divide 
step", each of which will do one of the 32 adds/subtracts and then shift.  
I'm not particularly fond of the SPARC architecture (don't like register 
windows), but this is a theoretical viewpoint and is not based on any 
direct exposure to assembly-language programming for it (translation:
sorry, I can't give you any more help).

Neither SPARC nor its designers were brain-dead when it was built.  It's just
that it is difficult to get multiplication and division (especially the 
latter) to run in 1 or 2 clock cycles.  All instructions are supposed to
execute in the ALU in 1 cycle; if the multiply and divide instructions take
more time then the front of the processor pipeline has to be able to stall
and this added complexity will slow down the entire processor.

Thus they provide you with the tools to do your own multiply and divide.  
One of the benefits is that a compiler can optimize small multiplies and 
divides to make them execute quicker (i.e. multiply by 10 takes 4 steps 
instead of 32).

It is important that you understand this if you are to write assembly
language programs for a SPARC.  If your instructions are not carefully
optimized, the result could be slower than if you write in a high-level
language and compile with its optimizer!  (Unless the SPARC assembler
performs instruction reordering.)

P.S.  Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be
      incredibly slow.  Unroll the loop 4 or 8 times (MULSTEP, MULSTEP,
      MULSTEP, MULSTEP, SUB 4, BNZ).  Branches are expensive.
-- 
		David Chapman

{known world}!decwrl!vlsisj!fndry!davidc
vlsisj!fndry!davidc@decwrl.dec.com

bs@linus.UUCP (Robert D. Silverman) (12/29/89)

In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes:
:In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes:
:>Does any have, of know of software for the SPARC [SUN-4] that will
:>perform the following:
:>
:> [standard multiply and divide]
:>
:>The SPARC is brain dead [as were its designers] when it comes to doing
:>integer arithmetic. It can't multiply and it can't divide.
:
:There should be instructions on the order of "multiply step" and "divide 
:step", each of which will do one of the 32 adds/subtracts and then shift.  
 
There is a multiply step instruction. There is no such support for division.
It can take 200+ cycles to do a division on the SPARC [worst case]. 
A 32 x 32 bit unsigned multiply takes 45-47 cycles. Programs that have a
significant number of multiplies and divides can run SLOWER on a SPARC
than on a SUN-3. [I have such!]  ONLY because of the slow multiply/divides.

:I'm not particularly fond of the SPARC architecture (don't like register 
:windows), but this is a theoretical viewpoint and is not based on any 
:direct exposure to assembly-language programming for it (translation:
:sorry, I can't give you any more help).
:
:Neither SPARC nor its designers were brain-dead when it was built.  It's just
 
I didn't say they were. I said they were with respect to arithmetic. I stand
by that assertion. Most programs may not need multiply/divide in hardware.
However, for those that do require it, not having it is a real KILLER 
of algorithms.

:that it is difficult to get multiplication and division (especially the 
:latter) to run in 1 or 2 clock cycles.  All instructions are supposed to
 
I know of quite a few DSP chips that do multiplies in 1 cycles. Divides
take a little longer [but not much; Ernie Brickell of SANDIA invented a
hardware divide that works much faster than standard conditional-shift/
subtract].

:execute in the ALU in 1 cycle; if the multiply and divide instructions take
:more time then the front of the processor pipeline has to be able to stall
:and this added complexity will slow down the entire processor.
:
:Thus they provide you with the tools to do your own multiply and divide.  
 
See above. They are too slow.

:One of the benefits is that a compiler can optimize small multiplies and 
:divides to make them execute quicker (i.e. multiply by 10 takes 4 steps 
 
That's fine for multiply-by-constant. Most programs that NEED multiply/divide
are multiplying variables.

:P.S.  Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be
:      incredibly slow.  Unroll the loop 4 or 8 times (MULSTEP, MULSTEP,
:      MULSTEP, MULSTEP, SUB 4, BNZ).  Branches are expensive.
 
Agreed. In fact my 32 x 32 bit multiply consists of 32 calls to multstep
and no looping at all. It is still slow.

-- 
Bob Silverman
#include <std.disclaimer>
Internet: bs@linus.mitre.org; UUCP: {decvax,philabs}!linus!bs
Mitre Corporation, Bedford, MA 01730

irf@kuling.UUCP (Bo Thide') (12/31/89)

In order to put the discussion on SPARC arithmetics slowness into
some perspective, we ran Plum Hall benchmark on one of our CISC
and two of our RISC machines.  The machines we used were:

HP9000/370 MC68030/33 MHz with and without floating point accelerator (fpa)
HP9000/835 HP-PA RISC/15 MHz
Sun SparcStation1 (SS1) SPARC RISC/20 MHz

================================================================================
                      register      auto      auto       int  function      auto
                           int     short      long  multiply  call+ret    double
 HP9000/370 (fpa -O)     0.22      0.26      0.22      1.35      3.96      0.62
       HP9000/370 (-O)   0.21      0.26      0.22      1.35      3.08      1.21
 HP9000/370 (fpa no -O)  0.26      0.40      0.36      1.44      4.42      1.56
    HP9000/370 (no -O)   0.26      0.40      0.37      1.45      3.38      2.72
       HP9000/835 (-O)   0.27      0.29      0.27      5.49      0.31      0.27 
    HP9000/835 (no -O)   0.29      0.53      0.45      5.62      0.31      0.59
          Sun SS1 (-O)   0.29      0.33      0.30     19.5       0.49      0.59
       Sun SS1 (no -O)   0.38      0.40      0.35     19.7       0.51      0.72
================================================================================
There is no question that Sun SparcStation 1 is *extremely* slow on integer
multiply even for a RISC architecture -- scaling the HP-PA results to the
same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster
than SPARC!!  Our results also show that integer arithmetics on CISC (MC68030)
is much faster than on RISC (HP-PA, SPARC).

For those of you not familiar with the Plum-Hall benchmark here is some info:
----------------------------------------------------------------------------

[The following article appeared in  "C Users Journal" May 1988.
 It describes the purpose and use of the enclosed benchmarks. ]


SIMPLE BENCHMARKS FOR C COMPILERS

by Thomas Plum

Dr.Plum is the author of several books on  C,  including  Efficient  C  (co-
authored  with  Jim  Brodie).  He is Vice-Chair of the ANSI X3J11 Committee,
and Chairman of Plum Hall Inc, which offers introductory and  advanced  sem-
inars on C.

Copyright (c) 1988, Plum Hall Inc


We are placing into the public domain some simple  benchmarks  with  several
appealing properties:

    They are short enough to type while browsing at trade shows.

    They are protected against overly-aggressive compiler optimizations.

    They reflect empirically-observed operator frequencies in C programs.

    They give a C programmer information directly relevant to programming.

In Efficient C, Jim Brodie and I described how useful it can be for  a  pro-
grammer  to have a general idea of how many microseconds it takes to execute
the "average operator" on   register  int's,  on   auto  short's,  on   auto
long's,  and  on  double  data, as well as the time for an integer multiply,
and the time to call-and-return from a function.  These six numbers allow  a
programmer  to  make  very good first-order estimates of the CPU time that a
particular algorithm will take.

tot@frend.fi (Teemu Torma) (01/02/90)

In article <1313@kuling.UUCP> irf@kuling.UUCP (Bo Thide') writes:

   ================================================================================
			 register      auto      auto       int  function      auto
			      int     short      long  multiply  call+ret    double
    HP9000/370 (fpa -O)     0.22      0.26      0.22      1.35      3.96      0.62
	  HP9000/370 (-O)   0.21      0.26      0.22      1.35      3.08      1.21
    HP9000/370 (fpa no -O)  0.26      0.40      0.36      1.44      4.42      1.56
       HP9000/370 (no -O)   0.26      0.40      0.37      1.45      3.38      2.72
	  HP9000/835 (-O)   0.27      0.29      0.27      5.49      0.31      0.27 
       HP9000/835 (no -O)   0.29      0.53      0.45      5.62      0.31      0.59
	     Sun SS1 (-O)   0.29      0.33      0.30     19.5       0.49      0.59
	  Sun SS1 (no -O)   0.38      0.40      0.35     19.7       0.51      0.72
   ================================================================================
   There is no question that Sun SparcStation 1 is *extremely* slow on integer
   multiply even for a RISC architecture -- scaling the HP-PA results to the
   same clock speed as the SPARC (20 MHz) we see that HP-PA is about 470% faster
   than SPARC!!  Our results also show that integer arithmetics on CISC (MC68030)
   is much faster than on RISC (HP-PA, SPARC).

Strange. I got much better int multiply results from Sun SS1. Gcc
version was 1.36.

			 register      auto      auto       int  function      auto
			      int     short      long  multiply  call+ret    double
Sun 4/60 (cc, no -O)	  0.38      0.40      0.36      3.52      0.28      0.72    
Sun 4/60 (cc, -O)	  0.32      0.35      0.32      3.62      0.28      0.62    
Sun 4/60 (cc, -O2)	  0.30      0.33      0.30      3.32      0.26      0.61    
Sun 4/60 (cc, -O3)	  0.30      0.33      0.30      3.45      0.28      0.63    
Sun 4/60 (gcc, no -O)	  0.21      0.38      0.38      3.50      0.27      0.77    
Sun 4/60 (gcc, -O)	  0.18      0.21      0.18      3.57      0.27      0.42    
Sun 4/60 (gcc, <2>)	  0.17      0.22      0.19      3.67      0.28      0.42    
Sun 4/60 (gcc, <3>)	  0.17      0.21      0.18      3.52      0.27      0.40    

gcc <2>: -fstrength-reduce -fcombine-regs -fomit-frame-pointer
gcc <3>: as <2> including -mno-epologue
--
Teemu Torma
Front End Oy, Helsinki, Finland
Internet: tot@nyse.frend.fi

whit@milton.acs.washington.edu (John Whitmore) (01/24/90)

In article <15418@vlsisj.VLSI.COM> davidc@vlsisj.UUCP (David Chapman) writes:
>In article <84768@linus.UUCP> bs@linus.mitre.org (Robert D. Silverman) writes:
>>Does any have, of know of software for the SPARC [SUN-4] that will
>>perform the following:
>>
>> [standard multiply and divide]
> . . .
>There should be instructions on the order of "multiply step" and "divide 
>step", each of which will do one of the 32 adds/subtracts and then shift.  
>
	I strongly disagree.  Smarter routines can optimize a lot of 
that kind of "microcode" away, and should do so given the opportunity.

>Thus they provide you with the tools to do your own multiply and divide.  
>One of the benefits is that a compiler can optimize small multiplies and 
>divides to make them execute quicker (i.e. multiply by 10 takes 4 steps 
>instead of 32).
	Yes, EXACTLY.  So extend the principle; take four-bit nibbles
of the argument and do a 16-way JUMP (whatever the equivalent is
on a SPARC) to sixteen cases like
CASE x0000:	RTN
CASE x0001:	add (to accumulator)
CASE x0010:	shift +1, add
CASE x0011:	subtract, shift+3, add
CASE x0100:	shift+3, add
CASE x0101:	add, shift+3, add
CASE x0110:	shift+2, add, shift+3, add
CASE x0111:	subtract, shift+4, add
	and if I can figure it out, you experts are getting bored
by now.  Four operations MAX for a four-bit multiplicand, opposed
to 12 operations (estimated) for the one-bit "MULSTEP" approach.
>
>P.S.  Don't write a loop on the order of "MULSTEP, DEC, BNZ" or it will be
>      incredibly slow.  Unroll the loop 4 or 8 times (MULSTEP, MULSTEP,
>      MULSTEP, MULSTEP, SUB 4, BNZ).  Branches are expensive.

	Hm.  The principle is good, but don't think small.  Unroll it
to really large chunks of code.  The "BNZ" is a bottleneck that
shouldn't be employed when really large fanout of the code path
can be done in ONE step.
	I seem to recall that the trick (above) was employed by a
hardware multiplier IBM made, some decade or more ago.  It still works.

I am known for my brilliance,                 John Whitmore
 by those who do not know me well.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/25/90)

  Actually memory is cheap enough to use 64k for a lookup table. Make a
16 bit address from the two bytes to multiply as
	0      78      15
	|_______|______|
	| byte1 | byte2|
	|_______|______|
and just pull the answer out of memory.

  Actually, the Intel practice of treating a 32 bit register as lots of
little registers, with the 8 and 16 bit portions addressible by name,
would be handy for some of this stuff, eliminating a shift. It should
still be quite fast.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
            "Stupidity, like virtue, is its own reward" -me