[comp.arch] SPARC and the Slow Multiply Instruction

rudell@ic.uucp (Richard Rudell) (03/02/88)

Keywords:



I became interested in the Sun/4 multiple/divide timings after reading
Appendix E of the Sun/4 Architecture Manual.  This Appendix describes
in detail the assmebly language routines in SPARC for doing 32-bit
multiply and divide.

SPARC has a multiply-step instruction for unsigned multiply; it takes
32 instructions plus a minimum of another 8 (even if both operands are
positive) to check for negative arguments in a signed multiply.  They
also place another 5 instructions at the head of the routine to check
for a 'short' case (i.e., if argument %o0 is less than 12 bits long.)
Hence, 12-bit multiplies go faster than the general case.  Note that it
does NOT check both operands to see if one is larger/smaller than the
other before this test; hence, only if the compiler knows that one
operand is small (e.g., structure size constants ?) does this speed it
up.  Function call overhead to reach the multiply routine is negliable
because the Sun compiler avoids the normal subroutine linkage on 'leaf'
routines.

The divide algorithm is an iterative algorithm and the complexity in
instruction count is difficult to predict.  It is interesting that Sun 
decided not to provide a divide-step instruction as well.

Anyways, here are some 'experimental' timing results.  Great care/pain
was taken to get around the Sun/4 optimizing compiler attempting to
move things out of loops, and dropping 'dead' code.  The goal was to
measure raw instruction times:


Multiply: 123456*789012

	MicrovaxII	5.3 us
	Sun 3/180	2.5 us
	Sun 4/280	3.0 us


Divide: 1234567890 / 789012

	Microvax II	7.7 us
	Sun 3/180	7.2 us
	Sun 4/280	9.3 us

Multiply and divide are slow.  Lets hope they are not used frequently.
On the other hand, here are the instruction times for executing :

	    f(f(0)+f(0))

given

	f(a)
	{ 
	    return a; 
	}

(i.e., 3 func calls with 1 argument plus a single addition.)

	MicrovaxII	55.9 us
	Sun 3/180	10.3 us
	Sun 4/280	 1.4 us

Note that this is an ideal case for the Sun 4 because the register windows
NEVER overflow.

Therefore, we proclaim the Sun/4 a 1.7 VAX-MIP multiplier, a 
.8 VAX-MIP divider, and a 40 VAX-MIP function-caller.

The microcycle time of the uVAX is 200 ns.  This compares to the Sun/4
'microcycle' time of 62.5 ns (except the microvax never misses in its
'instruction cache').  This means that the uVAX takes only 26 cycles for
the multiply given above; the Sun/4 takes 48 cycles.  For divide, the
counts are 38 cycles for the uVax and 149 cycles for the Sun/4.

Oh, by the way, the Sun 4/280 is a 8.6 VAX-MIP machine for the program
Espresso (a strictly integer bit-cruncher).  Apparently multiply is not
very important to Espresso.

*** EDITORIAL MODE ON ***

I like RISC.  The VAX Architecture is a big lose if you want to go
fast.  VAX is a pig at function call.  But this does not mean that
'simple is better' and 'RISC multiply is no slower than executing out
of microcode.'  When you consider operand setup, etc.  microcode or
special hardware is a win for multiply and divide (even if the
algorithm is still 1- or 2-bit at a time sequential).  This may not be
a problem for many typical Unix applications, but there are some
applications (16-bit DSP simulation, for example) which may run very
slowly on a Sun/4 compared to a Sun/3 or uVax.

I agree heartily with the MIPS Co. (and AMD 29000) decisions to put
multiply/divide instructions in the architecture.  This allows
different models/versions to implement them differently without losing
binary compatability; current models can either trap to the equivalent
multiply-step loops in software (AMD 29000), or use special 'hard-wired
microcode' (MIPS R2000).

Why did SPARC not include a 2-bit at a time multiply instruction, or
a special multiply instruction in the architecture to allow for future
growth ?  Why is there no divide instruction or divide-step instruction ?
Is this important to anyone ?

*** EDITORIAL MODE OFF ***


---------------------------------------------------------------------------
Richard Rudell
Graduate Student				rudell@ic.berkeley.edu (ARPA)
205 Cory Hall					...!ucbvax!ic!rudell (UUCP)
University of California			(415) 642-3626
Berkeley, CA  94720

rrr@naucse.UUCP (Bob Rose ) (03/04/88)

In article <1157@pasteur.Berkeley.Edu>, rudell@ic.uucp (Richard Rudell) writes:
> SPARC has a multiply-step instruction for unsigned multiply; it takes
> 32 instructions plus a minimum of another 8 (even if both operands are
> positive) to check for negative arguments in a signed multiply.
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This one always gets me. Assuming you are multipling two values that are
the same size (i.e. 32bits) and storing them back in the same size
(i.e. 32bits again) then you do not have to check for negative arguments
if you are using a 2's complement machine that ignores overflow.
There is a mathmatical proof for this.

I have a 68010 machine (68010 does not have a 32bit multiply) and
in the multiply routine they do that stupid check for negative arguments,
after removing the check the routine ran 40% faster.

Maybe my math degree isn't worthless after all.

Robert R. Rose
Northern Arizona University, Box 15600
Flagstaff, AZ 86011
                    .....!ihnp4!arizona!naucse!rrr