[comp.arch] Integer multiply and killer micros

Daniel.Stodolsky@cs.cmu.edu (01/05/90)

If my memory serves me correctly,  one should be able to compute a 32 x
32 -> 64 bit multiply with  four 16x16 -> 32  multiplies, 4 32 bit adds
and a few shits.  So why not put some of big memory killer micros to
work and have a 16 by 16 multiply lookup table? It would consume 10 megs
of core, but that's nothing for a KILLER MICRO. Assume memory access
with  a cache miss is around 3 cycles and one can schedule to avoid
register interlocking (as in HP-PA), it seems possible to do 32x32 -> 64
( results in registers) in about 20 cycles.  

This huge table could availible as a shared read only data segment, so
every process wouldn't need its own copy.

Comments?

Daniel Stodolsky
Engineering Design Research Center
Carnegie Mellon University
danner@cs.cmu.edu

mark@mips.COM (Mark G. Johnson) (01/09/90)

In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes:
>
>
>So why not put some of big memory killer micros to
>work and have a 16 by 16 multiply lookup table? It would consume 10 megs
>of core, but that's nothing for a KILLER MICRO. Assume memory access
>with  a cache miss is around 3 cycles and one can schedule to avoid
>register interlocking (as in HP-PA), it seems possible to do 32x32 -> 64
>( results in registers) in about 20 cycles.  
>

  The idea above proposes to use 80 million bits of RAM and 20 clock cycles
  to compute a 32b integer multiply.  This in noncompetitive when compared
  to killer micros, which multiply more quickly and consume far less real
  estate.  Instead of lookup tables they implement dedicated hardware:

       R6000:     16 cycles      32x32 -> 64
       R3000:     12 cycles      32x32 -> 64
       M88000:     4 cycles      32x32 -> 32  **

  **88k computes the 32 lsb's of the 64b product (upper bits are discarded).

  If, as rumored, a new imul instruction is added to SPARC, you can bet
  its implementation in hardware will be just as fast as, or perhaps faster
  than, those above.  This is what Doctor Ross is good at.  And of course the
  T.I. folks have lots of DSP experience with blazing fast hw multipliers.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}

johnl@esegue.segue.boston.ma.us (John R. Levine) (01/09/90)

In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes:
>... one should be able to compute a 32 x 32 -> 64 bit multiply with four
>16x16 -> 32 multiplies, 4 32 bit adds and a few shifts.  So why not put some
>big memory killer micros to work and have a 16 by 16 multiply lookup table?
>It would consume 10 megs of core, but that's nothing for a KILLER MICRO.

10 meg?  A 16x16 lookup table is 2^32 entries of 32 bits each, which by
my arithmetic is 16 gigabytes of memory, a little rich for even a mainframe
these days.  And imagine testing the damn thing.  You have to test every 
word and verify it against an independently computed value; if each test
took 5us, a single pass through it would take six hours.  Perhaps this isn't
the right technology yet.

I do know there are medium sized combinatorial multipliers, perhaps a little
pushing on them would be more productive.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
"Now, we are all jelly doughnuts."

ejp@bohra.cpg.oz (Esmond Pitt) (01/09/90)

In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes:
>
>If my memory serves me correctly,  one should be able to compute a 32 x
>32 -> 64 bit multiply with four 16x16 -> 32  multiplies, 4 32 bit adds
[unfortunate spelling error deleted]

Why use a quadratic algorithm?  It can be done with THREE
multiplications. See Knuth Vol II, #4.3.3.  In general two 2N-bit
numbers can be multiplied using N-bit precision in O(log(2N)), or
better for large N.

-- 
Esmond Pitt, Computer Power Group
ejp@bohra.cpg.oz

jayl@bit.UUCP (Jay Lessert) (01/10/90)

In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>  estate.  Instead of lookup tables they implement dedicated hardware:
>
>       R6000:     16 cycles      32x32 -> 64
>       R3000:     12 cycles      32x32 -> 64
>       M88000:     4 cycles      32x32 -> 32  **
>
Just for jollies, thought I'd point out that the B3110 Floating Point
Multiplier used in the RC6280 will do a 32x32 -> 64 integer multiply
in about 50ns, though I doubt that the 6280 uses it...

No integer divide on the B3110, though.  Sorry!  :-)
-- 
Jay Lessert  {ogccse,sun,decwrl}!bit!jayl
Bipolar Integrated Technology, Inc.
503-629-5490  (fax)503-690-1498

jgk@osc.COM (Joe Keane) (01/10/90)

In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes:
>If my memory serves me correctly,  one should be able to compute a 32 x
>32 -> 64 bit multiply with  four 16x16 -> 32  multiplies, 4 32 bit adds
>and a few shits.  So why not put some of big memory killer micros to
>work and have a 16 by 16 multiply lookup table? It would consume 10 megs
>of core, but that's nothing for a KILLER MICRO. Assume memory access
>with  a cache miss is around 3 cycles and one can schedule to avoid
>register interlocking (as in HP-PA), it seems possible to do 32x32 -> 64
>( results in registers) in about 20 cycles.  

I'm a big supporter of lookup tables, and i've dealt with some pretty strange
ones.  One is a 12trits -> 16bits table to evaluate Othello positions.  Or a
table of rint(log(1+exp(x*LB))/LB) to do logarithm-representation addition.

There's an interesting way to do multiplies which doesn't involve such a big
table.  You can use a table of squares and more adds and subtracts.  It gets a
little hairy to do 32x32 -> 64 in 16-bit chunks, but the table only takes up
512KB or so.  I don't know how you got 10MB for your table.

Of course, since everyone uses this table, you can put it in ROM.  But wait,
you can use less gates with a PLA, or just random logic.  And once you've done
that, you might as well throw in the shifter and adder.  What the heck, buy a
DSP, that's what it's for.  :-)

USER@osf.org (Michael Meissner) (01/10/90)

In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:

	...

|   The idea above proposes to use 80 million bits of RAM and 20 clock cycles
|   to compute a 32b integer multiply.  This in noncompetitive when compared
|   to killer micros, which multiply more quickly and consume far less real
|   estate.  Instead of lookup tables they implement dedicated hardware:
| 
|        R6000:     16 cycles      32x32 -> 64
|        R3000:     12 cycles      32x32 -> 64
|        M88000:     4 cycles      32x32 -> 32  **
| 
|   **88k computes the 32 lsb's of the 64b product (upper bits are discarded).

Actually if I remember chapter 7 of the 88100 user's manual, a
multiply 6 cycles (1 in FP1, 3 in the multiplier stage, 1 in FPLAST,
and 1 writeback).  Logically, the writeback phase should be available
to be feed forward, which logically shaves off 1 cycle.  However,
since non of the floating point operations do feed forwarding, I
wouldn't be surprised if integer multiply/divide don't feed forward
either.  As alluded to in an earlier article, multiple multiplications
can be done in parallel, since each cycle, the multiplier advances the
pipeline.  Floating point adds can similarly be pipelined.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

marvin@oakhill.UUCP (Marvin Denman) (01/11/90)

In article <USER.90Jan10090700@pmax27.osf.org> (Michael Meissner) writes:
>In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>|        R6000:     16 cycles      32x32 -> 64
>|        R3000:     12 cycles      32x32 -> 64
>|        M88000:     4 cycles      32x32 -> 32  **
>| 
>|   **88k computes the 32 lsb's of the 64b product (upper bits are discarded).
>
>Actually if I remember chapter 7 of the 88100 user's manual, a
>multiply 6 cycles (1 in FP1, 3 in the multiplier stage, 1 in FPLAST,
>and 1 writeback).  Logically, the writeback phase should be available
>to be feed forward, which logically shaves off 1 cycle.  However,
>since non of the floating point operations do feed forwarding, I
>wouldn't be surprised if integer multiply/divide don't feed forward
>either.  As alluded to in an earlier article, multiple multiplications
>can be done in parallel, since each cycle, the multiplier advances the
>pipeline.  Floating point adds can similarly be pipelined.

I think you may have a slight parity error in your memory.  Integer multiply
on the 88100 does take 4 cycles of latency just like Mark Johnson stated.  
Single precision floating point multiply however takes 6 cycles latency.  

Your comments about feed forward confuse me slightly.  The latency numbers 
that Motorola quotes are all in terms of when another instruction can use 
the result so they already account for feed forward.  Perhaps the source of 
your confusion is that double precision results have one extra clock of 
latency when feeding forward into store double instructions because of 
implementation considerations.  Double precision results do feed forward 
with no delay to other double precision operations.  

A key point of the 88100 design was its pipelined design.  Single precision
multiply and add along with integer multiply are fully pipelined.  Double
precision operations are pipelined except for stalling for one clock on
either end of the pipe.  This allows very high throughput on well scheduled
code.  Currently compilers do very little in the way of code scheduling for
floating point. (At least the code I have looked at)  When compilers begin to
take advantage of the 88100 floating point pipelines there will be a jump in
performance.  The current compilers are already making progress which will be
seen in the newest release of the SPEC benchmark numbers when they are published.
There is still much room for improvement.

Marvin Denman
Motorola 88000 Design

-- 

Marvin Denman
Motorola 88000 Design
cs.utexas.edu!oakhill!marvin