Daniel.Stodolsky@cs.cmu.edu (01/05/90)
If my memory serves me correctly, one should be able to compute a 32 x 32 -> 64 bit multiply with four 16x16 -> 32 multiplies, 4 32 bit adds and a few shits. So why not put some of big memory killer micros to work and have a 16 by 16 multiply lookup table? It would consume 10 megs of core, but that's nothing for a KILLER MICRO. Assume memory access with a cache miss is around 3 cycles and one can schedule to avoid register interlocking (as in HP-PA), it seems possible to do 32x32 -> 64 ( results in registers) in about 20 cycles. This huge table could availible as a shared read only data segment, so every process wouldn't need its own copy. Comments? Daniel Stodolsky Engineering Design Research Center Carnegie Mellon University danner@cs.cmu.edu
mark@mips.COM (Mark G. Johnson) (01/09/90)
In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes: > > >So why not put some of big memory killer micros to >work and have a 16 by 16 multiply lookup table? It would consume 10 megs >of core, but that's nothing for a KILLER MICRO. Assume memory access >with a cache miss is around 3 cycles and one can schedule to avoid >register interlocking (as in HP-PA), it seems possible to do 32x32 -> 64 >( results in registers) in about 20 cycles. > The idea above proposes to use 80 million bits of RAM and 20 clock cycles to compute a 32b integer multiply. This in noncompetitive when compared to killer micros, which multiply more quickly and consume far less real estate. Instead of lookup tables they implement dedicated hardware: R6000: 16 cycles 32x32 -> 64 R3000: 12 cycles 32x32 -> 64 M88000: 4 cycles 32x32 -> 32 ** **88k computes the 32 lsb's of the 64b product (upper bits are discarded). If, as rumored, a new imul instruction is added to SPARC, you can bet its implementation in hardware will be just as fast as, or perhaps faster than, those above. This is what Doctor Ross is good at. And of course the T.I. folks have lots of DSP experience with blazing fast hw multipliers. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
johnl@esegue.segue.boston.ma.us (John R. Levine) (01/09/90)
In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes: >... one should be able to compute a 32 x 32 -> 64 bit multiply with four >16x16 -> 32 multiplies, 4 32 bit adds and a few shifts. So why not put some >big memory killer micros to work and have a 16 by 16 multiply lookup table? >It would consume 10 megs of core, but that's nothing for a KILLER MICRO. 10 meg? A 16x16 lookup table is 2^32 entries of 32 bits each, which by my arithmetic is 16 gigabytes of memory, a little rich for even a mainframe these days. And imagine testing the damn thing. You have to test every word and verify it against an independently computed value; if each test took 5us, a single pass through it would take six hours. Perhaps this isn't the right technology yet. I do know there are medium sized combinatorial multipliers, perhaps a little pushing on them would be more productive. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl "Now, we are all jelly doughnuts."
ejp@bohra.cpg.oz (Esmond Pitt) (01/09/90)
In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes: > >If my memory serves me correctly, one should be able to compute a 32 x >32 -> 64 bit multiply with four 16x16 -> 32 multiplies, 4 32 bit adds [unfortunate spelling error deleted] Why use a quadratic algorithm? It can be done with THREE multiplications. See Knuth Vol II, #4.3.3. In general two 2N-bit numbers can be multiplied using N-bit precision in O(log(2N)), or better for large N. -- Esmond Pitt, Computer Power Group ejp@bohra.cpg.oz
jayl@bit.UUCP (Jay Lessert) (01/10/90)
In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: > estate. Instead of lookup tables they implement dedicated hardware: > > R6000: 16 cycles 32x32 -> 64 > R3000: 12 cycles 32x32 -> 64 > M88000: 4 cycles 32x32 -> 32 ** > Just for jollies, thought I'd point out that the B3110 Floating Point Multiplier used in the RC6280 will do a 32x32 -> 64 integer multiply in about 50ns, though I doubt that the 6280 uses it... No integer divide on the B3110, though. Sorry! :-) -- Jay Lessert {ogccse,sun,decwrl}!bit!jayl Bipolar Integrated Technology, Inc. 503-629-5490 (fax)503-690-1498
jgk@osc.COM (Joe Keane) (01/10/90)
In article <sZcrr1G00hMNI3EF5i@cs.cmu.edu> Daniel.Stodolsky@cs.cmu.edu writes: >If my memory serves me correctly, one should be able to compute a 32 x >32 -> 64 bit multiply with four 16x16 -> 32 multiplies, 4 32 bit adds >and a few shits. So why not put some of big memory killer micros to >work and have a 16 by 16 multiply lookup table? It would consume 10 megs >of core, but that's nothing for a KILLER MICRO. Assume memory access >with a cache miss is around 3 cycles and one can schedule to avoid >register interlocking (as in HP-PA), it seems possible to do 32x32 -> 64 >( results in registers) in about 20 cycles. I'm a big supporter of lookup tables, and i've dealt with some pretty strange ones. One is a 12trits -> 16bits table to evaluate Othello positions. Or a table of rint(log(1+exp(x*LB))/LB) to do logarithm-representation addition. There's an interesting way to do multiplies which doesn't involve such a big table. You can use a table of squares and more adds and subtracts. It gets a little hairy to do 32x32 -> 64 in 16-bit chunks, but the table only takes up 512KB or so. I don't know how you got 10MB for your table. Of course, since everyone uses this table, you can put it in ROM. But wait, you can use less gates with a PLA, or just random logic. And once you've done that, you might as well throw in the shifter and adder. What the heck, buy a DSP, that's what it's for. :-)
USER@osf.org (Michael Meissner) (01/10/90)
In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: ... | The idea above proposes to use 80 million bits of RAM and 20 clock cycles | to compute a 32b integer multiply. This in noncompetitive when compared | to killer micros, which multiply more quickly and consume far less real | estate. Instead of lookup tables they implement dedicated hardware: | | R6000: 16 cycles 32x32 -> 64 | R3000: 12 cycles 32x32 -> 64 | M88000: 4 cycles 32x32 -> 32 ** | | **88k computes the 32 lsb's of the 64b product (upper bits are discarded). Actually if I remember chapter 7 of the 88100 user's manual, a multiply 6 cycles (1 in FP1, 3 in the multiplier stage, 1 in FPLAST, and 1 writeback). Logically, the writeback phase should be available to be feed forward, which logically shaves off 1 cycle. However, since non of the floating point operations do feed forwarding, I wouldn't be surprised if integer multiply/divide don't feed forward either. As alluded to in an earlier article, multiple multiplications can be done in parallel, since each cycle, the multiplier advances the pipeline. Floating point adds can similarly be pipelined. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so
marvin@oakhill.UUCP (Marvin Denman) (01/11/90)
In article <USER.90Jan10090700@pmax27.osf.org> (Michael Meissner) writes: >In article <34259@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >| R6000: 16 cycles 32x32 -> 64 >| R3000: 12 cycles 32x32 -> 64 >| M88000: 4 cycles 32x32 -> 32 ** >| >| **88k computes the 32 lsb's of the 64b product (upper bits are discarded). > >Actually if I remember chapter 7 of the 88100 user's manual, a >multiply 6 cycles (1 in FP1, 3 in the multiplier stage, 1 in FPLAST, >and 1 writeback). Logically, the writeback phase should be available >to be feed forward, which logically shaves off 1 cycle. However, >since non of the floating point operations do feed forwarding, I >wouldn't be surprised if integer multiply/divide don't feed forward >either. As alluded to in an earlier article, multiple multiplications >can be done in parallel, since each cycle, the multiplier advances the >pipeline. Floating point adds can similarly be pipelined. I think you may have a slight parity error in your memory. Integer multiply on the 88100 does take 4 cycles of latency just like Mark Johnson stated. Single precision floating point multiply however takes 6 cycles latency. Your comments about feed forward confuse me slightly. The latency numbers that Motorola quotes are all in terms of when another instruction can use the result so they already account for feed forward. Perhaps the source of your confusion is that double precision results have one extra clock of latency when feeding forward into store double instructions because of implementation considerations. Double precision results do feed forward with no delay to other double precision operations. A key point of the 88100 design was its pipelined design. Single precision multiply and add along with integer multiply are fully pipelined. Double precision operations are pipelined except for stalling for one clock on either end of the pipe. This allows very high throughput on well scheduled code. Currently compilers do very little in the way of code scheduling for floating point. (At least the code I have looked at) When compilers begin to take advantage of the 88100 floating point pipelines there will be a jump in performance. The current compilers are already making progress which will be seen in the newest release of the SPEC benchmark numbers when they are published. There is still much room for improvement. Marvin Denman Motorola 88000 Design -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin