mark@mips.COM (Mark G. Johnson) (10/26/89)
In article <3300080@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: > >At some point I heard that MIPS pulled out just about every stopper to >speed up the floating point speed of the R2000/R3000. In other words, >they hardwired all the FPU ops, and provided 32 or 64-bit circuitry >wherever it was needed, and used all the optimal designs (like carry >lookahead and wallace trees -- please excuse my ignorance of >arithmetic circuitry) in their arithmetic unit? > >So now they just wait for device technology and heavy pipelining to >speed up their chip? Couldn't Crays make a small comeback by >exploiting their 1's complement arithmetic, which is supposed to be an >inherently faster number system for digital implementation? MIPS chose the IEEE-754 standard for floating point representation and arithmetic; the other candidates were perceived to be marketing suicide. The FP device is indeed hardwired (i.e. no microcode) and it has several different functional units so that four different FP operations can be executing in parallel ("overlapped"): an FP load/store, an FP add/subtract, an FP multiply, and an FP divide. {It's described in IEEE Micro, June 88} However, the chip contains only 75,000 transistors. So there were lots of potentially nifty hardware ideas that had to be omitted from the design, to stay within budget. Specifically, there are no Wallace trees in the R3010. The carry circuits on the R3010 are not the classical "lookahead" variety. The register file only has 4 ports. Etc etc etc. Of course, now that it is possible to cram a lot more than 75,000 transistors on a chip, perhaps it's reasonable to postulate that the additional transistors could be used to improve f.p. performance. (On top of speedups due to faster technology) -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 991-0208 mark@mips.com {or ...!decwrl!mips!mark}
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/26/89)
In article <30100@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >In article <3300080@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: > >At some point I heard that MIPS pulled out just about every stopper to > >speed up the floating point speed of the R2000/R3000. In other words, Actually, MIPSCo has a lot of room for improvement, as good as the R3010 is: by throwing more hardware at the job, they could lower f.p. multiply to two cycles latency, and division to 8 cycles latency, with full segmentation (1 operation started/clock cycle) on every operation. And then adding vector instructions, and more memory bandwidth, to make full use of the additional FPU bandwidth. Fortunately, there are ideas good for more improvement up to about 1 million gates, based on Cray and CDC/ETA designs. (Some of the CDC Cyber 205 models actually had fully segmented division, but it added a lot of extra real estate...) The first increment of improvement could come about by segmenting addition only and giving the multiply unit its own round/normalize capability. This might result in something like (a wild guess) a 50% improvement on codes like Linpack, without adding very many more transistors. >However, the chip contains only 75,000 transistors. So there were (A question, I know, which can't be answered precisely, but how many "gates" is that, very roughly ...?) >transistors on a chip, perhaps it's reasonable to postulate that the >additional transistors could be used to improve f.p. performance. >(On top of speedups due to faster technology) Another approach would be to integrate the FPU on the same chip as the CPU. When you consider the cost of off-chip communication vs. the low on-chip gate delays you get these days, it might make more sense to first put the entire CPU/FPU on one chip, using the current low-transistor-count design, and then, add more gates to the FPU as space allows with future technology. (Hmmm, sounds like another competitor's approach ... ) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (10/28/89)
>[Hugh LaMaster, commenting on MIPS' FP unit]: Fortunately, there are >ideas good for more improvement up to about 1 million gates, based on >Cray and CDC/ETA designs. Can you provide any references or publications on these designs? Hell - if anyone has a technical reference or gate layouts, I'd be interested in seeing them... >(Some of the CDC Cyber 205 models actually had fully segmented >division, but it added a lot of extra real estate...) > >The first increment of improvement could come about by segmenting >addition only and giving the multiply unit its own round/normalize >capability. What do you mean by "segmenting"? Do you mean pipelining - eg. so that divide doesn't need to use the same hardware over and over again for several cycles? By the way, can anyone provide details on Cyrix's 80387 superset floating pont chip? I have heard, for example, that it does IEEE extended (80 bit) division in 4 cycles. It uses quotient prediction of 17 bits, with a 17 by 69 bit multiplier array used in the iteration. (Let's see, 17 summands is no more than 5 3:2 CSA levels - actually fewer. That's fairly long as divider cycles go, according to Fandrianto, but I suppose that it needs to be that long in order to predict 17 bits) Does anyone know how they predict all 17 bits? Is it one level, or is it several levels (the way Taylor got 8 bit prediction by doing 4 bit radix 16 prediction twice)? And in another FP question, I notice that the IBM America (RT-2?) has a multiply-accumulate instruction that performs no intermediate rounding. Ie. it is ROUND(A*B+C). This is "more accurate" but does not necessarily produce the same answers as ROUND(ROUND(A*B)+C). I wonder what the numerical analysis mavens have to say about this: is it okay to get an answer more accurate than IEEE, or is this something to be avoided? -- Andy "Krazy" Glew, Motorola MCD, aglew@urbana.mcd.mot.com 1101 E. University, Urbana, IL 61801, USA. {uunet!,}uiucuxc!udc!aglew My opinions are my own; I indicate my company only so that the reader may account for any possible bias I may have towards our products.
ok@cs.mu.oz.au (Richard O'Keefe) (10/29/89)
In article <AGLEW.89Oct27193246@chant.urbana.mcd.mot.com>, aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes: > And in another FP question, I notice that the IBM America (RT-2?) has > a multiply-accumulate instruction that performs no intermediate > rounding. Ie. it is ROUND(A*B+C). This is "more accurate" but does > not necessarily produce the same answers as ROUND(ROUND(A*B)+C). > I wonder what the numerical analysis mavens have to say about this: Check Compiler Support for Floating-Point Computation Charles Farnum Software--Practice and Experience 18(7) 701-709 July 1988 Also check ANSI/IEEE Std 854-1987. A relevant part of that standard says: "Some languages place the results of intermediate calculations in destinations beyond the user's control. Nonetheless, this standard defines the result of an operation in terms of that destination's precision as well as the operand's values." It seems to me that one might be able to get away with saying that X := A*B+C is a case where the intermediate result A*B is placed in a (notional) double-extended destination; this would interpret X := A*B+C as X := round_to_double( % there are double_extended_sum( % three rounding steps double_extended_product( % in this expansion extend_double(A), extend_double(B)), extend_double(C))) Section 3.3 of IEEE Std 854 specifies only lower bounds on 'extended' range and precision, so double-extended could be big enough for double_extended_sum and double_extended_product to do no rounding in this case. Provided that double-extended was not otherwise provided it seems to me that this implementation could conform to IEEE Std 854. If, however, your program contains the statements begin double AB := A*B; X := AB+C end; then this may *not* be translated using the multiply-accumulate instruction, because the results are specified precisely by IEEE 854.
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/31/89)
In article <AGLEW.89Oct27193246@chant.urbana.mcd.mot.com> aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes: >Can you provide any references or publications on these designs? This is information which I gathered from reading hardware reference manuals supplied by the mfrs. during courses they taught on-site at Ames. I find it an interesting architectural tradeoff that, generally, CDC was able to deliver better results/clock cycle numbers, but Cray could deliver sufficiently better clock cycles :-) How much of this was Cray himself, and how much of it was the fact that there were significantly more gates in the CDC design, is open to speculation. That is, in fact, why I looked at the number of gates. Only the broad description (250,000 5/4 NAND gates - type information) is available in these publications, however. I think you would have to kill someone to get the actual layouts. >>The first increment of improvement could come about by segmenting >What do you mean by "segmenting"? Do you mean pipelining - eg. so that Yes, I mean pipelining. >IEEE extended (80 bit) >division in 4 cycles. This does indeed sound interesting. One would need to know how many gate delays per stage there are, though, to be really exciting. >rounding. Ie. it is ROUND(A*B+C). This is "more accurate" but does >not necessarily produce the same answers as ROUND(ROUND(A*B)+C). >I wonder what the numerical analysis mavens have to say about this: At least ONE Numerical Analyst ( :-) seem to dislike this. The "Paranoia" benchmark, available from netlib, requires intermediate results to be exactly 64 bit precision for a perfect score. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117