randys@mipon3.intel.com (Randy Steck) (08/26/87)
The August 20 issue of Electronics has an interesting description of the MIPS floating point processor in their "Technology to Watch" section. The implementation looks pretty impressive and I was wonder if someone from MIPS (John?) could enlighten us on some of the more interesting features of the part? The execution times look very impressive and the fact that the adder, multiplier, and divider can operate all in parallel could really increase performance. However, what sort of algorithm is used to provide a double precision divide in 5 clocks? One of the interesting items was the way in which the pipeline of the processor can be shut down when an exception in one of the floating point operations is found. Apparently the instruction stream can be restarted on the failing instruction. But, since the execution units operate in parallel, what happens when a 5 clock multiply is followed by a 2 clock add and the multiply overflows or signals some other exception? The add could have already completed and changed one of the input operands to the previous multiply so that simply restarting the multiply instruction would not be sufficient to guarantee the correct result. Also, do all operations conform to IEEE 754? This would include rounding and precision considerations. All in all, the processor looks very interesting and more detailed information would be welcome. Randy Steck {...intelca!omepd!mipon3!randys}
rowen@mips.UUCP (Chris Rowen) (08/27/87)
>From: randys@mipon3.intel.com (Randy Steck) >The August 20 issue of Electronics has an interesting description of the >MIPS floating point processor in their "Technology to Watch" section. The >implementation looks pretty impressive and I was wonder if someone from >MIPS (John?) could enlighten us on some of the more interesting features >of the part? The R2010 is a CMOS single chip, closely coupled floating point coprocessor for the R2000 CPU. It includes its own file of 16 64-bit registers and interprets the instruction stream in parallel with the CPU. Operands can be loaded directly from memory (data cache, usually) or from the CPU's registers. Its instructions look a lot like the CPU's: loads and stores, arithmetic ops, compares and branches. The tight handshake between the two chips handles machine stalls and exceptions without sacrificing speed or error recovery. MIPS sells it as part of a chip-set, in our line of 5-10MIPS processor boards and in our M/Series UNIX boxes. Our optimizing compiler suite (C, FORTRAN, Pascal and more) takes advantage of the parallelism of the R2010's independent add, multiply and divide units. >The execution times look very impressive and the fact that the adder, >multiplier, and divider can operate all in parallel could really increase >performance. However, what sort of algorithm is used to provide a double >precision divide in 5 clocks? Gee, I wish WE knew how to do it 5 cycles. Unfortunately, this was a misprint. Here is a table of correct operation cycle counts for the R2010: R2010 operation cycles {add,sub}.{s,d} 2 mul.s 4 mul.d 5 div.s 12 div.d 19 {mov,abs,neg}.{s,d} 1 cvt{s,d,w},{s,d,w} 2-3 The divider effectively retires 4 bits per cycle, plus 3 cycles of overhead for quotient adjustment and IEEE rounding at the end. The multiplier retires about 14 bits per cycle, with 1 cycle of overhead for IEEE rounding. For comparison, here are instruction latencies on back-to-back operations for a bunch of FP units. The Intel 80287 and Motorola 68881 are single-chip coprocessors like the MIPS R2010. The Weitek 1164/1165 multiplier and ALU require external hardware for register file, instruction decode and exception control to marry them to any particular CPU. These numbers are for 64-bit arithmetic on the Weitek and MIPS chips. The Motorola and Intel chips do all internal operations in extended precision(~80-bits). All values assume register-to-register operations. R2010 1164/65 80287 68881 16.67MHz 16.67MHz 10MHz 20MHz add 120ns 600ns 7000ns 2550ns mul 300 660 9000 3550 div 1140 3840 19300 5150 (Sorry, we don't have numbers handy for the 80387, 68882, Clipper) Other inaccuracies in the Electronics article: * The article mentions that the R2010 chip replaces 16KB of SRAM on MIPS's old R2360 Weitek-based FPA board. It's true there is a bunch of RAM on that board, but only a tiny section of it is used for the FP register file. * The chip dissipates 3.8W worst case, not the 2-3W mentioned in the article. * Whetstones at 16.67MHz are in the range of 12.0MWhets (single) and 9.3MWhets (double), not 10.7MWhets and 8.9MWhets (Your mileage may vary, etc. etc.) >One of the interesting items was the way in which the pipeline of the >processor can be shut down when an exception in one of the floating point >operations is found. Apparently the instruction stream can be restarted >on the failing instruction. But, since the execution units operate in >parallel, what happens when a 5 clock multiply is followed by a 2 clock add >and the multiply overflows or signals some other exception? The add could >have already completed and changed one of the input operands to the previous >multiply so that simply restarting the multiply instruction would not be >sufficient to guarantee the correct result. We actually dedicate a good bit of hardware to make parallel operations ("flushing three toilets at the same time") work in the presence of exceptions. We delay committing state for an instruction until earlier instructions are known to be exception-free. A second write port into the register file makes this easier. >Also, do all operations conform to IEEE 754? This would include rounding >and precision considerations. Yes, systems based on the R2010 conform with requirements and recommendations of the standard (what a mouthful :-)). The hardware for rounding to nearest, zero, +infinity and -infinity is pretty complicated, so it is implemented only once, in the add unit. Multiply and divide operations must get access to the adder at the end of their executions. The chip adopts the "RISC philosophy" in handling of certain infrequent operands (like denormalized numbers) and exceptional results -- it punts the problem over to system software. This lets us concentrate hardware on the frequent cases and moves complexity over to the system software, which has to be there anyway to provide user-level exception support. The overhead for software handling of these special operands and operations makes no perceivable difference in normal floating point performance. Chris Rowen decwrl!mips!rowen 930 Arques Ave. Mark Johnson decwrl!mips!mark Sunnyvale CA 94086 Generic disclaimer: We speak only for us...
mash@mips.UUCP (08/27/87)
In article <627@dumbo.UUCP> rowen@dumbo.UUCP (Chris Rowen & Mark Johnson) write: > >...The R2010 is a CMOS single chip, closely coupled floating point coprocessor >for the R2000 CPU... The tight handshake between >the two chips handles machine stalls and exceptions without sacrificing >speed or error recovery.... We delay committing state for an instruction >until earlier instructions are known to be exception-free.... A little more, from the software side, on what this means: When we see an FP exception, it's just like any other exception: The exception PC points at the instruction that caused the exception, or to the branch in whose delay slot the faulting instruction lies. Every instruction before the PC has been fully executed; no instruction at the EPC or logically after has had any effect. This area is one where the complexity stayed in the hardware, i.e., we might have had multiple, imprecise exceptions. After I generated the 512 lists to describe what the OS would do with every combination of exceptions, the chippers decided software would NEVER get it all right, so we got precise exceptions instead. (Thank goodness.) Another fact worth mentioning is that the FPU chip is physically LARGER than the CPU chip, even though the CPU has a big TLB, cache control, almost twice as many I/Os, etc. People often ask if we'd put the FPU and CPU together as chip shrinks become available. The usual answer is NO, we'd use more silicon to make an even faster FPU. In the near future (next few shrinks), it seems unlikely that anyone can incorporate a truly high-performance FPU (in the R2010's league) on the same chip as the integer unit. Moral: fast FP uses LOTS of silicon. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086