[comp.arch] MIPS Floating Point processor.

randys@mipon3.intel.com (Randy Steck) (08/26/87)

The August 20 issue of Electronics has an interesting description of the
MIPS floating point processor in their "Technology to Watch" section.  The
implementation looks pretty impressive and I was wonder if someone from
MIPS (John?) could enlighten us on some of the more interesting features
of the part?  

The execution times look very impressive and the fact that the adder,
multiplier, and divider can operate all in parallel could really increase
performance.  However, what sort of algorithm is used to provide a double
precision divide in 5 clocks?

One of the interesting items was the way in which the pipeline of the
processor can be shut down when an exception in one of the floating point
operations is found.  Apparently the instruction stream can be restarted
on the failing instruction.  But, since the execution units operate in
parallel, what happens when a 5 clock multiply is followed by a 2 clock add
and the multiply overflows or signals some other exception?  The add could
have already completed and changed one of the input operands to the previous
multiply so that simply restarting the multiply instruction would not be
sufficient to guarantee the correct result.

Also, do all operations conform to IEEE 754?  This would include rounding
and precision considerations.

All in all, the processor looks very interesting and more detailed
information would be welcome.

Randy Steck  {...intelca!omepd!mipon3!randys}

rowen@mips.UUCP (Chris Rowen) (08/27/87)

>From: randys@mipon3.intel.com (Randy Steck)
>The August 20 issue of Electronics has an interesting description of the
>MIPS floating point processor in their "Technology to Watch" section.  The
>implementation looks pretty impressive and I was wonder if someone from
>MIPS (John?) could enlighten us on some of the more interesting features
>of the part?  

The R2010 is a CMOS single chip, closely coupled floating point coprocessor 
for the R2000 CPU.  It includes its own file of 16 64-bit registers and
interprets the instruction stream in parallel with the CPU.  Operands
can be loaded directly from memory (data cache, usually) or from the
CPU's registers.  Its instructions look a lot like the CPU's: loads and 
stores, arithmetic ops, compares and branches.  The tight handshake between
the two chips handles machine stalls and exceptions without sacrificing
speed or error recovery.  MIPS sells it as part of a chip-set, in our line 
of 5-10MIPS processor boards and in our M/Series UNIX boxes.  Our optimizing 
compiler suite (C, FORTRAN, Pascal and more) takes advantage of the 
parallelism of the R2010's independent add, multiply and divide units.  

>The execution times look very impressive and the fact that the adder,
>multiplier, and divider can operate all in parallel could really increase
>performance.  However, what sort of algorithm is used to provide a double
>precision divide in 5 clocks?

Gee, I wish WE knew how to do it 5 cycles.  Unfortunately, this was a 
misprint.  Here is a table of correct operation cycle counts for the R2010:

	R2010 operation		cycles

	{add,sub}.{s,d}		2	
	mul.s			4	
	mul.d			5	
	div.s			12	
	div.d			19	
	{mov,abs,neg}.{s,d}	1	
	cvt{s,d,w},{s,d,w}	2-3	

The divider effectively retires 4 bits per cycle, plus 3 cycles of
overhead for quotient adjustment and IEEE rounding at the end.  The 
multiplier retires about 14 bits per cycle, with 1 cycle of overhead
for IEEE rounding.

For comparison, here are instruction latencies on back-to-back operations
for a bunch of FP units.  The Intel 80287 and Motorola 68881 are single-chip 
coprocessors like the MIPS R2010. The Weitek 1164/1165 multiplier and ALU 
require external hardware for register file, instruction decode and exception
control to marry them to any particular CPU.  These numbers are for 64-bit 
arithmetic on the Weitek and MIPS chips.  The Motorola and Intel chips do 
all internal operations in extended precision(~80-bits).  All values assume 
register-to-register operations.

		R2010		1164/65		80287		68881
	      16.67MHz		16.67MHz	10MHz		20MHz

add		120ns		600ns		7000ns		2550ns
mul		300		660		9000		3550
div	       1140	       3840	       19300		5150

(Sorry, we don't have numbers handy for the 80387, 68882, Clipper)

Other inaccuracies in the Electronics article:	

* The article mentions that the R2010 chip replaces 16KB of SRAM on MIPS's
  old R2360 Weitek-based FPA board.  It's true there is a bunch of RAM on that
  board, but only a tiny section of it is used for the FP register file.

* The chip dissipates 3.8W worst case, not the 2-3W mentioned in the article.

* Whetstones at 16.67MHz are in the range of 12.0MWhets (single) and 9.3MWhets
  (double), not 10.7MWhets and 8.9MWhets (Your mileage may vary, etc. etc.)

>One of the interesting items was the way in which the pipeline of the
>processor can be shut down when an exception in one of the floating point
>operations is found.  Apparently the instruction stream can be restarted
>on the failing instruction.  But, since the execution units operate in
>parallel, what happens when a 5 clock multiply is followed by a 2 clock add
>and the multiply overflows or signals some other exception?  The add could
>have already completed and changed one of the input operands to the previous
>multiply so that simply restarting the multiply instruction would not be
>sufficient to guarantee the correct result.

We actually dedicate a good bit of hardware to make parallel operations
("flushing three toilets at the same time") work in the presence of 
exceptions.  We delay committing state for an instruction until earlier 
instructions are known to be exception-free.  A second write port into 
the register file makes this easier.

>Also, do all operations conform to IEEE 754?  This would include rounding
>and precision considerations.

Yes, systems based on the R2010 conform with requirements and recommendations 
of the standard (what a mouthful :-)).  The hardware for rounding to nearest, 
zero, +infinity and -infinity is pretty complicated, so it is implemented 
only once, in the add unit.  Multiply and divide operations must get access 
to the adder at the end of their executions.  The chip adopts the "RISC 
philosophy" in handling of certain infrequent operands (like denormalized 
numbers) and exceptional results -- it punts the problem over to system 
software.  This lets us concentrate hardware on the frequent cases and moves 
complexity over to the system software, which has to be there anyway to 
provide user-level exception support.  The overhead for software handling of 
these special operands and operations makes no perceivable difference in 
normal floating point performance.


Chris Rowen		decwrl!mips!rowen		930 Arques Ave.
Mark Johnson		decwrl!mips!mark		Sunnyvale CA 94086

Generic disclaimer: We speak only for us...

mash@mips.UUCP (08/27/87)

In article <627@dumbo.UUCP> rowen@dumbo.UUCP (Chris Rowen & Mark Johnson) write:
>
>...The R2010 is a CMOS single chip, closely coupled floating point coprocessor 
>for the R2000 CPU...  The tight handshake between
>the two chips handles machine stalls and exceptions without sacrificing
>speed or error recovery....  We delay committing state for an instruction
>until earlier instructions are known to be exception-free....

A little more, from the software side, on what this means:

When we see an FP exception, it's just like any other exception:

The exception PC points at the instruction that caused the exception,
or to the branch in whose delay slot the faulting instruction lies.
Every instruction before the PC has been fully executed;
no instruction at the EPC or logically after has had any effect.

This area is one where the complexity stayed in the hardware,
i.e., we might have had multiple, imprecise exceptions.
After I generated the 512 lists to describe what the OS would do with
every combination of exceptions, the chippers decided software
would NEVER get it all right, so we got precise exceptions instead.
(Thank goodness.)

Another fact worth mentioning is that the FPU chip is physically LARGER
than the CPU chip, even though the CPU has a big TLB, cache control,
almost twice as many I/Os, etc.

People often ask if we'd put the FPU and CPU together as chip shrinks
become available.  The usual answer is NO, we'd use more silicon to make
an even faster FPU.  In the near future (next few shrinks),
it seems unlikely that anyone can incorporate a truly high-performance
FPU (in the R2010's league) on the same chip as the integer unit.
Moral: fast FP uses LOTS of silicon.

-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086