[comp.arch] 3010 fp

mark@mips.COM (Mark G. Johnson) (10/26/89)

In article <3300080@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
  >
  >At some point I heard that MIPS pulled out just about every stopper to
  >speed up the floating point speed of the R2000/R3000.  In other words,
  >they hardwired all the FPU ops, and provided 32 or 64-bit circuitry
  >wherever it was needed, and used all the optimal designs (like carry
  >lookahead and wallace trees -- please excuse my ignorance of
  >arithmetic circuitry) in their arithmetic unit?
  >
  >So now they just wait for device technology and heavy pipelining to
  >speed up their chip?  Couldn't Crays make a small comeback by
  >exploiting their 1's complement arithmetic, which is supposed to be an
  >inherently faster number system for digital implementation?

MIPS chose the IEEE-754 standard for floating point representation and
arithmetic; the other candidates were perceived to be marketing suicide.

The FP device is indeed hardwired (i.e. no microcode) and it has
several different functional units so that four different FP operations
can be executing in parallel ("overlapped"): an FP load/store, an FP
add/subtract, an FP multiply, and an FP divide.  {It's described in
IEEE Micro, June 88}

However, the chip contains only 75,000 transistors.  So there were
lots of potentially nifty hardware ideas that had to be omitted from
the design, to stay within budget.

Specifically, there are no Wallace trees in the R3010.  The carry
circuits on the R3010 are not the classical "lookahead" variety.
The register file only has 4 ports.  Etc etc etc.

Of course, now that it is possible to cram a lot more than 75,000
transistors on a chip, perhaps it's reasonable to postulate that the
additional transistors could be used to improve f.p. performance.
(On top of speedups due to faster technology)
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/26/89)

In article <30100@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>In article <3300080@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:

>  >At some point I heard that MIPS pulled out just about every stopper to
>  >speed up the floating point speed of the R2000/R3000.  In other words,

Actually, MIPSCo has a lot of room for improvement, as good as the
R3010 is: by throwing more hardware at the job, they could lower f.p.
multiply to two cycles latency, and division to 8 cycles latency,
with full segmentation (1 operation started/clock cycle) on every operation.
And then adding vector instructions, and more memory bandwidth, to make full
use of the additional FPU bandwidth.  Fortunately, there are ideas good for 
more improvement up to about 1 million gates, based on Cray and CDC/ETA 
designs.  (Some of the CDC Cyber 205 models actually had fully segmented
division, but it added a lot of extra real estate...)  

The first increment of improvement could come about by segmenting addition only
and giving the multiply unit its own round/normalize capability.  This might 
result in something like (a wild guess) a 50% improvement on codes like Linpack,
without adding very many more transistors.

>However, the chip contains only 75,000 transistors.  So there were

(A question, I know, which can't be answered precisely, but how many
"gates" is that, very roughly ...?)

>transistors on a chip, perhaps it's reasonable to postulate that the
>additional transistors could be used to improve f.p. performance.
>(On top of speedups due to faster technology)

Another approach would be to integrate the FPU on the same chip as the CPU.
When you consider the cost of off-chip communication vs. the low on-chip gate 
delays you get these days, it might make more sense to first
put the entire CPU/FPU on one chip, using the current low-transistor-count
design, and then, add more gates to the FPU as space allows with future 
technology.  (Hmmm, sounds like another competitor's approach ... )

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (10/28/89)

>[Hugh LaMaster, commenting on MIPS' FP unit]: Fortunately, there are
>ideas good for more improvement up to about 1 million gates, based on
>Cray and CDC/ETA designs.

Can you provide any references or publications on these designs?
Hell - if anyone has a technical reference or gate layouts, I'd be
interested in seeing them...



>(Some of the CDC Cyber 205 models actually had fully segmented
>division, but it added a lot of extra real estate...)
>
>The first increment of improvement could come about by segmenting
>addition only and giving the multiply unit its own round/normalize
>capability.

What do you mean by "segmenting"?  Do you mean pipelining - eg. so that
divide doesn't need to use the same hardware over and over again for several
cycles?




By the way, can anyone provide details on Cyrix's 80387 superset floating
pont chip?  I have heard, for example, that it does IEEE extended (80 bit)
division in 4 cycles.  It uses quotient prediction of 17 bits, with a
17 by 69 bit multiplier array used in the iteration. 
    (Let's see, 17 summands is no more than 5 3:2 CSA levels -
actually fewer. That's fairly long as divider cycles go, according to
Fandrianto, but I suppose that it needs to be that long in order to
predict 17 bits)
    Does anyone know how they predict all 17 bits? Is it one level, or is
it several levels (the way Taylor got 8 bit prediction by doing 4 bit radix 
16 prediction twice)?

    
    

And in another FP question, I notice that the IBM America (RT-2?) has
a multiply-accumulate instruction that performs no intermediate
rounding.  Ie. it is ROUND(A*B+C).  This is "more accurate" but does
not necessarily produce the same answers as ROUND(ROUND(A*B)+C).
I wonder what the numerical analysis mavens have to say about this:
is it okay to get an answer more accurate than IEEE, or is this
something to be avoided?


--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

ok@cs.mu.oz.au (Richard O'Keefe) (10/29/89)

In article <AGLEW.89Oct27193246@chant.urbana.mcd.mot.com>, aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes:
> And in another FP question, I notice that the IBM America (RT-2?) has
> a multiply-accumulate instruction that performs no intermediate
> rounding.  Ie. it is ROUND(A*B+C).  This is "more accurate" but does
> not necessarily produce the same answers as ROUND(ROUND(A*B)+C).
> I wonder what the numerical analysis mavens have to say about this:

Check
	Compiler Support for Floating-Point Computation
	Charles Farnum
	Software--Practice and Experience 18(7) 701-709 July 1988

Also check ANSI/IEEE Std 854-1987. 
A relevant part of that standard says:
	"Some languages place the results of intermediate calculations
	in destinations beyond the user's control.  Nonetheless, this
	standard defines the result of an operation in terms of that
	destination's precision as well as the operand's values."
It seems to me that one might be able to get away with saying that
	X := A*B+C
is a case where the intermediate result A*B is placed in a (notional)
double-extended destination; this would interpret X := A*B+C as
	X := round_to_double(			% there are
		double_extended_sum(		% three rounding steps
		    double_extended_product(	% in this expansion
			extend_double(A), extend_double(B)),
		    extend_double(C)))
Section 3.3 of IEEE Std 854 specifies only lower bounds on 'extended'
range and precision, so double-extended could be big enough for
double_extended_sum and double_extended_product to do no rounding in
this case.  Provided that double-extended was not otherwise provided
it seems to me that this implementation could conform to IEEE Std 854.

If, however, your program contains the statements

	begin double AB := A*B; X := AB+C end;

then this may *not* be translated using the multiply-accumulate
instruction, because the results are specified precisely by IEEE 854.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/31/89)

In article <AGLEW.89Oct27193246@chant.urbana.mcd.mot.com> aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes:

>Can you provide any references or publications on these designs?

This is information which I gathered from reading hardware reference manuals
supplied by the mfrs. during courses they taught on-site at Ames.  I find it
an interesting architectural tradeoff that, generally, CDC was able to deliver
better results/clock cycle  numbers, but Cray could deliver sufficiently
better clock cycles :-)  How much of this was Cray himself, and how much of it
was the fact that there were significantly more gates in the CDC design, is 
open to speculation.  That is, in fact, why I looked at the number of gates.
Only the broad description (250,000 5/4 NAND gates - type information) is
available in these publications, however.  I think you would have to kill
someone to get the actual layouts.

>>The first increment of improvement could come about by segmenting

>What do you mean by "segmenting"?  Do you mean pipelining - eg. so that

Yes, I mean pipelining.  

>IEEE extended (80 bit)
>division in 4 cycles.

This does indeed sound interesting.  One would need to know how many gate
delays per stage there are, though, to be really exciting.

>rounding.  Ie. it is ROUND(A*B+C).  This is "more accurate" but does
>not necessarily produce the same answers as ROUND(ROUND(A*B)+C).
>I wonder what the numerical analysis mavens have to say about this:

At least ONE Numerical Analyst  ( :-) seem to dislike this.  The "Paranoia"
benchmark, available from netlib, requires intermediate results to be
exactly 64 bit precision for a perfect score.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117