[comp.arch] Pipelined FP add

mark@mips.COM (Mark G. Johnson) (12/15/89)

In article <241@dg.dg.com> uunet!dg!chris (Chris Moriondo) writes:
  >
  >Anyone care to speculate as to how much pipelining add would win/lose
  >in terms of useful overlap in FP codes versus increased latency? 
  >

Several articles were posted to comp.arch this spring, talking about the
perceived benefits of a pipelined FP adder in the Motorola MC88K risc.
Perhaps the Moto folks can say more now.  [ after the departure of Ross ]

I dimly recall that the last time around, they were beginning to feel
that the 88K's shared register file -- same regs for integer and FP
operands -- required large numbers of read and write ports to make
FP programs run quickly.  The other design alternative, separate
integer regs from the FP regs, has lots of ports already since it's
2 copies of the hardware.  I'm sure if my memory is faulty somebody
will gently correct me.

The eariler MC88K articles were posted by wca@oakhill.UUCP and
mpaton@oakhill.UUCP; your site archives may contain comp.arch article
<1362@oakhill.UUCP>.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	(408) 991-0208    mark@mips.com  {or ...!decwrl!mips!mark}

preston@titan.rice.edu (Preston Briggs) (12/15/89)

In article <33570@hal.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>In article <241@dg.dg.com> uunet!dg!chris (Chris Moriondo) writes:
>  >
>  >Anyone care to speculate as to how much pipelining add would win/lose
>  >in terms of useful overlap in FP codes versus increased latency? 
>  >
>
>Several articles were posted to comp.arch this spring, talking about the
>perceived benefits of a pipelined FP adder in the Motorola MC88K risc.

Well, long pipelines will usually increase latency.
Additionally, they'll increase FP performance, given some effort 
by the compiler.  Deciding the tradeoff will just be a decision,
probably based on the target applications.  For a workstation
that *I* would use (to compile and edit things like compilers and
editors), FP performance isn't very important.  But when people
really want FP performance, they can usually manage to avoid task 
switching too often.

What'll happen when people get more used to cheap FP?  Probably more
FP in everyday code, particularly graphics display.

>I dimly recall that the last time around, they were beginning to feel
>that the 88K's shared register file -- same regs for integer and FP
>operands -- required large numbers of read and write ports to make
>FP programs run quickly.  The other design alternative, separate
>integer regs from the FP regs, has lots of ports already since it's
>2 copies of the hardware.

The point about read/write ports is good.  I used to believe
in a single (large) set of multi-purpose registers controlled
by the compiler.  This scheme would seem to be less wasteful
of resources (when you need more FP regs, spill more integers,
and vice-versa, whichever works out cheapest for particular programs).

Nowadays, I've come around to the view that extra resources for 
the moments of peak demand (triply-nested loops and such) is better.  
In particular, I prefer seperate FP and integer register sets.
Locally, friends are working on loop transformations that will
use *all* the available FP registers profitably.  If we've got
to balance the FP register pressure against the integer pressure,
then we've added another hard issue to optimization.

Of course, it's extra hardware and it'll cost money.

Preston Briggs
preston@titan.rice.edu

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (12/15/89)

In article <3740@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <33570@hal.mips.COM> mark@mips.COM (Mark G. Johnson) writes:

>>that the 88K's shared register file -- same regs for integer and FP
>>operands -- required large numbers of read and write ports to make

>The point about read/write ports is good.  I used to believe
>in a single (large) set of multi-purpose registers controlled
>by the compiler.  This scheme would seem to be less wasteful

>In particular, I prefer seperate FP and integer register sets.

There are alternatives.  For example, instead of creating
separate register files to increase the available ports, there is a
technique to use two identical register files for the same purpose.
The technique was used on the Cyber 205, and is described in an article
in IEEE Transactions on Computers [May 1982] by Neil Lincoln.  This redundancy
doubles the real estate per register, but, on the other hand, there
are advantages to doing this over using separate fp register files.
Codes which require movement of data between
integer registers, where the logical and bit-crunching instructions work,
typically, and the fp register file, don't require data movement.
(I hope nobody still uses the IBM 360 setup, where you have to push
the data to *memory* to get it back and forth between integer and fp units...)

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

mark@mips.COM (Mark G. Johnson) (12/15/89)

In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
  >(I hope nobody still uses the IBM 360 setup, where you have to push
  >the data to *memory* to get it back and forth between integer and fp
  > units...)

I believe SPARC does this, because there aren't FPR <-> GPR move instructions.
However, "*memory*" is just a few cycles away (3 for the store, 2 more for
the load) thanks to cache.

davec@proton.amd.com (Dave Christie) (12/15/89)

In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
| In article <3740@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
|   
    [discussion about separate vs. multi-port shared fp/int registers
     and integer/fp register allocation tradeoffs]
| 
| >In particular, I prefer seperate FP and integer register sets.
| 
| There are alternatives.  For example, instead of creating
| separate register files to increase the available ports, there is a
| technique to use two identical register files for the same purpose.
| The technique was used on the Cyber 205, and is described in an article
| in IEEE Transactions on Computers [May 1982] by Neil Lincoln.  This redundancy
| doubles the real estate per register, but, on the other hand, there
| are advantages to doing this over using separate fp register files.

This technique is fairly common in machines from the '70s (at least CDC machines)
that used small off-the-shelf single-port SRAMS for the register file. Doubling
the register file realestate is well worth it for the improved performance that
comes with accessing two operands at once.  (Small off-the-shelf multi-ported
SRAMS didn't quite cut it for various reasons.)

The extra ports that come free with separate register sets aren't the only benefit:
Preston alluded to the removal from the compiler of a difficult optimization
tradeoff (int vs. fp register allocation).  (But I think one could argue about 
whether or not this an enhancement, i.e. not having the opportunity to do it.)

Another benefit appears for multiple instruction issue: issuing simultaneous
fp/int instructions is simplified if you don't have to worry about dependency
checking between the two (witness IBM 360/91, i860, Apollo Prism(?separate register
sets?).  It also makes it easier to put all fp operations on a separate chip and
still maintain high performance (a la MIPS).

| Codes which require movement of data between
| integer registers, where the logical and bit-crunching instructions work,
| typically, and the fp register file, don't require data movement.
   [in a shared register set]

Quite true, but (IMHO) for most applications, int<->fp traffic is not very
high, so having to explicitly move some data around is no big deal ... as
long as it is done reasonably efficiently so that you don't blow one or two
applications out of the water.

------------
Dave Christie         My humble opinions, not my employer's.

slackey@bbn.com (Stan Lackey) (12/16/89)

In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>In article <3740@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>>In particular, I prefer seperate FP and integer register sets.
>
>There are alternatives.  For example, instead of creating
>separate register files to increase the available ports, there is a
>technique to use two identical register files for the same purpose.

It's also possible to build the register file with both multiple read and
multiple write ports.  I'm pretty sure the Multiflow Trace has this.
It's expensive, especially for multiple write ports, but it can certainly
be done; there is basically a mux in front of each bit.  I'd be surprised
if any of the fast RISC's have less than two read ports, in fact.
-Stan

ram@shukra.Sun.COM (Renu Raman) (12/16/89)

In article <33623@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>In article <38132@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>  >(I hope nobody still uses the IBM 360 setup, where you have to push
>  >the data to *memory* to get it back and forth between integer and fp
>  > units...)
>
>I believe SPARC does this, because there aren't FPR <-> GPR move instructions.
>However, "*memory*" is just a few cycles away (3 for the store, 2 more for
>the load) thanks to cache.

  minor addendum:

  What mark@mips said is for the Fujitsu/LSI/cypress parts.

  You can at best crunch it down to 3 cycles (2 cycles for store and 1 cycle
  for load - for doubles, it would be 4) if you can design a good
  cache system using the BIT ECL parts (which is left as an exercise to
  the reader :-)).  So, it is very implementation dependent.  The best case
  ofcourse is 2 cycles (if you can do single cycle stores. You can do
  single cycle load doubles to FP using the BIT parts)

  renu raman

pmontgom@sonia.math.ucla.edu (Peter Montgomery) (12/17/89)

In article <28413@amdcad.AMD.COM> davec@cayman.amd.com (Dave Christie) writes:
>
>Quite true, but (IMHO) for most applications, int<->fp traffic is not very
>high, so having to explicitly move some data around is no big deal ... as
>long as it is done reasonably efficiently so that you don't blow one or two
>applications out of the water.
>
	I have written a multiple precision integer arithmetic package.
Its time-critical routines are conditionally compiled so
the algorithms can be individually optimized for different machines 
(yes, I resemble Herman Rubin in wanting full access to machine
instructions from high-level languages).  On some machines,
almost all arithmetic is integer.  Recently, while transporting 
my program to an Alliant, I vectorized some of these codes.  
A frequent primitive operation requires finding integers X, Y such that 
X*2**30 + Y = A*B + C where A, B, C are given (vectors of) integers
up to 2^30 (2^30 is the radix for multiple-precision arithmetic).
The Alliant has an vectorized integer multiply instruction, but
only for the lower 32 bits of a product.  Hence I can get 
Y = IAND(A*B + C, 2**30 - 1) with vectorized integer operations.
To get X (upper half of product) using vector operations, 
I use floating point, such as

	X = NINT((DBLE(A)*DBLE(B) + DBLE(C - Y))*0.5**30)

(DBLE = convert integer to double, NINT = convert double to nearest integer).
This statement uses 4 conversions between integer and floating point
while doing only 3 floating point operations.
The vectorized DBLE is compiled inline, but the vectorized NINT
is done in a subroutine; the NINT subroutine uses 10% of my program's time
(but total program time is down from before vectorization).

	BTW, I have asked the X3J3 committee to add this operation 
(i.e., given A, B, C, D, all nonnegative, and either A < D or B < D,
find X, Y such that X*D + Y = A*B + C) to Fortran 8x.


--------
        Peter Montgomery
        pmontgom@MATH.UCLA.EDU 
	Department of Mathematics, UCLA, Los Angeles, CA 90024

davec@proton.amd.com (Dave Christie) (12/18/89)

  
In article <2100@sunset.MATH.UCLA.EDU# pmontgom@math.ucla.edu (Peter Montgomery) writes:
#In article <28413@amdcad.AMD.COM> davec@cayman.amd.com (Dave Christie) writes:
#>
#>Quite true, but (IMHO) for most applications, int<->fp traffic is not very
#>high, so having to explicitly move some data around is no big deal ... as
#>long as it is done reasonably efficiently so that you don't blow one or two
#>applications out of the water.
#>
#	I have written a multiple precision integer arithmetic package.
#Its time-critical routines are conditionally compiled so
#the algorithms can be individually optimized for different machines 
#(yes, I resemble Herman Rubin in wanting full access to machine
#instructions from high-level languages).  On some machines,

Yes, multiple precision seems to be one of the applications you may
want to be careful about (Herman Rubin replied as well!).  My rash 
generalization comes from a former life at a TLA mainframe company 
where I did detailed instruction frequency measurements of a wide range
of real-world Fortran programs - such things as structural analysis, 
flight simulation, ECAD, many other flavors (but not likely multiple 
precision - notice I'm not using the words "representative" or 
"comprehensive" ;-)  This indicated (among many other things) that the 
frequency of converts would not be a big factor in deciding whether 
to go with separate or shared registers, were one to design a new
general-purpose ISA - other factors carry more weight.  Of course,
your milage may vary.  What one sees fit to do generally depends
on what one thinks will bring in the most bucks, right?  

#A frequent primitive operation requires finding integers X, Y such that 
#X*2**30 + Y = A*B + C where A, B, C are given (vectors of) integers
#up to 2^30 (2^30 is the radix for multiple-precision arithmetic).
#The Alliant has an vectorized integer multiply instruction, but
#only for the lower 32 bits of a product.  Hence I can get 
#Y = IAND(A*B + C, 2**30 - 1) with vectorized integer operations.
#To get X (upper half of product) using vector operations, 
#I use floating point, such as
#
#	X = NINT((DBLE(A)*DBLE(B) + DBLE(C - Y))*0.5**30)
#
#(DBLE = convert integer to double, NINT = convert double to nearest integer).
#This statement uses 4 conversions between integer and floating point
#while doing only 3 floating point operations.

This sounds like a nice extreme case.  Now I'm curious - does anyone know,
or care to figure out (I'm not curious enough to take the time myself :-)
whether the simultaneous integer/fp issue capability of the i860 would 
more than compensate for the penalty on converts caused by the separate 
int/fp registers, even on this extreme piece of code?

#        Peter Montgomery
#        pmontgom@MATH.UCLA.EDU 
#	Department of Mathematics, UCLA, Los Angeles, CA 90024

------------
Dave Christie
The opinions are mine, the facts belong to everyone .... oops - sorry,
the facts are proprietary.....