[comp.arch] i860 registers/chip in general

rstewart@megatek.UUCP (Rich Stewart) (06/11/90)

Does anyone out there *know* why intel insisted in creating a distinction
between floating point and integer registers? Are they planning to
change this in the ix60 model?? Sure could write faster software if they
did.

-Rich

preston@titan.rice.edu (Preston Briggs) (06/12/90)

In article <495@tau.megatek.uucp> rstewart@megatek.UUCP (Rich Stewart) writes:
>Does anyone out there *know* why intel insisted in creating a distinction
>between floating point and integer registers? Are they planning to
>change this in the ix60 model?? Sure could write faster software if they
>did.

I *like* seperate FP and integer register sets.  Am I a minority among
compiler writers?  I can give a few reasons of varying importance.

- Seperate register sets allows us to specify more registers in
  a shorter instruction word.  5 bits can specify 32 int regs or
  32 FP regs.  On a three-address machine, we save 3 bits over
  naming 64 registers another way.

- Coloring register allocators can handle either style, but it's
  sometimes possible (e.g., the i860, and probably the MIPS and SPARC)
  to run to do seperate allocations.  This allows a very nice space saving
  at compile time.  (The interference graph is proportional to n^2 where
  n is the number of live ranges to be colored.  By coloring the ints
  seperately, we can save up to a factor of 4 on the interference
  graph alone.  There is also a related time savings for zeroing
  the graph initially.  Additionally, the savings in cache
  misses and paging overhead will be important.)

- Fancy restructuring to take advantage of the memory hierarchy
  needs to know the size of the cache and the size of the register
  set(s).  Seperate register sets increase the precision of our knowledge.
  That is, we don't have an unknown number of addressing temporaries
  out there competing for our FP registers.

How do seperate sets slow your code?  I can think of integer multiplies
and perhaps the pipelined loads which can only be done in the FP set.

Me?  I vote for a pipelined reciprocal and no pixel operations and
a prefetch for the data cache and ...

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

marc@oahu.cs.ucla.edu (Marc Tremblay) (06/12/90)

In article <495@tau.megatek.uucp> rstewart@megatek.UUCP (Rich Stewart) writes:
>Does anyone out there *know* why intel insisted in creating a distinction
>between floating point and integer registers? Are they planning to
>change this in the ix60 model?? Sure could write faster software if they
>did.

As mentioned in another article, compilers can use that feature to improve
performance. Other reasons, just as important, are related to the logic and
the layout of the chip.

First of all, in order to have two instructions execute in one cycle
(one "core" and one floating-point instruction in the dual-instruction mode),
the architecture must allow parallelism between operations using the core
register file and the fpu register file.
This could be done by providing one large register file with many ports and
extra logic to avoid read/write conflicts among the two units.
Notice that the fpu register file already has 5 ports and the core unit
has a 3 port register file (some of these ports are time-multiplexed).
Combining these two register files could lead to extra capacitance on the
buses, slower register cells and possibly a longer clock cycle. Separation
of the two units allows faster separate execution.

The register files of the i860 are disjointed and can thus be tailored
to their own unit in a better way. The fix-point unit has a 32-bit datapath
so the register file needs to be only 32 bit wide with two buses running on
top of it to provide the two operands.
The floating-point unit, on the other hand, has a 64-bit datapath
going through the adder and multiplier. The registers for the fpu seem to be
organized as a stack of 64-bit registers so that two operands can be routed
to the adder/multiplier in one cycle and match the width of the datapath.
To allow floating-point loads of up to 128 bits to be done in a single cycle,
the width of the datapath was made wider than the 32-bit datapath of the core.
Notice that it is not clear when loading 128 bits if the words are demultiplexed
onto two buses to load two rows of registers (each row containing 64 bits)
or if the register file is organized as a stack of 128-bit registers which
is in turn multiplexed onto the 64-bit datapath.

Finally, the 8kbyte (2-way) data cache also has a 128-bit internal path
which fits very well (on the chip) with the fpu; the 4 data cache cells
having the same pitch as one fpu cell. This interaction between the
floating-point register file and the data cache allows the cache
to be used as "vector registers" fulfilling the necessary bandwidth
for some matrix operations.

In a word, YES it makes a lot of sense, from the architectural and
VLSI point of view, to separate the register files.

			Marc Tremblay
			internet: marc@CS.UCLA.EDU
			UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc

dik@cwi.nl (Dik T. Winter) (06/12/90)

In article <8744@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
 > How do seperate sets slow your code?  I can think of integer multiplies
 > and perhaps the pipelined loads which can only be done in the FP set.

One important point (that Sun has forgotten to take account of in the SPARC):
it is important to have one of two (perhaps more I cannot think of now)
properties:
a.  A direct datapath from the general registers to the floating point
    registers.
b.  A calling sequence that allows passing parameters in floating point
    registers.
And also important:
c.  Conversion from integer to floating point vice-versa should go from
    integer registers to floating point registers, and the other way around.

The SPARC architecture as defined provides neither a, nor b, nor c, which
results in a considerable slow-down on many floating-point applications.

(Not that want to bash the SPARC, it is a good architecture in quite a
few ways, but it has its architectural defects.)
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

mark@mips.COM (Mark G. Johnson) (06/12/90)

In article <1635@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
> ... it is important to have ...
>a....
>b....
>c.  Conversion from integer to floating point vice-versa should go from
>    integer registers to floating point registers, and the other way around.

Has there ever been an architecture developed which [1] had/has separate
registers for FP and for integers; [2] obeyed Winter's  property (c.)
above?  Machines like the 88k (having non-separate int/FP registers)
do so trivially, of course.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086
	(408) 524-8308    mark@mips.com  {or ...!decwrl!mips!mark}

rob@wiley.UUCP (Robert Heiss) (06/13/90)

In article <1635@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
[SPARC is missing ...]
>a.  A direct datapath from the general registers to the floating point
>    registers.
>b.  A calling sequence that allows passing parameters in floating point
>    registers.
>c.  Conversion from integer to floating point vice-versa should go from
>    integer registers to floating point registers, and the other way around.

In article <39315@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>Has there ever been an architecture developed which [1] had/has separate
>registers for FP and for integers; [2] obeyed Winter's  property (c.) above?

The CLIPPER architecture has properties a, b, and c.  With floating point
integrated on the same chip as the integer unit, one would expect f.p.
conversions to be fast.  But the C100 implementation has such inefficiently
microcoded conversion instructions that the equivalent SPARC sequence is
faster.

I would like to see an experimental compiler/library for SPARC which avoids
the register windows.  Code as if there were 31 general registers and pass
parameters in registers like MIPS or i860.  Pass f.p. parameters in f.p.
registers.  Do interprocedural optimization too.  If the register windows
aren't a speed advantage, consider defining a SPARC2 architecture without
the window hardware.  As of 1990, transistors still aren't free.

	-Robert Heiss    rob@wilbur.coyote.trw.com