[comp.arch] Longer load/store because of register windows

marc@oahu.cs.ucla.edu (Marc Tremblay) (10/28/88)

>In article <469@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes:
>
>If I remember the arguments from MIPS correctly (want to help me out
>John?), there's a stronger objection to multiple-window-register-files.
>I think it's something to the effect that register-windows cause the
>load/store access time to be slower.  

Having a multiple-window register file, or more precisely, having many
registers, slows down the processor cycle. Even with an independent port for 
the load/store, the operation is still based on the basic processor cycle.
With a longer cycle the load/store accesses become slower. 
There are two reasons: 1) for a large register file, let's say 128 registers,
the decoding of the registers addresses is longer (more bits to decode,
even if you use partial decoding there is still a penalty), 
2) the data bus is longer because it has to go over so many registers. 
A longer data bus implies larger capacitance and longer discharge time, 
thus longer processor cycle.
Usually the access to the register file, either on a register read/write
or on a load/store, is part of the critical path.

You can play some tricks to get around those drawbacks, 
for example the Am29000 uses overlapping to avoid the penalty caused by
the decoding. Even though the hardware is quite expensive (3 large decoders,
3 small adders, and some multiplexers), it is a gain.
The Intel 80960 uses a cache for local register sets. 
I haven't seen the layout :-), but it seems like the sets are separated
in a way that the data bus is not lengthened.
Finally you can organize the layout in such a way that the current register set
is always at the same place. Everytime that there is a change of window,
you need to shift out the current window to a back up window, an shift in 
the new window into the current window, this whole operation can be done 
in ONE cycle for register files of a reasonable sizes (we've done it for
a register file of 128 registers). This method makes the length of the 
data bus independent of the number of windows.

So the question is: Is it clever to invest in a large register file with
windows or is it better to use the silicon for other circuitry? 
The answer depends on how good your compiler people are!

					Marc Tremblay
					marc@CS.UCLA.EDU
					...!(ihnp4,ucbvax)!ucla-cs!marc
					Computer Science Department, UCLA

mash@mips.COM (John Mashey) (10/28/88)

In article <17268@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:

>Having a multiple-window register file, or more precisely, having many
>registers, slows down the processor cycle. Even with an independent port for 
>the load/store, the operation is still based on the basic processor cycle.
>With a longer cycle the load/store accesses become slower. 
>There are two reasons: 1) for a large register file, let's say 128 registers,
>the decoding of the registers addresses is longer (more bits to decode,
>even if you use partial decoding there is still a penalty), 
>2) the data bus is longer because it has to go over so many registers. 
>A longer data bus implies larger capacitance and longer discharge time, 
>thus longer processor cycle....

>So the question is: Is it clever to invest in a large register file with
>windows or is it better to use the silicon for other circuitry? ....

Could you quantify the hit from these issues, or point at some references
that show data for this?
In any of my competitive analysis, I've never made an issue of it,
and generally haven't thought it was an issue....if somebody else
proves it is an issue, I won't argue too much! :-)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tim@crackle.amd.com (Tim Olson) (10/28/88)

In article <17268@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
| Having a multiple-window register file, or more precisely, having many
| registers, slows down the processor cycle. Even with an independent port for 
| the load/store, the operation is still based on the basic processor cycle.
| With a longer cycle the load/store accesses become slower. 

While larger register files do slow down register access by a very small
amount, I don't think that the register file access is really the
critical path on most machines (it certainly isn't the main critical
path on the Am29000).  As has been stated here before, cache accesses
(for instructions, data, or both), or MMU translations are probably the
limiting factor.

The emperical evidence also tends to support this:  both Am29000 and
SPARC (with large register files) have 30MHz versions available, while
MIPS and 88000 (32-register files) are at 20 - 20MHz.  Note that this
certainly does not mean that they couldn't be at 30MHz, I'm just
pointing out that the register file size has little to do with the
overall cycle time in real processors.


	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

bcase@cup.portal.com (Brian bcase Case) (10/29/88)

>Having a multiple-window register file, or more precisely, having many
>registers, slows down the processor cycle.  With a longer cycle the
>load/store accesses become slower. There are two reasons: 1) for a large
>register file, let's say 128 registers, the decoding of the registers
>addresses is longer (more bits to decode, even if you use partial decoding
>there is still a penalty), 2) the data bus is longer because it has to go
>over so many registers. A longer data bus implies larger capacitance and
>longer discharge time, thus longer processor cycle.

This is true in theory.  Howvever there are two effects that prevent a
(within reason) large reg. file from slowing down the basic cycle:
	Circuit design can solve some speed problems, its just a matter
	of spending the power budget.  It is possible to make a register
	file that reads and writes *at the same time*.  But it is larger
	and more wasteful.  Yes, decoding time and capacitive effects
	can be a problem.  But even for the 29K's reg file, the access
	time is something like 10ns, and that's good old 1.25 micron
	technology.

	However, no one designs a reg. file that reads and writes at the
	same time because it isn't necessary; the register file isn't
	the critical path.  Things like the cache tag access-> compare->
	set_select and ALU_subtract->condition_code  (or its equivalent)
	are the harder things.  Again, clever circuit design applies, but
	what usually happens is that a compromise is reached so that
	almost everything "is the critical path."  If you think about it,
	you see that anything else would be a poor, unbalanced design,
	and the circuit designers would get fired!

>You can play some tricks to get around those drawbacks, 
>for example the Am29000 uses overlapping to avoid the penalty caused by
>the decoding.

The 29k does no more overlapping than the next guy, to my knowledge.
BUT, I am not one of the circuit designers, so I might be wrong.

>The Intel 80960 uses a cache for local register sets. 
>I haven't seen the layout :-), but it seems like the sets are separated
>in a way that the data bus is not lengthened.

Again, the lengthening of the data bus is a concern, but need not be a
problem.  In a reasonable architecture, a pipeline register sits right
after the register file and right after the ALU.  The lengthening of
the data bus is important only for those internal bus operations that
have to traverse the entire length of the processor bit cells.  Very
few operations have to do this (maybe things involving indirect jumps
or something; the PC section might be at the end opposite the reg. file
in the bit cell), and, again, you just have to have mega drivers for them.

I don't mean to say this is *free*, but it is possible, and so a large
register file does not necessarily slow the processor cycle.

>So the question is: Is it clever to invest in a large register file with
>windows or is it better to use the silicon for other circuitry? 
>The answer depends on how good your compiler people are!

Well, the windows question is a hard one, so I won't even try that one.
But as for *lots* of registers:

Registers have three times the bandwidth and much
less latency than the next level in the memory hierarchy.  Maybe I can't
think of what to do with them today (actually, I can think of several
things), but putting a critical resource in a register instead of a
cache makes a BIG difference on almost any machine, certainly on simple
ones.  The 29k UNIX implementations keep operating system goodies in
protected global registers while user code is running.  So do some
CISC processors, but in them, you have no choice about what gets put
there, it's defined by the architecture.  Lots of registers give you
a powerful resource that you can use anyway you want!  When 29k compilers
get the sort of capability found in the DEC Titan compilers (wishful
thinking?), the abilty to do universal register allocation, global registers
will be useful for keeping a program's global scalars in registers.  This
will probably necessitate implementing the "missing" 64 global registers
because the greedy OS guys have already hoarded most of the 64 existing
globals. :-)

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/29/88)

In article <7195@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>In article <17268@shemp.CS.UCLA.EDU> marc@cs.ucla.edu (Marc Tremblay) writes:
>
>>Having a multiple-window register file, or more precisely, having many
>>registers, slows down the processor cycle. Even with an independent port for 

>Could you quantify the hit from these issues, or point at some references
>that show data for this?

I don't know enough of the details to know why this is so, but I have been
told that "all other things being equal" (i.e. a typical processor design
with typical technology) that it is difficult to provide more than about
32 registers with single cycle read/write access.  The only machine that
I know well that is germane is the Cyber 205, which has 256 G.P. registers
and takes two cycles to read/write the register file, and which has
additional logic to hide that fact in most circumstances.  So, it doesn't
prove the case, but it doesn't contradict it either.  Since 32 is usually
acknowledged to be in the diminishing returns area, I'm not sure how many
people have tried to provide more registers.  I have no knowledge of how
the registers are organized in one of the new register window machines, 
and whether this is a problem on those machines.

Anyway, the original poster's hypothesis, that "many" registers creates an
access speed problem, seems reasonable at this point.


-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

bcase@cup.portal.com (Brian bcase Case) (10/30/88)

>I don't know enough of the details to know why this is so, but I have been
>told that "all other things being equal" (i.e. a typical processor design
>with typical technology) that it is difficult to provide more than about
>32 registers with single cycle read/write access.

Wow, I don't know who told you this, but it is not true for VLSI
implementations.  Creating a register file from discrete memory chips
is *much* harder (I did it once in ECL; try it with two write ports!).
In VLSI, the register file is typically implemented in a write-before-
read fashion; so in some sense, the register file is doing two accesses
per cycle.  I.e., register file read must take less than 1/2 cycle
(pipeline register set-up time, some clock margin, etc.).

WRT the discussion about variable-sized windows, the 29K allows (doesn't
force) activation records (stack frames) to be allocated in the local
register file (this is what windowing means).  These activation records
are variable-sized windows, with resolution of one register, up to
the size of the local register file (128 registers).  The "stack-pointer-
plus-offset" calculation is overlapped with the register file write
operation (write-before read).

The register file can also be used as a flat space of 128 registers.