[comp.arch] RISC register windows

andy@Shasta.UUCP (02/25/87)

The sliding register window had an area for parameters passed
to the procedure, locals, and parameters passed to other procedures.
Calling lots of little procedures (that don't call other procedures)
that don't use their window fully is equivalent to doing a better
job allocating but having one less window.  The first call might
overflow, but subsequent ones won't.

There were also a set of global registers.

One argument for fixed window sizes is that register access
shouldn't require an addition.  Bit-wise or is one thing (or the
register number with the current base to determine the register);
addition is an expensive operation to put on the register access
data path.  This basically restricts window sizes to powers of two.
(Then again, if there are cheap functions that one could use to
implement variable size register windows, I'm sure someone on
the net will mention them.)  It then is probably easier to pick
a power than support a range of them.

Then again, I seem to remember that the window size wasn't a power
of two on the Berkeley RISC, which surprised me.  Maybe I'm wrong.

-andy
-- 
Andy Freeman
UUCP:  ...!decwrl!shasta!andy forwards to
ARPA:  andy@sushi.stanford.edu
(415) 329-1718/723-3088 home/cubicle

bcase@amdcad.UUCP (02/25/87)

In article <1335@Shasta.STANFORD.EDU> andy@Shasta.UUCP (Andy Freeman) writes:
>One argument for fixed window sizes is that register access
>shouldn't require an addition.  Bit-wise or is one thing (or the
>register number with the current base to determine the register);
>addition is an expensive operation to put on the register access
>data path.  This basically restricts window sizes to powers of two.
>
>Then again, I seem to remember that the window size wasn't a power
>of two on the Berkeley RISC, which surprised me.  Maybe I'm wrong.
>
True, the entire window in the Berkeley scheme (and Pyramid) isn't a power
of two in size, but the amount allocated/deallocated on call/return *is*
a power of two:  6 in, 10 local, 6 out = 22.  But 6 of those are overlapped
and aren't allocated on call, thus, 16 is the allocation size.

Also, the Berkeley scheme requires more than a simple or function; see
Katevenis' thesis.

    bcase
---------
It'll be worth it.  Keep watching this space.

johnw@astroatc.UUCP (02/26/87)

Its clear to me from the questions, that many people don't understand RISC.

RISC stands for Reduced Istruction Set Computer.

Features common to RISC machines:
 + One load/one store insturction.   (All others are reg-to-reg.)
 + Regular (read: easy to decode) instructions
    uaually a single (or limited set of) instruction format(s)
 + usually a fairly small set of insturctions

Additional features of the Berkeley RISC machine:
  + Register Windows
  + delayed branchs
  + Reg 0 is always 0
  + a one-bit field in the instuction to set codition codes
	(This was a great feature, in my opionion)
  + 3 operands/instruction

The Purpose of Register Windows was to remove the need for a cache.  As a
student project, a real cache design was deemed too hard.  In practice, I
think the Sun/3's and similar machines are proving that windows are not
the "only-true-way" to get good performace.

The division of registers:
-------------------------
Reg 0		always 0
Reg 1-9		Global (static) registers
Reg 10-16  (6)  input parameters
Reg 17-27 (10)  locals
Reg 28-31  (6)  output parameters

This 6/10/6 division of register is arbitrary, but is kept fixed to insure
maximum speed.  To make it dynamic, would require a size-register
(effectively nargs() value) probably *IN* the window (otherwize it would
not be possible to unwind the call-stack).

The division of register (6/10/6) was selected carfully, after weeks of
study on real code (see the RISC article in Computer, a few years back)

-   -   -   -   -   -   -   -   -   -   -   -   -   -   

As for dealing with register window over-flow, under-flow, the Berkeley
paper (as I recall) said that it's occurence was not frequent enough to
sweat about.   As for "dribbling" registers in or out (at the ends of the
window stack -- a circular queue with a dead-zone to represent the stack), 
would be fine for a small machine.

Thought experiment:

Lets build a *FAST* machine (I-cache, and DATA-cache) with
windows, and a memory system that's wide enough to fill a full window of
registers in one (memory access + split-transfer) time unit.  Now what
happens when you want to run a new process???   The state you must save is
not just your current window, but your *WHOLE* stack of windows!  Windows
are great for running benchmarks, but are not so good for doing context
switching.   Anyone know how Pyramid solved this???  How many users (vi,
csh, compiles, etc.) does it take to bring a Pyramid to its knees??

> >From: eppstein@tom.columbia.edu (David Eppstein)
> Subject: register window machine questions
> 
> (1) Has anyone tried making the window block size be just one register,
> i.e. the window can have an arbitrary alignment in relation to memory?
> I would expect this to be more efficient in terms of registers used (and
> therefore also memory), but register access time might suffer.
I don't understand this question!   The instruction has bit fields to
specify the register, so you can't go over the max-register number, and if
you don't use all you're registers, you're wasing resources anyway!  I
would venture to guess that FAST machines spend less that 10% of there
time running within 10% of their PEAK performance!!
 
> (2) dirty regs and (3) staic/window --> see above
> David Eppstein, eppstein@cs.columbia.edu, Columbia U. Computer Science Dept.
> 
---------------------------------------------
> 
> >From: chuck@amdahl.UUCP (Charles Simmons)
> Subject: Re: subroutine frequency
> 
> Are there reasons I am missing that make fixed sized windows extremely
> advantageous?
Just that fixed it simple (fast), easy to do, and doesn't require alot of
state to keep track of.  If you can show how dynamic windows are better
when you assume that memory *BANDWITH* is high enough, but its just a
*LATENCY* problem we have to deal with, then I'll vote with you.

NOTES: bandwith refers to thruput, or a long-term performace average.
       latency refers to responce, or the short-term time it takes to get
       what you ask for.   It's frequently easy to trade bandwith and
       latency, but its hard to improve *BOTH* at the same time!

			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!

howard@cpocd2.UUCP (02/26/87)

In article <1335@Shasta.STANFORD.EDU> andy@Shasta.UUCP (Andy Freeman) writes:
>The sliding register window had an area for parameters passed
>to the procedure, locals, and parameters passed to other procedures.
>There were also a set of global registers.
>
>One argument for fixed window sizes is that register access
>shouldn't require an addition.  Bit-wise or is one thing (or the
>register number with the current base to determine the register);
>addition is an expensive operation to put on the register access
>data path.  This basically restricts window sizes to powers of two.

Actually we didn't even use bitwise manipulation, we just built double
decoders for the overlapped registers so that two addresses would both
activate the same register.  You can easily see this in the chip photos,
in the upper left, as light and dark bands in the regfile decoder.  In
retrospect I'm not sure that was the best method.  Dave Johannsen once
suggested to me an alternate window structure in which the overlap could
be implemented with a "normal" RAM and only 2 extra gates!

>Then again, I seem to remember that the window size wasn't a power
>of two on the Berkeley RISC, which surprised me.  Maybe I'm wrong.

You are.  In RISC-I, say, in each window you have 6 "high" and 6 "low"
registers which overlap with the next window, 10 "local" ones which
don't, and 10 "global" ones which are the same in all windows.  That
adds up to 32.  However, the 10 global registers only exist in one
place, not in each window, so each window has 22 registers which are
non-global; this may be where you got your non-power-of-2-idea.  On
the other hand, adding an extra window costs 16 registers, because
either the high or the low (I forget which) already exist in the
previous window, so we're back to a power of 2 again.

The numbers in RISC-II are different, but could be plugged into the
preceding paragraph without making it untrue (I think).

Let H = # of high registers, M = # of local registers, L = # of low registers,
G = # of global registers, R = # of registers seen in each window, W = # of
windows, P = # of physical registers.  The following must hold for 2 or more
windows, assuming circular overflow (top-window high = bottom-window low):

	H = L
	R = H + M + L + G
	P = W*(M+L) + G
-- 

	Howard A. Landman
	...!intelca!mipos3!cpocd2!howard

bcase@amdcad.UUCP (02/26/87)

In article <161@astroatc.UUCP> johnw@astroatc.UUCP (John F. Wardale) writes:
>Its clear to me from the questions, that many people don't understand RISC.
Well, I am confident in *my* understanding.

>RISC stands for Reduced Istruction Set Computer.
True, this is true.  But the choice of the RISC acronym was unfortunate.
Reducing the number of instructions is incidental.  I know that the Berkeley
people emphasized this point in the beginning, but I think they (Patterson
and Katevenis (yes, I know he is out of the country)) would no longer
emphasize this aspect of simple machines.

>Features common to RISC machines:
>[a list of features]
Sigh, RISC machines have whatever features are necessary to achieve an
instruction rate of close to one per cycle and achieve a good compiler
interface.

>The Purpose of Register Windows was to remove the need for a cache.  As a
>student project, a real cache design was deemed too hard.  In practice, I
>think the Sun/3's and similar machines are proving that windows are not
>the "only-true-way" to get good performace.
Who ever said that windows were the only true way to get good performance?

>This 6/10/6 division of register is arbitrary, but is kept fixed to insure
No, not arbitrary. The amount allocated (in this case 16 regs) must be a
power of two.

>maximum speed.  To make it dynamic, would require a size-register
>(effectively nargs() value) probably *IN* the window (otherwize it would
>not be possible to unwind the call-stack).
Well, you can use dynamic links too.  (This makes the protocol for variable
sized windows more efficient.)

>Lets build a *FAST* machine (I-cache, and DATA-cache) with
>windows, and a memory system that's wide enough to fill a full window of
>registers in one (memory access + split-transfer) time unit.  Now what
Wow.  That's a wide bus, but I like the idea!  Bandwidth or bust!

>happens when you want to run a new process???   The state you must save is
>not just your current window, but your *WHOLE* stack of windows!  Windows
>are great for running benchmarks, but are not so good for doing context
>switching.   Anyone know how Pyramid solved this???  How many users (vi,
>csh, compiles, etc.) does it take to bring a Pyramid to its knees??
I worked at Pyramid, my thesis details a machine with variable sized windows,
and I have build a simple machine with variable size windows (er, activation
records that is).  Windows are good for running benchmarks *and* real
programs.  Variable sized windows are just better.  You are right, the entire
active portion of the register file must be swapped on a full context switch.
However, you are completely ignoring the fact that swapping the register file
is only a part, a small part, of the work that must be done to perform a
context switch.  The number of users needed to bring a Pyramid to its knees
is not correlated to the fact that it has register windows.  To be sure,
however, a real-time machine might be better off without windows (latency
vs. throughput).

Context switch overhead is mittigated by variable sized windows:  at least
the registers you are swpping are holding useful data.  With fixed size
windows, there is a high probability that some of the registers you are
swapping are not being used.  A smaller register file allowing variable
sized windows achieves the performance of a larger register file with fixed
sized windows.

    bcase
---------
Are you watching this space?