[comp.arch] register window machine questions

eppstein@tom.columbia.edu.UUCP (02/24/87)

Register window machines such as the Berkeley RISCs have been discussed
recently in this group.  I have some questions about these architectures:

(1) Has anyone tried making the window block size be just one register,
i.e. the window can have an arbitrary alignment in relation to memory?
I would expect this to be more efficient in terms of registers used (and
therefore also memory), but register access time might suffer.

(2) Do any architectures dribble out dirty registers near the bottom of
the set of on-chip registers in otherwise unused memory cycles, or
similarly dribble them in when they move back on-chip rather than
stopping everything else while bringing them back in?

(3) Are there machines with both dynamic (windowed) and static (normal)
registers?  For instance it might be useful for quick O.S. interrupts to
use only statics and not go to the trouble of setting up a register
window stack (which would involve lots of reads and writes moving the
old stack completely out to memory and then back again later).  Static
registers could also be useful for non-local variables.
-- 
David Eppstein, eppstein@cs.columbia.edu, Columbia U. Computer Science Dept.

howard@cpocd2.UUCP (02/25/87)

In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes:
>Register window machines such as the Berkeley RISCs have been discussed
>recently in this group.  I have some questions about these architectures:
>
>(1) Has anyone tried making the window block size be just one register,
>i.e. the window can have an arbitrary alignment in relation to memory?
>I would expect this to be more efficient in terms of registers used (and
>therefore also memory), but register access time might suffer.

Decoding gets a lot more complicated, but this approach might have some merit
if the memory itself was expensive relative to the decoding (probably not the
case in any MOS technology).  The maximum number of useful windows is about
6 or 7 anyway, for the UNIX programs examined by the Berkeley group.  Beyond
that the increased access time costs more than the extra windows save.  Of
course, with finer granularity, the access time would grow more slowly ...

>(2) Do any architectures dribble out dirty registers near the bottom of
>the set of on-chip registers in otherwise unused memory cycles, or
>similarly dribble them in when they move back on-chip rather than
>stopping everything else while bringing them back in?

The Berkeley RISC machines don't have any unused memory cycles.  So for them,
the answer is no.  This exact matching of processor memory requirement to
available memory bandwidth is a subtle, powerful, and not-often-appreciated
feature of those architectures.

>(3) Are there machines with both dynamic (windowed) and static (normal)
>registers?

Even in the Berkeley RISCs, 10 or so of the 32 registers are "global",
meaning they are the same in all windows.  So, yes.

>For instance it might be useful for quick O.S. interrupts to
>use only statics and not go to the trouble of setting up a register
>window stack (which would involve lots of reads and writes moving the
>old stack completely out to memory and then back again later).

This is a severely muddled question.  Mu!  Register windows IN HARDWARE,
which is what we're discussing, require almost NO setup time.  So I don't
understand what "trouble" and "reads" and "writes" you're talking about.
If you are talking about context switches, then yes there is an increased
overhead since (depending on how you implement it) multiple windows of
registers may need to be saved.  But context switches occur about two
orders of magnitude less frequently (in UNIX) than calls/returns, and so
this has no significant performance impact.  As far as "quick O.S. interrupts"
are concerned, these only need one set of registers, and so can be treated
pretty much just like a procedure call (being careful not to use the
overlapped registers).  No context switch is necessary in most cases.
Again, interrupts occur far less frequently than calls/returns, and so
even if it wasn't easy to handle them efficiently, the performance effect
would be small.  The only exception might be extremely high speed real-time
applications.

>Static registers could also be useful for non-local variables.

That's what they get used for, all right.  It turns out that in C
virtually all arrays are global, for example, and if you're looping
through all the elements of one it's nice to keep the address in a
register.
-- 

	Howard A. Landman
	...!intelca!mipos3!cpocd2!howard

jack@mcvax.UUCP (02/25/87)

Something that I would be very interested in is wether there
is any research done into machines with an unlimited register
file.
What I have in mind is a stackish machine, with an intelligent
stack cache that will delay writes around the stack pointer
(since there's a very good chance that they will never have
to be written at all).

The cache doesn't even have to be that intelligent: data should
be written out as soon as a write to a different address lands
in the same cache entry. Moreover, whenever the stack pointer
is decremented (on routine exit, for instance), you can scratch
all data above it.

One of the advantages of this seems to be that compilers become
simpler (no special cases for routines that run out of registers),
and routines that are recursive at the beginning would benefit
by not having (unnecessary) register-saves.
-- 
	Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp)
	The shell is my oyster.

bcase@amdcad.UUCP (02/26/87)

In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes:
>Register window machines such as the Berkeley RISCs have been discussed
>recently in this group.  I have some questions about these architectures:
>
>(3) Are there machines with both dynamic (windowed) and static (normal)
>registers?  For instance it might be useful for quick O.S. interrupts to
>use only statics and not go to the trouble of setting up a register
>window stack (which would involve lots of reads and writes moving the
>old stack completely out to memory and then back again later).  Static
>registers could also be useful for non-local variables.

The Pyramid machine has 16 "global" registers that are addressable by all
procedures; i.e., they are a part of every procedure context.

    bcase
---------
This space is not for rent; it will soon be used, so keep watching.

bcase@amdcad.UUCP (02/26/87)

In article <7284@boring.mcvax.cwi.nl> jack@boring.UUCP (Jack Jansen) writes:
>
>Something that I would be very interested in is wether there
>is any research done into machines with an unlimited register
>file.
>What I have in mind is a stackish machine, with an intelligent
>stack cache that will delay writes around the stack pointer
>(since there's a very good chance that they will never have
>to be written at all).

Close, but not quite, the C Machine Stack Cache.  See SIGARCH/SIGPLAN
notices, vol. 17, no. 4, April 1982.

baum@apple.UUCP (02/27/87)

--------
[]
>
>>(3) Are there machines with both dynamic (windowed) and static (normal)
>>registers? For instance it might be useful for quick O.S. interrupts to
>>use only statics and not go to the trouble of setting up a register
>>window stack (which would involve lots of reads and writes moving the
>>old stack completely out to memory and then back again later).


Ditzel et. al. have finally published details of the CRISP processor,
which is a chip implements a stack cache with 32 registers. This
implementation appears to be a register window with variable overlap,
in effect. An entry instruction at the head of each procedure
allocates space in the 'register file', spilling to main mem. if
required. A cleanup instruction after return specifies how many top
of stack locations will be needed, and hardware will 'unspill' to
keep that many valid. The architecture is memory to memory... the
stack cache can be addresses as ordinary memory (but is obviously
faster), and can be byte addressed. They claim exceptional high hit
ratios on the stack cache, and performance of about 9x a VAX780
running at 16Mhz (for some number of benchmarks including PCC--
smaller benchmarks show 30x!). There are a number of other
architectural innovations, most importantly a zero cycle branch.
There are papers in both this weeks Compcon and ISSCC proceedings.

The interrupt is very simple. The PC, PSW, and a Interrupt ID get
pushed onto the interrupt stack. The stack cache, however, is still
addressed by the regular stack pointer; therefore all accesses will
go to main memory instead of the stack. If the interrupt code runs in
the same address space, simply changing the interrupt stack pointer
to be the same as the stack pointer suddenly gives you access to the
stack cache.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

chuck@amdahl.UUCP (02/27/87)

In article <448@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes:
>>[David implies that an interrupt requires a complete context switch
>>of writing all active windows to memory and reloading them after
>>handling the interrupt]
>
>[Howard points out that for an interrupt, the operating system code
>that handles the interrupt can promise not to trash active windows,
>and continue to use the existing stack.  He also points out that
>context switches between user processes occur relatively infrequently
>and thus the overhead of writing out all active windows and reading in
>new windows on a context switch is not important.]

I agree with this, so I'm going to change the subject slightly.  Lately
I've been thinking about operating systems which would perform a complete
context switch on every interrupt.  In particular, code for handling
interrupts would run in a restricted environment (ie user mode) where 
their ability to accidentally trash other interrupt handlers, the operating 
system, and device drivers would be minimized.

My biggest problem with this concept is that context switching, even
on a machine with "only" 16 registers is extremely expensive.
I'd be interested in hearing about architechtures that minimize the
expense of a context switch.
 
(The expected advantage of the OS I'm toying with is that it would
make it much easier to write and debug programs (subroutines?) that
are normally part of a Unix kernel.  I've been writing some device
drivers lately, and I'm getting frustrated with the cycle of fixing
a bug, compiling the code, linking it with the kernel, rebooting the
system, testing the code and finding another bug, rebooting the system
to fix the new bug...
 
(If device drivers ran in user land, my cycle would be reduced to:
fixing a bug, compiling the code, testing the code and finding a bug...)
 
-- Chuck

rw@beatnix.UUCP (02/27/87)

In article <5763@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>
>I agree with this, so I'm going to change the subject slightly.  Lately
>I've been thinking about operating systems which would perform a complete
>context switch on every interrupt.  In particular, code for handling
>interrupts would run in a restricted environment (ie user mode) where 
>their ability to accidentally trash other interrupt handlers, the operating 
>system, and device drivers would be minimized.
>
>My biggest problem with this concept is that context switching, even
>on a machine with "only" 16 registers is extremely expensive.
>I'd be interested in hearing about architechtures that minimize the
>expense of a context switch.

In Embos (Elxsi Message Based Operating System), I/O drivers (called access
managers) are user processes.  Everything is a "user process" -- there are
no priveleged instructions.  While there are intra-process interrupts, they
are not used to implement I/O.  I/O is done through normal use of the message
system.  I/O controllers look like software processes, and the access managers
just send and receive messages with them.  This makes debugging AMs for new
controllers easy because you can simulate the controller in software
transparently to the AM.

The Elxsi architecture has 16 (64 bit) GPRs per process.  The current implemen-
tation has 15 register sets (each set is a GPR plus other process context).  A
process in each CPU called the register set manager tries to keep all register
sets filled with ready-to-run processes.  Context switches between processes
in register sets are done by the microcode in about 10 microseconds.  This
number, while true, isn't very significant, since the first thing an AM does 
when dispatched to handle an I/O completion message is a receive instruction, 
and that takes about another 60 microseconds.  You could speed this up without
affecting the context switch time, though.  Even in a machine without multiple
register sets, you could easily get a suitably fast context switch given
appropriate hardware and/or microcode support. Multiple register sets just
turned out to be the best way for our situation.

So Elxsi is toward the other end of the spectrum of the lots of registers per
process context vs. lots of process contexts tradeoff.  But of course those
things are influenced by the kind of machine too.  You'd expect different
choices for a large, lots-of-users, multi-CPU machine implemented on 3 giant
ECL boards than for a single chip single-or-few-user machine.

Russell Williams
..!{sun|styx}!elxsi!rw

jay@voder.UUCP (02/27/87)

In article <14986@amdcad.UUCP>, bcase@amdcad.UUCP (Brian Case) writes:
> In article <7284@boring.mcvax.cwi.nl> jack@boring.UUCP (Jack Jansen) writes:
> >
> >Something that I would be very interested in is wether there
> >is any research done into machines with an unlimited register
> >file.
> 
> Close, but not quite, the C Machine Stack Cache.  See SIGARCH/SIGPLAN
> notices, vol. 17, no. 4, April 1982.

It seems like a waste to have a machine with unlimited registers. 
At any given point there is only one set of registers active and other
registers are just sitting there. Instead, a better idea would be to 
limit the number of windows to, say two or three and do loads and stores
if you exceed the number of available windows. It might be a wothwhile
experiment to find out how many windows are optimum for running UNIX.

Jay

mason@tmsoft.UUCP (02/28/87)

In article <7284@boring.mcvax.cwi.nl> jack@boring.UUCP (Jack Jansen) writes:
>What I have in mind is a stackish machine, with an intelligent
>stack cache that will delay writes around the stack pointer
>(since there's a very good chance that they will never have
>to be written at all).

I have thought a fair amount about this.  The problem for really high
performance is that: as all the activity takes place around the stack pointer
it is difficult to get much pipelining in progress.  This means the cache
must be very fancy if you want to do Cray class machines, although RISC class
seems pretty easy.  I have toyed with this direction for a Ph.D. thesis.

>One of the advantages of this seems to be that compilers become
>simpler (no special cases for routines that run out of registers),

(But then all that fancy compiler technique the RISC people cite would be
a waste :-)
(But the compiler induced bug rate would certainly go down.)
-- 
	../Dave Mason,	TM Software Associates	(Compilers & System Consulting)
	..!{utzoo seismo!mnetor utcsri utgpu lsuc}!tmsoft!mason

markp@valid.UUCP (03/01/87)

> Register window machines such as the Berkeley RISCs have been discussed
> recently in this group.  I have some questions about these architectures:
> 
> (1) Has anyone tried making the window block size be just one register,
> i.e. the window can have an arbitrary alignment in relation to memory?
> I would expect this to be more efficient in terms of registers used (and
> therefore also memory), but register access time might suffer.
> 

I believe that the "Strategies for Managing the Register file in RISC"
paper, mentioned earlier, concluded that fully-programmble register window
size had very low marginal utility over a small variety of fixed sizes (i.e.
4/8/12 registers), and would probably be a lose considering the additional
complexity and critical path effects.  (i.e. you're better off making a
larger file with less control complexity)

The other significant conclusion that I recall they made was that the
most effective overflow/underflow handling mechanism (averaged across
many applications) is to save exactly one context on demand (subroutine
call overflows), and to restore exactly one context on demand (subroutine
return underflows).

Overflow/underflow problems are correlated with the subroutine nesting
depth over time of an application.  Programs which go down and come back
up several levels with few repetitions at any given level (i.e. Ackerman's
function) tend to be pathological overflowers, while programs which spend
a great deal of time at or near a particular nesting depth practically never
overflow.  The point here is that sensible thinking about the calling pattern
of your program can dramatically improve performance in a register-window
machine.  This isn't really a very difficult concept, but people don't always
think about it.

> (3) Are there machines with both dynamic (windowed) and static (normal)
> registers?  For instance it might be useful for quick O.S. interrupts to
> use only statics and not go to the trouble of setting up a register
> window stack (which would involve lots of reads and writes moving the
> old stack completely out to memory and then back again later).  Static
> registers could also be useful for non-local variables.

Absolutely.  A register window implementation generally maintains a set of
global registers which are accessible from any context (i.e. rM-rN are always
global registers).  Non-register window RISC machines must make a distinction
between globals, locals, and parameter registers in order to make the best
use of the register file, though the allocation is done by the smart
compiler/optimizer.  I'm sure John can speak more specifically.

	Mark Papamarcos
	Valid Logic
	{ihnp4,hplabs}!pesnta!valid!markp

markp@valid.UUCP (03/02/87)

Incidentally, another good reference in the area of register windows is:

"Analyzing Multiple Register Sets" by Hitchcock and Sprunt from CMU

Unfortunately, I neglected to write on my copy the publication from which
it came, but judging from "1985 IEEE," it's either the 1985 Architecture
conference or Transactions.

At any rate, the premise behind the paper is to decouple the effects of
multiple register sets and register windows from the architecture of the
machine.  To do this, the authors constructed simulators for RISC I, a 68000,
and a VAX and compared the number of memory references generated by
applications running with no multiple register sets, with multiple register
sets (no overlap), and with overlapped register windows.  The decrease in
memory references was as great as 2:1 for the VAX with windows as compared
to the normal VAX.  In one case (Ackerman's function, a pathological program
for any subroutine-call benchmark), register windows actually behaved WORSE
than a conventional register set due to the constant overflows and underflows.
The severity of this problem obviously depends on the number of windows
provided in the register file and the size of each window.  The reduction
in memory references for the 3 architectures varied from very little (25%)
to a factor of 7, depending on the application.  The RISC I fared particularly
poorly without its register windows.  The actual payoff is influenced by the
frequency of register variable access, overflow/underflow rate, and compiler
smartness.  In addition, some program constructs such as up-level references
in Pascal code can generate tremendous headaches for register-window
architectures.

Considering the results of this study, it is a rational conclusion that
register windows do not necessarily imply RISC, and should be considered
as a viable architectural parameter in any machine, subject to the usual
tradeoffs.  BTW- I could not find any indication in the paper of what the
relative sizes were of the normal and overlapped register files.  If indeed
(as I suspect) the overlapped register file was substantially larger than
the normal register file, then similar performance gains could probably be
attained through use of a large register file and a very good register
allocation scheme in the compiler/optimizer.  Actual performance comparison
of these 2 approaches with the same size register file is WAY beyond the
scope of this posting! (although I would LOVE to see some numbers and/or
references, with an indication of just how bad [in practice] the compiler
complexity issues are between the two approaches)

An earlier poster maintained that a primary purpose of register windows was
to allow operation without a cache.  This is not strictly true.  In fact,
register windows can function even better WITH a cache since it generates
block write and read requests on overflows and underflows.  Furthermore,
it is impractical to maintain a large enough register file to function as
an effective data cache.  In the interest of calling a certain class of
gardening tools "spades", register windows should be regarded as no more
and no less than they are: a clever architectural mechanism to decrease
the register-related overhead of subroutine call and return.

	Mark Papamarcos
	Valid Logic
	{ihnp4,hplabs}!pesnta!valid!markp

jjw@celerity.UUCP (03/04/87)

In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein)
writes:
>Register window machines such as the Berkeley RISCs have been discussed
>recently in this group.  I have some questions about these architectures:

I can't answer questions 1 or  2, but:
>(3) Are there machines with both dynamic (windowed) and static (normal)
>registers?  For instance it might be useful for quick O.S. interrupts to
>use only statics and not go to the trouble of setting up a register
>window stack (which would involve lots of reads and writes moving the
>old stack completely out to memory and then back again later).  Static
>registers could also be useful for non-local variables.

The Celerity "Accel" (a.k.a. C1200, C1260) provides both types of registers.
This is because of the NCR-32 chip we use.  That chip has a RISC-like
instruction set (register to register operations, "off-line" fetch and
store, pipeline with delayed jumps, etc.) but provides a set of 16 static,
on-chip registers.  the chip also provides access to off-chip registers and
Celerity has configured 48 of those into a register stack (16 overlapped
with caller, 16 overlapped with callee, and 16 local).

Because the chip allows more and faster operations on the static registers
they are used whenever possible.  The stack registers are used mostly for
parameter passing and static register saving but they are also used for
local variables when all the static registers are allocated.

The system provides several sets (banks) of register stacks (number varies
by model) which are allocated to running processes to reduce memory saving
and restoring on process switch.  For fast OS operation the kernel just
switches to a separate bank allocated for this purpose.

Other off-chip registers have been allocated as privileged "kernel scratch
pad" registers.  These registers can be used to hold kernel information
about the executing process and are used to quickly save some of the static
registers.

						-- J. J. Whelan
						   Celerity Computing

mash@mips.UUCP (03/05/87)

In article <495@apple.UUCP> baum@apple.UUCP (Allen Baum) writes:
>Ditzel et. al. have finally published details of the CRISP processor,
>which is a chip implements a stack cache with 32 registers....
>.... They claim exceptional high hit
>ratios on the stack cache, and performance of about 9x a VAX780
>running at 16Mhz (for some number of benchmarks including PCC--
>smaller benchmarks show 30x!). There are a number of other
>architectural innovations, most importantly a zero cycle branch.
>There are papers in both this weeks Compcon and ISSCC proceedings.

Unfortunately, I wasn't able to catch either COMPCON, or Dave's later
CRISP talk at Stanford, but I do believe that was X a 750, not a 780
[BTL-Research usually compares that way since they have 750s!].  From
reports of the Stanford talk, I also heard (maybe somebody else can confirm):
	13X	(750) Dhrystone	(11,000? hard to know what their 750s do?)
	20X	wc
	30X	Ackerman's function
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

adb@alice.UUCP (03/10/87)

In article <115@tmsoft.UUCP>, mason@tmsoft.UUCP writes:
> >What I have in mind is a stackish machine, with an intelligent
> >stack cache that will delay writes around the stack pointer
> The problem for really high performance is that: as all the activity takes
> place around the stack pointer it is difficult to get much pipelining in progress.  ...
For the CRISP microprocessor, we avoid the problem associated with stack
pointer writes by designing the architecture so stack pointer writes are
hardly ever done -- only on procedure entry and exit.  The two instructions
that modify the SP are carefully managed so addresses are still calculated
properly, and the stack cache on flushes on procedure entry.
> >One of the advantages of this seems to be that compilers become
> >simpler (no special cases for routines that run out of registers),
> (But the compiler induced bug rate would certainly go down.)
We hope so.

	Alan Berebaum, AT&T-Something,
	..!ihnp4!homxa!adb, ..!research!adb

adb@alice.UUCP (03/10/87)

In article <14@winchester.mips.UUCP>, mash@mips.UUCP writes:
> >Ditzel et. al. have finally published details of the CRISP processor,
> >which is a chip implements a stack cache with 32 registers....
> >.... They claim exceptional high hit
> >ratios on the stack cache, and performance of about 9x a VAX780
> >running at 16Mhz (for some number of benchmarks including PCC--
> >smaller benchmarks show 30x!). There are a number of other
> I do believe that was X a 750, not a 780 ...
Sorry, John, it really was 9 times a 780, not a 750.  I suspected that posting
a slide with 750 numbers would confuse people.
> 	13X	(750) Dhrystone	(11,000? hard to know what their 750s do?)
The Dhrystone number we published was 13,560, and the 750 we used ran about
996.  The precise numbers for wc and Ackerman's are 19.5 and 33.6 times
a 750, respectively.

	Alan Berenbaum, AT&T-Something
	..!ihnp4!homxa!adb   ..!research!adb