[comp.arch] Register Windows

lindsay@k.gp.cs.cmu.edu.UUCP (10/03/87)

References:

Register windows interest me.

Win#1:  less memory traffic to save/restore registers at procedure 
	call/returns. (Caches get some of this win.)
Win#2:  less memory traffic to build/access subroutine argument lists.
	(Caches get some of this win.) (Non-window machines can have
	much of this win by changing their calling convention.)
Win#3:  more likely that a compiler can "target" (compute a value in the
	place where it is to be used, thus saving "move" instructions).
Win#4:  hardware is dense and regular. (Caches are too, but are more complex.)
Lose#1: more state to be saved/restored at task switches. (Caches can
	be just declared invalid, or have tags.)
Lose#2: more hardware that can be on the ALU critical path. (Longer R bus,
	deeper R decode ). 
Lose#3: the extra chip area could have been doing something else, like,
	being a cache, which you wanted anyway.
Lose#4: Some programs (particularly Unix's "printf" ) want their parameter
	list to be at an address.

I'm fond of win#3, because only windows give it, and because compilers don't
need great subtlety to pick up the win. 

There are some ways to ways to fix things up. For example, instead of a
call sliding  by (say) 8 registers, slide by a program-supplied distance.
The average slide might then be (say) 3. This economy helps lose#1, and
changes lose#2 in very implementation-specific ways. Another fixup is to
"dribble" the state out to memory as a background activity. It's more
complex, but if it uses spare memory cycles, then task switches get a
"free" win.

What ways have been used around lose#4 ?
-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

marc@ucla-cs.UUCP (10/03/87)

In article <1242@k.gp.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes:
>Register windows interest me.
>
>Lose#1: more state to be saved/restored at task switches. (Caches can
>	be just declared invalid, or have tags.)
>Lose#2: more hardware that can be on the ALU critical path. (Longer R bus,
>	deeper R decode ). 

We designed a register file which almost eliminates this problem. By using
a "Shift-Register File", the data bus only goes through a single window
of registers. In our case it is 16 registers. Adding more windows to 
the file, does not lenghten the data bus. 

>There are some ways to ways to fix things up. For example, instead of a
>call sliding  by (say) 8 registers, slide by a program-supplied distance.
>The average slide might then be (say) 3. This economy helps lose#1, and
>changes lose#2 in very implementation-specific ways. 

To solve some of this problem, we make use of variable-size windows. 
Our windows sizes vary in increments of 4 (8, 12, 16) so that they can
fit a procedure call/return approprietly. In this way very few registers
are wasted because of bigger-than-necessary-windows. That is the reason
while we claim that our file of 64 registers performs just as well
as a normal 128 register file.

For more information you can read:
1) "A Reduced Register File for RISC Architectures", M. Huguet & T. Lang,
    Computer Architecture News (June 1985).
2) "VLSI Implementation of a Shift-Register File", M. Tremblay & T. Lang,
    20th Hawaian Int. Conf. on System Sciences, pp 112-121 (January 1987).

					Marc Tremblay
					marc@CS.UCLA.EDU
					...!(ihnp4,ucbvax)!ucla-cs!marc
					Computer Science Department, UCLA

guy%gorodish@Sun.COM (Guy Harris) (10/04/87)

> Lose#4: Some programs (particularly Unix's "printf" ) want their parameter
> 	list to be at an address.
> 
> What ways have been used around lose#4 ?

Our "printf" uses the "varargs" package, our "varargs.h" include file maps
"va_alist" into "__builtin_va_alist" on SPARC machines, and our SPARC compiler
recognizes the "__builtin_va_alist" construct and moves arguments from
registers to memory as needed.  Basically, if you want code using
"varargs"-like techniques to work on machines built around the SPARC chip,
you'd better not just use any "varargs"-like technique, you'd better use
"varargs".  (ANSI C contains a similar facility, and mandates its use in this
case.)

Note that this is a lose of any implementation where arguments are passed in
registers; it can either be supported by requiring this sort of code to be
written using higher-level primitives such as "varargs" rather than low-level
fiddling, or by having the compiler somehow detect that the low-level tricks
are being used (as I believe MIPSco's compiler for their non-register-window
machine does).
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

srg@quick.COM (Spencer Garrett) (10/05/87)

In article <1242@k.gp.cs.cmu.edu>, lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes:
> Register windows interest me.
> 
> Win#3:  more likely that a compiler can "target" (compute a value in the
> 	place where it is to be used, thus saving "move" instructions).
> 
> I'm fond of win#3, because only windows give it, and because compilers don't
> need great subtlety to pick up the win. 

Unfortunately, it's not so simple.  When you have procedure calls in
argument lists you have to tuck away all the arguments you've already
computed while you call the next procedure.  With a stack-based arg list,
you just compute the args in reverse order and let the stack grow to
include each new one in turn.  If your compiler is good about how
it does the tucking, it's probably still a net win, but it's tricky to
do.  Pyramid, for instance, often ends up pushing registers onto the
stack to preserve their values across procedure calls, thus circumventing
one of the main advantages of their architecture.

bcase@apple.UUCP (Brian Case) (10/06/87)

In article <1242@k.gp.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes:
>Register windows interest me.
>Lose#4: Some programs (particularly Unix's "printf" ) want their parameter
>	list to be at an address.
>
>What ways have been used around lose#4 ?

Well, I wouldn't say this is such a great way around it, but it does "work:"
In the compiler I implemented for internal use at AMD, if the address of
*any* argument was taken inside a procedure, the compiler assumed that at
least 16 argumnts were passed to the procedure (the max. that the compiler
would place in registers), and it stored all 16 arguments in memory in such
a way that if more than 16 were passed, they would all line up real nice,
thus facilitating printf.  The method that I implemented required three
(count 'em, three) stacks.  It was a very quick solution to a somewhat
tricky problem, but there are lots of others.

BTW, you mentioned a coupla things about reg. windows that need elaboration:
Registers *are* a cache, so the tradeoff "the space taken by registers could
have been used for something else, like a cache" is kinda weird; you seemed
to be saying in one "win" that registers were more dense than cache, but I
don't know that this is really true:  caches are typically single-ported
memories while register files are at least two-ported (probably three-).
The tags necessary for traditional caches take some space, true, but I still
think caches, as a general class, are denser than register files.  Register
files have two or three times the bandwidth, *this* is the big win.

bcase@apple.UUCP (Brian Case) (10/07/87)

In article <131@quick.COM> srg@quick.COM (Spencer Garrett) writes:
>Unfortunately, it's not so simple.  When you have procedure calls in
>argument lists you have to tuck away all the arguments you've already
>computed while you call the next procedure.  With a stack-based arg list,
>you just compute the args in reverse order and let the stack grow to
>include each new one in turn.  If your compiler is good about how
>it does the tucking, it's probably still a net win, but it's tricky to
>do.  Pyramid, for instance, often ends up pushing registers onto the
>stack to preserve their values across procedure calls, thus circumventing
>one of the main advantages of their architecture.

Well, first of all, I doubt that the registers are pushed onto the stack
*often.*  I can see it happening in some cases, but this doesn't sound
like the common case.

Further, C doesn't specify the order of evaluation of arguments; I used
this to advantage in the compiler a RISC compiler that I wrote:  the
compiler analyzes the parse tree so that inner-most calls (if any,
typically there are none) are done first.  This is neither hard nor
time-consuming.

csg@pyramid.pyramid.com (Carl S. Gutekunst) (10/13/87)

Since Pyramid has the oddest architecture of any register window machine :-),
let me see if I can contribute some useful information. People with additional
questions can send me e-mail.

In article <1242@k.gp.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes:
>Win#3:  more likely that a compiler can "target" (compute a value in the
>	place where it is to be used, thus saving "move" instructions).
>....
>I'm fond of win#3, because only windows give it, and because compilers don't
>need great subtlety to pick up the win. 

Actually, this doesn't win that much. While it gives me a warm fuzzy feeling
to see the compiler putting the calculations for

	printf("%d\n", (i+j)*47);

directly into the register that will be used to pass the argument, this only
saves at most one 100ns move instruction. (I know, I know, removing one 100ns
move instruction in the right place has a dramatic effect on Dhrystone. :-))
Or suppose, as is often the case, that I need the expression ((i+j)*47) more
than once. Then I want the compiler to identify the common expression, and
keep it in a local variable for copying when needed. I'm a lot more concerned
about the latter than the former, and the presense of either complicates the
implementation of the other. 

Pyramid's C compiler does catch both cases. (It also determines when recalcu-
lating the expression costs exactly the same as MOV'ing it, e.g. addition by
small integers.) But it's not exactly free.

>Lose#4: Some programs (particularly Unix's "printf" ) want their parameter
>	list to be at an address.

As pointed out nicely by Guy Harris and John Mashey, the real answer is to
use <varargs.h>. Anything that does variable length/types argument handling
in C without using <varargs.h> is going to have problems, especially now that
non-VAX-like architectures are becoming commonplace in the UNIX universe.

As it happens, though, you *can* take the address of a register on the Pyramid
90x. This was deemed necessary because of the huge volume of (non-portable) C
code that depends on it. (Note that MIPSco got around this problem by making
the compiler smart enough to detect it. And in case anyone got lost, the MIPS
processor does *not* use register windows. It was dragged into the discussion
as an example of a processor that does not use VAX-style parameter passage.
Sun's SPARC processor *does* use register windows.)

The Pyramid's register set is transparently mapped into the "Control Stack,"
a bottom-up stack in the virtual address space that is used for all function
calls. (Yes, there is also a Data Stack. It grows top-down.) Calls and returns
bump the Control Stack Pointer up and down 32*4, and the register window moves
along with it. When the CSP falls off the end of the register set (after
nesting function calls 16 levels deep), the bottom 32 registers are written
back to real memory, and mapped back up to the top. 

Writebacks and context switches *do* take time, but not as much as you might
expect; the chunk of memory "behind" the control stack is cached (just like
the rest of the memory), and the Pyramid has a 256-bit (32 byte) datapath to
the main store.

The does *not* mean that the Pyramid's <varargs.h> is as simple as that of the
PDP-11 or the VAX, since the size of the register window is finite, and there
are some objects you cannot put in a register, like structs and (in Pascal)
arrays. These go on the data stack. The Pyramid handles this in <varargs.h>;
nothing special is done in the compiler.

In article <131@quick.COM> srg@quick.COM (Spencer Garrett) writes:
>Pyramid, for instance, often ends up pushing registers onto the stack to
>preserve their values across procedure calls, thus circumventing one of the
>main advantages of their architecture.

The Pyramid *NEVER* saves registers on the data stack. Parameters are passed
on the data stack, but only when the call uses more that 13 arguments, or when
passing entire arrays or structures. And there are writebacks when you nest
functions more than 16 levels deep. But that's it.

To what are you referring?

In article <756@winchester.UUCP> mash@mips.UUCP (John Mashey) remarks:
>a) Most of the machines have 2 register sets: integer and FP....

The Pyramid has one register set. FP lives with everything else. One less
thing for <varargs.h> to worry about. :-) But John's comments still apply to
structs and arrays.

>If you're using high-powered global optimization technology (at least 3 of
>the 4 above are: I don't know about Pyramid, they may be also.)  then a lot
>of the information-gathering needed to generate good code for the argument
>passing falls out of technology.

Yes, Pyramid employs a relatively powerful global optimizer, although this
has little bearing on parameter passage. This is one aspect that I have often
thought of as an oversight (and perhaps the only oversight) in the MIPS R2000
architecture: parameter passage has been made very difficult for the compiler.
In the Pyramid it falls out cleanly. On the other hand, the MIPS compiler is
allowed to make trade-offs that the Pyramid's compiler is not.

<csg>

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (10/19/87)

References:


It occurs to me that register-window machines tend to remember data which a
cache would (correctly) overwrite.

For example, assume that a program is half a dozen calls deep. What is the
average amount of execution before the bottom (oldest) window is uncovered ?

A reasonable answer: more than the average amount of execution between
system calls, or more than the average amount of execution between
interrupts, or more than the average amount of execution between task
switches, or all of the above.

If this dead weight is sent to memory more than once (by the inevitible
second system call, or second interrupt) then considerable traffic could
occur.  This leads to the idea that a register-window machine should spill
its elderly data to memory, perhaps at the first interrupt, perhaps at the
first system call, or perhaps by a "trickle" mechanism. It shouldn't recall
these values back from memory until some (much) later time - say, when their
slot is returned-to (or when the one above them is returned-to) (or when
return-from-interrupt notices that they are almost uncovered).

It would be interesting to see simulation results on this. There is an
excellent article in the September issue of (IEEE) Computer, by Flynn and
others, which reports simulation results on various other register models.
I consider the article flawed by its emphasis on counting memory traffic,
while ignoring the topic of syscalls and interrupts. If they would factor
these in, then the ideas presented above could be tested.

Yes, I know that interrupt frequency is system-dependant (and so on). 
My point remains. Our understanding of register issues still needs work.
-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (10/19/87)

I posted yesterday on the subject of minimizing the traffic required to save
and restore registers.

There are those who argue that interrupts and task switches are enormously
rare compared to subroutine calls and returns. They are right. They then
argue that the rare events can be penalized, if this makes the common events
run faster. (The RISC argument, if you will.)

I don't agree. There's the issue of "rare for which customer".  The HP
Spectrum addresses this by allowing differing resource mixes, so for
example, a customer who does a lot of multiplies can have a multiplier.

There's also the issue of "rare but critical". This brings me back to
register window state. If it's big, then interrupt latency and task-switch
latency are high. This affects more than just the upper bound on interrupt
rate. Interrupts may be announcing important events, and you want to get to
work. Aside from any costs due to slow response, there is the fact that a
slow chip just won't be as widely useful, or used. Also, machine features
tend to determine programming style. This is usually a pity. For example,
many device drivers are subroutine packages, rather than full processes,
simply because "faster" dominated "cleaner" at design time.

-- 
	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

henry@utzoo.UUCP (Henry Spencer) (10/20/87)

> ...register window state. If it's big, then interrupt latency and task-switch
> latency are high...

Well, maybe.  If your interrupt handlers are big and complex -- all too
likely these days -- then they will be calling functions, and hence register
windows will speed them up too.  Register windows may end up being a net win
despite context-switch overhead.
-- 
PS/2: Yesterday's hardware today.    |  Henry Spencer @ U of Toronto Zoology
OS/2: Yesterday's software tomorrow. | {allegra,ihnp4,decvax,utai}!utzoo!henry

frazier@CS.UCLA.EDU (10/21/87)

In article <8801@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>                                 Register windows may end up being a net win
>despite context-switch overhead.
>-- 

There isn't that much of a context-switch overhead, either.  Saving
the register windows does not dominate the cost of a
context-switch.  Context switches require hundreds of instructions,
while a pipelined system should be able to save the register file
in about 100-150 instructions.  This cost is more than offset by the
savings incurred in the calls between process swaps, even in a system
which performs the swaps fairly often.  In addition, smart systems 
only restore the "top" window
of a swapped out process.  This prevents filling the register file
with windows which are then have to be saved for further calls, and
incurs no (read little) extra cost if the process, once active,
proceeds to execute a series of returns and the windows have to be
restored after all. 

$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$

Greg Frazier              Your feet are frozen to the floor...  -more-
CS dept., UCLA

Internet: frazier@CS.UCLA.EDU
    UUCP: ...!{ihnp4,ucbvax,sdcrdcf,trwspp,randvax,ism780}!ucla-cs!frazier

$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$

howard@cpocd2.UUCP (10/22/87)

In article <201@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
>There are those who argue that interrupts and task switches are enormously
>rare compared to subroutine calls and returns. They are right. They then
>argue that the rare events can be penalized, if this makes the common events
>run faster. (The RISC argument, if you will.)

The argument is simply that this approach leads to the best average
performance.  Which it does.

The CISC argument is then that everything has to be optimized, even if
it's rare and optimizing it (locally) adversely affects overall cost and
performance.  Optimizing everything is the same as optimizing nothing.

>I don't agree. There's the issue of "rare for which customer".  The HP
>Spectrum addresses this by allowing differing resource mixes, so for
>example, a customer who does a lot of multiplies can have a multiplier.

"Which customer" is the one you based your studies on.  This gives you a
clean target and a clear metric for choosing between design variations.

Lack of a multiplier in the RISC-I was more due to lack of student design
time than anything else.  This has nothing to do with RISC.  Many RISC
machines have multipliers.  (E.g. MIPS-R2000)

>There's also the issue of "rare but critical". This brings me back to
>register window state. If it's big, then interrupt latency and task-switch
>latency are high.

Not necessarily.  Many interrupts can be handled almost like procedure
calls in some RISC architectures, so all you need to save is (possibly)
one register frame.  The "original" RISC architecture, the IBM 801, I
believe had a separate set of registers for interrupts (I'm possibly confusing
this with another machine, but the idea is certainly practical and provides
an easy counterexample to Donald's sweeping generalization).  Task switch is
harder to do well, agreed, but it occurs about 1/100th as often as call/return.

All these concerns are addressed in the early RISC papers.  Why bring them up
now, unless you haven't read those papers?

>This affects more than just the upper bound on interrupt
>rate. Interrupts may be announcing important events, and you want to get to
>work. Aside from any costs due to slow response, there is the fact that a
>slow chip just won't be as widely useful, or used.

You seem to have a very narrow idea of "slow".  Interrupt latency counts,
but program execution time doesn't?  Give me a break!

>Also, machine features
>tend to determine programming style. This is usually a pity. For example,
>many device drivers are subroutine packages, rather than full processes,
>simply because "faster" dominated "cleaner" at design time.

And many people write in assembler because they have atrocious compilers,
which are in part due to overly complicated architectures.  It is better
to program in HLL when possible, which supports the notion of hardware
designed for compilers instead of the other way round.  This is where
RISC began.

-- 
	Howard A. Landman
	{oliveb,hplabs}!intelca!mipos3!cpocd2!howard	<- works
	howard%cpocd2%sc.intel.com@RELAY.CS.NET		<- recently flaky
	howard%cpocd2.intel.com@RELAY.CS.NET		<- ??? try this

mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer) (10/22/87)

In article <8758@shemp.UCLA.EDU> frazier@CS.UCLA.EDU (Greg Frazier) writes:
<In article <8801@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
<>                                 Register windows may end up being a net win
<>despite context-switch overhead.
<>-- 

Every time I've looked at an article on register windows, I think of
something I refer to as "the Johnson stack machine." This machine
seemed to have all the advantages of register windows, and avoided
most of the disadvantages.

The basic idea was to not bother with either a data cache or a
register file. Instead, you used all that chip space for a
register-speed cache that is mapped onto memory. The mapping is tracked
by a base/length pair. It's a write-through cache.

The trick is that this is mapped onto the top of the stack. Pushing
something onto the stack involved incrementing the length counter,
possibly wrapping around in the file (my mind goes hazy on how that
worked - it may have actually required shuffling elements in the
cache).

But context switches were *cheap*. The save half involved nothing more
than making sure all writes had actually finished. Loading a new
context required reloading (tagging words as invalid, maybe?) the
file.

I have good reason to believe that AT&T was working on hardware that
used this model. Does anyone know what became of that?

Better yet, can someone who actually understands hardware explain why
I've not seen anything like the above in the real world?

	Thanx,
	<mike


--
I'm gonna lasso you with my rubberband lazer,		Mike Meyer
Pull you closer to me, and look right to the moon.	mwm@berkeley.edu
Ride side by side when worlds collide,			ucbvax!mwm
And slip into the Martian tide.				mwm@ucbjade.BITNET

lamaster@ames.arpa (Hugh LaMaster) (10/22/87)

This may have come up before- if so, I missed it.  The Sept. 87 issue
of IEEE Computer has an article, one of whose authors is The Michael
Flynn, which examines certain aspects of RISC architectures.  The
simulations described are using rather simple benchmarks, so it would
be dangerous to generalize too much from the article, but register
windows don't seem to be a win even where expect them to do best:
the benchmarks were medium sized Pascal programs with lots of procedure
calls.  (Another part of the article is an examination of tradeoffs
between compact instructions to reduce instruction traffic, and
simpler instructions + instruction cache using the same chip area...)

Do any of the register window advocates have similar studies which
demonstrate a WIN for register windows?

tim@amdcad.AMD.COM (Tim Olson) (10/22/87)

In article <933@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:

	... good stuff about register windows deleted ...

|an easy counterexample to Donald's sweeping generalization).  Task switch is
|harder to do well, agreed, but it occurs about 1/100th as often as call/return.

Actually, we measure more like .5% - 1% procedure calls ( == 1% - 2%
call+return). Assuming 40ns cycles and ~100 - 200 context switches/sec, then
context switches are only 1/1000 to 1/4000 as often as call/return.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

baum@apple.UUCP (Allen J. Baum) (10/22/87)

--------
[]
>In article <5567@jade.BERKELEY.EDU> mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer) writes:
>Every time I've looked at an article on register windows, I think of
>something I refer to as "the Johnson stack machine."
>I have good reason to believe that AT&T was working on hardware that
>used this model. Does anyone know what became of that?
>
>Better yet, can someone who actually understands hardware explain why
>I've not seen anything like the above in the real world?

Check out the ATT CRISP microprocessor (see ASPLOS II, or Feb.87 Compcon
proceedings). That's what they are using.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

hansen@mips.UUCP (Craig Hansen) (10/22/87)

In article <933@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes:
> In article <201@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes:
> >There are those who argue that interrupts and task switches are enormously
> >rare compared to subroutine calls and returns. They are right. They then
> >argue that the rare events can be penalized, if this makes the common events
> >run faster. (The RISC argument, if you will.)
> 
> The argument is simply that this approach leads to the best average
> performance.  Which it does.

Sigh. As always, RISC means many different things to different people.  "The
RISC argument" leads to varying conclusions, depending on what you're
focusing on, and what your design team can and can't do easily. I've worked
on three RISC architectures in different environments, and they each ended
up quite unique. The MIPS team has the strongest software and measurement
skills of the three designs, and it let us delegate a lot more to software,
and quickly see where design trade-offs were taking us.

Unfortunately, comparing interrupt and task switch rates to subroutine call
and return rates is only part of the story. The register window approach
trades hardware resources against optimizing software. The MIPS-R2000, which
doesn't have hardware register windows, relies on an intelligent use of a
flat, uniform, register set, implemented by optimizing compilers.  As has
been described in this forum previously, procedure arguments and return
values are passed in registers. In addition, the compiler system identifies
leaf-level routines, and allocates variables into registers, starting at
these leaf-level routines, working its way up to call tree. When registers
must be spilled to memory, the optimizer selects appropriate locations for
the register saving and restoring code (moving up call tree and outside
loops, etc.). The end result is a healthy reduction in the amount of
register saving and restoring at procedure boundaries. The register window
approach opts for simpler software (a good idea if you don't have software),
but spends more die area, and ultimately, cycle speed, on register file
accesses and register file spilling hardware.

Ultimately, the selection of an architecture is the result of several
inter-related design trade-offs. While it is not an absolute requirement of
register windows, let me observe that every one of these machines built to
date takes multiple cycles to perform canonical load and store operations.
There's no question that register windows can, in some cases, reduce load
and store frequencies, but to really pay their cost, they have to reduce
load and store frequencies sufficiently to offset the higher cost of load
and store operations on these machines. I've not been on the design team of
a register windowed machine, but it seems that the H/W designers might have
assumed that the register windows were going to eliminate so many memory
references that they didn't spend the time they should have on making loads
and stores go fast.

If you are concerned about UNIX kernel performance, it would be worthwhile
to note that the MIPS-R2000 instruction set and compiler system are designed
to permit efficient references to data contained in structures (a reference
to a structure-element through a pointer is a single-cycle operation, but is
two-to-three cycles at minimum for the SPARC and Am29k register window-based
machines. Now what's the relative frequency of structure references in a
UNIX kernel? (Hint: Dhrystone will give you the wrong answer. I'm not trying
to be glib here; dhrystone is fine benchmark for comparing some things
[uncached machines without optimizing compilers, x86 compilers for Personal
Computers], but the characteristics of data references, particularly the use
of "addressing modes" and data dependencies, aren't representative of
anything.)

If you've got some dusty FORTRAN card decks that you've been using on some
itty-bitty-mainframe computer, you should note that procedure calls aren't
all that frequent, compared to Dhrystone. Having an optimizer that can put a
couple of dozen variables (not necessarily local variables) into registers
efficiently, and make fast references to global variables (FORTRAN is
notorious for its use of variables stuffed into common blocks) is a big win
here. The MIPS compiler system puts aside one register to point to a region
of the global data space, where one single-cycle instruction can get to it's
value.  Again, here, the speed of basic load and store operations will
matter more than whether the registers are windowed.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

aglew@ccvaxa.UUCP (10/24/87)

>>Henry Spencer:
>>Register windows may end up being a net win despite context-switch overhead.
>
>There isn't that much of a context-switch overhead, either.  Saving
>the register windows does not dominate the cost of a
>context-switch.  Context switches require hundreds of instructions,
>while a pipelined system should be able to save the register file
>in about 100-150 instructions.
>
>Greg Frazier

Why do context switches require hundreds of instructions? On Real-Time OSes
a context switch can require much less than a hundred instructions - and
that's without a CISCy dispatch instruction. Even on UNIX a context switch
can be made to occur in not much more time than a syscall (which is another
story).

The 'hidden' loss due to cache flushing, etc., can cost a lot, but on real
memory systems that shouldn't bother you.

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.

tim@amdcad.AMD.COM (Tim Olson) (10/24/87)

In article <821@mips.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:
+-----
| There's no question that register windows can, in some cases, reduce load
| and store frequencies, but to really pay their cost, they have to reduce
| load and store frequencies sufficiently to offset the higher cost of load
| and store operations on these machines. I've not been on the design team of
| a register windowed machine, but it seems that the H/W designers might have
| assumed that the register windows were going to eliminate so many memory
| references that they didn't spend the time they should have on making loads
| and stores go fast.
+-----

Yes, the register windows eliminate *many* memory references (although
fast loads/stores are still a high priority).  For example, MIPS
published a paper at the recent ASPLOS II convention showing dynamic
load/store percentages, as well as the percent of load/store
instructions that required a non-zero offset calculation:

			nroff		asl
     load/store %	  28%		 30%
     non-zero offset%	  88%		 80%

Contrast this to the statistics gathered on the Am29000 simulator
(register-windowed machine):

			nroff		asm29k
     load/store %	  16%		 16%
     non-zero offset%	  9.0%		 9.2%

Now, since MIPS didn't publish the input they used on these programs to
derive their numbers, we obviously cannot perform a direct comparison.
We assume, however, that the input wasn't "specialized" in any way.
One possible explination for this is that non-register-windowed machines
have many more loads/stores to the stack during procedure call
entry/exit, resulting in higher load/store percentages as well as higher
utilization of the base+offset addressing mode.

The main point to realize here is that a series of trade-offs are made
in any processor design, and what might be a big win for one processor
may have minimal impact on another.  These individual tradeoffs cannot
be taken as "absolutes" without examining their relationship with the
rest of the architecture.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

hansen@mips.UUCP (Craig Hansen) (10/25/87)

In article <18843@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
> [a comparison of load/store frequencies and non-zero offset frequencies]

> Now, since MIPS didn't publish the input they used on these programs to
> derive their numbers, we obviously cannot perform a direct comparison.

Even if the input were the same, because the instruction count isn't
necessarily the same for these two machines, the comparison of load/store
frequencies isn't really relevant.  In addition, as1 and am29k aren't even
the same program! (There are several variants of nroff around, too.)

> We assume, however, that the input wasn't "specialized" in any way.
> One possible explination for this is that non-register-windowed machines
> have many more loads/stores to the stack during procedure call
> entry/exit, resulting in higher load/store percentages as well as higher
> utilization of the base+offset addressing mode.

Yes, a possible explanation, but wrong. When compiling nroff (a version
released as part of UMIPS-BSD), using our inter-procedural register
allocator, and an input file of 700 lines, 4670 words, and 30530 characters,
we get the following dynamic statistics:

26.4% loads+stores
52.1 instructions per call
Average registers saved per call: 1.6
Register save+restore: 23.3% of loads+stores (7.0% of instructions)
    (This percentage includes variables which must reside in memory
     to handle call by address parameters.)

If you eliminated all these register save/restores, load+store frequency
would be 20.9% (remember that the total instruction count changes too).

Loads/stores to variables through global pointer register: 15.2% (Because of
use of the global pointer register, these references are single instructions,
all of which have non-zero offset values.) I would suspect that the Am29000
statistics do not count these as non-zero offset values, though they are
more than half of all the references.

Other load/stores: 4.2% (Zero-values offsets are frequent here within this
sub-class, but are only about 3% of total instruction references.)

> The main point to realize here is that a series of trade-offs are made
> in any processor design, and what might be a big win for one processor
> may have minimal impact on another.  These individual tradeoffs cannot
> be taken as "absolutes" without examining their relationship with the
> rest of the architecture.

Amen. Tim should also be careful, though, to realize that the tools and
compilers used to measure the architecture also influences the results.
Unless you set out to design both the architecture and the compiler
concurrently, you won't be able to make good trade-offs between them. It
should also be noted, again, that the selection of programs used as
benchmarks will influence the results too. Making your architectural
trade-offs on the basis of Dhrystone, or even nroff, which is not very
representative of more modern C code, isn't very smart.

I dare say, though, that trading the MIPS architecture for the MIPS
architecture + windowed-registers - displacements - single-cycle
load/stores, would be a loss.

Regards,
-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

tim@amdcad.AMD.COM (Tim Olson) (10/26/87)

In article <833@mips.UUCP> hansen@mips.UUCP (Craig Hansen) writes:

| Loads/stores to variables through global pointer register: 15.2% (Because of
| use of the global pointer register, these references are single instructions,
| all of which have non-zero offset values.) I would suspect that the Am29000
| statistics do not count these as non-zero offset values, though they are
| more than half of all the references.


You are correct; these were not counted.  The statistics were gathered
by looking (at "runtime") for explicit address calculations of load
addresses.  I don't really consider the use of the global pointer
register to be an "non-zero offset" addressing mode, however.  Because
the lower 16 bits of the global pointer register are zero, the offset is
really just a concatenation -- no "add" is performed.

| Tim should also be careful, though, to realize that the tools and
| compilers used to measure the architecture also influences the results.

Certainly.

| Unless you set out to design both the architecture and the compiler
| concurrently, you won't be able to make good trade-offs between them.

THE compiler?  Which one?  C? Pascal? FORTRAN? Ada? Common LISP? Smalltalk?
Obviously you must design the architecture with a great deal of thought
put into how the resources are to be used by software and what can be
easily/cheaply performed in software vs hardware, but I disagree with
your statement.  MIPS certainly didn't write an entire Ada compiler
before starting architectural design, but that in no way decreases the
viability of Ada running on a MIPS machine.

| It should also be noted, again, that the selection of programs used as
| benchmarks will influence the results too. Making your architectural
| trade-offs on the basis of Dhrystone, or even nroff, which is not very
| representative of more modern C code, isn't very smart.

Making architectural decisions based on *any* single program isn't very
smart.  You should examine a large body of code, looking at older,
heavily-used programs as well as more "modern" code (output of C++
compilers, object-oriented programming).

Discounting nroff is kind of silly, however; it happens to be a large,
heavily-used utility.  MIPS used it, along with other significant
UNIX utilities, in early papers on the R2000 [Rowen, et al, VLSI Systems
Design, March 1986], and continues to reference it in John Mashey's
performance briefs.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

earl@mips.UUCP (Earl Killian) (10/26/87)

In article <18855@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
> In article <833@mips.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
> | It should also be noted, again, that the selection of programs used as
> | benchmarks will influence the results too. Making your architectural
> | trade-offs on the basis of Dhrystone, or even nroff, which is not very
> | representative of more modern C code, isn't very smart.
> Making architectural decisions based on *any* single program isn't very
> smart.  You should examine a large body of code, looking at older,
> heavily-used programs as well as more "modern" code (output of C++
> compilers, object-oriented programming).

It sounds like everyone's in agreement, and yet, so far the discussion
has talked about one program!  What's interesting about programs is
that some statistics are fairly consistent and some vary all over the
place.  To show the variance of the statistics relevant to this
discussion, consider a wide range of programs (statistics for the
MIPSco architecture):
						  non-sp/gp  non-sp/gp
			  sp-based  reg  gp-based  0-offset non-0 offset
	    loads  stores  ld/st   ld/st   ld/st    ld/st     ld/st
            -----   -----  -----   -----   -----    -----     -----
espresso    19.6%    1.1%   0.1%    0.1%    1.3%    18.7%      0.4%
spice       26.9%   16.3%   7.2%    2.8%    4.2%     1.5%     27.5%
wolf        25.3%    8.2%   7.1%    1.9%    3.6%     8.0%     12.9%
yacc        15.7%    2.1%   0.9%    0.5%    2.5%    12.4%      1.5%
diff        16.2%    3.2%   0.5%    0.7%    4.6%     7.2%      6.4%
compress    18.3%   10.6%   0.1%    3.5%    8.1%     7.8%      9.4%
uopt        21.8%    8.4%   5.6%    5.2%    1.2%     6.8%     11.4%
as1         18.3%   11.2%   4.4%    6.8%    3.7%     3.8%     10.8%
nroff       18.8%    8.6%   0.4%    7.7%   14.5%     3.1%      1.7%
tex         21.9%   13.8%   3.6%    9.2%   10.8%     5.1%      7.0%
ccom        18.7%   12.2%   3.4%   11.9%    3.8%     5.0%      6.9%
doduc       29.4%   10.2%  10.1%    4.1%   12.3%     1.5%     11.6%

The sum of the last 5 columns should be equal to the sum of the first
2.  In other words the last 5 are a partition of the loads and stores
into sp-based, register saves and restores at procedure entry/exit,
gp-based (that is small static variables addressed by a dedicated
register, which always have a nonzero offset on the MIPSco machine),
other load/stores with zero offset, and other load/stores with nonzero
offset.  Everything is a percentage of the total instructions
executed.

My conclusion: be careful of drawing conclusions from a small number
of data points.  For example, when deciding whether to have an offset
for load/store, looking at just espresso (0.4% nonzero) or just spice
(27.5% nonzero) would lead to very different results.

Caveat: this statistics are from one architecture / compiler.  Your
actual mileage may vary.  In particular, I would bet the 29k compilers
probably try hard to avoid needing offsets.  To the extent that these
techniques succeed, it would reduce the 27.5% slowdown you'd expect in
spice.  However, I think these statistics do give some indication of
the desirability of having a load/store offset.

daveb@geac.UUCP (10/26/87)

In article <5567@jade.BERKELEY.EDU> mwm@eris.BERKELEY.EDU 
	(Mike (My watch has windows) Meyer) writes:
>Every time I've looked at an article on register windows, I think of
>something I refer to as "the Johnson stack machine."
>I have good reason to believe that AT&T was working on hardware that
>used this model. Does anyone know what became of that?

  Well, ICL (in Britain) was working with stack-top caches on the
large 2900's some years ago (10!).  If anyone from Blighty would
care to comment on what happened subsequently, I'll refrain from
blurting out what was then the hot technique in the internal
rumor-mill.  (Can you say "half-cache"?)

 --dave (walking security breach) c-b
-- 
 David Collier-Brown.                 {mnetor|yetti|utgpu}!geac!daveb
 Geac Computers International Inc.,   |  Computer Science loses its
 350 Steelcase Road,Markham, Ontario, |  memory (if not its mind)
 CANADA, L3R 1B3 (416) 475-0525 x3279 |  every 6 months.

bcase@apple.UUCP (Brian Case) (10/26/87)

In article <18843@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes:
> [[MIPS numbers]]
>			nroff		asl
>     load/store %	  28%		 30%
>     non-zero offset%	  88%		 80%
>
>Contrast this to the statistics gathered on the Am29000 simulator
>(register-windowed machine):
>
>			nroff		asm29k
>     load/store %	  16%		 16%
>     non-zero offset%	  9.0%		 9.2%
>
>Now, since MIPS didn't publish the input they used on these programs to
>derive their numbers, we obviously cannot perform a direct comparison.

Just out of curiosity, are these numbers static or dynamic?  I would
hope that they both represent dynamic measurements.

Another thing to notice is that the load/store operations incurred by
stack cache over-/under-flows are guaranteed to be serial in nature;
that is, they are going to be to/from sequential addresses and are thus
amenable to load-/store-multiple (i.e. burst-mode) operations.  It may
be that a majority of load/store operations in non-stack-cache machines
are also around procedure calls and would also be amenable to burst-mode
transfers.  However, since over-/under-flows are rare events in stack-
cache machines (at least on the Am29000), the effect may not be significant.
Do you MIPS Co./AMD guys have any data on sequential/non-sequential
load/store behavior?

tim@amdcad.AMD.COM (Tim Olson) (10/27/87)

In article <6554@apple.UUCP> bcase@apple.UUCP (Brian Case) writes:
| In article <18843@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes:
| >Contrast this to the statistics gathered on the Am29000 simulator
| >(register-windowed machine):
| >
| >			nroff		asm29k
| >     load/store %	  16%		 16%
| >     non-zero offset%	  9.0%		 9.2%
| >
| Just out of curiosity, are these numbers static or dynamic?  I would
| hope that they both represent dynamic measurements.

Yes, they are dynamic measurements.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

henry@utzoo.UUCP (Henry Spencer) (10/28/87)

> ...you used all that chip space for a
> register-speed cache that is mapped onto memory. The mapping is tracked
> by a base/length pair. It's a write-through cache.
> The trick is that this is mapped onto the top of the stack...

How much this wins depends on details.  If the stuff is addressed as normal
memory -- remember that optimizing compilers often want to use their
temporaries in a non-LIFO pattern, so making "push" and "pop" special cases
isn't necessarily a win -- then you pay the various penalties involved in
needing memory addresses for all your temporaries.  If the cache pretends
to be registers, then what you have is register windows with a lot more
writes to memory, some small fraction of which incidentally make it
unnecessary to do writes at context-switch time.  This is probably not a
win unless you are big on interrupt latency at the expense of throughput,
because you *are* doing a lot more memory accesses, although they are less
concentrated.

It's an interesting idea that might work well if done right, but it's *not*
"all of the advantages of register windows with none of the disadvantages".
It has some of the advantages and has disadvantages of its own.
-- 
PS/2: Yesterday's hardware today.    |  Henry Spencer @ U of Toronto Zoology
OS/2: Yesterday's software tomorrow. | {allegra,ihnp4,decvax,utai}!utzoo!henry

aglew@ccvaxa.UUCP (10/28/87)

..> Tim Olson posts comparisons of load/stores, and load/stores
..> with non-zero offsets, for MIPS (lots of indexing) and AMD29K (little).

How do you count load/stores with non-zero offsets for the AMD29000,
seeing as it has no such addressing mode? Was this for a design
alternative, that did have indexing? Or do you count by looking at the
number of loads that have an add immediately preceding to the register
used in the address of the load/store? Variations of this last method
would seem to me to be very susceptible to masking by code rearrangement
in the optimizer.

A while back I posted static counts of address constants on VAX 4.2 BSD.
As I remember it, about 40% were for locals (stack offsets), but 20% were
for pointer+field_offset. How do you optimize those away?

tim@amdcad.UUCP (10/29/87)

In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes:

| How do you count load/stores with non-zero offsets for the AMD29000,
| seeing as it has no such addressing mode? Was this for a design
| alternative, that did have indexing? Or do you count by looking at the
| number of loads that have an add immediately preceding to the register
| used in the address of the load/store? Variations of this last method
| would seem to me to be very susceptible to masking by code rearrangement
| in the optimizer.

Yes, we counted the "offset" loads and stores by looking for an add
(dynamic, at runtime) preceding the load or store which computes into
the load/store address register.

You are correct that this might be suceptible to code rearrangement --
however, our internal compiler doesn't schedule loads, and the only
other code motion is due to loop-invarience [not a problem here] and
filling branch delay slots [the add, load/store will still be sequential
at runtime].

| A while back I posted static counts of address constants on VAX 4.2 BSD.
| As I remember it, about 40% were for locals (stack offsets), but 20% were
| for pointer+field_offset. How do you optimize those away?

Perhaps there is a big difference in static vs dynamic statistics?

	-- tim

bcase@apple.UUCP (Brian Case) (10/29/87)

In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>..> Tim Olson posts comparisons of load/stores, and load/stores
>..> with non-zero offsets, for MIPS (lots of indexing) and AMD29K (little).
>
>How do you count load/stores with non-zero offsets for the AMD29000,
>seeing as it has no such addressing mode? Was this for a design
>alternative, that did have indexing? Or do you count by looking at the
>number of loads that have an add immediately preceding to the register
>used in the address of the load/store? Variations of this last method
>would seem to me to be very susceptible to masking by code rearrangement
>in the optimizer.
>
>A while back I posted static counts of address constants on VAX 4.2 BSD.
>As I remember it, about 40% were for locals (stack offsets), but 20% were
>for pointer+field_offset. How do you optimize those away?

Well, I can tell you that, at least for the original 29000, an offset
addressing mode was NEVER a design alternative.  I would have had a
tantrum.  Anyway, when Tim told me that he was collecting these statistics,
one of my first thoughts was:  "But wait:  optimization will move some
of the add instructions away from the object load or store instruction.
That will skew the results!"  Upon further thought, I realized that if
optimization can move the add instruction away (out of a loop or to fill
a branch delay or load/store delay slot), then it *shouldn't* be counted
anyway.  The more of these that happen, the more sense it makes not to
have the offset addressing mode.  In other words, lets count only those
offset additions that would be *profitably* integrated with the load or
store instruction itself.  Even though the numbers Tim collected are low,
I believe that a real optimizing compiler (coming soon I hear) will
produce even lower (if only marginally lower) numbers.  My compiler does
not do load/store latency scheduling; if it did, more opportunities to
overlap the offset add would arise.  In retrospect, large performance
increases could have been realized, especially for the 29000 with VDRAM
memory configurations (this has been proven by hand coding; sigh, one
has only so much time...).

I'll reiterate:  you just can't say things like:  "processors need to
have x, no matter what the rest of the architecture is like."  The offset
addressing mode has its own costs.  From the statistics collected from the
output of my (not-wonderfully-optimizing) compiler, one is led to believe
that an offset addressing mode would have little positive impact on the
29000.  Further, it would only add another delay in the memory addressing
path.

I was very glad to see the numbers you collected from studying the VAX.
Were those dynamic measurements?

bcase@apple.UUCP (Brian Case) (10/29/87)

In article <18903@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes:
>Yes, we counted the "offset" loads and stores by looking for an add
>(dynamic, at runtime) preceding the load or store which computes into
>the load/store address register.
>
>You are correct that this might be suceptible to code rearrangement --
>however, our internal compiler doesn't schedule loads, and the only
>other code motion is due to loop-invarience [not a problem here] and
>filling branch delay slots [the add, load/store will still be sequential
>at runtime].

Well, it is true that load/store scheduling isn't done, but the offset
computation instruction *will* be factored out of a loop when its source
operands are loop invariant; I have seen the compiler do this.  And then,
if this instruction is the only one in the loop header, and there is only
a branch to the loop header, the offset add will get moved even farther
away.  Anybody wanna write a debugger?  :-)

(garbage
garbage
garbage
so this message won't be rejected)

bcase@apple.UUCP (Brian Case) (10/29/87)

In article <6571@apple.UUCP> bcase@apple.UUCP (Brian Case) writes:
>In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes:
>>A while back I posted static counts of address constants on VAX 4.2 BSD.
                        ******
>>As I remember it, about 40% were for locals (stack offsets), but 20% were
>>for pointer+field_offset. How do you optimize those away?
>
>I was very glad to see the numbers you collected from studying the VAX.
>Were those dynamic measurements?
            *******

Oops, sorry for the stupid question.

aglew@ccvaxa.UUCP (10/31/87)

..> Procedure call vs. interrupt frequency.

Something else to point out: globally optimizing compilers can do inline
expansion of much code, reducing procedure call overhead and frequency.
This cannot be done to interrupts.

Of course, there are a whole slew of tradeoffs here. Eg. reduced call
overhead vs. cache misses, if the function is called more than once.
Code size. Inlining militates against shared libraries.


Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.

aglew@ccvaxa.UUCP (10/31/87)

..> Address constants, register windows, and indexed addressing modes
..> vs. AMD's non-indexed addressing for the AMD29000.

Addressing modes have fascinated me for a while, and I must admit
that I am overjoyed by AMD's indications that non-indexing is not too
painful. Now, would no register windows + non-indexing be useful?

Something else to consider here is a non-virtual machine. Remember
base registers? Base registers were originally introduced so that 
programs could be relocated, before virtual memory - on relocation,
just update the base registers, plus the pool of address constants that
the base registers are loaded from (ever wonder why PL/1 had a BASED
attribute?). This requires that the base register + rest of address
addition be atomic, or at least that no relocation may occur between them.
It is possible to do this without base register addressing modes, but 
that handicaps optimizing compilers.

Usually, base register addressing is:
	BASE + REGISTER + OFFSET
since people want to have the REGISTER OFFSET addressing mode.
Three input adders, ugh.

But if we can get away without indexing, you just have
     	BASE + LITERAL
	BASE + REGISTER
And, as I am fond of pointing out, you can use OR instead of ADD.

Why would I want base register addressing in this age of virtual memory?
Well, as you know, Crays aren't virtual... it might be nice to be able to
introduce a high performance, non-virtual, addition to your product line.

The biggest problem is keeping your compiler writer's dirty hands off the base
registers, since you want to reserve them for the OS. Especially, you don't
want them to use the base registers like an extended constant pool (grr..)


Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.

aglew@ccvaxa.UUCP (11/02/87)

>I had an idea some time ago that I'm surprised I've never seen discussed.
>Suppose, for example, that your instruction processor has four stages.  With
>conventional pipelining, that means that four consecutive instructions from
>the same program are at some stage of execution at the same time.
>
>Instead, why not have four different execution threads being performed
>simultaneously?  This eliminates the dependency checks and latency delays
>inherent in "vertical" pipelining.  (Many RISCs put these into the compiler
>instead of the architecture, but they're still there).  On a multi-user
>system with a reasonable load level, it seems to me that this should
>represent a performance improvement.  Of course, it won't look good on the
>standard benchmarks.
>
>Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
>Ashton-Tate          52 Oakland Ave North         E. Hartford, CT 06108

This idea, multiprocessing through pipelining, was used in the Denelcore
HEP machine. Except that HEP had many more stages in the pipeline. If there
are folk out there who worked at Denelcore, they can tell us what happened;
I think the gist of it is that they went under because they were trying out
too many new ideas, and because of manufacturing problems. I've heard stories
like <<Somebody in manufacturing saw this great big loop of wire on the
backplane, and decided to cut it off and make it shorter. Except that the loop
was there for timing...>>. Apparently UNIX exercised the machine in ways
that their proprietary OS did not, and exposed bugs...

Overall, though, I *really* *like* the multiprocessing through pipelining
idea. I was first introduced to it by a British prof who said "but who in the
world can keep 40 processes at a time active?". "UNIX", I thought.

Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa

I always felt that disclaimers were silly and affected, but there are people
who let themselves be affected by silly things, so: my opinions are my own,
and not the opinions of my employer, or any other organisation with which I am
affiliated. I indicate my employer only so that other people may account for
any possible bias I may have towards my employer's products or systems.

fu@uicsrd.csrd.uiuc.edu (11/03/87)

/* Written  7:49 am  Oct 26, 1987 by daveb@geac.UUCP in uicsrd:comp.arch */
In article <5567@jade.BERKELEY.EDU> mwm@eris.BERKELEY.EDU 
	(Mike (My watch has windows) Meyer) writes:
>Every time I've looked at an article on register windows, I think of
>something I refer to as "the Johnson stack machine."
>I have good reason to believe that AT&T was working on hardware that
>used this model. Does anyone know what became of that?

>>  Well, ICL (in Britain) was working with stack-top caches on the
>>large 2900's some years ago (10!).  If anyone from Blighty would
>>care to comment on what happened subsequently, I'll refrain from
>>blurting out what was then the hot technique in the internal
>>rumor-mill.  (Can you say "half-cache"?)

I assume you are taking about the ICL 2980, which was announced 
over 10 years ago as the top-of-the-line model. I didn't work on this
particular machine but on the follow on, but I seem to remember that
the cache or slave store (in ICL-ese) only held stack elements due to
it's small size. As the ICL 2900 architecture was stack based, this
seemed a reasonable trade off. There was no specific architectural
support for this slave store while, from what I can recall of the AT&T
stack cache, its operation is supported in the instruction set. 

In the follow up machine we had slightly larger chips to play with
(1k ecl rams!) and decided to cache all primary operands. Fortunately
the machine was cancelled due to lack of small gyms necessary to house
the systems had we actually finished it (over 200 pcbs with over 100
packages per board). The last I heard, ICL no longer builds large
systems, but buys them ready to sell from the Japanese. 

John Fu

University of Illinois,
Center for Supercomputer Research and Development.
ARPANET: fu%uicsrd@a.cs.uiuc.edu
	 or sometimes fu%tallis@decwrl.dec.com

jpdres10@usl-pc.UUCP (Green Eric Lee) (11/04/87)

Distribution:


I've been following this conversation for some time. I wonder, how
much does the adder in the register path (to add the register stack
pointer to the instruction's desired register) slow things down in the
AMD29000? It might also seem like if you have register windows, you
have the same thing sitting there in the middle of the address path
between instruction latch and register addressing, but one scheme I've
read about is to make your register windows out of shift-registers,
i.e. no adding, as far as the machine is concerned, you just have your
16 registers out there or whatever, and when a procedure call is made,
or a return made, shift the current registers down or up (with some
provisions for stack overflow). And, of course, Plain Old Registers
have no problem at all with something out there in the register
addressing path, except, of course, the decode tree... it seems
offhand that the MIPS approach, with lots of registers and a very
smart compiler capable of allocating them to procedures as needed,
might offer some potential future speed advantages over traditional
(non-shift-register) register windows or AMD's "stack window"
approach, unless some very heavy pipelining is employed.

---
Eric Green  elg@usl.CSNET       from BEYOND nowhere:
{ihnp4,cbosgd}!killer!elg,      P.O. Box 92191, Lafayette, LA 70509
{ut-sally,killer}!usl!elg     "there's someone in my head, but it's not me..."

rpw3@amdcad.AMD.COM (rpw3) (11/07/87)

In article <28200061@ccvaxa> aglew@ccvaxa.UUCP writes:
+---------------
| ..> Procedure call vs. interrupt frequency.
| Something else to point out: globally optimizing compilers can do inline
| expansion of much code, reducing procedure call overhead and frequency.
| This cannot be done to interrupts.
+---------------

Just another note about the Am29000... The entire register set does *NOT*
have to be saved on either a system call or an interrupt, only on a "forced"
context switch.

For a system call, not even the temps have to be saved; a system call is
a "subroutine", and temps are assumed destroyed across subroutines. A system
call also uses the same register window (set of busy/free registers) as the
user (yup!). Thus, if the current register window happens to have enough
registers free at the end, no registers need be saved/restored to complete
the system call.

Interrupts (at least those which call C code, and aren't just "soft-DMA"
assembly-language heads) *do* have to save the "global" temps, but still
need only save/restore the "local" stack-cache registers actually used.

Only in the case where an interrupt forces a context switch (say, at a
runtime quantum expiration) does the entire set have to be saved/restored.
(Context switches due to blocking-I/O had to have a system call first,
so a lot of the registers were free already.)

A related point: Since a system calls are implemented with a synchronous
"trap"-type instruction, virtually none of the processor state has to
be saved, either, only the user PC (29000's "PC1"). Contrast this with
CISC machines for which any mode change saves beaucoup bits...

Do a "vmstat" on your VAX, and you'll see that making system calls and
interrupts faster than context switches is a "good thing".


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

bcase@apple.UUCP (Brian Case) (11/09/87)

In article <230@usl-pc.UUCP> jpdres10@usl-pc.UUCP (Green Eric Lee) writes:
>I've been following this conversation for some time. I wonder, how
>much does the adder in the register path (to add the register stack
>pointer to the instruction's desired register) slow things down in the
>AMD29000? It might also seem like if you have register windows, you
>have the same thing sitting there in the middle of the address path
>between instruction latch and register addressing, but one scheme I've
>read about is to make your register windows out of shift-registers,
>i.e. no adding, as far as the machine is concerned, you just have your
>16 registers out there or whatever, and when a procedure call is made,
>or a return made, shift the current registers down or up (with some
>provisions for stack overflow). And, of course, Plain Old Registers
>have no problem at all with something out there in the register
>addressing path, except, of course, the decode tree... it seems
>offhand that the MIPS approach, with lots of registers and a very
>smart compiler capable of allocating them to procedures as needed,
>might offer some potential future speed advantages over traditional
>(non-shift-register) register windows or AMD's "stack window"
>approach, unless some very heavy pipelining is employed.

Well, that's a lot of stuff for one posting.  First off, the adder in
the register addressing path doesn't cost much, in the current implementation,
other than silicon area; reason:  the chip does register write before read,
and the write must complete before reads can be done so that an extra level
of forwarding (bypassing is another term often used) can be avoided.  So,
while the write is being done, we might as well do something useful like
an address add.  Machines like Berkeley RISC II (and I suspect SUN SPARC)
simply integrate the "add" into the decode tree.  This is somewhat less
expensive, but the idea is the same.  Ok, so the question comes up (at least
it did at AMD):  "What about other implementations, e.g. ECL?"  Well, it
is quite likely that a 4-stage pipeline is still going to be desirable in
other technologies.  It is also likely that the elimination of a stage of
forwarding with write-before-read is desirable.  At least with the kind of
ECL we were considering at AMD, things should scale reasonably well.  Sure
things aren't going to scale exactly, but we felt the large register file
was worth the risk of a slight mis-match in pipestage lengths.

As for the viability of flat register files, there is some evidence beyond
whatever you consider MIPS, et. al. to have provided.  David Wall of DECWRL
constructed what is basically an optimizing linker for the experimental
Titan machine.  This linker uses complete knowledge of the program object
code, and can even make use of fed-back run-time info, to construct a very
good register allocation.  The effect is somewhat like register windows.
Unfortunately, this technique is based on static analysis, even when run-time
info is fed back, and so doesn't do the same job as register windows.
However, in practice, it is worth something.  The Titan has 63 registers.
Wall didn't say how well the technique would work with fewer registers.

Anyway, here is the reference:

Wall, D. W., "Global Register Allocation At Link Time," ACM SIGPLAN Compiler
Construction Conference, Palo Alto, June 1986.

Sorry, I don't have the Vol and No info since I got the paper directly
from David.

Ok, so to address future speed advantages, yes there might be some speed
advantages for those with simple register files.  However, for the Am29000,
the critical paths were quite balanced (Dave Witt, are you out there?)
with, I believe, the TLB and/or instruction cache being the limiting
factor.  Next came the ALU, and then the register file.  Unless you want
to do things like spread the ALU cost over two pipestages (possible to do),
I don't think the register file is going to be the limiting factor.

First, I don't mean to imply that we considered all other implementation
technologies nor that we considered any very seriously.  Second, some of
what I said about the 29000 implementation may not be completely correct,
but the gist is correct.  Third, what do other people have to say?
Probably not much, since talking about this would be tantamount to disclosing
future plans.  Oh welwor

david@neptune.AMD.COM (11/10/87)

In article <6681@apple.UUCP> bcase@apple.UUCP (Brian Case) writes:
>Ok, so to address future speed advantages, yes there might be some speed
>advantages for those with simple register files.  However, for the Am29000,
>the critical paths were quite balanced (Dave Witt, are you out there?)
>with, I believe, the TLB and/or instruction cache being the limiting
>factor.  Next came the ALU, and then the register file.  Unless you want
>to do things like spread the ALU cost over two pipestages (possible to do),
>I don't think the register file is going to be the limiting factor.

	well, since my friend bcase requested a response from me, on the 29k 
	design the stack relative add was one of the speed paths 
	encountered on the part, but certainly no worse that the
	64-32 funnel shift or worst case alu adds, tlb translation
	or conditional jump and read from the branch target cache.

	Specifically for that particular path, in one half clock phase,
	the internal pipe was required to discharge the instruction
	bus and statically add the stack pointer to the a,b,c offset
	in three separate 7-bit adders.  In parallel, a zero detect
	and a check on the msb of the a,b,c values determined the
	selection in a 3:1 multiplexor to enable the stack-relative
	local register, the global registers, or the indirect pointers.
	
	The output of the multiplexor was the address for the row/column
	decode for the 3-port register file which would be locally
	decoded and accessed in the next half clock phase for a double
	read. (the write is obviously delayed due to the internal pipe
	and therefore not a speed path).  The total amount of gate
	delays for this path (including a small amount of lookahead
	for the adder) was 12 gates.  In initial silicon at nominal
	temp this worst case path was passing in excess of 35mhz.

	In my opinion, in our internal pipe, it was not a major concern
	in terms of designing in this functionality.

henry@utzoo.UUCP (Henry Spencer) (11/11/87)

> ... the MIPS approach, with lots of registers and a very
> smart compiler capable of allocating them to procedures as needed,
> might offer some potential future speed advantages over traditional
> (non-shift-register) register windows or AMD's "stack window"
> approach...

Don't forget that this is, to some extent, a function of what language
you want to use.  The assumption behind lots-of-registers-plus-smart-
compiler is that calling patterns are predictable enough to permit the
compiler to base optimizations on them.  This is not so (or not necessarily
so) for more dynamic languages like Smalltalk.
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry

hankd@pur-ee.UUCP (Hank Dietz) (10/04/88)

In article <ANDREW.88Sep28160417@jung.harlqn.uucp>, andrew@jung.harlqn.uucp (Andrew Watson) writes:
> In article <91@zeno.MN.ORG> gene@zeno.MN.ORG (Gene H. Olson) writes:
> 
>    [stuff as to ignoring register windows for SPARC...]
> 
> 	   SPARC offers you the option of register windows.  If they don't
> 	   help in some situations (they seem to work well in others) there
> 	   is no reason you need use them.  Since only two of them are
> 	   required, there is no serious penalty if they are unused by an
> 	   implementation.  Its clearly a win-win situation.
> 
> Just a small point from a compiler-writer who's been attempting to write a code
> generator for the SPARC that *does* ignore register windows - it's possible,
> but the architecture really doesn't support it.
> 
> [complaint about having no burst register push/pop instructions]

At HICSS'21, in January 1988, I chaired a discussion on RISC architecture,
which quickly became a discussion of register windowing concepts.  There
were several key points which surfaced in the discussion; the following is
just a bit about what you really want instead of burst regsiter save/restore
instructions....

	LAZY STORE/LOAD SHOULD BE USED.

	How Many Windows Do You Need?  For some reason, it has become
popular to assume that the more registers you have, the better off you are.
However, consider window management using lazy operations:  you need only
two *actual* register sets and more help only a very little.

	Suppose we start executing in window A.  Suddenly, we find that we
must execute a subroutine call, making the current window B (yet keeping A in
its register set).  In most machines, not every memory reference cycle is
actually used to refer to memory...  let's suppose that F is the fraction of
time during which the memory reference time slot is empty.  Suppose also that
the number of instructions between creating window A and switching to window
B is I, and the number of instructions executed between creating B and
reverting to A is J.  Further suppose that there are R registers used in a
typical register window.  If R*(1/F) > I and R*(1/F) > J then by having the
hardware automatically save and restore "in-use" registers lazily (whenever
it sees free memory cycles), one can pretend to have as many register sets as
one wants, even though the hardware need implement only two:  by the time a
register set must be reused for another set, the set will have been saved
using lazy stores, hence, it will be ready and waiting.  Likewise, when a set
is to be removed, the hardware begins lazy-reloading of the previous set,
hence no delay is seen.  Compare this to the very-hard-to-hide delay of using
burst stores, with or without register windowing....

	For three conceptual sets, A, B, and C, the sequence is like:

		A begins in set 0
		A executes some stuff
		B begins in set 1, A remains in 0
		B executes some stuff, while A is lazily stored
		C begins in set that had been A (set 0), B remains in 1
		C executes some stuff, while B is lazily stored
		C terminates, B is still available in 1
		B continues to execute (in 1), while A is lazily reloaded (in 0)
		B terminates, A is now available again (set 0)
		A continues to execute in set 0
		A terminates

It is easy to see that having more sets simply makes the hardware more
tolerant to variations in R, F, I, and J (since these values are effectively
averaged over all the actual register sets).

	Notice that a lazy-store/lazy-reload, register-name-addressible
top-of-stack cache is virtually equivalent to the lazy window scheme, but
probably is a bit more general.


...There is plenty more to say about this, if people are interested.  (Of
course, when it comes to register window use, I'm biased:  I'd rather use the
compiler-driven register/cache/CReg management that Chi and I have developed;
however, I'm perfectly happy to start a discussion which will lead everyone
else to the same conclusion ;-).

							-Prof. Hank Dietz
     __         /|
  _ |  |  __   / |  Compiler-oriented
 /  |--| |  | |  |  Architecture
/   |  | |__| |_/   Researcher from
\__ |  | | \  |     Purdue
    \    |  \  \
	 \      \   hankd@ee.ecn.purdue.edu

bcase@cup.portal.com (10/05/88)

>	LAZY STORE/LOAD SHOULD BE USED.
>you need only two *actual* register sets and more help only a very little.
[SEE THE ORIGINAL FOR ADDITONAL VERBOSE CONTEXT]
>one can pretend to have as many register sets as
>one wants, even though the hardware need implement only two:  by the time a
>register set must be reused for another set, the set will have been saved
>using lazy stores, hence, it will be ready and waiting.  Likewise, when a set
>is to be removed, the hardware begins lazy-reloading of the previous set,
>hence no delay is seen.  Compare this to the very-hard-to-hide delay of using
>burst stores, with or without register windowing....

In my opinion, this is BS:  This is tantamount to saying you can predict
the sequence of calls and returns!!  If my CPU starts lazily saving but
the next thing it does is return, it has wasted all the memory cycles spent
lazily saving.  If my CPU lazily restores but the next thing it does is
call, it has wasted all the memory cylces spent restoring.  This is related
to the question:  "How many windows should be saved/restored on a call/
return?"  This question was answered by the Berkeley people:  exactly one.
The reason it is one is that you can't predict the future.

"But," you say, "the lazy cycles aren't wasted since they don't interfere
with normal memory references."  Sorry, I don't buy this because to insert
memory references so that they don't interefere means that you are able
to predict the pattern of memory references.  Sure by looking ahead, and
all that, but this is *not simple*.  You must be able to predict conditional
branches!

Lastly, a register window scheme with multiple windows provides hysterisis
*without any* memory references at all!  The two-window-plus-laziness
scheme doesn't.

Just because the memory interface isn't completly saturated *doesn't*
mean that the unused capacity can be used without penalty.  There is
an allocation penalty associated with arbitrating for any fixed 
resource; think of an intersection controlled by a stop light at rush
hour.  The resource simple can't be used 100% of the time:  we have
to switch the direction of traffic flow every now and then.  This is
the purpose behind bursting:  once the direction of traffic (memory)
flow has been established, speed through the intersection as fast as
possible. Changing the direction of flow after every car (memory
reference) is inefficient.  I think queuing theory also comes into
play here.  True, if you could look ahead at
the instruction stream, predict correcly all the conditional branches
etc., then you could know exectly where to insert the lazy loads and
stores.  But I think this is unrealistic.

You have to consider the realities of implementation!

If my analysis has erred, please enlighten me!

jmd@granite.dec.com (John Danskin) (10/07/88)

I just love writing replies to unidentified flamers from the portal system 8-).

In article <9725@cup.portal.com> bcase@cup.portal.com writes:
>>	LAZY STORE/LOAD SHOULD BE USED.
>>you need only two *actual* register sets and more help only a very little.
>[SEE THE ORIGINAL FOR ADDITONAL VERBOSE CONTEXT]
>>one can pretend to have as many register sets as
>>one wants, even though the hardware need implement only two:  by the time a
>>register set must be reused for another set, the set will have been saved
>>using lazy stores, hence, it will be ready and waiting.  Likewise, when a set
>>is to be removed, the hardware begins lazy-reloading of the previous set,
>>hence no delay is seen.  Compare this to the very-hard-to-hide delay of using
>>burst stores, with or without register windowing....
>
>In my opinion, this is BS:  This is tantamount to saying you can predict
>the sequence of calls and returns!!  If my CPU starts lazily saving but
>the next thing it does is return, it has wasted all the memory cycles spent
>lazily saving.

[Followed by a bunch more stuff about how yu can't predict
the instruction stream]

There are a couple of ways people have proposed doing lazy saving:

o	Katenevis (or somebody working with him) proposed trying to
	save register windows during unused memory cycles. They did
	some analysis and concluded that there were not enough unused
	cycles. Part of the problem here might have been that they needed
	to save the whole window before any of it was usable.

o	When you start a new register window, flag all of the registers
	as 'clean'. Whenever you update one, flag it as dirty. Now,
	when it comes time to reuse a register window, whenever the
	processor tries to update a previously dirty register, copy it
	somewhere and try to save it during an otherwise unused memory
	cycle.

	On return, flag registers which have lost their values as 'empty'
	(or something). When the processor tries to refer to an empty
	register, the value is reloaded.

	This is really lazy. You only save/restore registers when you need to.

None of these schemes imply knowing the instruction stream in advance.
I guess the point that (unnamed) missed was that (other unnamed) assumed
that the hardware was managing the lazy loads/stores, not the compiler.

Of course, these schemes produce more complicated silicon than burst
read/write schemes. In particular, anything with flags/register
introduces latency into the register read/write path which is the
critical path unless you have blown everything.

Is it a win? simulate and see. Tell the world.

By the way:
	There is a paper:
		"Register Windows Vs. General Registers: A Comparison of
		Memory Access Patterns" by Scott Morrison and Nancy Walker
		of UC Berkeley.

	Which shows that the MIPS R2000 (aside from running faster) achieves
	fewer memory references (in almost all cases) than SPARC with all
	levels of optimization and as many as 7 register windows.

	a) Does anyone know if/where (Earl?) this paper was published?
	(I got a copy from MIPS people, they love to give it away).

	b) Does anybody at SUN have an answer (tell us how they got it all
	wrong, register windows really DO save memory references).

	c) Anybody at AMD (Tim?) want to say something about how burst
	read/write makes the extra references OK?

-- 
John Danskin				| decwrl!jmd
DEC Technology Development		| (415) 853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.

hankd@pur-ee.UUCP (Hank Dietz) (10/07/88)

Pertaining to my posting about lazy store/reload of register frames:

In article <9725@cup.portal.com>, bcase@cup.portal.com writes:
> [Flaming as to lazy ops not being free and some obscure concept about
>  needing to predict the future...]
> The reason it is one is that you can't predict the future.
> [More flaming about memory intereference...]
> ...  You must be able to predict conditional branches!
> 
> Lastly, a register window scheme with multiple windows provides hysterisis
> *without any* memory references at all!  The two-window-plus-laziness
> scheme doesn't.
>
> ...  True, if you could look ahead at
> the instruction stream, predict correcly all the conditional branches
> etc., then you could know exectly where to insert the lazy loads and
> stores.  But I think this is unrealistic.

You don't seem to know what lazy operations are.  A lazy operation is NOT a
simple "delayed" load or store, and it isn't triggered by an opcode being
executed; it is an operation which is automatically triggered either when
certain conditions exist which favor its efficient completion or when its
result is required --  one does NOT insert lazy operations in anything.

Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die
space consumed by them).  What do you do when the 9th nested call is made?
The 10th?  You do a sit-and-wait-for-it burst store, that's what...  would
you really describe that as being "*without any* memory references at all!"

With lazy store/reload built-in, by the time you make a call nested deeper
than you have windows, it is very likely that you've got at least one
register set saved so that you can immediately reuse it -- without waiting
for any memory references.  When a procedure returns, the frame which was
saved begins to lazily reload its values into the set it was flushed from.
Now, you can argue that there might not be enough time between calls or
between returns to lazily store or reload a set, but that's unlikely
because:

Calls:		The lazy stores only have to store registers which are live
		and dirty, i.e., whose value will be referenced after return
		from the call and also is not the same as a value stored
		somewhere in memory (e.g., a variable) by the programmer's
		code.  In most RISC processors, the only instructions able
		to make a register dirty are register = register op
		register...  which all have a free memory reference cycle!
		In the worst case, you'd lag behind by just one register
		store, which sure beats doing a non-lazy burst every so
		often. QED.

Returns:	The obvious worst case is if the instruction immediately
		after each call is a return.  However, if you do return
		immediately, then those registers were not live after the
		call (see above):  you would not have saved 'em in the first
		place, so they'd be restored in zero time.  Since live
		values are usually kept in registers primarily to be
		operated upon, a similar argument to that given for calls
		can be applied...  lazy reloading might not keep-up with
		returns, but it shouldn't be far behind.

How unlikely?  Specify an instruction set and timings and find out.  That's
one of the reasons I posted this -- to inspire a bit of thought and research.
However, one has very good reason to believe it will work quite well, and if
it works even nearly as well as non-lazy windowing but requires far fewer
windows, just look at all that valuable die space you just bought!

As for predicting the future, it has very little to do lazy operation, but
moreso it's a misnomer.  Statically examining code (e.g., at compile time)
there is no past or future, you know everything probabilistically and that's
usually good enough -- I agree that hardware can't predict the future, so you
ought not try to make it do that...  hardware is good at other things, like
telling you exactly which way THIS branch goes or which memory module THIS
pointer references.  The static/dynamic tradeoffs are what my research group,
CARP, is all about.

     __         /|
  _ |  |  __   / |  Compiler-oriented
 /  |--| |  | |  |  Architecture
/   |  | |__| |_/   Researcher from
\__ |  | | \  |     Purdue
    \    |  \  \
	 \      \   hankd@ee.ecn.purdue.edu

tim@crackle.amd.com (Tim Olson) (10/07/88)

In article <287@granite.dec.com> jmd@granite.UUCP (John Danskin) writes:
| By the way:
| 	There is a paper:
| 		"Register Windows Vs. General Registers: A Comparison of
| 		Memory Access Patterns" by Scott Morrison and Nancy Walker
| 		of UC Berkeley.
| 
| 	Which shows that the MIPS R2000 (aside from running faster) achieves
| 	fewer memory references (in almost all cases) than SPARC with all
| 	levels of optimization and as many as 7 register windows.

This says fewer overall memory references, but what is missing here is
the ratio of loads and stores to the rest of the instruction mix.  I
wouldn't be surprised if it is just that the Sun compiler is not doing
as good a job in general, and so the total number of instructions
(including loads and stores) increased with respect to the MIPS
compiler.  However, the number of loads and stores as a percentage of
the instruction mix might be lower. 

| 	a) Does anyone know if/where (Earl?) this paper was published?
| 	(I got a copy from MIPS people, they love to give it away).
| 
| 	b) Does anybody at SUN have an answer (tell us how they got it all
| 	wrong, register windows really DO save memory references).
| 
| 	c) Anybody at AMD (Tim?) want to say something about how burst
| 	read/write makes the extra references OK?

Well, I don't know what SUN seems to be doing wrong, but let's try this:
bsd 4.3 nroff with the 4.3 libraries running

	nroff /usr/doc/misc/sysperf/2.t	[a 10655 byte file]

results in:

---------- Pipeline ----------
 32.63% idle pipeline:
	 18.39% Instruction Fetch Wait
	 11.44% Data Transaction Wait
	  0.69% Page Boundary Crossing Fetch Wait
	  0.01% Unfilled Cache Fetch Wait
	  0.00% Load/Store Multiple Executing	<-- Hmm, not much time here!
	  2.07% Load/Load Transaction Wait
	  0.03% Pipeline Latency

---------- Bus Utilization ----------
Inst Bus Utilization:	 63.97%
	 8669133 Instruction Fetches

Data Bus Utilization:	  9.75%
	  979830 Loads
	  340998 Stores

---------- Instruction Mix ----------
	  1.86% Calls
	 15.65% Jumps
	 10.73% Loads
	  3.74% Stores
	  4.33% No-ops

---------- Register File Spilling/Filling ----------
	       3 Spills				<-- this is why
	       0 Fills

Spill/Fill sizes:
   1 registers:         0 time(s) (  0.00%)
   2 registers:         1 time(s) ( 33.33%)
   3 registers:         0 time(s) (  0.00%)
   4 registers:         1 time(s) ( 33.33%)
   5 registers:         0 time(s) (  0.00%)
   6 registers:         0 time(s) (  0.00%)
   7 registers:         0 time(s) (  0.00%)
   8 registers:         0 time(s) (  0.00%)
   9 registers:         0 time(s) (  0.00%)
  10 registers:         0 time(s) (  0.00%)
  11 registers:         0 time(s) (  0.00%)
  12 registers:         1 time(s) ( 33.33%)
  13 registers:         0 time(s) (  0.00%)
  14 registers:         0 time(s) (  0.00%)
  15 registers:         0 time(s) (  0.00%)
  16 registers:         0 time(s) (  0.00%)
> 16 registers:         0 time(s) (  0.00%)

So for the entire nroff run, we wrote a total of 18 words out to memory
due to stack cache overflow.  And we loaded 0 words from the stack due
to underflow (this is because nroff exits() while it is still a few
procedures down in the overall call chain).

I would be interested in seeing how this compares to a
non-register-windowed processor, in particular the total number of
loads/stores, the loads/stores as a percentage of instruction mix, and
the number of words of scalar data transfered to/from the stack.
	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

ok@quintus.uucp (Richard A. O'Keefe) (10/07/88)

In article <9410@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>Pertaining to my posting about lazy store/reload of register frames:
>
Wasn't there an MIT paper with a title something like
	"Lazy Scoping -- the Dream of a Lifetime"
which had to do with lazy register saves and restores?

tim@crackle.amd.com (Tim Olson) (10/07/88)

In article <9410@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
| Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die
| space consumed by them).  What do you do when the 9th nested call is made?
| The 10th?  You do a sit-and-wait-for-it burst store, that's what...  would
| you really describe that as being "*without any* memory references at all!"

True, but then non-lazy stores are only loading or storing what is absolutely
required, when it is required.  Lazy operations are continually trying
to load/store ahead. This seems like the real misnomer -- non-lazy
windows are truely lazy (only doing what is required) vs "lazy windows"
(which are quite active).

Consider an "on-demand" load/store window scheme with 4 windows, vs.  a
"background" load/store window scheme with 2 windows (which was implied
to be all that was required).  If the call chain looks like

1 2 3 4 5 6 7 6 7 6 7 6 7 6 7 8 7 6 5 6

i.e.  spends a lot of around a local maximum depth (certainly not
atypical).  The "on-demand" window scheme has no saving or restoring to
do while bouncing around between levels 7 and 6 because of the built-in
hysterisis.  The "background" load/store scheme, however, is
continually saving and restoring.  This is a waste of memory bandwidth. 
What register windows are buying is this hysterisis in saving and
restoring the stack frame, and the only way to get it is to provide a
large number of windows. Once that is available, background load/stores
don't buy much, because register file spilling/filling just doesn't
occur that often in real programs (maybe 0.5% of all calls spill)

One other problem with "background" load/stores occurs when memory
operations take more than a single cycle.  In this case, a "background"
memory operation may be started when the memory was otherwise idle, and
right after that, a regular load or store is requested.  The requested
operation must wait for the background one to complete, decreasing
performance (this is what I think Brian Case was talking about when he
mentioned having to predict future operations -- to ensure that this
kind of collision doesn't occur).

Finally, if background loads/stores are interspersed with the regular
load/store stream, they cannot take full advantage of their sequential
nature, and thus cannot take full advantage of any faster burst-mode
capability that memory may provide. 

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

bcase@cup.portal.com (10/08/88)

John Danskin at decwrl writes: (the stupid portal mailer doesn't include
the "who wrote it" line when you reply, sigh):
>I just love writing replies to unidentified flamers from the portal system 8-).

??Er, I don't understand what you mean by unidentified; my login name
was included in your article.  Do you mean that my full name wasn't
included?  It is Brian Case...  I am dismayed to discover that I am
being judged because my postings originate from Portal!  Anybody want
to give me an account on a respectable UNIX system?
                                  |||||
>In article <9725@cup.portal.com> bcase@cup.portal.com writes:
>>In my opinion, this is BS:  This is tantamount to saying you can predict
>>the sequence of calls and returns!!  If my CPU starts lazily saving but
>>the next thing it does is return, it has wasted all the memory cycles spent
>>lazily saving.

John left out my comment that I believe the lazy loads/stores interfere
with normal memory traffic.  I am probably wrong though....

>None of these schemes [the ones he mentioned in his reply to my
>posting] imply knowing the instruction stream in advance.
>I guess the point that (unnamed) missed was that (other unnamed) assumed
>that the hardware was managing the lazy loads/stores, not the compiler.

Which unnamed am I?  :-)  Anyway, I was assuming that hardware would be
scheduling the lazy loads/stores.  While the hardware at least has
knowledge about what instructions are in its pipeline (safely assuming
a pipelined implementation, I think), it still can't predict the
future.  I still maintain that a great deal of disruptive loading/storing
can go on just to find out later that it was wasted.  With more than
two windows, hysterisis causes the register file to capture the working
set of the register window stack.

I guess the point that I was trying to make in my posting was that
hardware can't know exactly when to insert them lazy loads/stores:
if it inserts one that causes a cache miss (though this is probably
unlikely; there are probably better cases to use as an example) and
immediately after the insertion a conditonal branch is taken to a
"real" load (one that the program wants to do), the "real" load will
be delayed by the lazy load/store.  But, you say, if the lazy load/
store had been done explicitly, it would be guaranteed to delay the
progress of the program.  Yes, but if the explicit lazy loads/stores
are done as part of a burst, each one is much more efficient.

I think something that might be able to end this discussion is the
exposition of the hardware algorithm that would be used to implement
lazy load/store.  It can't be complicated, so it should be suitable
for posting.  If it is good and works, then it will be clear that I
was wrong to attack the original suggestion that two winodows are
sufficient.  (Of course, there is a whole contingent who says that
"no windows" is sufficient!)

>Of course, these schemes produce more complicated silicon than burst
>read/write schemes. In particular, anything with flags/register
>introduces latency into the register read/write path which is the
>critical path unless you have blown everything.

As I have said before, I think that one of the various caches in the
system (excluding the register file if you think of that as a cache),
e.g., the TLB or instruction cache, will be the critical path.

>Is it a win? simulate and see. Tell the world.

Yes, this is the right thing to do.

>By the way:
>	There is a paper:  [from Berkeley]
>	Which shows that the MIPS R2000 (aside from running faster) achieves
>	fewer memory references (in almost all cases) than SPARC with all
>	levels of optimization and as many as 7 register windows.

Well, this is news to me!  I must blush and appologize for many of my
past postings if these results are true for architectures like the 29K.
I do hope they included the effects of high-bandwidth, sequential-access
memories.

bcase@cup.portal.com (10/09/88)

First off, let me appologize to the net in general for starting this
discussion at all and particularly for apparently starting it off with
a touch of flame.  My intent was not to flame or get personal, but I
did want to say what I really thought was right.  Believe me, I'll drop
it after this because I am clearly not making myself understood and am
probably wrong anyway.  My intent is not to piss-off anyone; I am not
that kind of person (usually :-).  I might meet anyone here at a conference
or whatever some day; I don't want the reaction to be "oh, yeah, that 
flaming *sshole from portal."

Hank Dietz (Hope I spelled the name right, our mailer doesn't include the
name automagically) writes:
>In article <9725@cup.portal.com>, bcase@cup.portal.com writes:
>> ...  True, if you could look ahead at
>> the instruction stream, predict correcly all the conditional branches
>> etc., then you could know exectly where to insert the lazy loads and
>> stores.  But I think this is unrealistic.
>You don't seem to know what lazy operations are.  A lazy operation is NOT a
>simple "delayed" load or store, and it isn't triggered by an opcode being
>executed; it is an operation which is automatically triggered either when
>certain conditions exist which favor its efficient completion or when its
>result is required --  one does NOT insert lazy operations in anything.

No, I do understand what lazy operations are; the word "insert" as I have
used it refers to "inserting" the loads/stores into the pipeline.  I think
I understand what lazy means here because we considered using this technique
(call it lazy or dribble-back or whatever; BTW, it was called dribble-back
a long time ago, I don't know who named it) in the implementation of the 29K.
That was in the very early days of the design.  We must not have had access
to the info you have now because we considered it too expensive or complicated
or something to implement.  At least I think I understand what lazy means;
correct me if I am wrong.

>Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die
>space consumed by them).  What do you do when the 9th nested call is made?
>The 10th?  You do a sit-and-wait-for-it burst store, that's what...  would
>you really describe that as being "*without any* memory references at all!"

No!  Of course not; clearly that has memory references!  The research that
I did at AMD (I instrumented the PCC to count the number and size of overflows
and underflows the VAX would have if it had a register file like the ones
we were considering for the 29K; I modified applications like vi and circuit
design tools to allocate their large, non-scalar local items from the heap
instead of from the stack and then compiled and ran them.  At exit, they
they spit out stats) showed that overflowing and underflowing would happen
infrequently; the actual results from simulations and real chips supports
my research conclusions.  Thus, yes, overflows do happen, but they happen
infrequently in real life (there are pathological cases for both schemes,
though).  What I meant is that on average, several "windows" (the 29K
doesn't have "windows") can capture the working set of the scalar run-time
stack, and can effect most of the calls/returns with no memory references.
Sorry for the confusion.

Your comment about the windows taking one heck of a lot of valuable die
space is correct; I happen to believe that that valuable die space is
occupied by a valuable resource:  registers.  Let me say for the record
that I prefer the 29K approach, where all registers, or a very large
number of them, are simultaneously addressable, over the SPARC approach
where only one window's worth is addressable at any one time.  I feel
that registers are a very performance-improving resource, one for which
there is no substitute.  They are the fastest part of the memory
hierarchy and have three times the bandwidth of a single-ported data
cache, etc. etc. etc.

>Now, you can argue that there might not be enough time between calls or
>between returns to lazily store or reload a set, but that's unlikely
>because:
>Calls:		The lazy stores only have to store registers which are live
>		and dirty, i.e., whose value will be referenced after return
>		from the call and also is not the same as a value stored
>		somewhere in memory (e.g., a variable) by the programmer's
>		code.  In most RISC processors, the only instructions able
>		to make a register dirty are register = register op
>		register...  which all have a free memory reference cycle!
>		In the worst case, you'd lag behind by just one register
>		store, which sure beats doing a non-lazy burst every so
>		often. QED.

I think QED is a bit strong.  If you can prove that the lazy scheme "sure
beats doing a non-lazy burst every so often," then I am wrong, of course.
But your saying it doesn't make it so.  As I said in a previous posting,
I think seeing the algorithm, which can't be complicated or it wouldn't
be suitable for hardware implementation, might end the discussion; and if
it is good, it will certianly make me look foolish.  Maybe that's what I
need, although it wouldn't be the first time....

Let me take one last crack at making my point:  loads/stores tyically take
two or more cycles.  Hardware will have to be very clever about finding
the right place to put the lazy loads/stores.  I think the added lazy
loads/stores will bring the data memory channel utilization up to near
100% (which can be a good thing if doing so does not slow down the general
rate of load/store completion; pipelining might solve the problem...).
Thus, I believe that the lazy loads/stores will, even assuming the optimal
placement in time, interfere with the loads/stores actually required by the
algorithm being executed.  This interference will lower the benefit of lazy
loads/stores to bring this scheme at least on equal footing with the non-lazy,
burst scheme.  Since the burst scheme is simpler, I would prefer it.  But the
lazy scheme uses fewer registers, so you prefer it instead.  Since the lazy
scheme does not take advantage of the benefits afforded the sequential access
pattern of the burst scheme, I would not choose it (since I know that I
can make the burst scheme fast).  Since the lazy scheme makes better
utilization of the memory channel (closer to 100%), you would choose it.

Or whatever.  I am not saying that these comments are God's absolute truth.
I am just trying to make clear my original point.  The stuff about predicting
the future was just my (apparently stupid) way of saying that figuring out
where to insert the lazy loads/stores is very difficult (at least that's what
I think).  And if you don't put them in the best places, you will interfere
with loads/stores required by the algorithm at hand.  If interference happens,
which I believe will, then I don't see the advantage of laziness over
explicitness.  At least the explicit scheme can take advantage of high-
bandwidth sequential access memories.  There, now I think I have made myself
clear.  Wrong maybe, but clear.

>The static/dynamic tradeoffs are what my research group,
>CARP, is all about.

This is valuable research; I didn't mean to attack your group or your
research directions either directly or indirectly.  I was once in grad
school; you are probably all thinking "I swear, some of those industry-
types are truly complete jerks."  I know *I* thought that about industry-
types more than a few times!  Sigh, I guess we all eventually become that
which we despise!  (Parents?, e.g.)

mcdonald@uxe.cso.uiuc.edu (10/10/88)

What is a register window?

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/11/88)

In article <9837@cup.portal.com> bcase@cup.portal.com writes:
>John Danskin at decwrl writes: (the stupid portal mailer doesn't include
>>By the way:
>>	There is a paper:  [from Berkeley]
>>	Which shows that the MIPS R2000 (aside from running faster) achieves
>>	fewer memory references (in almost all cases) than SPARC with all
>>	levels of optimization and as many as 7 register windows.
>Well, this is news to me!  I must blush and appologize for many of my
>past postings if these results are true for architectures like the 29K.
>I do hope they included the effects of high-bandwidth, sequential-access
>memories.

>
>
>
>
>
>
>
I wonder what that paper is.  There is another paper which may be of interest:
Look in the September 1987 IEEE Computer,
"And Now a Case for More Complex Instruction Sets", IEEE Computer, Sept. 87,
Vol. 20, No. 9, pages 71-83.

This paper makes the case, again, that you need to simulate and test, not
hypothesize, what the effects of the various optimizations are.  However,
while we are hypothesizing, the results of the paper again call into
question the efficacy of register windows.  For example:

"From data traffic considerations alone, it seems that OBI360 [a particular
1 address instruction format looked at in the paper] with a register set of
about size 16 plus a small data cache is preferable to multiple register sets 
for most area combinations..."   [page 83]

How about someone publishing a simulation showing the results of using
lazy windows?




-- 
  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

johnl@ima.ima.isc.com (John R. Levine) (10/13/88)

In article <46500028@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>What is a register window?

It's a little piece of Lexan in the chip package of a microprocessor that lets
you see into the machine's registers. Writers of optimizing compilers find
these windows invaluable for debugging and tuning object code. For example, in
case of register overflow you can see bits floating near the register that
overflowed. In case of register collisions, which all compiler writers will
agree are a big pain, if you examine the registers very closely with a
magnifying glass, you can usually see little scratches and dents where the
collision occurred.

Register windows are such a useful tool that it's a shame that the SPARC is
the only commercial chip that provides them, although even on the SPARC the
Lexan gets scratched after a few months, making them hard to see through. I
wish that quartz windows weren't so expensive, they last a lot longer.
-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ bbn | think | decvax | harvard | yale }!ima!johnl, Levine@YALE.something
Rome fell, Babylon fell, Scarsdale will have its turn.  -G. B. Shaw

dillon@jumbo.dec.com (John Dillon) (10/13/88)

In article <2768@ima.ima.isc.com>, johnl@ima.ima.isc.com (John R. Levine) writes:
> In article <46500028@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
> >What is a register window?

> It's a little piece of Lexan in the chip package of a microprocessor that lets
> you see into the machine's registers. Writers of optimizing compilers find
> these windows invaluable for debugging and tuning object code. For example, in
> case of register overflow you can see bits floating near the register that
> overflowed. In case of register collisions, which all compiler writers will
> agree are a big pain, if you examine the registers very closely with a
> magnifying glass, you can usually see little scratches and dents where the
> collision occurred.
> 
> Register windows are such a useful tool that it's a shame that the SPARC is
> the only commercial chip that provides them, although even on the SPARC the
> Lexan gets scratched after a few months, making them hard to see through. I
> wish that quartz windows weren't so expensive, they last a lot longer.
> -- 
> John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
> { bbn | think | decvax | harvard | yale }!ima!johnl, Levine@YALE.something
> Rome fell, Babylon fell, Scarsdale will have its turn.  -G. B. Shaw

I must take Mr. Levine to task for forgetting the AMD 29000.

AMD goes to the trouble of providing the user with pictures of the
metallization, since metal shows the little scratches and dents much
better than silicon.

The 29000 does not provide the same register windows as SPARC.  Instead,
AMD adopted the neat strategy of trapping to supervisor code when
the register window overflows and underflows.

This strategy dovetails nicely with AMD's commitment to shadow registers
and serial scan.  The trap handler, using the shadow register (actually
a miniature on-chip ccd camera obscura), takes a picture of the
metallization and scans it out to the companion AMD 291984 image
processing chip, which reconstructs the picture for the user.

Consider the advantages and disadvantages (as I see them :-)
  + no need for either lexan or quartz window
  + the packaged part is insensitive to light
  + no need for a magnifying glass: you put the image on your display
  + only compiler writers need purchase the 291984
  - the 291984 is unlikely to be in your machine when you need it
  - it takes time to reconstruct and scan the image: no real-time window

No, I have no connection with either AMD or Sun.

   -- John

disclaimer
n.
1. the act of disclaiming; the renouncing, repudiating, or denying of a
   claim; disavowal.
2. a person who disclaims.
3. a statement, document, or the like, that disclaims.

news@ism780c.isc.com (News system) (10/14/88)

In article <2768@ima.ima.isc.com> johnl@ima.UUCP (John R. Levine) writes:
[definition of a register window]
>
>It's a little piece of Lexan in the chip package of a microprocessor that lets
>you see into the machine's registers. Writers of optimizing compilers find
>these windows invaluable for debugging and tuning object code.

Historical Note:
The first programmable digital computer that I worked with was a CPC (card
programed calculator).  This machine had 800 decimal digits of storage that
could be arranged into one or more registers via a patch panel.  The
implementation of the digits was mechanical, much like an automobile
odometer.  The value in a register could actually be seen (numbers painted on
wheels).  The primary debugging aid for this machine was a flashlight to
illuminate the registers.

   Marv Rubinstein

paul@unisoft.UUCP (n) (10/14/88)

In article <13381@jumbo.dec.com> dillon@jumbo.dec.com (John Dillon) writes:
>I must take Mr. Levine to task for forgetting the AMD 29000.
>
>This strategy dovetails nicely with AMD's commitment to shadow registers
>and serial scan.  The trap handler, using the shadow register (actually
>a miniature on-chip ccd camera obscura), takes a picture of the
>metallization and scans it out to the companion AMD 291984 image
>processing chip, which reconstructs the picture for the user.
>
>Consider the advantages and disadvantages (as I see them :-)
>  + no need for either lexan or quartz window
>  + the packaged part is insensitive to light
>  + no need for a magnifying glass: you put the image on your display
>  + only compiler writers need purchase the 291984
>  - the 291984 is unlikely to be in your machine when you need it
>  - it takes time to reconstruct and scan the image: no real-time window

Actually these last two are not a problem, one simply pages in a 
simulation of a 291984 and uses that (it is a RISC chip after all).

On a related note: I'm looking for venture for a new state of the art
product 'Register Windex' we expect a great demand for this product especially
for those systems that use 'dirty' bits for management of their windows.

	Paul

-- 
Paul Campbell, UniSoft Corp. 6121 Hollis, Emeryville, Ca ..ucbvax!unisoft!paul  
Nothing here represents the opinions of UniSoft or its employees (except me)
"Gorbachev is returning to the heritage of the great Lenin" - Ronald Reagan 1988
  (then the Wasington Post attacked RR [from the right] for being a Leninist)