[comp.arch] Register usage

bs@linus.UUCP (Robert D. Silverman) (01/20/87)

There is some ongoing discussion about register usage in programs.
While it is true that one generally wants to keep as many variables
as possible (especially such things as loop indices!) there are times
when using a lot of registers can actually slow things down.

This arises when a subroutine shares registers in common with its
parent. If the routine is called many times the cost of saving and
restoring registers during the call and return can offset any
speed savings thru register usage. I have actually seen this
phenomenon occur quite strikingly in programs that require 
multi-precision arithmetic. The cost of saving the registers for
a multi-precise add or multiply routine (say) when they are called
~10^8 or 10^9 times can greatly slow down a program.
 
Bob Silverman

neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/11/89)

Summary: If your compiler is not antiquated (compared to this one, even
	 gcc looks bad), you can use gazillions of registers. The speedup
	 is on the order of 30%, so don't hold your breath.
 
David W. Wall of Digitals Western Research Lab did exactly the kind of
analysis you all are looking for. From Davids paper "Global Register
Allocation at Link Time", (SIGPLAN Notices 21,7,pp. 264-275). The benchmarks
used  were the Livermore loops, Whetstone, LINPACK, the Stanford benchmark
suite and two applications, a logic simulator RSIM and a timing verifier.
 
The compiler used was for the not-so-widely-known DECWRL Titan RISC machine,
a ECL RISC with 64 non-windowed registers and a single cylce load. The compiler
does global register allocation (yes, global variables in the sense of C) at
link time and is a common backend for C, Fortran and Modula 2. The number of
registers does not apply to the expression evaluation and address generation
registers (this seems to be the 4-7 people have been talking so far) but
those used by the optimizer to hold things. It analyzes all scalar variables
into non-conflicting groups and tries to allocate those to registers.
From the introduction:
 
	"When we use our method for 52 registers, our benchmarks speed up
	 by 10 to 25 %. Even with only 8 registers, the speedup can be 
	 nearly that large, if we use previously collected profile information
	 to guide the allocation. We cannot do much better, because programs
	 whose variables all fit in registers rarely speed up by more than
	 30 %. Moreover, profiling shows us that we usually remove 60-90% of
	 the loads and stores of scalar variables that the program performs
	 during it's execution."
	 
The benchmarks characteristics:
 
		Variables	Groups 	coloring coloring + profile
	-------------------------------------------------------------------
Livermore	 166		 165	  18 %		19 %
Whetstone	 254		 181	  10 %		10 %
Linpack	 	 214		 119	  13 %		13 %
Stanford	 402		 211	  27 %		28 %
Simulator	 811		 262	  15 %		16 %
Verifier	1395		 693	  15 %		19 %
 
Where the columns stand for:
 
	Variables: Candidate scalar variables for global register allocation
	Groups:	   Overlapping variables of above which need separate registers.		   If there were that many registers in the machine, they
		   could all be assigned to registers.
	coloring:  Speedup obtained by global register allocation versus
		   register allocation just like what say gcc -O does.
	coloring + profile: Speedup obtained by guiding the allocator with
		   actual profiles of execution.
 
David measured the number of memory references that could be eliminated in
the benchmarks if all scalars could be held in registers and then plotted
the precentage his scheme actually removed:
 
 
		coloring  coloring + profile
	    -----------------------------------
Livermore	 81 %		94 %
Whetstone	 75 %		88 %
Linpack	 	 95 %		99 %
Stanford	 90 %		98 %
Simulator	 83 %		95 %
Verifier	 61 %		83 %
 
This shows that his scheme is very efficient in removing these memory
references. Please note that given the enormous "hit rate" he has and
given the not so impressive speedups he got the overall precentage of
scalar memory references cannot be that big versus accesses to bigger
data structures.
 
Now the interesting tables. What happens if you use fewer registers ?
The following table shows the speed improvements with 52, 32 and 8 registers.
All of these performance measures are the relative improvement the programs
took with global register allocation guided by profile information relative
to "naive register allocation".
 
		52	32	8
	-----------------------------
Livermore	19 %	18 %	12 %
Whetstone	10 %	10 %	 5 %
Linpack	 	13 %	13 %	10 %
Stanford	28 %	27 %	20 %
Simulator	16 %	15 %	 8 %
Verifier	19 %	16 %	 7 %
 
There is another very interesting paper by David comparing register window
schemes of varying organization with this global allocation stuff and this
seems to suggest that a slightly bigger global register file beats register
windows if you are willing to use this extremely advanced compilation
techniques. The paper appeared in Proc. of the SIGPLAN 1988 Conference on
Programming Language Design and Implementation, June 1988. The papers title
is "Register Windows vs. Register Allocation". It's way to long
to reproduce here and the graphics in there are much nicer than anything
I can type here.
 
		Burkhard Neidecker-Lutz, Digital CEC Karlsruhe
 
Disclaimer: I don't speak for Digital, etc. ...

sjs@spectral.ctt.bellcore.com (Stan Switzer) (05/11/89)

In article <921@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
> The one paper I read about this (unfortunately for John Mashey I cannot
> find the exact reference -- the reason is too embarassing, even if not for
> me, to state publicly) was about taking the PCC (for the PDP) and changing the
> number of registers available to its Sethi-Ullman register allocator, and
> then benchmarking a few Unix tools.
> 
> They found that in these conditions (CISC machine, no interexpression
> optimization, virtually only fixed point computation) speed/code size did
> not improve substantially with more than three scratch registers, and four
> were plenty.

My own experience with a C-like compiler for the Honeywell Level/6
bears this out.  As long as you are simply doing expression-level code
generation, three registers, surprisingly enough, are quite
sufficient.  The Sethi-Ullman "pebbling" scheme (a limited form of
coloring) works well in this case.

Larger register sets pay you back when you do dataflow analysis and
allocate registers over larger spans of code than C expression
statements.  This is where I disagree with Piercarlo.  You _can_ make
good use of more registers, but you'll have to do some global analysis
first.

If you only have a handful of registers, though, you might as well
keep it simple.

The Level/6, if I remember correctly, had 14 registers: 7 integer
(logical) and 7 pointer registers.  Of the integer registers, numbered
1 to 7, #1-3 could be used as indexes, #1 could be used as a shift
count, #6-7 could be paired to handle large (32 bit) values.  Of the
seven pointer only #1-3 could be used in indexed addressing addressing
modes.  Because of this grunginess, my compiler used the first three
regs in each set as expression temps and used the rest for dedicated
purposes (frame pointer, return linkage, common storage, etc.).  Going
beyond statement-level allocation would have probably been pointless,
but a good peephole optimizer would have helped a bit.

Stan Switzer  sjs@ctt.bellcore.com

cquenel@polyslo.CalPoly.EDU (18 more school days) (05/12/89)

In 9851 neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes:
	(a very informative and interesting summary of
	 a paper on register allocation)

>The compiler used was for the not-so-widely-known DECWRL Titan RISC machine,
>a ECL RISC with 64 non-windowed registers and a single cylce load.
                                                 ^^^^^^^^^^^^^^^^^

	This could have something to do with the degree of
	speed-up by removing these loads.  I don't think
	that they get very much bigger, but on RISC machines
	with a 1 or 2 cycle delay on a load from memory,
	the difference could be more significant.
	(And don't forget those nasty caches misses, that
	could be avoided once in a while).
	
-- 
@---@  -----------------------------------------------------------------  @---@
\. ./  | Chris (The Lab Rat) Quenelle      cquenel@polyslo.calpoly.edu |  \. ./
 \ /   |  You can keep my things, they've come to take me home -- PG   |   \ / 
==o==  -----------------------------------------------------------------  ==o==

a464@mindlink.UUCP (Bruce Dawson) (05/12/89)

     One thing that needs to be kept in mind when talking about the advantages
of huge numbers of registers is that some of the advantages of registers go
away when you have a lot (when have you a lot availabel simultaneously I should
say).  In the extreme case of the computer someone mentioned that had 256
registers, a register-register operation would use up sixteen bits just to
specify the two registers involved.  Contrast that with the six bits required
if you only have eight registers.  Given the finite memory speeds that we have
to deal with, an extra ten bits so that you can have 256 instead of eight
registers is probably too big a price to pay and would probably slow programs
down.

     Speaking from my experience on the 68000, I would say that for assembly
language programming, 16 registers is generally enough, with more being
occasionally desirable.  Because C can't do a very good job of comprehending an
entire substantial subroutine or program and deciding what should be in
registers (actually, replace 'C' with HLL) I would think that a compiler could
actually get by with slightly less.  One thing to keep in mind is that the
80386's eight registers are actually eight, whereas the 680x0 families sixteen
are usually more like thirteen or fourteen (by the time you use one for a stack
pointer, one for a data segment pointer and one for a stack frame pointer).

.Bruce.

neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/12/89)

Reference: 10189,9851
 
Yes, the savings have to do with the load delay's from the data cache.
Another table in the paper shows the relative savings for a certain
allocation method with 52 registers and varying data cache latencies:
 
Program		Data cache speed in cycles
 
		 1	 2	 3	 4	 5
------------------------------------------------------------------------
Simulator	12 %	19 %	24 %	27 %	29 %
Verifier	10 %	15 %	19 %	21 %	23 %
 
So there is an increasing potential for savings with longer data
cache latencies, but a faster cache will help you always on those
nasty array references (combined with block refills), have a look
at the R2000/R3000 speed/Mhz numbers for these effects. And given
the availability of ridiciously fast static RAMS (I have a Cypress
Semiconductor catalogue in front of me, CY100E474 BiCMOS RAM, 4k,
3 nS...) there doesn't seem a good reason to have at least the first
level cache be single cycle, even if you go to > 100 Mhz clocks.

henry@utzoo.uucp (Henry Spencer) (05/14/89)

In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>...Because C can't do a very good job of comprehending an
>entire substantial subroutine or program and deciding what should be in
>registers (actually, replace 'C' with HLL) I would think that a compiler could
>actually get by with slightly less...

Speak for your own compilers, please.  Some folks' compilers do an *excellent*
job of comprehending an entire subroutine, or even an entire file, of C or
other HLL and deciding what should be in registers.
-- 
Subversion, n:  a superset     |     Henry Spencer at U of Toronto Zoology
of a subset.    --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

brooks@vette.llnl.gov (Eugene Brooks) (05/15/89)

In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>
>     One thing that needs to be kept in mind when talking about the advantages
>of huge numbers of registers is that some of the advantages of registers go
>away when you have a lot (when have you a lot availabel simultaneously I should
>say).  In the extreme case of the computer someone mentioned that had 256
>registers, a register-register operation would use up sixteen bits just to
>specify the two registers involved.  Contrast that with the six bits required
>if you only have eight registers.  Given the finite memory speeds that we have
Actually, if you have three register operation codes, you chew up 24 bits for
the register fields.  Add, say, an 8 bit operation code and you have 32 bits
for each basic computation instruction.  This is what we had for the instruction
set of the Cerberus multiprocessor simulator.  The instructions which need a
static offset for addressing are even wider, another 32 bits.  With the huge
size of code caches these days I don't see a problem here.

brooks@maddog.llnl.gov, brooks@maddog.uucp

elg@killer.Dallas.TX.US (Eric Green) (05/15/89)

in article <259@mindlink.UUCP>, a464@mindlink.UUCP (Bruce Dawson) says:
>      One thing that needs to be kept in mind when talking about the advantages
> of huge numbers of registers is that some of the advantages of registers go
> away when you have a lot (when have you a lot availabel simultaneously I should
> say).  In the extreme case of the computer someone mentioned that had 256
> registers, a register-register operation would use up sixteen bits just to
> specify the two registers involved.  Contrast that with the six bits required
> if you only have eight registers.  Given the finite memory speeds that we have
> to deal with, an extra ten bits so that you can have 256 instead of eight
> registers is probably too big a price to pay and would probably slow programs
> down.

The AMD29000 addresses 256 registers (although there's only 192 actual
registers -- 128 organized as a stack, the rest as global registers).
Each AMD29000 instruction is 32 bits long. I seem to recall that it's
a three-address machine instead of a two-address machine a' la
68000/80x86, so instructions are 8 bits of opcode, and 24 bits of
register addresses. How does this slow it down??? In fact, the 29000
is one of the faster RISCs out there, though it didn't catch on in the
Unix workstation world (for one thing, the idea of kludging in
byte-fetch logic externally probably turned off potential system
designers). 

As I mentioned in a previous posting, program-memory bandwidth is
almost unlimited on these kinds of high end machines (large cache,
Harvard-style seperate program and data busses, possibly interleaved
program memory and burst-mode DRAM accesses by the program cache
controller to take advantage of sequential accesses, etc.). The only
glitches in the pipeline are a) fetching data from memory, and b)
branches, so you want to reduce both as much as possible, which is why
you want a lot of registers and (on the branch side) things like
"smart" conditions to reduce the number of conditional branches, and
branch target caches to minimize the effects when you do get a branch.
Program memory bandwidth becomes almost inconsequential insofar as
performance is concerned, under those conditions, and fixed-size
32-bit instructions greatly ease pipeline design.  It's only when you
get to low end non-Harvard machines like the 68000 that program
bandwidth becomes important.

Conclusions: The program memory bandwidth increase resulting from
increasing the number of addressible registers is inconsequential.
Other factors, such as real-time responsiveness, compiler technology,
and the number of gates needed to decode the register addresses
(remember the Cray axiom -- minimum # of gates in critical paths) are
more important in detirmining how many registers to put in your
architecture.

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
|  //    Join the Church of HAL, and worship at the altar of all computers  |
|\X/   with three-letter names (e.g. IBM and DEC). White lab coats optional.|

tim@crackle.amd.com (Tim Olson) (05/15/89)

In article <8104@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
| (for one thing, the idea of kludging in
| byte-fetch logic externally probably turned off potential system
| designers). 

This is something that we have changed -- Revision 'C' Am29000's have
fully-implemented byte and half-word loads and stores, with both signed
and unsigned versions of loads.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

rpw3@amdcad.AMD.COM (Rob Warnock) (05/16/89)

In article <25633@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
+---------------
| In article <8104@killer.Dallas.TX.US> elg@killer.Dallas.TX.US writes:
| | (for one thing, the idea of kludging in byte-fetch logic externally
| | probably turned off potential system designers). 
| This is something that we have changed -- Revision 'C' Am29000's have
| fully-implemented byte and half-word loads and stores, with both signed
| and unsigned versions of loads.
+---------------

And it was done in a fully upwards-compatible way. Old code continues to run
correctly on new chips. And the C compiler can still be told to generate any
of the flavors of old, old-but-with-byte-write, or new code.

Anyway, the compilers have always handled 29k byte-load and byte-store on
word-only memory just fine [LOAD/EXTRACT or LOAD/INSERT/STORE]. All the
published performance numbers (until Rev.C) were for *no* hardware byte
support. And Unix doesn't care (both 4.3 & S5r3 run). I suspect the real
reason you don't see 29k-based Unix workstations is that it just wasn't
quite there during the window that a bunch of people were choosing their
RISC/Unix CPU. "Missed the wave", as it were.  [Who knows, it may see
some Unix use yet, by those who'd rather not compete directly with other
SPARC/MIPS/88k vendors.]

Still, it makes a neat embedded controller...


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/16/89)

In article <8905110956.AA12655@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes:

    Summary: If your compiler is not antiquated (compared to this one, even
    	 gcc looks bad), you can use gazillions of registers. The speedup
    	 is on the order of 30%, so don't hold your breath.

Well, "gazillions" is a bit exxagerated, but 30% is not bad improvement;
the curve though has a very mild slope at the end, so the knee is
fairly early. Read on...
     
    The compiler used was for the not-so-widely-known DECWRL Titan RISC
    machine, a ECL RISC with 64 non-windowed registers and a single cylce
    load. The compiler does global register allocation (yes, global
    variables in the sense of C) at link time and is a common backend for C,
    Fortran and Modula 2.

If I remember correctly, the decwrl machine is a titan, and is the favourite
plaything of Paul Vixie :-) :-).

    The number of registers does not apply to the expression evaluation and
    address generation registers (this seems to be the 4-7 people have been
    talking so far)

I am not surprised...

    but those used by the optimizer to hold things.
    It analyzes all scalar variables into non-conflicting groups and tries
    to allocate those to registers.  From the introduction:
     
    	"When we use our method for 52 registers, our benchmarks speed up
    	 by 10 to 25 %. Even with only 8 registers, the speedup can be 
    	 nearly that large, if we use previously collected profile information
    	 to guide the allocation.

I thank you enormously. This quote greatly cheers me up. This is a very good
support for my position on "register" (where a competent programmer is
expected to do this, with great simplication of the compiler, because
compilers cannot possibly know which variables are the most frequently used
dynamically).

         We cannot do much better, because programs whose variables all fit
	 in registers rarely speed up by more than 30 %.

Excellent. I can surmise that this is simply because there not that many hot
spots in a program...

	[ ... many interesting numbers ... ]
     
    This shows that his scheme is very efficient in removing these memory
    references. Please note that given the enormous "hit rate" he has and
    given the not so impressive speedups he got the overall precentage of
    scalar memory references cannot be that big versus accesses to bigger
    data structures.

I am not surprised either. it is the old rule of hot spots on one hand, and
of the desirability of using vector architrctures for vector code...
     
    Now the interesting tables. What happens if you use fewer registers ?
    The following table shows the speed improvements with 52, 32 and 8 registers.
    All of these performance measures are the relative improvement the programs
    took with global register allocation guided by profile information relative
    to "naive register allocation".
     
    		52	32	8
    	-----------------------------
    Livermore	19 %	18 %	12 %
    Whetstone	10 %	10 %	 5 %
    Linpack 	13 %	13 %	10 %
    Stanford	28 %	27 %	20 %
    Simulator	16 %	15 %	 8 %
    Verifier	19 %	16 %	 7 %

This does not surprise me either. I would go of course for 8 registers....
Note that these 8 global registers chosen using profiling probably would
be more than plenty instead of just adequate on a reg-mem machine... I remain
solid in my reckoning that for a reg-mem machine 8 registers overall is just
adequate, and 16 is plenty, with some increment required for reg-reg.

    There is another very interesting paper by David comparing register window
    schemes of varying organization with this global allocation stuff and this
    seems to suggest that a slightly bigger global register file beats register
    windows

The 29k guys will cheer...

    if you are willing to use this extremely advanced compilation
    techniques.

Or the "register" keyword, and you are a competent programmer. Possibly
extended to global variables...

    The paper appeared in Proc. of the SIGPLAN 1988 Conference on
    Programming Language Design and Implementation, June 1988. The papers
    title is "Register Windows vs. Register Allocation". It's way to long to
    reproduce here and the graphics in there are much nicer than anything I
    can type here.

It is very good indeed. But the question is always of course whether a
statically allocated large register file is can still be called a register
file, and not rather a first level memory; if it is so large that you
can store essentially all the variables into it for virtually all of
the program (i.e. use it part as data and part as stack), then I beg
to submit that you have a two level mem-mem architecture. Maybe the
as AMD 29,000 has a couple hundred registers, the AMD 290,000 will have
two thousand, and paging/swapping of registers, etc... :-).
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (05/16/89)

In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:

>say).  In the extreme case of the computer someone mentioned that had 256

I was the person who mentioned 256 registers.  The machine is the Cyber 205.

>registers, a register-register operation would use up sixteen bits just to
>specify the two registers involved.  Contrast that with the six bits required
>if you only have eight registers.  

The register to register instructions on the 205 were 32 bits long, the same
length as most of today's current crop of machines.  1 Byte went to an 8 bit
opcode, and three bytes went to specify the two input and one output register
of each instruction.  

>to deal with, an extra ten bits so that you can have 256 instead of eight
>registers is probably too big a price to pay and would probably slow programs
>down.

I agree that 256 registers were probably too many, because the compiler had
trouble using more than about 60 typically.  A 64 register machine would
have been adequate, given what compilers can generally make use of.
I note that these were 64-bit registers.

I am not sure what you mean by "slow programs down".  Certainly, RISC machines
have bigger code than compact code CISC's like VAX and NS32000.  But the RISC
machines have generally been significantly faster when implemented in the 
same techology than the corresponding CISC machines.  By your reasoning, the
Intel iAPX 432 should have been the fastest machine around, since it used
bit-variable-length instructions which could begin on any bit.

>     Speaking from my experience on the 68000, I would say that for assembly
>language programming, 16 registers is generally enough, with more being
>occasionally desirable.  

This statement has been repeated in
various ways by several people the last few weeks.  The problem with this
is that the compilers referred to, and the assembly code and coders referred to,
ASSUME that registers are for expression evaluation only.  We have seen 
evidence that C expression evaluation does not require more than 16 registers,
but that DOES NOT imply that substantial improvements in code speed cannot be
made by making "global" (beyond one statement) register assignments.  This is
not even new.  Commercial products have been doing it for at least 15 years.

>    Because C can't do a very good job of comprehending an
>entire substantial subroutine or program and deciding what should be in
>registers (actually, replace 'C' with HLL) I would think that a compiler could
>actually get by with slightly less.  

The postings on register usage by the AMD 29000 C compiler were representative
of what I have read about most modern C compilers.  Clearly, compilers today
can make use of 20-30 registers to advantage in C.  When dealing with double
precision, Fortran, etc. I suspect a case could (it needs investigation) for
64 32-bit registers.  


  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

danm@amdcad.AMD.COM (Daniel Mann) (05/16/89)

In article <25635@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
+---------------
| ... And Unix doesn't care (both 4.3 & S5r3 run). I suspect the real
| reason you don't see 29k-based Unix workstations is that it just wasn't
| quite there during the window that a bunch of people were choosing their
| RISC/Unix CPU. "Missed the wave", as it were.  [Who knows, it may see
| some Unix use yet, by those who'd rather not compete directly with other
| SPARC/MIPS/88k vendors.]
+---------------

AMD wishes to support customers developing UNIX systems based upon the
Am29000. Therefore, the complete source code for a port of 4.3bsd UNIX
to the Am29000 is available to any licensed AT&T user.

The 4.3bsd Berkeley UNIX currently runs on the AMD PC Execution Board
(PCEB29K), which plugs into the PC-bus of an IBM-PC/XT/AT or compatible.
It is a simple and inexpensive implementation of the Am29000 architecture,
utilising a video-RAM memory. The PC is used as an I/O server.

A detailed application note entitled "4.3bsd Unix and the PCEB" is
available which discusses the 4.3 port. It is suitable reading for
experienced UNIX system programmers wishing to understand Am29000
issues. Designers of new Am29000 systems may wish to evaluate the
experience gained in the PCEB port before committing to any new
hardware design.

For a copy of the application note or for further information about
obtaining the Am29000 4.3bsd Unix software, call or write:

CONTACT:       Daniel Mann, Am29000 Software Development Supervisor
ADDRESS:       AMD, 901 Thompson Place, MS 6, Po Box 3453, Sunnyvale, CA94088
PHONE:         408 7495058
E-MAIL:        danm@amdcad.amd.com

rpw3@amdcad.AMD.COM (Rob Warnock) (05/18/89)

In article <952@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
+---------------
| In article <8905110956.AA12655@decwrl.dec.com> neideck@nestvx.dec.com writes:
| > There is another very interesting paper by David comparing register window
| > schemes of varying organization with this global allocation stuff and this
| > seems to suggest that a slightly bigger global register file beats register
| > windows
| The 29k guys will cheer...
+---------------

Not really.

Again, to clear up the slight misunderstanding, by current 29k software
convention [with hardware support from the "gr1+local" address adder],
the Am29000 *is* a (variable-sized) register window machine. The "local"
regs are allocated with a strict stack discipline, which, as Neideck
implies, may not be quite as good as a super-global register allocation
strategy.  However, with much less than 1% of cycles being spent on
register cache spilling/filling (all benchmarks except "Ackerman"),
it's probably good enough. [And of course since the current register
discipline *is* a software convention, nothing prohibits a future
compiler from switching to a super-global allocation strategy...]


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

andrew@frip.WV.TEK.COM (Andrew Klossner) (05/18/89)

[]

	"Larger register sets pay you back when you do dataflow
	analysis and allocate registers over larger spans of code than
	C expression statements ... If you only have a handful of
	registers, though, you might as well keep it simple."

Dataflow analysis is good for more than register allocation, and can be
a win on a register-starved machine.  For example, when compiling this
code (which swaps "x" and "y"):

	temp := x;
	x := y;
	y := temp;

if dataflow analysis tells you that "temp" is dead after this fragment,
you can avoid a store operation, and you've won even if you have only
two registers.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

andrew@frip.WV.TEK.COM (Andrew Klossner) (05/18/89)

[On the original 29k chips, which had no byte load/store:]

	"Anyway, the compilers have always handled 29k byte-load and
	byte-store on word-only memory just fine [LOAD/EXTRACT or
	LOAD/INSERT/STORE].  And Unix doesn't care ..."

My Unix kernel device drivers sure care!  I have to talk to peripheral
chips, some of whose registers require word-wide load/store and some of
which require byte-wide.  I can't implement byte-wide with word-wide
because of the side effects.  The extra logic to solve this problem is
not pretty.

The 29k looks like a nice chip, though, if you deal only with
well-behaved peripheral interfaces.

  -=- Andrew Klossner   (uunet!tektronix!orca!frip!andrew)      [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

davecb@yunexus.UUCP (David Collier-Brown) (05/18/89)

In article <952@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
[responding to some factual information on register usage]
>I thank you enormously. This quote greatly cheers me up. This is a very good
>support for my position on "register" (where a competent programmer is
>expected to do this, with great simplication of the compiler, because
>compilers cannot possibly know which variables are the most frequently used
>dynamically).

  True if you define "compiler" narrowly (ie, normally).  However, if
you think of an optimization "environment", you can justify feeding the
results of a profile back into the compiler to guide register allocation.
  I would tend to structure this as an optional "pass" (well, pseudo-pass)
for purely practical reasons...

  And if you want to experiment with the idea right now, consider
writing an output processor for a profiler which orders variables by
use in individual functions, and then another utility to rewrite the
source code with register declarations for the X hottest variables
per function iff the call cost/run cost ratio would not be deranged
thereby.  (It is fairly hard (:-)).

--dave (only two more days to monomania!) c-b
--
David Collier-Brown,  | {toronto area...}lethe!dave
72 Abitibi Ave.,      |  Joyce C-B:
Willowdale, Ontario,  |     He's so smart he's dumb.
CANADA. 223-8968      |


	

rpw3@amdcad.AMD.COM (Rob Warnock) (05/23/89)

In article <3355@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM writes:
+---------------
| [On the original 29k chips, which had no byte load/store:]
| 	"Anyway, the compilers have always handled 29k byte-load and byte-store
| 	... And Unix doesn't care ..."
| My Unix kernel device drivers sure care!  I have to talk to peripheral chips,
| some of whose registers require word-wide load/store and some of which require
| byte-wide.  I can't implement byte-wide with word-wide because of the side
| effects.  The extra logic to solve this problem is not pretty.
+---------------

True, which was why they changed the chip.

But note that the byte-access bits came out to the pins, even on the
"non-byte-read/write" chips. That is, even though the memory interface
was nominally word-wide, the byte indicators in the load and store
instructions and the LSBs of the addresses came out, and could be used
by external hardware, if desired. So assembly-language code could always
do the right thing. It's just that the code generated by the earlier C
compilers couldn't, since they "knew" that "all" memory was word-wide.
[And now that's fixed, so it's moot.]

By the way, another common trick [used even on other CPUs, such as 68000's]
was to simply put byte-wide chips on one byte rail on the bus (pick one...
for a couple of reasons, the *high*-order usually turned out to be the most
convenient on the 29k), and to "don't care" the LSBs of the CPU. That is,
wire CPU A2 to I/O chip A0, etc. Then, simply write the device header file
to name the I/O chip's registers as being 4 bytes apart...

+---------------
| The 29k looks like a nice chip, though, if you deal only with well-behaved
| peripheral interfaces.
+---------------

And the fact that the lockstep instruction pipelining is "exposed"
to the programmer means that (at least in the present 29k family)
you can predict what sequence of bus accesses a given instruction
sequence will generate, which also makes it really nice when dealing
with not-so-well-behaved interfaces...


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

mcg@mipon2.intel.com (05/30/89)

In article <25382@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes:
>
>>say).  In the extreme case of the computer someone mentioned that had 256
>>registers, 
>
>I agree that 256 registers were probably too many, because the compiler had
>trouble using more than about 60 typically.
>
>I am not sure what you mean by "slow programs down".  Certainly, RISC machines
>have bigger code than compact code CISC's like VAX and NS32000.  But the RISC
>machines have generally been significantly faster when implemented in the 
>same techology than the corresponding CISC machines.

One thing that no one has yet pointed out is that a reason not to implement
huge directly-addressable register files is that, in any reasonable
implementation, the register file must be multi-ported.  A six-ported register
cell is about 5x the size of a single-ported register cell or a cache RAM
cell.  To more fully utilize micro-parallelism in an architecture, more
sources and results need to be fetched from the register file simultaneously,
thus the additional ports.

This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
(actually 256) architectural registers.  In the current implementation they
are (I believe) 3-ported, and even now occupy a large amount of the 29k
die space.  I believe that they will run into serious problems if they
ever attempt to dispatch and execute multiple general instructions per cycle.

The 960, on the other hand, exposes 32 general registers architecturally,
but because 16 of these are "local", and saved/restored on call/return
to/from architecurally-hidden resources, we can easily move from a cheap
(1-ported by 4 sets) register implementation, to a very fast one
(6-ported by 8+ sets) in high-performance implementations.

The cut line is different on every architecture - 32 is sufficient on the
960, but I am not disagreeing with Wall's estimate of 64.  Certainly
in floating-point intensive scientific applications dominated by
double-precision arithmetic in loops, more registers are needed.  But
substantialy more than 64 seems to limit architectural flexibility
quite severely.

S. McGeady
Intel Corp.

tim@crackle.amd.com (Tim Olson) (05/30/89)

In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes:
| One thing that no one has yet pointed out is that a reason not to implement
| huge directly-addressable register files is that, in any reasonable
| implementation, the register file must be multi-ported.  A six-ported register
| cell is about 5x the size of a single-ported register cell or a cache RAM
| cell.  To more fully utilize micro-parallelism in an architecture, more
| sources and results need to be fetched from the register file simultaneously,
| thus the additional ports.

I don't see why this is a reason not to implement large register files. 
You need to apply transistors where they will do the most good.  Note
that processors that attempt to "more fully utilize micro-parallelism"
also tend to want to have more general-purpose registers available to
maintain full performance.

| This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
| (actually 256) architectural registers.  In the current implementation they
| are (I believe) 3-ported, and even now occupy a large amount of the 29k
| die space.  I believe that they will run into serious problems if they
| ever attempt to dispatch and execute multiple general instructions per cycle.

Well, we see no problems in either our 2nd or 3rd generation parts...

| The 960, on the other hand, exposes 32 general registers architecturally,
| but because 16 of these are "local", and saved/restored on call/return
| to/from architecurally-hidden resources, we can easily move from a cheap
| (1-ported by 4 sets) register implementation, to a very fast one
| (6-ported by 8+ sets) in high-performance implementations.

So you *will* be looking at large register files (128+, 6-ported) for
high performance.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

bcase@cup.portal.com (Brian bcase Case) (05/31/89)

>A six-ported register cell is about 5x the size of a single-ported
>register cell or a cache RAM cell.

This could be, depending on technology details, of course.  (The width
of metal dominates, usually?)

>This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
>(actually 256) architectural registers.  In the current implementation they
>are (I believe) 3-ported, and even now occupy a large amount of the 29k
>die space.  I believe that they will run into serious problems if they
>ever attempt to dispatch and execute multiple general instructions per cycle.

I can speak to this point with a little authority.  The register file in the
current implementation is indeed 3-ported.  I must admit that I have had
second thoughts about making the register file so big.  The advantages are
many, but the cost of additional ports is indeed bigger than for other
architectures (boy, the '386 architecture has a leg up here! :-).

The current 29K implementation has about 1/5 of the usable (i.e., non-pad
ring) die area dedicated to the register file using 1.26 (Dave Witt:
what is it really?) micron technology.
At about 1 micron, that will shrink to about 1/9 of the usable area, and at
.8 micron, about 1/11.  Increasing the number of ports to 6, lets say, will
increase the size about a factor of 1-2/3 (3x a single-ported cell to
(using Steve's number) about 5x a single-ported cell).  Thus, at 1 micron,
the register file will use about 18% of the useable die area, and at .8 micron
about 15% of the area.  This is not an insignificant amount of area, but it
is not "too" much in my opinion.  Why?  Because having lots of registers IS
GOOD.  If the 29K spends more area on a 6-ported data storage resource than
other processors, I think it's an advantage as long as its not taking
"too" much area!  Up to a point, I would rather spend area on a 6-ported
resource than a 1-ported resource (cache).  I guess we are talking about
where is the "point" in "up to a point."

On the other hand, a 6-ported, 32-register file would be
about (I am guessing by simple scaling)
2% to 5% of the usable die area.  On the 960, some of the
13% difference is used for the register file "backing store", but not much,
say another 3% (I dunno what it really is).  So, the 960 has a 10% "bonus"
chunk of die area.  What can be done with it?  At some points on the technology
curve, it will have a larger cache than the 29K at the same point.  At some
other points, the 10% diea area will "only" result in a smaller die because
10% isn't enough to increase the size of a cache or an FP somethingorother
(but 10% smaller die are cheaper die, depending on yield, greediness,etc.).

So the 960 seems to have a slight implementation advantage.  What's the real
difference?  I don't know because 10% is a small enough amount that it can
be lost in the "noise" of implementation!:  If one guy uses automatic tools
and the other uses full custom deisgn/layout, some difference will result.
Also, just due to "the way things are" the 10% might not be usable.  Some
die are square while some are rectangular because of "the ways things are."

I am making this argument in full recognition of an argument about CISCs
that I used to believe:  "CISCs will always have an implementation
disadvantage becuase of the microcode ROM."  This is bogus for the same
reason:  as the technology improves, the ROM itself shrinks until it can no
longer be seen with the naked eye!  What constitutes either a current RISC
processor or a current CISC processor (the PROCESSOR pipeline not the caches,
TLBs, etc.) would be a very small corner on the die if implemented in the
technology of 1995.

However, what constitutes a current RISC or CISC processor will be
wholely uninteresting in 1995.  For the implementations to come, including
superscalar stuff, the issues will be the complexity of implementation and
the cost (read:  people) of realizing that implementation.  This is where
CISCs, I believe, will faulter.  One of the great advantages of RISCs is 
that they are conceptually easier to implement.  This effect is compounded
with increasing ambitions for greater performance:  the fewer interactions
between instructions the easiser a multi-instruction-per-cycle
implementation will be to construct.  In 1995, there might be another
reactionary simplification movment in computer architecture; maybe current
RISCs are too complex!  Too many special cases!  But I digress...

Size will still matter, but I believe it will be dominated by the many
connections (buses) and small structures required to handle the special
cases and resource interactions, not the larger, regular structures like
register files and caches (although we will still be trying to fit as much
cache as possible).  Thus, the 14-ported 192-register file of the 29K will
still be larger than the 14-ported 32-register files of the 960, MIPS, i860,
etc., but it won't matter because the 29K's register file will be 1% of the
die while the 960's will be 0.1%.  (BTW, the register file will be the center
of the processor, not at one end of the data path as it is now.)
So I believe.

>The 960, on the other hand, exposes 32 general registers architecturally,
>but because 16 of these are "local", and saved/restored on call/return
>to/from architecurally-hidden resources, we can easily move from a cheap
>(1-ported by 4 sets) register implementation, to a very fast one
>(6-ported by 8+ sets) in high-performance implementations.

This indeed gives more flexibility in implementation choices.  The 29K's
register file gives more flexibility in choices for using the register file.
Only time will tell if more advantage is gained from having implementation
choices or from use choices.  If the belief that, in the end, business
issues dominate, maybe the business issue of cost is more important.

>The cut line is different on every architecture - 32 is sufficient on the
>960, but I am not disagreeing with Wall's estimate of 64.  Certainly
>in floating-point intensive scientific applications dominated by
>double-precision arithmetic in loops, more registers are needed.  But
>substantialy more than 64 seems to limit architectural flexibility
>quite severely.

That should be "more than 64 seems to limit *implementation* flexibility."
The 29K has more architectural flexibility (the register file can be used
as a stack cache, a flat pool of 192 registers, or as a few pools of a
smaller number of registers.  Is this important?  I dunno yet.).

These are just a few of my opinions mixed up with some pseudo-facts.  Don't
believe any of them!

"I did it my way."  - Sinatra.

lars@salt.acc.com (Lars J Poulsen) (05/31/89)

In article <25382@ames.arc.nasa.gov>
	   lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>>I agree that 256 registers were probably too many, because the compiler had
>>trouble using more than about 60 typically.

In article <m0fRx4x-0001fDC@mipon2.intel.com>
	   mcg@mipon2.UUCP (Steven McGeady) writes:
>This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
>(actually 256) architectural registers.
>
>The 960, on the other hand, exposes 32 general registers architecturally,

From a humble applications programmer, who occasionally has written a bit
of kernel code: The biggest pain with an architecture that exposes too
large a register file is saving and restoring on context switches. While
interrupt service routines can ignore this and store only what they
need, context switches require storing of the entire register set. Or do
people really feel that the processors today are fast enough that
scheduling pre-emption is too rare to influence the selection of
register file size ?

Many years ago I switched from working on (then) Univac-1100 to PDP-11's
and VAXen; the 1100 had about 44 visible registers in the user set; few
programs really used more than half of them; worse yet, they were
asymmetric, with different properties between the three major groups.

The VAX instruction set got more mileage out of its 16 general registers
than the Univac got out of its 44, and saved many cycles on register
save/restores.

/ Lars Poulsen <lars@salt.acc.com>     (800) 222-7308  or (805) 963-9431 ext 358
  ACC Customer Service                Affiliation stated for identification only
                  My employer probably would not agree if he knew what I said !!

brooks@vette.llnl.gov (Eugene Brooks) (05/31/89)

In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes:
>The cut line is different on every architecture - 32 is sufficient on the
>960, but I am not disagreeing with Wall's estimate of 64.  Certainly
>in floating-point intensive scientific applications dominated by
>double-precision arithmetic in loops, more registers are needed.  But
>substantialy more than 64 seems to limit architectural flexibility
>quite severely.
The cut line for scratch registers is directly proportional to the memory
latency.  The longer your latency the more registers, and concurrently
handled computation, you need to mask it.


brooks@maddog.llnl.gov, brooks@maddog.uucp

khb@gammara.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/31/89)

In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes:
.... # regs on different machines>>

>
>From a humble applications programmer, who occasionally has written a bit
>of kernel code: The biggest pain with an architecture that exposes too
>large a register file is saving and restoring on context switches. While
>interrupt service routines can ignore this and store only what they
>need, context switches require storing of the entire register set. Or do
>people really feel that the processors today are fast enough that
>scheduling pre-emption is too rare to influence the selection of
>register file size ?
>
>Many years ago I switched from working on (then) Univac-1100 to PDP-11's
>and VAXen; the 1100 had about 44 visible registers in the user set; few
>programs really used more than half of them; worse yet, they were
>asymmetric, with different properties between the three major groups.
>
>The VAX instruction set got more mileage out of its 16 general registers
>than the Univac got out of its 44, and saved many cycles on register
>save/restores.

"modern" compilers tend to do global (well, misnomer, all of a
procedure at once) analysis; older compilers tended to look at smaller
bits of code (block, maybe interblock) ... so "modern" compilers have
the opportunity to do better.

On scientific application codes the bodies of the modules are large
enough, and there are enough other effects, that saving a considerable
number of registers is not necessarily a large component of the total
cost (on one machine, in a former life, it tended to be less than 5%
of the call cost). 

If one inlines "leaf" nodes, very respectable numbers of registers can
get used.

For those who are slightly twisted, the Cydra 5 register usage,
documented in the reccent ASPLOS-III proceedings shows how the
combination of software pipelining and VLIW can eat up registers quite
quickly. 

Upshot: modern compilers can employ as many registers are you can
	design in. If you got 'em, you gotta figure out a way to 
	save 'em. Windows, "multiconnect", register pointers, special
	purpose (including vector) and other solutions are workable.

Naive rationale for infinite (as long as they are free) registers:

constant propagation and common subexpressions, combined with loop
unrolling (or more advanced techniques like percolation scheduling)
result in a need for as many as you got ... and then some.

OS is different from application programs, and it is not clear that
one might not be better off with a special OS register allocation
scheme.
Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

firth@sei.cmu.edu (Robert Firth) (05/31/89)

In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes:

>From a humble applications programmer, who occasionally has written a bit
>of kernel code: The biggest pain with an architecture that exposes too
>large a register file is saving and restoring on context switches.

It's more than a pain; it can cause a severe performance penalty in
some application domains.  This is one of the reasons I don't like
register-file machines: ALL the registers have to be saved and
restored across a full context switch, and there are an awful lot
of them.  Moreover, a hardware instruction isn't much help, since
the holdup is the actual data copy to and from memory, not the
instruction fetch.

For this reason, I think people with hard real-time applications should
look pretty carefully at context switch costs.

> While
>interrupt service routines can ignore this and store only what they
>need, context switches require storing of the entire register set.

There's another pitfall here.  Some machines use the same register
window mechanism to handle interrupts, automatically shifting to
a new set.  If the interrupt hits you at just the wrong call depth,
you get an automatic register spill.  So about one-sixth or one-fourth
of the time, it takes maybe three times as long overall to field the
interrupt and maybe thirty times as long to reach the first instruction
of the handler.  Since hard real time people care about predictability
next only to performance, this gives loads of grief.

frazier@oahu.cs.ucla.edu (Greg Frazier) (06/01/89)

In article <3427@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes:
>
>>From a humble applications programmer, who occasionally has written a bit
>>of kernel code: The biggest pain with an architecture that exposes too
>>large a register file is saving and restoring on context switches.
>
>It's more than a pain; it can cause a severe performance penalty in
>some application domains.  This is one of the reasons I don't like
>register-file machines: ALL the registers have to be saved and
>restored across a full context switch, and there are an awful lot
>of them.  Moreover, a hardware instruction isn't much help, since
>the holdup is the actual data copy to and from memory, not the
>instruction fetch.
Actually, this is not entirely true.  While you do have to save
all currently used registers, it is actually optimal to only
restore a single window within the register file (if you are
using windows - if you're not, then yes, all currently used
must be restored).  In addition, if you have a cache, saving
the registers is very fast (unless your cache is write-through).
If the cache is context dependent, then the time required to
flush the cache and the delay caused by a cold start will make
the register save/restore time insignificant.
>
>For this reason, I think people with hard real-time applications should
>look pretty carefully at context switch costs.
Hard real-time applications always have problems with memory
hierarchies - cache misses can be just as painful as register
saving/restoring, and don't even mention page faults! :-)
>> While
>>interrupt service routines can ignore this and store only what they
>>need, context switches require storing of the entire register set.
>
>There's another pitfall here.  Some machines use the same register
>window mechanism to handle interrupts, automatically shifting to
>a new set.  If the interrupt hits you at just the wrong call depth,
>you get an automatic register spill.  So about one-sixth or one-fourth
>of the time, it takes maybe three times as long overall to field the
>interrupt and maybe thirty times as long to reach the first instruction
>of the handler.  Since hard real time people care about predictability
>next only to performance, this gives loads of grief.
Obviously, for realtime applications, you need to save a window (or
some set of registers) for interrupt handling.  And, equally obvious,
you still run into problems if you get a high-priority interrupt
while you are handling a high-priority interrupt.  Hard realtime
systems require a high degree of predictability (i.e. you have
to thoroughly understand your application).  As I mentioned before,
introducing hierarchical memory makes the task of understanding
MUCH more difficult.  Add to this multiple processors (just to add
another issue), and you have no choice but to do soft realtime,
unless your application is VERY simple.  ...And, if your application
is that simple, you can (?) take register behavior into account
(I think - I have not done any research in hard realtime, so I'm
just blathering).

Greg Frazier
&&&&&&&&&&&&&&&&&&!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!((((((((((((((((((

Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/01/89)

In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes:

>large a register file is saving and restoring on context switches. While
>interrupt service routines can ignore this and store only what they

From time to time, I have seen the argument that large register files slow
down context switches.  Yes and no.  If you are looking at general purpose
operating systems like Unix, register saves are a very small part of the
cost of a context switch.  Last week I saw an ad for a real time version
of Unix with a guaranteed response time (switch to kernel to high priority
process) of 5 milliseconds.  Now, for the sizes of register files we are
talking about, a register save is about 1/1000th to 1/100th of the total time.

If you are looking at embedded real time systems with no general purpose kernel,
a large number of registers could be a problem.  Where it gets "interesting"
is if you are trying to support microtasking at the outer do-loop level in
Fortran from within the kernel.  Well, there is no such thing as a free lunch.
I guess you just have to design carefully when you get near the edge...

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

lars@salt.acc.com (Lars J Poulsen) (06/01/89)

>In article <1RcY6x#64Zq3Y=news@anise.acc.com>
	lars@salt.acc.com (Lars J Poulsen) writes:
>.... # regs on different machines ...
>>The biggest pain with an architecture that exposes too
>>large a register file is saving and restoring on context switches. While
>>interrupt service routines can ignore this and store only what they
>>need, context switches require storing of the entire register set. Or do
>>people really feel that the processors today are fast enough that
>>scheduling pre-emption is too rare to influence the selection of
>>register file size ?

In article <107302@sun.eng.sun.com>
	khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:
>Upshot: modern compilers can employ as many registers are you can
>	design in. If you got 'em, you gotta figure out a way to 
>	save 'em. Windows, "multiconnect", register pointers, special
>	purpose (including vector) and other solutions are workable.
>
>Naive rationale for infinite (as long as they are free) registers:
>
>constant propagation and common subexpressions, combined with loop
>unrolling (or more advanced techniques like percolation scheduling)
>result in a need for as many as you got ... and then some.
>
>OS is different from application programs, and it is not clear that
>one might not be better off with a special OS register allocation
>scheme.

What I am talking about is saving and restoring registers when the
operating system switches from one user process to another. This is not
something that the compiler can improve on. Maybe the architecture could
define a "highest register currently in use" pointer, and encourage
context switches just before it gets incremented ?
/ Lars Poulsen <lars@salt.acc.com>     (800) 222-7308  or (805) 963-9431 ext 358
  ACC Customer Service                Affiliation stated for identification only
                  My employer probably would not agree if he knew what I said !!

mash@mips.COM (John Mashey) (06/01/89)

In article <3427@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
....
>of the handler.  Since hard real time people care about predictability
>next only to performance, this gives loads of grief.

Actually, some hard real time people care about predictability more than
performance.  Maybe some knowledgable R.T. folks out there might care
to post something about this: it's an important area, and one that
doesn't discussed here as much as it deserves.  In particular, there
are of course serious implications of going faster (which usually implies
more caching) and being less predictable (sometimes implied by caching).
Maybe some real time folks could post some references, or some descriptions
of the different flavors of real-time (as there are quite a few).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jwest@pitstop.West.Sun.COM (Jeremy West) (06/01/89)

In article <3427@bd.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:
> 
> This is one of the reasons I don't like
> register-file machines: ALL the registers have to be saved and
> restored across a full context switch, and there are an awful lot
> of them.  Moreover, a hardware instruction isn't much help, since
> the holdup is the actual data copy to and from memory, not the
> instruction fetch.
> 

In SunOS running on SPARC it only saves the register windows that
you have actually used when it does a context switch. This is often all
of them but not allways. You have to fetch instructions to save registers,
and the double-word load/store that SPARC uses reduces the number
of instructions by half. Since this gives some assistance I think the
SPARC architects decided that there wasn't much more to gain from a 
more specialised instruction. 

> For this reason, I think people with hard real-time applications should
> look pretty carefully at context switch costs.

If you're that keen on context switching the fastest thing around
IMHO is the Transputer which does it in 600ns at 20 MHz for a full switch
between two low priority tasks. Of course it effectively has one register
(sort of) and prevents you from leaving things in registers when certain
instrcutions are executed :-}

With SPARC you can partition the register windows so that two or more tasks
have separate sets of register windows. You can then get much faster
context switches between a small number of tasks. In fact to do this
effectively you need to *increase* the number of registers and I think
that some future SPARC chips (aimed specifically at realtime) will
have more.
> 
> Some machines use the same register
> window mechanism to handle interrupts, automatically shifting to
> a new set.  If the interrupt hits you at just the wrong call depth,
> you get an automatic register spill.  So about one-sixth or one-fourth
> of the time, it takes maybe three times as long overall to field the
> interrupt and maybe thirty times as long to reach the first instruction
> of the handler.  Since hard real time people care about predictability
> next only to performance, this gives loads of grief.

SPARC always reserves one register window for taking traps and
interrupts in. For an 8-window system the user code would only get 7 before
having to save to memory. If you divide for two processes then they each get
3 windows for user processes and one for taking traps.

Adrian Cockcroft
Sun Cambridge UK TSE
sun!sunuk!acockcroft

(Borrowing Jerry West's account at Mountain View to get at USENET)

slackey@bbn.com (Stan Lackey) (06/01/89)

In article <18965@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>For the implementations to come, including
>superscalar stuff, the issues will be the complexity of implementation and
>the cost (read:  people) of realizing that implementation.  This is where
>CISCs, I believe, will faulter.  One of the great advantages of RISCs is 
>that they are conceptually easier to implement.

In order to go superscalar, future RISC chips will end up having new,
incompatibile architectures.  I make this statement because I understand
the difficulty of the alternative, which is the execution of several
existing-architecture instructions concurrently.

I strongly suspect the RISC guys are, as we speak, analyzing stuff like:
 What kind of sequences frequently show up
 How to handle register dependencies and overutilized facilities
 Should totally independent instr's go in parallel only, or can dependent
   ones be issued and then pipelined (load to r0 followed by add to r0)

Let me guess what they're finding.  Correct me if I'm wrong.

 Load/store followed by increment/decr of the address is common.
 Load followed by dependent use of the data is common.
 Decrement of register, followed by test (compare or just zero), followed
   by branch, is common.
 Addition to register followed by its use as an address.
 Bunch of loads in a row, bunch of stores in a row.
[I could probably think of more with more time.]

Look familiar?  This prompted my first reaction when RISC started
getting popular: Where do they go from here? Yes, it's an interesting
technology opportunity.  But I'd rather implement the next generation
machines as CISC's rather than RISC's, IF I HAVE TO STAY INSTRUCTION
SET COMPATIBLE.  In other words, I think that when they learn how to
implement a fast CISC, it will go faster than the same-technology
RISC.  Why?  Because for the RISC to keep up, it will have to execute
an average of two instructions per cycle, and one instruction is a lot
easier to implement than two, even with auto-increment addressing.

Of course, there is always the possibility to do what they did the
first time: come out with a new architecture that matches the
technology window exactly.  Note recent product announcements: this
just keeps happening!  It's just that when you do this, expect a
limited lifetime of the architecture.

-Stan  [Disclaimer:  Do I have an opinion yet?]

earl@orac.mips.com (Earl Killian) (06/01/89)

Another problem with register windows is that they make it difficult
to support coroutines.  I typically do queuing simulations by having
multiple coroutines representing separate entities, each with a
separate stack.  I've implemented this on VAXs, 68000s, and MIPS boxes
with only 20-50 lines of assemlber.  But I can't see any way to do
stack switching on a register window machine without a kernel call
(yuck).  Have any of the register window architectures implemented a
way for user code to do stack switching?
--
UUCP: {ames,decwrl}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

cik@l.cc.purdue.edu (Herman Rubin) (06/01/89)

In article <20810@orac.mips.COM>, earl@orac.mips.com (Earl Killian) writes:
> Another problem with register windows is that they make it difficult
> to support coroutines.  I typically do queuing simulations by having
> multiple coroutines representing separate entities, each with a
> separate stack.  I've implemented this on VAXs, 68000s, and MIPS boxes
> with only 20-50 lines of assemlber.  But I can't see any way to do
> stack switching on a register window machine without a kernel call
> (yuck).  Have any of the register window architectures implemented a
> way for user code to do stack switching?

Suppose you have two coroutines.  The the variables fall into three classes;
those which belong to both, those which belong to the first, and those which
belong to the second.  Why not partition the register file accordingly?

As far as stack switching, if the instructions have effectively a dedicated
stack register, then that register must be saved and restored.  But it is
necessary to do this anyhow.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

baum@Apple.COM (Allen J. Baum) (06/01/89)

[]
>In article <40718@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>I strongly suspect the RISC guys are, as we speak, analyzing stuff like:
> What kind of sequences frequently show up
>Let me guess what they're finding.  Correct me if I'm wrong.
> (..lots of pairs that look like single CISC instructions...)
>Look familiar?  This prompted my first reaction when RISC started
>getting popular: Where do they go from here? Yes, it's an interesting
>technology opportunity.  But I'd rather implement the next generation
>machines as CISC's rather than RISC's, IF I HAVE TO STAY INSTRUCTION
>SET COMPATIBLE.  In other words, I think that when they learn how to
>implement a fast CISC, it will go faster than the same-technology
>RISC.  Why?  Because for the RISC to keep up, it will have to execute
>an average of two instructions per cycle, and one instruction is a lot
>easier to implement than two, even with auto-increment addressing.
>
>Of course, there is always the possibility to do what they did the
>first time: come out with a new architecture that matches the
>technology window exactly.  Note recent product announcements: this
>just keeps happening!  It's just that when you do this, expect a
>limited lifetime of the architecture.
>
>-Stan  [Disclaimer:  Do I have an opinion yet?]

Absolutely- maybe. I certainly agree that the current RISC philosphy is just
a window in time. As design tradeoffs change (and they will), a lot of what
currently is dogma will become rather obsolete. For example, just as compiler
technolgy affected our design choices, silicon compiler technology might affect
it. New device technologies (GaAs, new kinds of memory device) will affect the
relative speeds of memory versus logic. Advances in compiler technology may
take advantage of cache control operations to optimize cache hit ratios, or
be able to compiler for parallel machines.

I'm not sure, however, that you can with authority that it will be easier to
implement a CISC type mem-reg architecture than it would be to execute two RISC
style instructions at once (at the same speed), because the RISC approach lets
you schedule the interactions at compile time, while the CISC is forced to do
it at run time. Therefore, the RISC might keep up without going balls out for
speed.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

pcg@aber-cs.UUCP (Piercarlo Grandi) (06/01/89)

In article <107302@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:
    
    "modern" compilers tend to do global (well, misnomer, all of a procedure
    at once) analysis; older compilers tended to look at smaller bits of
    code (block, maybe interblock) ... so "modern" compilers have the
    opportunity to do better.

It must be repeated for the Nth time that this is only true if spill
minimization is of paramount importance; if you look at speed, then most
spills avoided by global optimizers with large register sets don't make much
of a difference.
    
    Upshot: modern compilers can employ as many registers are you can design
	in.

But pointlessly... And even old compilers can just register everything in
sight; if there are many registers, then using an optimizer is not very
important.  The hard work, as we have just discussed, is to cache *only* the
variables that matter, for *only* the section of code where they matter (and
this can be done by the programmer using "register" in C, or by the compiler
when fed with either "representative" profile data, or with calculations or
estimations of where hot spots lie).  This tipically requires many less
registers than minimizing spills regardless of whether they are expensive
ones or not.
    
    Naive rationale for infinite (as long as they are free) registers:
				  ^^^^^^^^^^^^^^^^^^^^^^^^

Unfortunately they are not free; more registers make the system stiffer, in
that they do raise the cost of multithreading, which is where os technology
is finally heading (Mach, Os/2, etc...), and they do have costs in real
estate and even, possibly, cycle time lengthening (Cray's law). You only
need a handful of register to capture most of the benefit of expression
optimization, and another to capture most of the benefits of intra statement
optimization (whether you do it via "register" in C or leave it to the
compiler).

Large register banks are only justified for special purpose machines
(vector, VLIW, superscalar) where the only thing that matters is raw speed
in processing batched numeric codes where there is an inherent high degree
of parallelism in the algorithms employed.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

rw@beatnix.UUCP (Russell Williams) (06/01/89)

In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes:
>From a humble applications programmer, who occasionally has written a bit
>of kernel code: The biggest pain with an architecture that exposes too
>large a register file is saving and restoring on context switches. While
>interrupt service routines can ignore this and store only what they
>need, context switches require storing of the entire register set. Or do
>people really feel that the processors today are fast enough that
>scheduling pre-emption is too rare to influence the selection of
>register file size ?

   It's been my experience in working with general purpose operating systems
(not realtime monitors for embedded systems) that the register save time
is negligable compared to the other software overhead involved in a 
task or process switch. 

Russell Williams
..uunet!elxsi!rw
..ucbvax!sun!elxsi!rw

pcg@aber-cs.UUCP (Piercarlo Grandi) (06/01/89)

In article <26204@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

    In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes:
    
    >large a register file is saving and restoring on context switches. While
    >interrupt service routines can ignore this and store only what they
    
    From time to time, I have seen the argument that large register files slow
    down context switches.  Yes and no.  If you are looking at general purpose
    operating systems like Unix, register saves are a very small part of the
    cost of a context switch.

I seem to remember reading that Unix context switch time is 10% the actual
switch, 90% the cost of calling the scheduler to choose the next running
process.

This can be obviated in three ways:

	[1] long timeslices for CPU bound processes (Unix uses 1 second); for
	    io bound processes, process switches tipically occur when the process
	    goes to sleep, so calling the scheduler is almost inevitable.
	[2] a two level scheduler (e.g. MUSS); the short range scheduler only
	    does context switching, always selecting the next process to run
	    as the first in a short queue of very-runnable processes, while the
	    medium term scheduler runs periodically to determine which of
	    the runnable processes should belong to the very-runnable set.
	[3] like [2], but the short term scehduler is done by hardware,
	     e.g. the very old Honeywell 8 contexts-in-a-ring mainframe.

Using [1], the Unix strategy, large register files don't matter a lot, simply
because (as Hugh LaMaster) we are accepting such colossal overheads that
everything pales by comparison.

But large register sets do severely impact the speed of switching in case
[2], less in [3] (where the average cost may be low, because of the multiple
hw contexts for the processes in the very-runnable queue, but variance is
still large when changing its composition).

    Last week I saw an ad for a real time version of Unix with a guaranteed
    response time (switch to kernel to high priority process) of 5
    milliseconds.

A note here: I saw similar ads in the past. 5 ms means 200hz. Booooooo!
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

bcase@cup.portal.com (Brian bcase Case) (06/02/89)

>>For the implementations to come, including
>>superscalar stuff, the issues will be the complexity of implementation and
>>the cost (read:  people) of realizing that implementation.  This is where
>>CISCs, I believe, will faulter.  One of the great advantages of RISCs is 
>>that they are conceptually easier to implement.
>
>In order to go superscalar, future RISC chips will end up having new,
>incompatibile architectures.  I make this statement because I understand
>the difficulty of the alternative, which is the execution of several
>existing-architecture instructions concurrently.

By definition, a superscalar implementation of an architecture is an
implementation that *is compatible* with previous non-superscalar
implementations.  It doesn't make sense to say that to go superscalar,
future RISC chips will have incompatible architectures.  New things will
certainly be learned about the interactions between architcture
and implementation, but that doesn't mean that current RISC architectures
won't have superscalar implementations.  They are in the works as we
type....

>In other words, I think that when they learn how to
>implement a fast CISC, it will go faster than the same-technology
>RISC.  Why?  Because for the RISC to keep up, it will have to execute
>an average of two instructions per cycle, and one instruction is a lot
>easier to implement than two, even with auto-increment addressing.

First of all, the fast CISCs that we are seeing, 486 and 68040, are looking
more and more like RISCs, to the extent that that is possible.  That is,
the simple, pipelinable instructions go fast while the complex, inherently
serial instructions go slow.  This trend will only increase.  Note that
the fast CISC, which have *next generation* technology on their side, are
still not faster than current technology RISCs.

The process of subsetting a CISC's instruction set and changing the
compilers to take advantage of the fast subset is the only way for a CISC
to go reasonably fast.
For a superscalar implementation of a CISC, the subsetting will get more
severe, I predict.  Trying to execute two complex instructions, with side
effects, at the same time is much harder to understand and implement
(correctly) than trying to execute two simple instructions, relatively
free of side effects, at the same time.

>Of course, there is always the possibility to do what they did the
>first time: come out with a new architecture that matches the
>technology window exactly.  Note recent product announcements: this
>just keeps happening!  It's just that when you do this, expect a
>limited lifetime of the architecture.

The commercial introduction of new architectures is completely up to
companies.  If a new architecture facilitates superscalar, or super-
mumble, implementation, then it is the job of academics to discover
and document it.  I guess I am saying that no one should expect that
CISCs will make a comeback because of the desire to build superscalar
implementations.  Quite the contrary.

On limited lifetimes:  If we only had a portable means of distributing
applications, such as that proposed by the OSF ANDF (application-neutral
distribution format), then we could innovate at the architectural level
'till we're blue in the face (maybe we already are).  An architecture
might be in vogue for a limited amount of time, but with ANDF, it would
still be able to run new software.

tim@crackle.amd.com (Tim Olson) (06/02/89)

In article <20810@orac.mips.COM> earl@orac.mips.com (Earl Killian) writes:
| Another problem with register windows is that they make it difficult
| to support coroutines.  I typically do queuing simulations by having
| multiple coroutines representing separate entities, each with a
| separate stack.  I've implemented this on VAXs, 68000s, and MIPS boxes
| with only 20-50 lines of assemlber.  But I can't see any way to do
| stack switching on a register window machine without a kernel call
| (yuck).  Have any of the register window architectures implemented a
| way for user code to do stack switching?

Well, this seems true for SPARC, since the Current Window Pointer field
is in the Processor Status Register, which is protected, but it is not
true of the Am29000 or the i960.  The i960 has a FLUSHREG instruction
which seems to be user-accessible, and in the 29k the user has access to
all of the local registers and the stack pointer register.  The
low-level thread package I wrote for the 29k has a swapThread procedure
which takes 35 instructions, all in user-mode.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

les@unicads.UUCP (Les Milash) (06/02/89)

In article <1Rd5QS#7Pjn10=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes:
>What I am talking about is saving and restoring registers when the
>operating system switches from one user process to another. This is not
>something that the compiler can improve on. Maybe the architecture could
>define a "highest register currently in use" pointer, and encourage
>context switches just before it gets incremented ?

Somebody recently also pointed out that the INMOS Transputer can c-switch
in .6uS.  The way they do it is sort of like this but wierder;
and it's different-but-related wrt this thread:
there are 6 registers, a PC, a FP, and a 3-or-4 deep hardware stack.
all local variables are in "memory", and they have fast FP+short
offset addressing mode.  current chips have (1-soon8 kWord) on-chip 50ns
memory where you want to put your stack(s?).  architecturally i think
the chip could be done with NO on-chip (i.e. fixed address) ram but
cache instead.  in fact you can disable the on-chip bank if you want to,
although i'm not sure there're the perfect set of signals coming
out that cache users would want.

context switches will only happen on certain instructions, (like loop bottom,
and in/out (which might block).  they do have a "highest register currently
in use" sort of, but they assume that the compiler will make sure that
NONE of them are in use at those points, it only saves the PC and FP.
#define HIGHEST_USED_REGISTER 0 /* HA.  so THERE.  deal with THAT, mate! */

remember the TI 99*?  before my time, but it's a memory-to-memory machine,
with the "registers" being memory-mapped, FP relative.  i heard that it
was a dog, at least the micro version.  has anybody considered memory-to
memory architectures in the face of modern cache design?

it seems to solve some problems (but probably introduce some, i just cann't
see what right off the bat).  somebody recently was wondering about if
you do global register allocation, what if your most frequent variable
had pointers to it, how do you handle that?  this'll solve that at least,
right?

in america the transputer gets little attention (i guess cause it's got no
mmu and it's not intel- or moto- compatable and cause inmos
took about 5 years before they got anybody to understand what the h eck it
was).  but it is wierd, and in lots of ways good, and we can learn from it.
it's risc not in the sense of "register-to-register with windows" but in the
sense of "optimized in the face of real data", or at least they do constantly
quote numbers about how "only 1.5% of these actually require those so we'll
make you do it in software" and they were one of the first to make short
immediate constants load fast.

have load-store day.

Les Milash

baum@Apple.COM (Allen J. Baum) (06/03/89)

[]
>In article <479@unicads.UUCP> les@unicads.UUCP (Les Milash) writes:
>
>Somebody recently also pointed out that the INMOS Transputer can c-switch
>in .6uS.  The way they do it is sort of like this but wierder;
>and it's different-but-related wrt this thread:
>there are 6 registers, a PC, a FP, and a 3-or-4 deep hardware stack.
>all local variables are in "memory", and they have fast FP+short
>offset addressing mode.
>
>remember the TI 99*?  before my time, but it's a memory-to-memory machine,
>with the "registers" being memory-mapped, FP relative.

The ATT CRISP chip is similar; its actually a memory-memory architecture, 
where all references are relative to a stack pointer. The top 32 entries are
cached in a Tos-of-Stack cache. This is like a register-window, with a
granularity of one register (i.e. window slides by 1 min, instead of fixed 8),
and with the advantage that you can address into the registers.

A context switch can be extremely fast; just change the stack pointer. Of 
course, then all accesses to the 'registers' are cache misses.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

khb@chiba.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (06/03/89)

In article <978@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

>...
>It must be repeated for the Nth time that this is only true if spill
>minimization is of paramount importance; if you look at speed, then most
>spills avoided by global optimizers with large register sets don't make much
>of a difference.

And for the N+1th time, I guess, it must be repeated that the class of
machines for which minimizing spills is quite interesting (and getting
more so all the time). Consider the paper by Hsu,Dehnert, and Bratt in
ASPLOS-III on the Cydra 5, as an example.

>    
>    Upshot: modern compilers can employ as many registers are you can design
>	in.
>
>But pointlessly... And even old compilers can just register everything in

One can, but old compilers tend to do a very bad job of it. 

>sight; if there are many registers, then using an optimizer is not very
>important.  The hard work, as we have just discussed, is to cache *only* the
>variables that matter, for *only* the section of code where they matter (and

Which with software pipeling, trace scheduling and similar techniques
can be quite a long time indeed. On the Cydra 5 a memory write was
assumed to take 26 cycles, two memory references could be initated
EVERY clock, as well as two integer operations and two FP operations.
Since the FP operations required more than one cycle, the instruction
scheduling was quite interesting. With respect to register "live time"
a given register might be required several loop iterations into the future.

>this can be done by the programmer using "register" in C, or by the compiler

Which is often ignored by the compiler...simply because programmers
cannot reasonably guess how the compiler (try it on your multiflow
trace/28 for a bunch of codes and show us what is produced!) will
unroll, split and otherwise contort your code.


>when fed with either "representative" profile data, or with calculations or
>estimations of where hot spots lie).  This tipically requires many less
>registers than minimizing spills regardless of whether they are expensive
>ones or not.

Doing a "spill" (i.e. running out of interconnect) on the Cydra 5
meant your loop ran 10x slower. This is not acceptable to most programmers.


>    Naive rationale for infinite (as long as they are free) registers:
>				  ^^^^^^^^^^^^^^^^^^^^^^^^
>
>Unfortunately they are not free; more registers make the system stiffer, in

True. Which is why folks build windows, small (32) register files,
file pointers (AMD, Gould) and other stuff.

>that they do raise the cost of multithreading, which is where os technology
>is finally heading (Mach, Os/2, etc...), and they do have costs in real
>estate and even, possibly, cycle time lengthening (Cray's law). You only
>need a handful of register to capture most of the benefit of expression
>optimization, and another to capture most of the benefits of intra statement
>optimization (whether you do it via "register" in C or leave it to the
>compiler).

Your assertation about multithreading is quite true, it is here to
stay. It is far from clear that transputer type designs will win (tiny
machine fast communication) out over somewhat "chunkier" designs.

But the assertation about a handful of registers being sufficient on
high performance machines is simply not borne out. All of Seymour's
machines have a bunch (don't forget those vector registers), and this
is NECESSARY for those long pipes (superpiplining ?) ... and it is
just as true for software pipelining.


>Large register banks are only justified for special purpose machines
>(vector, VLIW, superscalar) where the only thing that matters is raw speed
>in processing batched numeric codes where there is an inherent high degree
>of parallelism in the algorithms employed.

Multiflow claims that they eat "general" code just fine. As Mashy has
pointed out there are several superscalar projects running around ...
and business codes, database codes, and windowing systems benefit from
that kind of parallelism just like numeric codes (although writing the
code in C makes it much harder to extract the parallelism).


Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist    ! kbierman@sun.com
I Voted for Bill &    |   Languages and Performance Tools. 
Opus  (* strange as it may seem, I do more engineering now     *)

hays@zeus.hf.intel.com (hays) (06/03/89)

In article <26204@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

>Last week I saw an ad for a real time version
>of Unix with a guaranteed response time (switch to kernel to high priority
>process) of 5 milliseconds.

Hugh is absolutely right in his response here - you must design for
your environment.  A guaranteed task switch of 5 milliseconds is *soft*
real time, but is still impressive for a real time Unix.

I have worked on a *hard* real time kernel for the 386 (iRMK, from
Intel) where we designed for a 50 uS maximum guaranteed interrupt
latency, and a 5 uS maximum guaranteed task switch (at 16 MHz).  Not
easy, but three orders of magnitude faster than the Unix mentioned
above - and with its own market niche.

The point?  Design for your environment - chips for UNIX should have
large register files, real time chips should have small register files,
and compromises will be made for many successful products.

-----
Kirk Hays -- "I'm the NRA!"

mcg@mipon2.intel.com (06/05/89)

In article <25786@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes:
>| [Re:] huge directly-addressable register files ... in any reasonable
>| implementation, the register file must be multi-ported.  A six-ported register
>| cell is about 5x the size of a single-ported register cell ...
>
>I don't see why this is a reason not to implement large register files. 
>You need to apply transistors where they will do the most good.

This is a truism, a cliche.  The point is that we are, and have always been,
limited in our scope by how many devices could be {effectively, economically,
manufacturably} put on a die.  If one spends 2/3 of one's silicon budget
on a register file, then correspondingly less is available for function
units (e.g. floating-point), caches, and on-chip peripherals.  The question
of whether a large register file will "do the most good" is precisely the
point here, and I have seen no evidence that it is.

>Note
>that processors that attempt to "more fully utilize micro-parallelism"
>also tend to want to have more general-purpose registers available to
>maintain full performance.

As above, I'd would like to see your experimental evidence that suggests
that more than 64 general registers are useful for a preponderance of
embedded applications (which is, I believe, your target market), or
for any application other than multi-precision scientific code.
We have seen that, given currently and projected compiler technology,
an extremely large *addressable* register set is substantially less
useful than a large on-chip cache.  With wide (128-bit) low-latency
transfers from cache to registers overlapped with other operations,
a large register set is not really needed.  Large embedded applications
we have examined show an average of 6 local scalar variables in registers,
with an additional 3-5 temporaries in use.  Additional registers may be
used to cache global variables, and to temporally separate destination
values that could otherwise share registers, thus avoiding scoreboarding
blocks.

And since one can put 5x the cache in the place of the registers that
would be added, the cache makes even more sense as a repository for
values of import that are not currently in registers.


>| This, IMHO, is one of the greatest flaws of the 29k - it exposes 192
>| (actually 256) architectural registers.  In the current implementation they
>| are (I believe) 3-ported, and even now occupy a large amount of the 29k
>| die space.  I believe that they will run into serious problems if they
>| ever attempt to dispatch and execute multiple general instructions per cycle.
>
>Well, we see no problems in either our 2nd or 3rd generation parts...

Well then, do speak up.  I can't imagine how you will effectively increase
the number of ports in the register file without creating an imbalance
between CPU speed and bus (because of poor register/cache balance).
And I can't imagine how you will lower your CPI without it.  But then,
perhaps I'm just not imaginative enough.

>| The 960, on the other hand, exposes 32 general registers architecturally,
>| but because 16 of these are "local", and saved/restored on call/return
>| to/from architecurally-hidden resources, we can easily move from a cheap
>| (1-ported by 4 sets) register implementation, to a very fast one
>| (6-ported by 8+ sets) in high-performance implementations.
>
>So you *will* be looking at large register files (128+, 6-ported) for
>high performance.

No, not at all.  If you read Glenn Hinton's Compcon paper, you will realize
that, while in the first generation we used a ((16*4)+16) single-ported
register file, the subsequent generation includes a 32 register 6-ported
file, and that registers spilled by call and retored on return are
flushed to an on-chip cache capable of storing 8 or more of these register
sets.  The registers are flushed across a 128-bit wide dual bus, so
call and return take only 4 clocks each.

S. McGeady
Intel Corp.

mcg@mipon2.intel.com (06/05/89)

In article <26145@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes:
>In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes:
>>The cut line is different on every architecture - 32 is sufficient on the
>>960, but I am not disagreeing with Wall's estimate of 64.  Certainly
>>in floating-point intensive scientific applications dominated by
>>double-precision arithmetic in loops, more registers are needed.  But
>>substantialy more than 64 seems to limit architectural flexibility
>>quite severely.
>The cut line for scratch registers is directly proportional to the memory
>latency.  The longer your latency the more registers, and concurrently
>handled computation, you need to mask it.

This is entirely correct, but doesn't mention several distinctions.  First,
their is a distinction between *addressable* registers (of which the
960 has 32) and *available* registers (of which the 960 architecture has
an undefined number, the K* implementations have 80, and the subsequent
implementations has more than 144).  The distinction is important because
one must decide what the locality of register use is: do you wish to/need to
use the preponderance of your registers in a single routine, or do you
use them across many routines?  In the former case, a larger addressable
register file helps; in the latter case it does not, and may adversely
impact other aspects of the architecture.  Scientific code often falls
into the former category.  Most control code (at least when well-structured
and/or written in HLL) falls into the latter.

Second, memory latency is not simply a function of external bus and speed
and memory wait-states.  It is affected by the size, speed, and behaviour
of on-chip caches.

The balance between caching and registers is tricky.  Caches can never
be big enough, so there is a temptation to replace them with registers
and assume they will be implemented off-chip.  This is a fine choice
for some applications, but not for those with an emphasis on low-cost
computing.  In cost-sensitive applications, off-chip memory is strictly
DRAM, and the customer tunes the cost (aka speed) of the DRAM to her
performance needs.  You need 90% of the rated performance?  Use 1
wait-state; only 70%? Use 3 wait state.  Contrast this with the
equivalent message of other architectures: Don't want SRAM cache? Then
you get 50% performance; Don't want two distinct memory systems?
Then use this special memory that's more expensive and slower than what
you might want to use.  Of course, most comp.arch readers probably
don't actually *build* systems out of these chips, so these questions
are at best boring, and probably irrelevant here.

S. McGeady
Intel Corp.

tim@crackle.amd.com (Tim Olson) (06/07/89)

In article <m0fU2pe-0001hNC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes:
| In article <25786@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
| 
| [Steve talking about problems with multi-porting large register files]
| 
| >Well, we see no problems in either our 2nd or 3rd generation parts...
| 
| Well then, do speak up.  I can't imagine how you will effectively increase
| the number of ports in the register file without creating an imbalance
| between CPU speed and bus (because of poor register/cache balance).
| And I can't imagine how you will lower your CPI without it.  But then,
| perhaps I'm just not imaginative enough.

I think you are setting up a "straw-man" argument here.  We certainly
aren't talking about 2/3 of the die area as you previously implied.  The
Am29000's triple-ported, 192 register file currently takes about 21% of
the die area.  Even if we (hypothetically) doubled the number of ports
on it for a future generation part, standard process shrinks would still
end up making it a much smaller portion of the die than it already is.
It would end up being dwarfed by the large caches and multiple
functional units required to sustain multiple instruction/cycle
throughput.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mcg@mipon2.intel.com (06/08/89)

In article <20810@orac.mips.COM> earl@orac.mips.com (Earl Killian) writes:
>Another problem with register windows is that they make it difficult
>to support coroutines.  I typically do queuing simulations by having
>multiple coroutines representing separate entities, each with a
>separate stack.  I've implemented this on VAXs, 68000s, and MIPS boxes
>with only 20-50 lines of assemlber.  But I can't see any way to do
>stack switching on a register window machine without a kernel call
>(yuck).  Have any of the register window architectures implemented a
>way for user code to do stack switching?

This is easiest to answer with code for the 960.  I count 15 instructions.
This code was written by Jim Valerio.  'pfp' is the previous frame pointer
register.


/*
 *	C callable coroutine support for 960
 *
 */

/*
 *	co_fp = co_init(stk_addr, init_ip);
 *
 * Creates coroutine on the stack `stack_addr' to begin at `init_ip'.
 * The value returned should be assigned to a global variable which
 * is used in calls to co_jump (e.g. task1 = co_init(...) and later
 * co_jump(task1)).  We assume that `stk_addr' is properly aligned.
 *
 */
	.globl	_co_init
_co_init:
	st	g14,(g0)	/* pfp := 0  (assumes g14 is 0) */
	lda	0x40(g0),g2
	st	g2,4(g0)	/* initial sp */
	st	g1,8(g0)	/* init_ip */
	ret

/*
 *	co_jump(&my_fp, new_co_fp);
 *
 * Saves away the calling-frame's FP for a later return, and then
 * returns to the new coroutine's context.  We know here that the
 * C calling convention requires that only g8..g12 be saved.
 */
	.globl	_co_jump
_co_jump:
	movq	g8,r4
	mov	g12,r8
	call	co_swap
	mov	r8,g12
	movq	r4,g8
	ret

co_swap:
	st	pfp,(g0)
	flushreg
	mov	g1,pfp
	ret			/* goto new coroutine */


-------------------------------------------------------------------------------

S. McGeady
Intel Corp.

wayne@dsndata.uucp (Wayne Schlitt) (06/10/89)

>In article <m0fU37i-0001hNC@mipon2.intel.com> mcg@mipon2.intel.com writes:
>                         Of course, most comp.arch readers probably
> don't actually *build* systems out of these chips, so these questions
> are at best boring, and probably irrelevant here.

> S. McGeady
> Intel Corp.


umm... just because we don't actually build systems out of these chips
doesn't mean we aren't interested in knowing the trade offs that went
into the chip designs.  i'm just a lowly software person (:->), but in
order for me to guess what the hardware is going to look like five
years from now, i need to know these kind of things.


-wayne