bs@linus.UUCP (Robert D. Silverman) (01/20/87)
There is some ongoing discussion about register usage in programs. While it is true that one generally wants to keep as many variables as possible (especially such things as loop indices!) there are times when using a lot of registers can actually slow things down. This arises when a subroutine shares registers in common with its parent. If the routine is called many times the cost of saving and restoring registers during the call and return can offset any speed savings thru register usage. I have actually seen this phenomenon occur quite strikingly in programs that require multi-precision arithmetic. The cost of saving the registers for a multi-precise add or multiply routine (say) when they are called ~10^8 or 10^9 times can greatly slow down a program. Bob Silverman
neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/11/89)
Summary: If your compiler is not antiquated (compared to this one, even gcc looks bad), you can use gazillions of registers. The speedup is on the order of 30%, so don't hold your breath. David W. Wall of Digitals Western Research Lab did exactly the kind of analysis you all are looking for. From Davids paper "Global Register Allocation at Link Time", (SIGPLAN Notices 21,7,pp. 264-275). The benchmarks used were the Livermore loops, Whetstone, LINPACK, the Stanford benchmark suite and two applications, a logic simulator RSIM and a timing verifier. The compiler used was for the not-so-widely-known DECWRL Titan RISC machine, a ECL RISC with 64 non-windowed registers and a single cylce load. The compiler does global register allocation (yes, global variables in the sense of C) at link time and is a common backend for C, Fortran and Modula 2. The number of registers does not apply to the expression evaluation and address generation registers (this seems to be the 4-7 people have been talking so far) but those used by the optimizer to hold things. It analyzes all scalar variables into non-conflicting groups and tries to allocate those to registers. From the introduction: "When we use our method for 52 registers, our benchmarks speed up by 10 to 25 %. Even with only 8 registers, the speedup can be nearly that large, if we use previously collected profile information to guide the allocation. We cannot do much better, because programs whose variables all fit in registers rarely speed up by more than 30 %. Moreover, profiling shows us that we usually remove 60-90% of the loads and stores of scalar variables that the program performs during it's execution." The benchmarks characteristics: Variables Groups coloring coloring + profile ------------------------------------------------------------------- Livermore 166 165 18 % 19 % Whetstone 254 181 10 % 10 % Linpack 214 119 13 % 13 % Stanford 402 211 27 % 28 % Simulator 811 262 15 % 16 % Verifier 1395 693 15 % 19 % Where the columns stand for: Variables: Candidate scalar variables for global register allocation Groups: Overlapping variables of above which need separate registers. If there were that many registers in the machine, they could all be assigned to registers. coloring: Speedup obtained by global register allocation versus register allocation just like what say gcc -O does. coloring + profile: Speedup obtained by guiding the allocator with actual profiles of execution. David measured the number of memory references that could be eliminated in the benchmarks if all scalars could be held in registers and then plotted the precentage his scheme actually removed: coloring coloring + profile ----------------------------------- Livermore 81 % 94 % Whetstone 75 % 88 % Linpack 95 % 99 % Stanford 90 % 98 % Simulator 83 % 95 % Verifier 61 % 83 % This shows that his scheme is very efficient in removing these memory references. Please note that given the enormous "hit rate" he has and given the not so impressive speedups he got the overall precentage of scalar memory references cannot be that big versus accesses to bigger data structures. Now the interesting tables. What happens if you use fewer registers ? The following table shows the speed improvements with 52, 32 and 8 registers. All of these performance measures are the relative improvement the programs took with global register allocation guided by profile information relative to "naive register allocation". 52 32 8 ----------------------------- Livermore 19 % 18 % 12 % Whetstone 10 % 10 % 5 % Linpack 13 % 13 % 10 % Stanford 28 % 27 % 20 % Simulator 16 % 15 % 8 % Verifier 19 % 16 % 7 % There is another very interesting paper by David comparing register window schemes of varying organization with this global allocation stuff and this seems to suggest that a slightly bigger global register file beats register windows if you are willing to use this extremely advanced compilation techniques. The paper appeared in Proc. of the SIGPLAN 1988 Conference on Programming Language Design and Implementation, June 1988. The papers title is "Register Windows vs. Register Allocation". It's way to long to reproduce here and the graphics in there are much nicer than anything I can type here. Burkhard Neidecker-Lutz, Digital CEC Karlsruhe Disclaimer: I don't speak for Digital, etc. ...
sjs@spectral.ctt.bellcore.com (Stan Switzer) (05/11/89)
In article <921@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: > The one paper I read about this (unfortunately for John Mashey I cannot > find the exact reference -- the reason is too embarassing, even if not for > me, to state publicly) was about taking the PCC (for the PDP) and changing the > number of registers available to its Sethi-Ullman register allocator, and > then benchmarking a few Unix tools. > > They found that in these conditions (CISC machine, no interexpression > optimization, virtually only fixed point computation) speed/code size did > not improve substantially with more than three scratch registers, and four > were plenty. My own experience with a C-like compiler for the Honeywell Level/6 bears this out. As long as you are simply doing expression-level code generation, three registers, surprisingly enough, are quite sufficient. The Sethi-Ullman "pebbling" scheme (a limited form of coloring) works well in this case. Larger register sets pay you back when you do dataflow analysis and allocate registers over larger spans of code than C expression statements. This is where I disagree with Piercarlo. You _can_ make good use of more registers, but you'll have to do some global analysis first. If you only have a handful of registers, though, you might as well keep it simple. The Level/6, if I remember correctly, had 14 registers: 7 integer (logical) and 7 pointer registers. Of the integer registers, numbered 1 to 7, #1-3 could be used as indexes, #1 could be used as a shift count, #6-7 could be paired to handle large (32 bit) values. Of the seven pointer only #1-3 could be used in indexed addressing addressing modes. Because of this grunginess, my compiler used the first three regs in each set as expression temps and used the rest for dedicated purposes (frame pointer, return linkage, common storage, etc.). Going beyond statement-level allocation would have probably been pointless, but a good peephole optimizer would have helped a bit. Stan Switzer sjs@ctt.bellcore.com
cquenel@polyslo.CalPoly.EDU (18 more school days) (05/12/89)
In 9851 neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes: (a very informative and interesting summary of a paper on register allocation) >The compiler used was for the not-so-widely-known DECWRL Titan RISC machine, >a ECL RISC with 64 non-windowed registers and a single cylce load. ^^^^^^^^^^^^^^^^^ This could have something to do with the degree of speed-up by removing these loads. I don't think that they get very much bigger, but on RISC machines with a 1 or 2 cycle delay on a load from memory, the difference could be more significant. (And don't forget those nasty caches misses, that could be avoided once in a while). -- @---@ ----------------------------------------------------------------- @---@ \. ./ | Chris (The Lab Rat) Quenelle cquenel@polyslo.calpoly.edu | \. ./ \ / | You can keep my things, they've come to take me home -- PG | \ / ==o== ----------------------------------------------------------------- ==o==
a464@mindlink.UUCP (Bruce Dawson) (05/12/89)
One thing that needs to be kept in mind when talking about the advantages of huge numbers of registers is that some of the advantages of registers go away when you have a lot (when have you a lot availabel simultaneously I should say). In the extreme case of the computer someone mentioned that had 256 registers, a register-register operation would use up sixteen bits just to specify the two registers involved. Contrast that with the six bits required if you only have eight registers. Given the finite memory speeds that we have to deal with, an extra ten bits so that you can have 256 instead of eight registers is probably too big a price to pay and would probably slow programs down. Speaking from my experience on the 68000, I would say that for assembly language programming, 16 registers is generally enough, with more being occasionally desirable. Because C can't do a very good job of comprehending an entire substantial subroutine or program and deciding what should be in registers (actually, replace 'C' with HLL) I would think that a compiler could actually get by with slightly less. One thing to keep in mind is that the 80386's eight registers are actually eight, whereas the 680x0 families sixteen are usually more like thirteen or fourteen (by the time you use one for a stack pointer, one for a data segment pointer and one for a stack frame pointer). .Bruce.
neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) (05/12/89)
Reference: 10189,9851 Yes, the savings have to do with the load delay's from the data cache. Another table in the paper shows the relative savings for a certain allocation method with 52 registers and varying data cache latencies: Program Data cache speed in cycles 1 2 3 4 5 ------------------------------------------------------------------------ Simulator 12 % 19 % 24 % 27 % 29 % Verifier 10 % 15 % 19 % 21 % 23 % So there is an increasing potential for savings with longer data cache latencies, but a faster cache will help you always on those nasty array references (combined with block refills), have a look at the R2000/R3000 speed/Mhz numbers for these effects. And given the availability of ridiciously fast static RAMS (I have a Cypress Semiconductor catalogue in front of me, CY100E474 BiCMOS RAM, 4k, 3 nS...) there doesn't seem a good reason to have at least the first level cache be single cycle, even if you go to > 100 Mhz clocks.
henry@utzoo.uucp (Henry Spencer) (05/14/89)
In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes: >...Because C can't do a very good job of comprehending an >entire substantial subroutine or program and deciding what should be in >registers (actually, replace 'C' with HLL) I would think that a compiler could >actually get by with slightly less... Speak for your own compilers, please. Some folks' compilers do an *excellent* job of comprehending an entire subroutine, or even an entire file, of C or other HLL and deciding what should be in registers. -- Subversion, n: a superset | Henry Spencer at U of Toronto Zoology of a subset. --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
brooks@vette.llnl.gov (Eugene Brooks) (05/15/89)
In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes: > > One thing that needs to be kept in mind when talking about the advantages >of huge numbers of registers is that some of the advantages of registers go >away when you have a lot (when have you a lot availabel simultaneously I should >say). In the extreme case of the computer someone mentioned that had 256 >registers, a register-register operation would use up sixteen bits just to >specify the two registers involved. Contrast that with the six bits required >if you only have eight registers. Given the finite memory speeds that we have Actually, if you have three register operation codes, you chew up 24 bits for the register fields. Add, say, an 8 bit operation code and you have 32 bits for each basic computation instruction. This is what we had for the instruction set of the Cerberus multiprocessor simulator. The instructions which need a static offset for addressing are even wider, another 32 bits. With the huge size of code caches these days I don't see a problem here. brooks@maddog.llnl.gov, brooks@maddog.uucp
elg@killer.Dallas.TX.US (Eric Green) (05/15/89)
in article <259@mindlink.UUCP>, a464@mindlink.UUCP (Bruce Dawson) says: > One thing that needs to be kept in mind when talking about the advantages > of huge numbers of registers is that some of the advantages of registers go > away when you have a lot (when have you a lot availabel simultaneously I should > say). In the extreme case of the computer someone mentioned that had 256 > registers, a register-register operation would use up sixteen bits just to > specify the two registers involved. Contrast that with the six bits required > if you only have eight registers. Given the finite memory speeds that we have > to deal with, an extra ten bits so that you can have 256 instead of eight > registers is probably too big a price to pay and would probably slow programs > down. The AMD29000 addresses 256 registers (although there's only 192 actual registers -- 128 organized as a stack, the rest as global registers). Each AMD29000 instruction is 32 bits long. I seem to recall that it's a three-address machine instead of a two-address machine a' la 68000/80x86, so instructions are 8 bits of opcode, and 24 bits of register addresses. How does this slow it down??? In fact, the 29000 is one of the faster RISCs out there, though it didn't catch on in the Unix workstation world (for one thing, the idea of kludging in byte-fetch logic externally probably turned off potential system designers). As I mentioned in a previous posting, program-memory bandwidth is almost unlimited on these kinds of high end machines (large cache, Harvard-style seperate program and data busses, possibly interleaved program memory and burst-mode DRAM accesses by the program cache controller to take advantage of sequential accesses, etc.). The only glitches in the pipeline are a) fetching data from memory, and b) branches, so you want to reduce both as much as possible, which is why you want a lot of registers and (on the branch side) things like "smart" conditions to reduce the number of conditional branches, and branch target caches to minimize the effects when you do get a branch. Program memory bandwidth becomes almost inconsequential insofar as performance is concerned, under those conditions, and fixed-size 32-bit instructions greatly ease pipeline design. It's only when you get to low end non-Harvard machines like the 68000 that program bandwidth becomes important. Conclusions: The program memory bandwidth increase resulting from increasing the number of addressible registers is inconsequential. Other factors, such as real-time responsiveness, compiler technology, and the number of gates needed to decode the register addresses (remember the Cray axiom -- minimum # of gates in critical paths) are more important in detirmining how many registers to put in your architecture. -- | // Eric Lee Green P.O. Box 92191, Lafayette, LA 70509 | | // ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg (318)989-9849 | | // Join the Church of HAL, and worship at the altar of all computers | |\X/ with three-letter names (e.g. IBM and DEC). White lab coats optional.|
tim@crackle.amd.com (Tim Olson) (05/15/89)
In article <8104@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes: | (for one thing, the idea of kludging in | byte-fetch logic externally probably turned off potential system | designers). This is something that we have changed -- Revision 'C' Am29000's have fully-implemented byte and half-word loads and stores, with both signed and unsigned versions of loads. -- Tim Olson Advanced Micro Devices (tim@amd.com)
rpw3@amdcad.AMD.COM (Rob Warnock) (05/16/89)
In article <25633@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: +--------------- | In article <8104@killer.Dallas.TX.US> elg@killer.Dallas.TX.US writes: | | (for one thing, the idea of kludging in byte-fetch logic externally | | probably turned off potential system designers). | This is something that we have changed -- Revision 'C' Am29000's have | fully-implemented byte and half-word loads and stores, with both signed | and unsigned versions of loads. +--------------- And it was done in a fully upwards-compatible way. Old code continues to run correctly on new chips. And the C compiler can still be told to generate any of the flavors of old, old-but-with-byte-write, or new code. Anyway, the compilers have always handled 29k byte-load and byte-store on word-only memory just fine [LOAD/EXTRACT or LOAD/INSERT/STORE]. All the published performance numbers (until Rev.C) were for *no* hardware byte support. And Unix doesn't care (both 4.3 & S5r3 run). I suspect the real reason you don't see 29k-based Unix workstations is that it just wasn't quite there during the window that a bunch of people were choosing their RISC/Unix CPU. "Missed the wave", as it were. [Who knows, it may see some Unix use yet, by those who'd rather not compete directly with other SPARC/MIPS/88k vendors.] Still, it makes a neat embedded controller... Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
pcg@aber-cs.UUCP (Piercarlo Grandi) (05/16/89)
In article <8905110956.AA12655@decwrl.dec.com> neideck@nestvx.dec.com (Burkhard Neidecker-Lutz) writes:
Summary: If your compiler is not antiquated (compared to this one, even
gcc looks bad), you can use gazillions of registers. The speedup
is on the order of 30%, so don't hold your breath.
Well, "gazillions" is a bit exxagerated, but 30% is not bad improvement;
the curve though has a very mild slope at the end, so the knee is
fairly early. Read on...
The compiler used was for the not-so-widely-known DECWRL Titan RISC
machine, a ECL RISC with 64 non-windowed registers and a single cylce
load. The compiler does global register allocation (yes, global
variables in the sense of C) at link time and is a common backend for C,
Fortran and Modula 2.
If I remember correctly, the decwrl machine is a titan, and is the favourite
plaything of Paul Vixie :-) :-).
The number of registers does not apply to the expression evaluation and
address generation registers (this seems to be the 4-7 people have been
talking so far)
I am not surprised...
but those used by the optimizer to hold things.
It analyzes all scalar variables into non-conflicting groups and tries
to allocate those to registers. From the introduction:
"When we use our method for 52 registers, our benchmarks speed up
by 10 to 25 %. Even with only 8 registers, the speedup can be
nearly that large, if we use previously collected profile information
to guide the allocation.
I thank you enormously. This quote greatly cheers me up. This is a very good
support for my position on "register" (where a competent programmer is
expected to do this, with great simplication of the compiler, because
compilers cannot possibly know which variables are the most frequently used
dynamically).
We cannot do much better, because programs whose variables all fit
in registers rarely speed up by more than 30 %.
Excellent. I can surmise that this is simply because there not that many hot
spots in a program...
[ ... many interesting numbers ... ]
This shows that his scheme is very efficient in removing these memory
references. Please note that given the enormous "hit rate" he has and
given the not so impressive speedups he got the overall precentage of
scalar memory references cannot be that big versus accesses to bigger
data structures.
I am not surprised either. it is the old rule of hot spots on one hand, and
of the desirability of using vector architrctures for vector code...
Now the interesting tables. What happens if you use fewer registers ?
The following table shows the speed improvements with 52, 32 and 8 registers.
All of these performance measures are the relative improvement the programs
took with global register allocation guided by profile information relative
to "naive register allocation".
52 32 8
-----------------------------
Livermore 19 % 18 % 12 %
Whetstone 10 % 10 % 5 %
Linpack 13 % 13 % 10 %
Stanford 28 % 27 % 20 %
Simulator 16 % 15 % 8 %
Verifier 19 % 16 % 7 %
This does not surprise me either. I would go of course for 8 registers....
Note that these 8 global registers chosen using profiling probably would
be more than plenty instead of just adequate on a reg-mem machine... I remain
solid in my reckoning that for a reg-mem machine 8 registers overall is just
adequate, and 16 is plenty, with some increment required for reg-reg.
There is another very interesting paper by David comparing register window
schemes of varying organization with this global allocation stuff and this
seems to suggest that a slightly bigger global register file beats register
windows
The 29k guys will cheer...
if you are willing to use this extremely advanced compilation
techniques.
Or the "register" keyword, and you are a competent programmer. Possibly
extended to global variables...
The paper appeared in Proc. of the SIGPLAN 1988 Conference on
Programming Language Design and Implementation, June 1988. The papers
title is "Register Windows vs. Register Allocation". It's way to long to
reproduce here and the graphics in there are much nicer than anything I
can type here.
It is very good indeed. But the question is always of course whether a
statically allocated large register file is can still be called a register
file, and not rather a first level memory; if it is so large that you
can store essentially all the variables into it for virtually all of
the program (i.e. use it part as data and part as stack), then I beg
to submit that you have a two level mem-mem architecture. Maybe the
as AMD 29,000 has a couple hundred registers, the AMD 290,000 will have
two thousand, and paging/swapping of registers, etc... :-).
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (05/16/89)
In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes: >say). In the extreme case of the computer someone mentioned that had 256 I was the person who mentioned 256 registers. The machine is the Cyber 205. >registers, a register-register operation would use up sixteen bits just to >specify the two registers involved. Contrast that with the six bits required >if you only have eight registers. The register to register instructions on the 205 were 32 bits long, the same length as most of today's current crop of machines. 1 Byte went to an 8 bit opcode, and three bytes went to specify the two input and one output register of each instruction. >to deal with, an extra ten bits so that you can have 256 instead of eight >registers is probably too big a price to pay and would probably slow programs >down. I agree that 256 registers were probably too many, because the compiler had trouble using more than about 60 typically. A 64 register machine would have been adequate, given what compilers can generally make use of. I note that these were 64-bit registers. I am not sure what you mean by "slow programs down". Certainly, RISC machines have bigger code than compact code CISC's like VAX and NS32000. But the RISC machines have generally been significantly faster when implemented in the same techology than the corresponding CISC machines. By your reasoning, the Intel iAPX 432 should have been the fastest machine around, since it used bit-variable-length instructions which could begin on any bit. > Speaking from my experience on the 68000, I would say that for assembly >language programming, 16 registers is generally enough, with more being >occasionally desirable. This statement has been repeated in various ways by several people the last few weeks. The problem with this is that the compilers referred to, and the assembly code and coders referred to, ASSUME that registers are for expression evaluation only. We have seen evidence that C expression evaluation does not require more than 16 registers, but that DOES NOT imply that substantial improvements in code speed cannot be made by making "global" (beyond one statement) register assignments. This is not even new. Commercial products have been doing it for at least 15 years. > Because C can't do a very good job of comprehending an >entire substantial subroutine or program and deciding what should be in >registers (actually, replace 'C' with HLL) I would think that a compiler could >actually get by with slightly less. The postings on register usage by the AMD 29000 C compiler were representative of what I have read about most modern C compilers. Clearly, compilers today can make use of 20-30 registers to advantage in C. When dealing with double precision, Fortran, etc. I suspect a case could (it needs investigation) for 64 32-bit registers. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
danm@amdcad.AMD.COM (Daniel Mann) (05/16/89)
In article <25635@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes: +--------------- | ... And Unix doesn't care (both 4.3 & S5r3 run). I suspect the real | reason you don't see 29k-based Unix workstations is that it just wasn't | quite there during the window that a bunch of people were choosing their | RISC/Unix CPU. "Missed the wave", as it were. [Who knows, it may see | some Unix use yet, by those who'd rather not compete directly with other | SPARC/MIPS/88k vendors.] +--------------- AMD wishes to support customers developing UNIX systems based upon the Am29000. Therefore, the complete source code for a port of 4.3bsd UNIX to the Am29000 is available to any licensed AT&T user. The 4.3bsd Berkeley UNIX currently runs on the AMD PC Execution Board (PCEB29K), which plugs into the PC-bus of an IBM-PC/XT/AT or compatible. It is a simple and inexpensive implementation of the Am29000 architecture, utilising a video-RAM memory. The PC is used as an I/O server. A detailed application note entitled "4.3bsd Unix and the PCEB" is available which discusses the 4.3 port. It is suitable reading for experienced UNIX system programmers wishing to understand Am29000 issues. Designers of new Am29000 systems may wish to evaluate the experience gained in the PCEB port before committing to any new hardware design. For a copy of the application note or for further information about obtaining the Am29000 4.3bsd Unix software, call or write: CONTACT: Daniel Mann, Am29000 Software Development Supervisor ADDRESS: AMD, 901 Thompson Place, MS 6, Po Box 3453, Sunnyvale, CA94088 PHONE: 408 7495058 E-MAIL: danm@amdcad.amd.com
rpw3@amdcad.AMD.COM (Rob Warnock) (05/18/89)
In article <952@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: +--------------- | In article <8905110956.AA12655@decwrl.dec.com> neideck@nestvx.dec.com writes: | > There is another very interesting paper by David comparing register window | > schemes of varying organization with this global allocation stuff and this | > seems to suggest that a slightly bigger global register file beats register | > windows | The 29k guys will cheer... +--------------- Not really. Again, to clear up the slight misunderstanding, by current 29k software convention [with hardware support from the "gr1+local" address adder], the Am29000 *is* a (variable-sized) register window machine. The "local" regs are allocated with a strict stack discipline, which, as Neideck implies, may not be quite as good as a super-global register allocation strategy. However, with much less than 1% of cycles being spent on register cache spilling/filling (all benchmarks except "Ackerman"), it's probably good enough. [And of course since the current register discipline *is* a software convention, nothing prohibits a future compiler from switching to a super-global allocation strategy...] Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
andrew@frip.WV.TEK.COM (Andrew Klossner) (05/18/89)
[] "Larger register sets pay you back when you do dataflow analysis and allocate registers over larger spans of code than C expression statements ... If you only have a handful of registers, though, you might as well keep it simple." Dataflow analysis is good for more than register allocation, and can be a win on a register-starved machine. For example, when compiling this code (which swaps "x" and "y"): temp := x; x := y; y := temp; if dataflow analysis tells you that "temp" is dead after this fragment, you can avoid a store operation, and you've won even if you have only two registers. -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
andrew@frip.WV.TEK.COM (Andrew Klossner) (05/18/89)
[On the original 29k chips, which had no byte load/store:] "Anyway, the compilers have always handled 29k byte-load and byte-store on word-only memory just fine [LOAD/EXTRACT or LOAD/INSERT/STORE]. And Unix doesn't care ..." My Unix kernel device drivers sure care! I have to talk to peripheral chips, some of whose registers require word-wide load/store and some of which require byte-wide. I can't implement byte-wide with word-wide because of the side effects. The extra logic to solve this problem is not pretty. The 29k looks like a nice chip, though, if you deal only with well-behaved peripheral interfaces. -=- Andrew Klossner (uunet!tektronix!orca!frip!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
davecb@yunexus.UUCP (David Collier-Brown) (05/18/89)
In article <952@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: [responding to some factual information on register usage] >I thank you enormously. This quote greatly cheers me up. This is a very good >support for my position on "register" (where a competent programmer is >expected to do this, with great simplication of the compiler, because >compilers cannot possibly know which variables are the most frequently used >dynamically). True if you define "compiler" narrowly (ie, normally). However, if you think of an optimization "environment", you can justify feeding the results of a profile back into the compiler to guide register allocation. I would tend to structure this as an optional "pass" (well, pseudo-pass) for purely practical reasons... And if you want to experiment with the idea right now, consider writing an output processor for a profiler which orders variables by use in individual functions, and then another utility to rewrite the source code with register declarations for the X hottest variables per function iff the call cost/run cost ratio would not be deranged thereby. (It is fairly hard (:-)). --dave (only two more days to monomania!) c-b -- David Collier-Brown, | {toronto area...}lethe!dave 72 Abitibi Ave., | Joyce C-B: Willowdale, Ontario, | He's so smart he's dumb. CANADA. 223-8968 |
rpw3@amdcad.AMD.COM (Rob Warnock) (05/23/89)
In article <3355@orca.WV.TEK.COM> andrew@frip.WV.TEK.COM writes: +--------------- | [On the original 29k chips, which had no byte load/store:] | "Anyway, the compilers have always handled 29k byte-load and byte-store | ... And Unix doesn't care ..." | My Unix kernel device drivers sure care! I have to talk to peripheral chips, | some of whose registers require word-wide load/store and some of which require | byte-wide. I can't implement byte-wide with word-wide because of the side | effects. The extra logic to solve this problem is not pretty. +--------------- True, which was why they changed the chip. But note that the byte-access bits came out to the pins, even on the "non-byte-read/write" chips. That is, even though the memory interface was nominally word-wide, the byte indicators in the load and store instructions and the LSBs of the addresses came out, and could be used by external hardware, if desired. So assembly-language code could always do the right thing. It's just that the code generated by the earlier C compilers couldn't, since they "knew" that "all" memory was word-wide. [And now that's fixed, so it's moot.] By the way, another common trick [used even on other CPUs, such as 68000's] was to simply put byte-wide chips on one byte rail on the bus (pick one... for a couple of reasons, the *high*-order usually turned out to be the most convenient on the 29k), and to "don't care" the LSBs of the CPU. That is, wire CPU A2 to I/O chip A0, etc. Then, simply write the device header file to name the I/O chip's registers as being 4 bytes apart... +--------------- | The 29k looks like a nice chip, though, if you deal only with well-behaved | peripheral interfaces. +--------------- And the fact that the lockstep instruction pipelining is "exposed" to the programmer means that (at least in the present 29k family) you can predict what sequence of bus accesses a given instruction sequence will generate, which also makes it really nice when dealing with not-so-well-behaved interfaces... Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
mcg@mipon2.intel.com (05/30/89)
In article <25382@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >In article <259@mindlink.UUCP> a464@mindlink.UUCP (Bruce Dawson) writes: > >>say). In the extreme case of the computer someone mentioned that had 256 >>registers, > >I agree that 256 registers were probably too many, because the compiler had >trouble using more than about 60 typically. > >I am not sure what you mean by "slow programs down". Certainly, RISC machines >have bigger code than compact code CISC's like VAX and NS32000. But the RISC >machines have generally been significantly faster when implemented in the >same techology than the corresponding CISC machines. One thing that no one has yet pointed out is that a reason not to implement huge directly-addressable register files is that, in any reasonable implementation, the register file must be multi-ported. A six-ported register cell is about 5x the size of a single-ported register cell or a cache RAM cell. To more fully utilize micro-parallelism in an architecture, more sources and results need to be fetched from the register file simultaneously, thus the additional ports. This, IMHO, is one of the greatest flaws of the 29k - it exposes 192 (actually 256) architectural registers. In the current implementation they are (I believe) 3-ported, and even now occupy a large amount of the 29k die space. I believe that they will run into serious problems if they ever attempt to dispatch and execute multiple general instructions per cycle. The 960, on the other hand, exposes 32 general registers architecturally, but because 16 of these are "local", and saved/restored on call/return to/from architecurally-hidden resources, we can easily move from a cheap (1-ported by 4 sets) register implementation, to a very fast one (6-ported by 8+ sets) in high-performance implementations. The cut line is different on every architecture - 32 is sufficient on the 960, but I am not disagreeing with Wall's estimate of 64. Certainly in floating-point intensive scientific applications dominated by double-precision arithmetic in loops, more registers are needed. But substantialy more than 64 seems to limit architectural flexibility quite severely. S. McGeady Intel Corp.
tim@crackle.amd.com (Tim Olson) (05/30/89)
In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes: | One thing that no one has yet pointed out is that a reason not to implement | huge directly-addressable register files is that, in any reasonable | implementation, the register file must be multi-ported. A six-ported register | cell is about 5x the size of a single-ported register cell or a cache RAM | cell. To more fully utilize micro-parallelism in an architecture, more | sources and results need to be fetched from the register file simultaneously, | thus the additional ports. I don't see why this is a reason not to implement large register files. You need to apply transistors where they will do the most good. Note that processors that attempt to "more fully utilize micro-parallelism" also tend to want to have more general-purpose registers available to maintain full performance. | This, IMHO, is one of the greatest flaws of the 29k - it exposes 192 | (actually 256) architectural registers. In the current implementation they | are (I believe) 3-ported, and even now occupy a large amount of the 29k | die space. I believe that they will run into serious problems if they | ever attempt to dispatch and execute multiple general instructions per cycle. Well, we see no problems in either our 2nd or 3rd generation parts... | The 960, on the other hand, exposes 32 general registers architecturally, | but because 16 of these are "local", and saved/restored on call/return | to/from architecurally-hidden resources, we can easily move from a cheap | (1-ported by 4 sets) register implementation, to a very fast one | (6-ported by 8+ sets) in high-performance implementations. So you *will* be looking at large register files (128+, 6-ported) for high performance. -- Tim Olson Advanced Micro Devices (tim@amd.com)
bcase@cup.portal.com (Brian bcase Case) (05/31/89)
>A six-ported register cell is about 5x the size of a single-ported >register cell or a cache RAM cell. This could be, depending on technology details, of course. (The width of metal dominates, usually?) >This, IMHO, is one of the greatest flaws of the 29k - it exposes 192 >(actually 256) architectural registers. In the current implementation they >are (I believe) 3-ported, and even now occupy a large amount of the 29k >die space. I believe that they will run into serious problems if they >ever attempt to dispatch and execute multiple general instructions per cycle. I can speak to this point with a little authority. The register file in the current implementation is indeed 3-ported. I must admit that I have had second thoughts about making the register file so big. The advantages are many, but the cost of additional ports is indeed bigger than for other architectures (boy, the '386 architecture has a leg up here! :-). The current 29K implementation has about 1/5 of the usable (i.e., non-pad ring) die area dedicated to the register file using 1.26 (Dave Witt: what is it really?) micron technology. At about 1 micron, that will shrink to about 1/9 of the usable area, and at .8 micron, about 1/11. Increasing the number of ports to 6, lets say, will increase the size about a factor of 1-2/3 (3x a single-ported cell to (using Steve's number) about 5x a single-ported cell). Thus, at 1 micron, the register file will use about 18% of the useable die area, and at .8 micron about 15% of the area. This is not an insignificant amount of area, but it is not "too" much in my opinion. Why? Because having lots of registers IS GOOD. If the 29K spends more area on a 6-ported data storage resource than other processors, I think it's an advantage as long as its not taking "too" much area! Up to a point, I would rather spend area on a 6-ported resource than a 1-ported resource (cache). I guess we are talking about where is the "point" in "up to a point." On the other hand, a 6-ported, 32-register file would be about (I am guessing by simple scaling) 2% to 5% of the usable die area. On the 960, some of the 13% difference is used for the register file "backing store", but not much, say another 3% (I dunno what it really is). So, the 960 has a 10% "bonus" chunk of die area. What can be done with it? At some points on the technology curve, it will have a larger cache than the 29K at the same point. At some other points, the 10% diea area will "only" result in a smaller die because 10% isn't enough to increase the size of a cache or an FP somethingorother (but 10% smaller die are cheaper die, depending on yield, greediness,etc.). So the 960 seems to have a slight implementation advantage. What's the real difference? I don't know because 10% is a small enough amount that it can be lost in the "noise" of implementation!: If one guy uses automatic tools and the other uses full custom deisgn/layout, some difference will result. Also, just due to "the way things are" the 10% might not be usable. Some die are square while some are rectangular because of "the ways things are." I am making this argument in full recognition of an argument about CISCs that I used to believe: "CISCs will always have an implementation disadvantage becuase of the microcode ROM." This is bogus for the same reason: as the technology improves, the ROM itself shrinks until it can no longer be seen with the naked eye! What constitutes either a current RISC processor or a current CISC processor (the PROCESSOR pipeline not the caches, TLBs, etc.) would be a very small corner on the die if implemented in the technology of 1995. However, what constitutes a current RISC or CISC processor will be wholely uninteresting in 1995. For the implementations to come, including superscalar stuff, the issues will be the complexity of implementation and the cost (read: people) of realizing that implementation. This is where CISCs, I believe, will faulter. One of the great advantages of RISCs is that they are conceptually easier to implement. This effect is compounded with increasing ambitions for greater performance: the fewer interactions between instructions the easiser a multi-instruction-per-cycle implementation will be to construct. In 1995, there might be another reactionary simplification movment in computer architecture; maybe current RISCs are too complex! Too many special cases! But I digress... Size will still matter, but I believe it will be dominated by the many connections (buses) and small structures required to handle the special cases and resource interactions, not the larger, regular structures like register files and caches (although we will still be trying to fit as much cache as possible). Thus, the 14-ported 192-register file of the 29K will still be larger than the 14-ported 32-register files of the 960, MIPS, i860, etc., but it won't matter because the 29K's register file will be 1% of the die while the 960's will be 0.1%. (BTW, the register file will be the center of the processor, not at one end of the data path as it is now.) So I believe. >The 960, on the other hand, exposes 32 general registers architecturally, >but because 16 of these are "local", and saved/restored on call/return >to/from architecurally-hidden resources, we can easily move from a cheap >(1-ported by 4 sets) register implementation, to a very fast one >(6-ported by 8+ sets) in high-performance implementations. This indeed gives more flexibility in implementation choices. The 29K's register file gives more flexibility in choices for using the register file. Only time will tell if more advantage is gained from having implementation choices or from use choices. If the belief that, in the end, business issues dominate, maybe the business issue of cost is more important. >The cut line is different on every architecture - 32 is sufficient on the >960, but I am not disagreeing with Wall's estimate of 64. Certainly >in floating-point intensive scientific applications dominated by >double-precision arithmetic in loops, more registers are needed. But >substantialy more than 64 seems to limit architectural flexibility >quite severely. That should be "more than 64 seems to limit *implementation* flexibility." The 29K has more architectural flexibility (the register file can be used as a stack cache, a flat pool of 192 registers, or as a few pools of a smaller number of registers. Is this important? I dunno yet.). These are just a few of my opinions mixed up with some pseudo-facts. Don't believe any of them! "I did it my way." - Sinatra.
lars@salt.acc.com (Lars J Poulsen) (05/31/89)
In article <25382@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >>I agree that 256 registers were probably too many, because the compiler had >>trouble using more than about 60 typically. In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes: >This, IMHO, is one of the greatest flaws of the 29k - it exposes 192 >(actually 256) architectural registers. > >The 960, on the other hand, exposes 32 general registers architecturally, From a humble applications programmer, who occasionally has written a bit of kernel code: The biggest pain with an architecture that exposes too large a register file is saving and restoring on context switches. While interrupt service routines can ignore this and store only what they need, context switches require storing of the entire register set. Or do people really feel that the processors today are fast enough that scheduling pre-emption is too rare to influence the selection of register file size ? Many years ago I switched from working on (then) Univac-1100 to PDP-11's and VAXen; the 1100 had about 44 visible registers in the user set; few programs really used more than half of them; worse yet, they were asymmetric, with different properties between the three major groups. The VAX instruction set got more mileage out of its 16 general registers than the Univac got out of its 44, and saved many cycles on register save/restores. / Lars Poulsen <lars@salt.acc.com> (800) 222-7308 or (805) 963-9431 ext 358 ACC Customer Service Affiliation stated for identification only My employer probably would not agree if he knew what I said !!
brooks@vette.llnl.gov (Eugene Brooks) (05/31/89)
In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes: >The cut line is different on every architecture - 32 is sufficient on the >960, but I am not disagreeing with Wall's estimate of 64. Certainly >in floating-point intensive scientific applications dominated by >double-precision arithmetic in loops, more registers are needed. But >substantialy more than 64 seems to limit architectural flexibility >quite severely. The cut line for scratch registers is directly proportional to the memory latency. The longer your latency the more registers, and concurrently handled computation, you need to mask it. brooks@maddog.llnl.gov, brooks@maddog.uucp
khb@gammara.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (05/31/89)
In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: .... # regs on different machines>> > >From a humble applications programmer, who occasionally has written a bit >of kernel code: The biggest pain with an architecture that exposes too >large a register file is saving and restoring on context switches. While >interrupt service routines can ignore this and store only what they >need, context switches require storing of the entire register set. Or do >people really feel that the processors today are fast enough that >scheduling pre-emption is too rare to influence the selection of >register file size ? > >Many years ago I switched from working on (then) Univac-1100 to PDP-11's >and VAXen; the 1100 had about 44 visible registers in the user set; few >programs really used more than half of them; worse yet, they were >asymmetric, with different properties between the three major groups. > >The VAX instruction set got more mileage out of its 16 general registers >than the Univac got out of its 44, and saved many cycles on register >save/restores. "modern" compilers tend to do global (well, misnomer, all of a procedure at once) analysis; older compilers tended to look at smaller bits of code (block, maybe interblock) ... so "modern" compilers have the opportunity to do better. On scientific application codes the bodies of the modules are large enough, and there are enough other effects, that saving a considerable number of registers is not necessarily a large component of the total cost (on one machine, in a former life, it tended to be less than 5% of the call cost). If one inlines "leaf" nodes, very respectable numbers of registers can get used. For those who are slightly twisted, the Cydra 5 register usage, documented in the reccent ASPLOS-III proceedings shows how the combination of software pipelining and VLIW can eat up registers quite quickly. Upshot: modern compilers can employ as many registers are you can design in. If you got 'em, you gotta figure out a way to save 'em. Windows, "multiconnect", register pointers, special purpose (including vector) and other solutions are workable. Naive rationale for infinite (as long as they are free) registers: constant propagation and common subexpressions, combined with loop unrolling (or more advanced techniques like percolation scheduling) result in a need for as many as you got ... and then some. OS is different from application programs, and it is not clear that one might not be better off with a special OS register allocation scheme. Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
firth@sei.cmu.edu (Robert Firth) (05/31/89)
In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: >From a humble applications programmer, who occasionally has written a bit >of kernel code: The biggest pain with an architecture that exposes too >large a register file is saving and restoring on context switches. It's more than a pain; it can cause a severe performance penalty in some application domains. This is one of the reasons I don't like register-file machines: ALL the registers have to be saved and restored across a full context switch, and there are an awful lot of them. Moreover, a hardware instruction isn't much help, since the holdup is the actual data copy to and from memory, not the instruction fetch. For this reason, I think people with hard real-time applications should look pretty carefully at context switch costs. > While >interrupt service routines can ignore this and store only what they >need, context switches require storing of the entire register set. There's another pitfall here. Some machines use the same register window mechanism to handle interrupts, automatically shifting to a new set. If the interrupt hits you at just the wrong call depth, you get an automatic register spill. So about one-sixth or one-fourth of the time, it takes maybe three times as long overall to field the interrupt and maybe thirty times as long to reach the first instruction of the handler. Since hard real time people care about predictability next only to performance, this gives loads of grief.
frazier@oahu.cs.ucla.edu (Greg Frazier) (06/01/89)
In article <3427@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: >In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: > >>From a humble applications programmer, who occasionally has written a bit >>of kernel code: The biggest pain with an architecture that exposes too >>large a register file is saving and restoring on context switches. > >It's more than a pain; it can cause a severe performance penalty in >some application domains. This is one of the reasons I don't like >register-file machines: ALL the registers have to be saved and >restored across a full context switch, and there are an awful lot >of them. Moreover, a hardware instruction isn't much help, since >the holdup is the actual data copy to and from memory, not the >instruction fetch. Actually, this is not entirely true. While you do have to save all currently used registers, it is actually optimal to only restore a single window within the register file (if you are using windows - if you're not, then yes, all currently used must be restored). In addition, if you have a cache, saving the registers is very fast (unless your cache is write-through). If the cache is context dependent, then the time required to flush the cache and the delay caused by a cold start will make the register save/restore time insignificant. > >For this reason, I think people with hard real-time applications should >look pretty carefully at context switch costs. Hard real-time applications always have problems with memory hierarchies - cache misses can be just as painful as register saving/restoring, and don't even mention page faults! :-) >> While >>interrupt service routines can ignore this and store only what they >>need, context switches require storing of the entire register set. > >There's another pitfall here. Some machines use the same register >window mechanism to handle interrupts, automatically shifting to >a new set. If the interrupt hits you at just the wrong call depth, >you get an automatic register spill. So about one-sixth or one-fourth >of the time, it takes maybe three times as long overall to field the >interrupt and maybe thirty times as long to reach the first instruction >of the handler. Since hard real time people care about predictability >next only to performance, this gives loads of grief. Obviously, for realtime applications, you need to save a window (or some set of registers) for interrupt handling. And, equally obvious, you still run into problems if you get a high-priority interrupt while you are handling a high-priority interrupt. Hard realtime systems require a high degree of predictability (i.e. you have to thoroughly understand your application). As I mentioned before, introducing hierarchical memory makes the task of understanding MUCH more difficult. Add to this multiple processors (just to add another issue), and you have no choice but to do soft realtime, unless your application is VERY simple. ...And, if your application is that simple, you can (?) take register behavior into account (I think - I have not done any research in hard realtime, so I'm just blathering). Greg Frazier &&&&&&&&&&&&&&&&&&!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(((((((((((((((((( Greg Frazier o Internet: frazier@CS.UCLA.EDU CS dept., UCLA /\ UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier ----^/---- /
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (06/01/89)
In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: >large a register file is saving and restoring on context switches. While >interrupt service routines can ignore this and store only what they From time to time, I have seen the argument that large register files slow down context switches. Yes and no. If you are looking at general purpose operating systems like Unix, register saves are a very small part of the cost of a context switch. Last week I saw an ad for a real time version of Unix with a guaranteed response time (switch to kernel to high priority process) of 5 milliseconds. Now, for the sizes of register files we are talking about, a register save is about 1/1000th to 1/100th of the total time. If you are looking at embedded real time systems with no general purpose kernel, a large number of registers could be a problem. Where it gets "interesting" is if you are trying to support microtasking at the outer do-loop level in Fortran from within the kernel. Well, there is no such thing as a free lunch. I guess you just have to design carefully when you get near the edge... Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
lars@salt.acc.com (Lars J Poulsen) (06/01/89)
>In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: >.... # regs on different machines ... >>The biggest pain with an architecture that exposes too >>large a register file is saving and restoring on context switches. While >>interrupt service routines can ignore this and store only what they >>need, context switches require storing of the entire register set. Or do >>people really feel that the processors today are fast enough that >>scheduling pre-emption is too rare to influence the selection of >>register file size ? In article <107302@sun.eng.sun.com> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes: >Upshot: modern compilers can employ as many registers are you can > design in. If you got 'em, you gotta figure out a way to > save 'em. Windows, "multiconnect", register pointers, special > purpose (including vector) and other solutions are workable. > >Naive rationale for infinite (as long as they are free) registers: > >constant propagation and common subexpressions, combined with loop >unrolling (or more advanced techniques like percolation scheduling) >result in a need for as many as you got ... and then some. > >OS is different from application programs, and it is not clear that >one might not be better off with a special OS register allocation >scheme. What I am talking about is saving and restoring registers when the operating system switches from one user process to another. This is not something that the compiler can improve on. Maybe the architecture could define a "highest register currently in use" pointer, and encourage context switches just before it gets incremented ? / Lars Poulsen <lars@salt.acc.com> (800) 222-7308 or (805) 963-9431 ext 358 ACC Customer Service Affiliation stated for identification only My employer probably would not agree if he knew what I said !!
mash@mips.COM (John Mashey) (06/01/89)
In article <3427@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: .... >of the handler. Since hard real time people care about predictability >next only to performance, this gives loads of grief. Actually, some hard real time people care about predictability more than performance. Maybe some knowledgable R.T. folks out there might care to post something about this: it's an important area, and one that doesn't discussed here as much as it deserves. In particular, there are of course serious implications of going faster (which usually implies more caching) and being less predictable (sometimes implied by caching). Maybe some real time folks could post some references, or some descriptions of the different flavors of real-time (as there are quite a few). -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jwest@pitstop.West.Sun.COM (Jeremy West) (06/01/89)
In article <3427@bd.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes: > > This is one of the reasons I don't like > register-file machines: ALL the registers have to be saved and > restored across a full context switch, and there are an awful lot > of them. Moreover, a hardware instruction isn't much help, since > the holdup is the actual data copy to and from memory, not the > instruction fetch. > In SunOS running on SPARC it only saves the register windows that you have actually used when it does a context switch. This is often all of them but not allways. You have to fetch instructions to save registers, and the double-word load/store that SPARC uses reduces the number of instructions by half. Since this gives some assistance I think the SPARC architects decided that there wasn't much more to gain from a more specialised instruction. > For this reason, I think people with hard real-time applications should > look pretty carefully at context switch costs. If you're that keen on context switching the fastest thing around IMHO is the Transputer which does it in 600ns at 20 MHz for a full switch between two low priority tasks. Of course it effectively has one register (sort of) and prevents you from leaving things in registers when certain instrcutions are executed :-} With SPARC you can partition the register windows so that two or more tasks have separate sets of register windows. You can then get much faster context switches between a small number of tasks. In fact to do this effectively you need to *increase* the number of registers and I think that some future SPARC chips (aimed specifically at realtime) will have more. > > Some machines use the same register > window mechanism to handle interrupts, automatically shifting to > a new set. If the interrupt hits you at just the wrong call depth, > you get an automatic register spill. So about one-sixth or one-fourth > of the time, it takes maybe three times as long overall to field the > interrupt and maybe thirty times as long to reach the first instruction > of the handler. Since hard real time people care about predictability > next only to performance, this gives loads of grief. SPARC always reserves one register window for taking traps and interrupts in. For an 8-window system the user code would only get 7 before having to save to memory. If you divide for two processes then they each get 3 windows for user processes and one for taking traps. Adrian Cockcroft Sun Cambridge UK TSE sun!sunuk!acockcroft (Borrowing Jerry West's account at Mountain View to get at USENET)
slackey@bbn.com (Stan Lackey) (06/01/89)
In article <18965@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: >For the implementations to come, including >superscalar stuff, the issues will be the complexity of implementation and >the cost (read: people) of realizing that implementation. This is where >CISCs, I believe, will faulter. One of the great advantages of RISCs is >that they are conceptually easier to implement. In order to go superscalar, future RISC chips will end up having new, incompatibile architectures. I make this statement because I understand the difficulty of the alternative, which is the execution of several existing-architecture instructions concurrently. I strongly suspect the RISC guys are, as we speak, analyzing stuff like: What kind of sequences frequently show up How to handle register dependencies and overutilized facilities Should totally independent instr's go in parallel only, or can dependent ones be issued and then pipelined (load to r0 followed by add to r0) Let me guess what they're finding. Correct me if I'm wrong. Load/store followed by increment/decr of the address is common. Load followed by dependent use of the data is common. Decrement of register, followed by test (compare or just zero), followed by branch, is common. Addition to register followed by its use as an address. Bunch of loads in a row, bunch of stores in a row. [I could probably think of more with more time.] Look familiar? This prompted my first reaction when RISC started getting popular: Where do they go from here? Yes, it's an interesting technology opportunity. But I'd rather implement the next generation machines as CISC's rather than RISC's, IF I HAVE TO STAY INSTRUCTION SET COMPATIBLE. In other words, I think that when they learn how to implement a fast CISC, it will go faster than the same-technology RISC. Why? Because for the RISC to keep up, it will have to execute an average of two instructions per cycle, and one instruction is a lot easier to implement than two, even with auto-increment addressing. Of course, there is always the possibility to do what they did the first time: come out with a new architecture that matches the technology window exactly. Note recent product announcements: this just keeps happening! It's just that when you do this, expect a limited lifetime of the architecture. -Stan [Disclaimer: Do I have an opinion yet?]
earl@orac.mips.com (Earl Killian) (06/01/89)
Another problem with register windows is that they make it difficult to support coroutines. I typically do queuing simulations by having multiple coroutines representing separate entities, each with a separate stack. I've implemented this on VAXs, 68000s, and MIPS boxes with only 20-50 lines of assemlber. But I can't see any way to do stack switching on a register window machine without a kernel call (yuck). Have any of the register window architectures implemented a way for user code to do stack switching? -- UUCP: {ames,decwrl}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
cik@l.cc.purdue.edu (Herman Rubin) (06/01/89)
In article <20810@orac.mips.COM>, earl@orac.mips.com (Earl Killian) writes: > Another problem with register windows is that they make it difficult > to support coroutines. I typically do queuing simulations by having > multiple coroutines representing separate entities, each with a > separate stack. I've implemented this on VAXs, 68000s, and MIPS boxes > with only 20-50 lines of assemlber. But I can't see any way to do > stack switching on a register window machine without a kernel call > (yuck). Have any of the register window architectures implemented a > way for user code to do stack switching? Suppose you have two coroutines. The the variables fall into three classes; those which belong to both, those which belong to the first, and those which belong to the second. Why not partition the register file accordingly? As far as stack switching, if the instructions have effectively a dedicated stack register, then that register must be saved and restored. But it is necessary to do this anyhow. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
baum@Apple.COM (Allen J. Baum) (06/01/89)
[] >In article <40718@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >I strongly suspect the RISC guys are, as we speak, analyzing stuff like: > What kind of sequences frequently show up >Let me guess what they're finding. Correct me if I'm wrong. > (..lots of pairs that look like single CISC instructions...) >Look familiar? This prompted my first reaction when RISC started >getting popular: Where do they go from here? Yes, it's an interesting >technology opportunity. But I'd rather implement the next generation >machines as CISC's rather than RISC's, IF I HAVE TO STAY INSTRUCTION >SET COMPATIBLE. In other words, I think that when they learn how to >implement a fast CISC, it will go faster than the same-technology >RISC. Why? Because for the RISC to keep up, it will have to execute >an average of two instructions per cycle, and one instruction is a lot >easier to implement than two, even with auto-increment addressing. > >Of course, there is always the possibility to do what they did the >first time: come out with a new architecture that matches the >technology window exactly. Note recent product announcements: this >just keeps happening! It's just that when you do this, expect a >limited lifetime of the architecture. > >-Stan [Disclaimer: Do I have an opinion yet?] Absolutely- maybe. I certainly agree that the current RISC philosphy is just a window in time. As design tradeoffs change (and they will), a lot of what currently is dogma will become rather obsolete. For example, just as compiler technolgy affected our design choices, silicon compiler technology might affect it. New device technologies (GaAs, new kinds of memory device) will affect the relative speeds of memory versus logic. Advances in compiler technology may take advantage of cache control operations to optimize cache hit ratios, or be able to compiler for parallel machines. I'm not sure, however, that you can with authority that it will be easier to implement a CISC type mem-reg architecture than it would be to execute two RISC style instructions at once (at the same speed), because the RISC approach lets you schedule the interactions at compile time, while the CISC is forced to do it at run time. Therefore, the RISC might keep up without going balls out for speed. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
pcg@aber-cs.UUCP (Piercarlo Grandi) (06/01/89)
In article <107302@sun.Eng.Sun.COM> khb@sun.UUCP (Keith Bierman - SPD Languages Marketing -- MTS) writes:
"modern" compilers tend to do global (well, misnomer, all of a procedure
at once) analysis; older compilers tended to look at smaller bits of
code (block, maybe interblock) ... so "modern" compilers have the
opportunity to do better.
It must be repeated for the Nth time that this is only true if spill
minimization is of paramount importance; if you look at speed, then most
spills avoided by global optimizers with large register sets don't make much
of a difference.
Upshot: modern compilers can employ as many registers are you can design
in.
But pointlessly... And even old compilers can just register everything in
sight; if there are many registers, then using an optimizer is not very
important. The hard work, as we have just discussed, is to cache *only* the
variables that matter, for *only* the section of code where they matter (and
this can be done by the programmer using "register" in C, or by the compiler
when fed with either "representative" profile data, or with calculations or
estimations of where hot spots lie). This tipically requires many less
registers than minimizing spills regardless of whether they are expensive
ones or not.
Naive rationale for infinite (as long as they are free) registers:
^^^^^^^^^^^^^^^^^^^^^^^^
Unfortunately they are not free; more registers make the system stiffer, in
that they do raise the cost of multithreading, which is where os technology
is finally heading (Mach, Os/2, etc...), and they do have costs in real
estate and even, possibly, cycle time lengthening (Cray's law). You only
need a handful of register to capture most of the benefit of expression
optimization, and another to capture most of the benefits of intra statement
optimization (whether you do it via "register" in C or leave it to the
compiler).
Large register banks are only justified for special purpose machines
(vector, VLIW, superscalar) where the only thing that matters is raw speed
in processing batched numeric codes where there is an inherent high degree
of parallelism in the algorithms employed.
--
Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
rw@beatnix.UUCP (Russell Williams) (06/01/89)
In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: >From a humble applications programmer, who occasionally has written a bit >of kernel code: The biggest pain with an architecture that exposes too >large a register file is saving and restoring on context switches. While >interrupt service routines can ignore this and store only what they >need, context switches require storing of the entire register set. Or do >people really feel that the processors today are fast enough that >scheduling pre-emption is too rare to influence the selection of >register file size ? It's been my experience in working with general purpose operating systems (not realtime monitors for embedded systems) that the register save time is negligable compared to the other software overhead involved in a task or process switch. Russell Williams ..uunet!elxsi!rw ..ucbvax!sun!elxsi!rw
pcg@aber-cs.UUCP (Piercarlo Grandi) (06/01/89)
In article <26204@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: In article <1RcY6x#64Zq3Y=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: >large a register file is saving and restoring on context switches. While >interrupt service routines can ignore this and store only what they From time to time, I have seen the argument that large register files slow down context switches. Yes and no. If you are looking at general purpose operating systems like Unix, register saves are a very small part of the cost of a context switch. I seem to remember reading that Unix context switch time is 10% the actual switch, 90% the cost of calling the scheduler to choose the next running process. This can be obviated in three ways: [1] long timeslices for CPU bound processes (Unix uses 1 second); for io bound processes, process switches tipically occur when the process goes to sleep, so calling the scheduler is almost inevitable. [2] a two level scheduler (e.g. MUSS); the short range scheduler only does context switching, always selecting the next process to run as the first in a short queue of very-runnable processes, while the medium term scheduler runs periodically to determine which of the runnable processes should belong to the very-runnable set. [3] like [2], but the short term scehduler is done by hardware, e.g. the very old Honeywell 8 contexts-in-a-ring mainframe. Using [1], the Unix strategy, large register files don't matter a lot, simply because (as Hugh LaMaster) we are accepting such colossal overheads that everything pales by comparison. But large register sets do severely impact the speed of switching in case [2], less in [3] (where the average cost may be low, because of the multiple hw contexts for the processes in the very-runnable queue, but variance is still large when changing its composition). Last week I saw an ad for a real time version of Unix with a guaranteed response time (switch to kernel to high priority process) of 5 milliseconds. A note here: I saw similar ads in the past. 5 ms means 200hz. Booooooo! -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
bcase@cup.portal.com (Brian bcase Case) (06/02/89)
>>For the implementations to come, including >>superscalar stuff, the issues will be the complexity of implementation and >>the cost (read: people) of realizing that implementation. This is where >>CISCs, I believe, will faulter. One of the great advantages of RISCs is >>that they are conceptually easier to implement. > >In order to go superscalar, future RISC chips will end up having new, >incompatibile architectures. I make this statement because I understand >the difficulty of the alternative, which is the execution of several >existing-architecture instructions concurrently. By definition, a superscalar implementation of an architecture is an implementation that *is compatible* with previous non-superscalar implementations. It doesn't make sense to say that to go superscalar, future RISC chips will have incompatible architectures. New things will certainly be learned about the interactions between architcture and implementation, but that doesn't mean that current RISC architectures won't have superscalar implementations. They are in the works as we type.... >In other words, I think that when they learn how to >implement a fast CISC, it will go faster than the same-technology >RISC. Why? Because for the RISC to keep up, it will have to execute >an average of two instructions per cycle, and one instruction is a lot >easier to implement than two, even with auto-increment addressing. First of all, the fast CISCs that we are seeing, 486 and 68040, are looking more and more like RISCs, to the extent that that is possible. That is, the simple, pipelinable instructions go fast while the complex, inherently serial instructions go slow. This trend will only increase. Note that the fast CISC, which have *next generation* technology on their side, are still not faster than current technology RISCs. The process of subsetting a CISC's instruction set and changing the compilers to take advantage of the fast subset is the only way for a CISC to go reasonably fast. For a superscalar implementation of a CISC, the subsetting will get more severe, I predict. Trying to execute two complex instructions, with side effects, at the same time is much harder to understand and implement (correctly) than trying to execute two simple instructions, relatively free of side effects, at the same time. >Of course, there is always the possibility to do what they did the >first time: come out with a new architecture that matches the >technology window exactly. Note recent product announcements: this >just keeps happening! It's just that when you do this, expect a >limited lifetime of the architecture. The commercial introduction of new architectures is completely up to companies. If a new architecture facilitates superscalar, or super- mumble, implementation, then it is the job of academics to discover and document it. I guess I am saying that no one should expect that CISCs will make a comeback because of the desire to build superscalar implementations. Quite the contrary. On limited lifetimes: If we only had a portable means of distributing applications, such as that proposed by the OSF ANDF (application-neutral distribution format), then we could innovate at the architectural level 'till we're blue in the face (maybe we already are). An architecture might be in vogue for a limited amount of time, but with ANDF, it would still be able to run new software.
tim@crackle.amd.com (Tim Olson) (06/02/89)
In article <20810@orac.mips.COM> earl@orac.mips.com (Earl Killian) writes: | Another problem with register windows is that they make it difficult | to support coroutines. I typically do queuing simulations by having | multiple coroutines representing separate entities, each with a | separate stack. I've implemented this on VAXs, 68000s, and MIPS boxes | with only 20-50 lines of assemlber. But I can't see any way to do | stack switching on a register window machine without a kernel call | (yuck). Have any of the register window architectures implemented a | way for user code to do stack switching? Well, this seems true for SPARC, since the Current Window Pointer field is in the Processor Status Register, which is protected, but it is not true of the Am29000 or the i960. The i960 has a FLUSHREG instruction which seems to be user-accessible, and in the 29k the user has access to all of the local registers and the stack pointer register. The low-level thread package I wrote for the 29k has a swapThread procedure which takes 35 instructions, all in user-mode. -- Tim Olson Advanced Micro Devices (tim@amd.com)
les@unicads.UUCP (Les Milash) (06/02/89)
In article <1Rd5QS#7Pjn10=news@anise.acc.com> lars@salt.acc.com (Lars J Poulsen) writes: >What I am talking about is saving and restoring registers when the >operating system switches from one user process to another. This is not >something that the compiler can improve on. Maybe the architecture could >define a "highest register currently in use" pointer, and encourage >context switches just before it gets incremented ? Somebody recently also pointed out that the INMOS Transputer can c-switch in .6uS. The way they do it is sort of like this but wierder; and it's different-but-related wrt this thread: there are 6 registers, a PC, a FP, and a 3-or-4 deep hardware stack. all local variables are in "memory", and they have fast FP+short offset addressing mode. current chips have (1-soon8 kWord) on-chip 50ns memory where you want to put your stack(s?). architecturally i think the chip could be done with NO on-chip (i.e. fixed address) ram but cache instead. in fact you can disable the on-chip bank if you want to, although i'm not sure there're the perfect set of signals coming out that cache users would want. context switches will only happen on certain instructions, (like loop bottom, and in/out (which might block). they do have a "highest register currently in use" sort of, but they assume that the compiler will make sure that NONE of them are in use at those points, it only saves the PC and FP. #define HIGHEST_USED_REGISTER 0 /* HA. so THERE. deal with THAT, mate! */ remember the TI 99*? before my time, but it's a memory-to-memory machine, with the "registers" being memory-mapped, FP relative. i heard that it was a dog, at least the micro version. has anybody considered memory-to memory architectures in the face of modern cache design? it seems to solve some problems (but probably introduce some, i just cann't see what right off the bat). somebody recently was wondering about if you do global register allocation, what if your most frequent variable had pointers to it, how do you handle that? this'll solve that at least, right? in america the transputer gets little attention (i guess cause it's got no mmu and it's not intel- or moto- compatable and cause inmos took about 5 years before they got anybody to understand what the h eck it was). but it is wierd, and in lots of ways good, and we can learn from it. it's risc not in the sense of "register-to-register with windows" but in the sense of "optimized in the face of real data", or at least they do constantly quote numbers about how "only 1.5% of these actually require those so we'll make you do it in software" and they were one of the first to make short immediate constants load fast. have load-store day. Les Milash
baum@Apple.COM (Allen J. Baum) (06/03/89)
[] >In article <479@unicads.UUCP> les@unicads.UUCP (Les Milash) writes: > >Somebody recently also pointed out that the INMOS Transputer can c-switch >in .6uS. The way they do it is sort of like this but wierder; >and it's different-but-related wrt this thread: >there are 6 registers, a PC, a FP, and a 3-or-4 deep hardware stack. >all local variables are in "memory", and they have fast FP+short >offset addressing mode. > >remember the TI 99*? before my time, but it's a memory-to-memory machine, >with the "registers" being memory-mapped, FP relative. The ATT CRISP chip is similar; its actually a memory-memory architecture, where all references are relative to a stack pointer. The top 32 entries are cached in a Tos-of-Stack cache. This is like a register-window, with a granularity of one register (i.e. window slides by 1 min, instead of fixed 8), and with the advantage that you can address into the registers. A context switch can be extremely fast; just change the stack pointer. Of course, then all accesses to the 'registers' are cache misses. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
khb@chiba.Sun.COM (Keith Bierman - SPD Languages Marketing -- MTS) (06/03/89)
In article <978@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: >... >It must be repeated for the Nth time that this is only true if spill >minimization is of paramount importance; if you look at speed, then most >spills avoided by global optimizers with large register sets don't make much >of a difference. And for the N+1th time, I guess, it must be repeated that the class of machines for which minimizing spills is quite interesting (and getting more so all the time). Consider the paper by Hsu,Dehnert, and Bratt in ASPLOS-III on the Cydra 5, as an example. > > Upshot: modern compilers can employ as many registers are you can design > in. > >But pointlessly... And even old compilers can just register everything in One can, but old compilers tend to do a very bad job of it. >sight; if there are many registers, then using an optimizer is not very >important. The hard work, as we have just discussed, is to cache *only* the >variables that matter, for *only* the section of code where they matter (and Which with software pipeling, trace scheduling and similar techniques can be quite a long time indeed. On the Cydra 5 a memory write was assumed to take 26 cycles, two memory references could be initated EVERY clock, as well as two integer operations and two FP operations. Since the FP operations required more than one cycle, the instruction scheduling was quite interesting. With respect to register "live time" a given register might be required several loop iterations into the future. >this can be done by the programmer using "register" in C, or by the compiler Which is often ignored by the compiler...simply because programmers cannot reasonably guess how the compiler (try it on your multiflow trace/28 for a bunch of codes and show us what is produced!) will unroll, split and otherwise contort your code. >when fed with either "representative" profile data, or with calculations or >estimations of where hot spots lie). This tipically requires many less >registers than minimizing spills regardless of whether they are expensive >ones or not. Doing a "spill" (i.e. running out of interconnect) on the Cydra 5 meant your loop ran 10x slower. This is not acceptable to most programmers. > Naive rationale for infinite (as long as they are free) registers: > ^^^^^^^^^^^^^^^^^^^^^^^^ > >Unfortunately they are not free; more registers make the system stiffer, in True. Which is why folks build windows, small (32) register files, file pointers (AMD, Gould) and other stuff. >that they do raise the cost of multithreading, which is where os technology >is finally heading (Mach, Os/2, etc...), and they do have costs in real >estate and even, possibly, cycle time lengthening (Cray's law). You only >need a handful of register to capture most of the benefit of expression >optimization, and another to capture most of the benefits of intra statement >optimization (whether you do it via "register" in C or leave it to the >compiler). Your assertation about multithreading is quite true, it is here to stay. It is far from clear that transputer type designs will win (tiny machine fast communication) out over somewhat "chunkier" designs. But the assertation about a handful of registers being sufficient on high performance machines is simply not borne out. All of Seymour's machines have a bunch (don't forget those vector registers), and this is NECESSARY for those long pipes (superpiplining ?) ... and it is just as true for software pipelining. >Large register banks are only justified for special purpose machines >(vector, VLIW, superscalar) where the only thing that matters is raw speed >in processing batched numeric codes where there is an inherent high degree >of parallelism in the algorithms employed. Multiflow claims that they eat "general" code just fine. As Mashy has pointed out there are several superscalar projects running around ... and business codes, database codes, and windowing systems benefit from that kind of parallelism just like numeric codes (although writing the code in C makes it much harder to extract the parallelism). Keith H. Bierman |*My thoughts are my own. Only my work belongs to Sun* It's Not My Fault | Marketing Technical Specialist ! kbierman@sun.com I Voted for Bill & | Languages and Performance Tools. Opus (* strange as it may seem, I do more engineering now *)
hays@zeus.hf.intel.com (hays) (06/03/89)
In article <26204@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >Last week I saw an ad for a real time version >of Unix with a guaranteed response time (switch to kernel to high priority >process) of 5 milliseconds. Hugh is absolutely right in his response here - you must design for your environment. A guaranteed task switch of 5 milliseconds is *soft* real time, but is still impressive for a real time Unix. I have worked on a *hard* real time kernel for the 386 (iRMK, from Intel) where we designed for a 50 uS maximum guaranteed interrupt latency, and a 5 uS maximum guaranteed task switch (at 16 MHz). Not easy, but three orders of magnitude faster than the Unix mentioned above - and with its own market niche. The point? Design for your environment - chips for UNIX should have large register files, real time chips should have small register files, and compromises will be made for many successful products. ----- Kirk Hays -- "I'm the NRA!"
mcg@mipon2.intel.com (06/05/89)
In article <25786@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes: >| [Re:] huge directly-addressable register files ... in any reasonable >| implementation, the register file must be multi-ported. A six-ported register >| cell is about 5x the size of a single-ported register cell ... > >I don't see why this is a reason not to implement large register files. >You need to apply transistors where they will do the most good. This is a truism, a cliche. The point is that we are, and have always been, limited in our scope by how many devices could be {effectively, economically, manufacturably} put on a die. If one spends 2/3 of one's silicon budget on a register file, then correspondingly less is available for function units (e.g. floating-point), caches, and on-chip peripherals. The question of whether a large register file will "do the most good" is precisely the point here, and I have seen no evidence that it is. >Note >that processors that attempt to "more fully utilize micro-parallelism" >also tend to want to have more general-purpose registers available to >maintain full performance. As above, I'd would like to see your experimental evidence that suggests that more than 64 general registers are useful for a preponderance of embedded applications (which is, I believe, your target market), or for any application other than multi-precision scientific code. We have seen that, given currently and projected compiler technology, an extremely large *addressable* register set is substantially less useful than a large on-chip cache. With wide (128-bit) low-latency transfers from cache to registers overlapped with other operations, a large register set is not really needed. Large embedded applications we have examined show an average of 6 local scalar variables in registers, with an additional 3-5 temporaries in use. Additional registers may be used to cache global variables, and to temporally separate destination values that could otherwise share registers, thus avoiding scoreboarding blocks. And since one can put 5x the cache in the place of the registers that would be added, the cache makes even more sense as a repository for values of import that are not currently in registers. >| This, IMHO, is one of the greatest flaws of the 29k - it exposes 192 >| (actually 256) architectural registers. In the current implementation they >| are (I believe) 3-ported, and even now occupy a large amount of the 29k >| die space. I believe that they will run into serious problems if they >| ever attempt to dispatch and execute multiple general instructions per cycle. > >Well, we see no problems in either our 2nd or 3rd generation parts... Well then, do speak up. I can't imagine how you will effectively increase the number of ports in the register file without creating an imbalance between CPU speed and bus (because of poor register/cache balance). And I can't imagine how you will lower your CPI without it. But then, perhaps I'm just not imaginative enough. >| The 960, on the other hand, exposes 32 general registers architecturally, >| but because 16 of these are "local", and saved/restored on call/return >| to/from architecurally-hidden resources, we can easily move from a cheap >| (1-ported by 4 sets) register implementation, to a very fast one >| (6-ported by 8+ sets) in high-performance implementations. > >So you *will* be looking at large register files (128+, 6-ported) for >high performance. No, not at all. If you read Glenn Hinton's Compcon paper, you will realize that, while in the first generation we used a ((16*4)+16) single-ported register file, the subsequent generation includes a 32 register 6-ported file, and that registers spilled by call and retored on return are flushed to an on-chip cache capable of storing 8 or more of these register sets. The registers are flushed across a 128-bit wide dual bus, so call and return take only 4 clocks each. S. McGeady Intel Corp.
mcg@mipon2.intel.com (06/05/89)
In article <26145@lll-winken.LLNL.GOV> brooks@maddog.llnl.gov (Eugene Brooks) writes: >In article <m0fRx4x-0001fDC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes: >>The cut line is different on every architecture - 32 is sufficient on the >>960, but I am not disagreeing with Wall's estimate of 64. Certainly >>in floating-point intensive scientific applications dominated by >>double-precision arithmetic in loops, more registers are needed. But >>substantialy more than 64 seems to limit architectural flexibility >>quite severely. >The cut line for scratch registers is directly proportional to the memory >latency. The longer your latency the more registers, and concurrently >handled computation, you need to mask it. This is entirely correct, but doesn't mention several distinctions. First, their is a distinction between *addressable* registers (of which the 960 has 32) and *available* registers (of which the 960 architecture has an undefined number, the K* implementations have 80, and the subsequent implementations has more than 144). The distinction is important because one must decide what the locality of register use is: do you wish to/need to use the preponderance of your registers in a single routine, or do you use them across many routines? In the former case, a larger addressable register file helps; in the latter case it does not, and may adversely impact other aspects of the architecture. Scientific code often falls into the former category. Most control code (at least when well-structured and/or written in HLL) falls into the latter. Second, memory latency is not simply a function of external bus and speed and memory wait-states. It is affected by the size, speed, and behaviour of on-chip caches. The balance between caching and registers is tricky. Caches can never be big enough, so there is a temptation to replace them with registers and assume they will be implemented off-chip. This is a fine choice for some applications, but not for those with an emphasis on low-cost computing. In cost-sensitive applications, off-chip memory is strictly DRAM, and the customer tunes the cost (aka speed) of the DRAM to her performance needs. You need 90% of the rated performance? Use 1 wait-state; only 70%? Use 3 wait state. Contrast this with the equivalent message of other architectures: Don't want SRAM cache? Then you get 50% performance; Don't want two distinct memory systems? Then use this special memory that's more expensive and slower than what you might want to use. Of course, most comp.arch readers probably don't actually *build* systems out of these chips, so these questions are at best boring, and probably irrelevant here. S. McGeady Intel Corp.
tim@crackle.amd.com (Tim Olson) (06/07/89)
In article <m0fU2pe-0001hNC@mipon2.intel.com> mcg@mipon2.UUCP (Steven McGeady) writes: | In article <25786@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: | | [Steve talking about problems with multi-porting large register files] | | >Well, we see no problems in either our 2nd or 3rd generation parts... | | Well then, do speak up. I can't imagine how you will effectively increase | the number of ports in the register file without creating an imbalance | between CPU speed and bus (because of poor register/cache balance). | And I can't imagine how you will lower your CPI without it. But then, | perhaps I'm just not imaginative enough. I think you are setting up a "straw-man" argument here. We certainly aren't talking about 2/3 of the die area as you previously implied. The Am29000's triple-ported, 192 register file currently takes about 21% of the die area. Even if we (hypothetically) doubled the number of ports on it for a future generation part, standard process shrinks would still end up making it a much smaller portion of the die than it already is. It would end up being dwarfed by the large caches and multiple functional units required to sustain multiple instruction/cycle throughput. -- Tim Olson Advanced Micro Devices (tim@amd.com)
mcg@mipon2.intel.com (06/08/89)
In article <20810@orac.mips.COM> earl@orac.mips.com (Earl Killian) writes: >Another problem with register windows is that they make it difficult >to support coroutines. I typically do queuing simulations by having >multiple coroutines representing separate entities, each with a >separate stack. I've implemented this on VAXs, 68000s, and MIPS boxes >with only 20-50 lines of assemlber. But I can't see any way to do >stack switching on a register window machine without a kernel call >(yuck). Have any of the register window architectures implemented a >way for user code to do stack switching? This is easiest to answer with code for the 960. I count 15 instructions. This code was written by Jim Valerio. 'pfp' is the previous frame pointer register. /* * C callable coroutine support for 960 * */ /* * co_fp = co_init(stk_addr, init_ip); * * Creates coroutine on the stack `stack_addr' to begin at `init_ip'. * The value returned should be assigned to a global variable which * is used in calls to co_jump (e.g. task1 = co_init(...) and later * co_jump(task1)). We assume that `stk_addr' is properly aligned. * */ .globl _co_init _co_init: st g14,(g0) /* pfp := 0 (assumes g14 is 0) */ lda 0x40(g0),g2 st g2,4(g0) /* initial sp */ st g1,8(g0) /* init_ip */ ret /* * co_jump(&my_fp, new_co_fp); * * Saves away the calling-frame's FP for a later return, and then * returns to the new coroutine's context. We know here that the * C calling convention requires that only g8..g12 be saved. */ .globl _co_jump _co_jump: movq g8,r4 mov g12,r8 call co_swap mov r8,g12 movq r4,g8 ret co_swap: st pfp,(g0) flushreg mov g1,pfp ret /* goto new coroutine */ ------------------------------------------------------------------------------- S. McGeady Intel Corp.
wayne@dsndata.uucp (Wayne Schlitt) (06/10/89)
>In article <m0fU37i-0001hNC@mipon2.intel.com> mcg@mipon2.intel.com writes: > Of course, most comp.arch readers probably > don't actually *build* systems out of these chips, so these questions > are at best boring, and probably irrelevant here. > S. McGeady > Intel Corp. umm... just because we don't actually build systems out of these chips doesn't mean we aren't interested in knowing the trade offs that went into the chip designs. i'm just a lowly software person (:->), but in order for me to guess what the hardware is going to look like five years from now, i need to know these kind of things. -wayne