lindsay@k.gp.cs.cmu.edu.UUCP (10/03/87)
References: Register windows interest me. Win#1: less memory traffic to save/restore registers at procedure call/returns. (Caches get some of this win.) Win#2: less memory traffic to build/access subroutine argument lists. (Caches get some of this win.) (Non-window machines can have much of this win by changing their calling convention.) Win#3: more likely that a compiler can "target" (compute a value in the place where it is to be used, thus saving "move" instructions). Win#4: hardware is dense and regular. (Caches are too, but are more complex.) Lose#1: more state to be saved/restored at task switches. (Caches can be just declared invalid, or have tags.) Lose#2: more hardware that can be on the ALU critical path. (Longer R bus, deeper R decode ). Lose#3: the extra chip area could have been doing something else, like, being a cache, which you wanted anyway. Lose#4: Some programs (particularly Unix's "printf" ) want their parameter list to be at an address. I'm fond of win#3, because only windows give it, and because compilers don't need great subtlety to pick up the win. There are some ways to ways to fix things up. For example, instead of a call sliding by (say) 8 registers, slide by a program-supplied distance. The average slide might then be (say) 3. This economy helps lose#1, and changes lose#2 in very implementation-specific ways. Another fixup is to "dribble" the state out to memory as a background activity. It's more complex, but if it uses spare memory cycles, then task switches get a "free" win. What ways have been used around lose#4 ? -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
marc@ucla-cs.UUCP (10/03/87)
In article <1242@k.gp.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes: >Register windows interest me. > >Lose#1: more state to be saved/restored at task switches. (Caches can > be just declared invalid, or have tags.) >Lose#2: more hardware that can be on the ALU critical path. (Longer R bus, > deeper R decode ). We designed a register file which almost eliminates this problem. By using a "Shift-Register File", the data bus only goes through a single window of registers. In our case it is 16 registers. Adding more windows to the file, does not lenghten the data bus. >There are some ways to ways to fix things up. For example, instead of a >call sliding by (say) 8 registers, slide by a program-supplied distance. >The average slide might then be (say) 3. This economy helps lose#1, and >changes lose#2 in very implementation-specific ways. To solve some of this problem, we make use of variable-size windows. Our windows sizes vary in increments of 4 (8, 12, 16) so that they can fit a procedure call/return approprietly. In this way very few registers are wasted because of bigger-than-necessary-windows. That is the reason while we claim that our file of 64 registers performs just as well as a normal 128 register file. For more information you can read: 1) "A Reduced Register File for RISC Architectures", M. Huguet & T. Lang, Computer Architecture News (June 1985). 2) "VLSI Implementation of a Shift-Register File", M. Tremblay & T. Lang, 20th Hawaian Int. Conf. on System Sciences, pp 112-121 (January 1987). Marc Tremblay marc@CS.UCLA.EDU ...!(ihnp4,ucbvax)!ucla-cs!marc Computer Science Department, UCLA
guy%gorodish@Sun.COM (Guy Harris) (10/04/87)
> Lose#4: Some programs (particularly Unix's "printf" ) want their parameter > list to be at an address. > > What ways have been used around lose#4 ? Our "printf" uses the "varargs" package, our "varargs.h" include file maps "va_alist" into "__builtin_va_alist" on SPARC machines, and our SPARC compiler recognizes the "__builtin_va_alist" construct and moves arguments from registers to memory as needed. Basically, if you want code using "varargs"-like techniques to work on machines built around the SPARC chip, you'd better not just use any "varargs"-like technique, you'd better use "varargs". (ANSI C contains a similar facility, and mandates its use in this case.) Note that this is a lose of any implementation where arguments are passed in registers; it can either be supported by requiring this sort of code to be written using higher-level primitives such as "varargs" rather than low-level fiddling, or by having the compiler somehow detect that the low-level tricks are being used (as I believe MIPSco's compiler for their non-register-window machine does). Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com
srg@quick.COM (Spencer Garrett) (10/05/87)
In article <1242@k.gp.cs.cmu.edu>, lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes: > Register windows interest me. > > Win#3: more likely that a compiler can "target" (compute a value in the > place where it is to be used, thus saving "move" instructions). > > I'm fond of win#3, because only windows give it, and because compilers don't > need great subtlety to pick up the win. Unfortunately, it's not so simple. When you have procedure calls in argument lists you have to tuck away all the arguments you've already computed while you call the next procedure. With a stack-based arg list, you just compute the args in reverse order and let the stack grow to include each new one in turn. If your compiler is good about how it does the tucking, it's probably still a net win, but it's tricky to do. Pyramid, for instance, often ends up pushing registers onto the stack to preserve their values across procedure calls, thus circumventing one of the main advantages of their architecture.
bcase@apple.UUCP (Brian Case) (10/06/87)
In article <1242@k.gp.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes: >Register windows interest me. >Lose#4: Some programs (particularly Unix's "printf" ) want their parameter > list to be at an address. > >What ways have been used around lose#4 ? Well, I wouldn't say this is such a great way around it, but it does "work:" In the compiler I implemented for internal use at AMD, if the address of *any* argument was taken inside a procedure, the compiler assumed that at least 16 argumnts were passed to the procedure (the max. that the compiler would place in registers), and it stored all 16 arguments in memory in such a way that if more than 16 were passed, they would all line up real nice, thus facilitating printf. The method that I implemented required three (count 'em, three) stacks. It was a very quick solution to a somewhat tricky problem, but there are lots of others. BTW, you mentioned a coupla things about reg. windows that need elaboration: Registers *are* a cache, so the tradeoff "the space taken by registers could have been used for something else, like a cache" is kinda weird; you seemed to be saying in one "win" that registers were more dense than cache, but I don't know that this is really true: caches are typically single-ported memories while register files are at least two-ported (probably three-). The tags necessary for traditional caches take some space, true, but I still think caches, as a general class, are denser than register files. Register files have two or three times the bandwidth, *this* is the big win.
bcase@apple.UUCP (Brian Case) (10/07/87)
In article <131@quick.COM> srg@quick.COM (Spencer Garrett) writes: >Unfortunately, it's not so simple. When you have procedure calls in >argument lists you have to tuck away all the arguments you've already >computed while you call the next procedure. With a stack-based arg list, >you just compute the args in reverse order and let the stack grow to >include each new one in turn. If your compiler is good about how >it does the tucking, it's probably still a net win, but it's tricky to >do. Pyramid, for instance, often ends up pushing registers onto the >stack to preserve their values across procedure calls, thus circumventing >one of the main advantages of their architecture. Well, first of all, I doubt that the registers are pushed onto the stack *often.* I can see it happening in some cases, but this doesn't sound like the common case. Further, C doesn't specify the order of evaluation of arguments; I used this to advantage in the compiler a RISC compiler that I wrote: the compiler analyzes the parse tree so that inner-most calls (if any, typically there are none) are done first. This is neither hard nor time-consuming.
csg@pyramid.pyramid.com (Carl S. Gutekunst) (10/13/87)
Since Pyramid has the oddest architecture of any register window machine :-), let me see if I can contribute some useful information. People with additional questions can send me e-mail. In article <1242@k.gp.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes: >Win#3: more likely that a compiler can "target" (compute a value in the > place where it is to be used, thus saving "move" instructions). >.... >I'm fond of win#3, because only windows give it, and because compilers don't >need great subtlety to pick up the win. Actually, this doesn't win that much. While it gives me a warm fuzzy feeling to see the compiler putting the calculations for printf("%d\n", (i+j)*47); directly into the register that will be used to pass the argument, this only saves at most one 100ns move instruction. (I know, I know, removing one 100ns move instruction in the right place has a dramatic effect on Dhrystone. :-)) Or suppose, as is often the case, that I need the expression ((i+j)*47) more than once. Then I want the compiler to identify the common expression, and keep it in a local variable for copying when needed. I'm a lot more concerned about the latter than the former, and the presense of either complicates the implementation of the other. Pyramid's C compiler does catch both cases. (It also determines when recalcu- lating the expression costs exactly the same as MOV'ing it, e.g. addition by small integers.) But it's not exactly free. >Lose#4: Some programs (particularly Unix's "printf" ) want their parameter > list to be at an address. As pointed out nicely by Guy Harris and John Mashey, the real answer is to use <varargs.h>. Anything that does variable length/types argument handling in C without using <varargs.h> is going to have problems, especially now that non-VAX-like architectures are becoming commonplace in the UNIX universe. As it happens, though, you *can* take the address of a register on the Pyramid 90x. This was deemed necessary because of the huge volume of (non-portable) C code that depends on it. (Note that MIPSco got around this problem by making the compiler smart enough to detect it. And in case anyone got lost, the MIPS processor does *not* use register windows. It was dragged into the discussion as an example of a processor that does not use VAX-style parameter passage. Sun's SPARC processor *does* use register windows.) The Pyramid's register set is transparently mapped into the "Control Stack," a bottom-up stack in the virtual address space that is used for all function calls. (Yes, there is also a Data Stack. It grows top-down.) Calls and returns bump the Control Stack Pointer up and down 32*4, and the register window moves along with it. When the CSP falls off the end of the register set (after nesting function calls 16 levels deep), the bottom 32 registers are written back to real memory, and mapped back up to the top. Writebacks and context switches *do* take time, but not as much as you might expect; the chunk of memory "behind" the control stack is cached (just like the rest of the memory), and the Pyramid has a 256-bit (32 byte) datapath to the main store. The does *not* mean that the Pyramid's <varargs.h> is as simple as that of the PDP-11 or the VAX, since the size of the register window is finite, and there are some objects you cannot put in a register, like structs and (in Pascal) arrays. These go on the data stack. The Pyramid handles this in <varargs.h>; nothing special is done in the compiler. In article <131@quick.COM> srg@quick.COM (Spencer Garrett) writes: >Pyramid, for instance, often ends up pushing registers onto the stack to >preserve their values across procedure calls, thus circumventing one of the >main advantages of their architecture. The Pyramid *NEVER* saves registers on the data stack. Parameters are passed on the data stack, but only when the call uses more that 13 arguments, or when passing entire arrays or structures. And there are writebacks when you nest functions more than 16 levels deep. But that's it. To what are you referring? In article <756@winchester.UUCP> mash@mips.UUCP (John Mashey) remarks: >a) Most of the machines have 2 register sets: integer and FP.... The Pyramid has one register set. FP lives with everything else. One less thing for <varargs.h> to worry about. :-) But John's comments still apply to structs and arrays. >If you're using high-powered global optimization technology (at least 3 of >the 4 above are: I don't know about Pyramid, they may be also.) then a lot >of the information-gathering needed to generate good code for the argument >passing falls out of technology. Yes, Pyramid employs a relatively powerful global optimizer, although this has little bearing on parameter passage. This is one aspect that I have often thought of as an oversight (and perhaps the only oversight) in the MIPS R2000 architecture: parameter passage has been made very difficult for the compiler. In the Pyramid it falls out cleanly. On the other hand, the MIPS compiler is allowed to make trade-offs that the Pyramid's compiler is not. <csg>
lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (10/19/87)
References: It occurs to me that register-window machines tend to remember data which a cache would (correctly) overwrite. For example, assume that a program is half a dozen calls deep. What is the average amount of execution before the bottom (oldest) window is uncovered ? A reasonable answer: more than the average amount of execution between system calls, or more than the average amount of execution between interrupts, or more than the average amount of execution between task switches, or all of the above. If this dead weight is sent to memory more than once (by the inevitible second system call, or second interrupt) then considerable traffic could occur. This leads to the idea that a register-window machine should spill its elderly data to memory, perhaps at the first interrupt, perhaps at the first system call, or perhaps by a "trickle" mechanism. It shouldn't recall these values back from memory until some (much) later time - say, when their slot is returned-to (or when the one above them is returned-to) (or when return-from-interrupt notices that they are almost uncovered). It would be interesting to see simulation results on this. There is an excellent article in the September issue of (IEEE) Computer, by Flynn and others, which reports simulation results on various other register models. I consider the article flawed by its emphasis on counting memory traffic, while ignoring the topic of syscalls and interrupts. If they would factor these in, then the ideas presented above could be tested. Yes, I know that interrupt frequency is system-dependant (and so on). My point remains. Our understanding of register issues still needs work. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (10/19/87)
I posted yesterday on the subject of minimizing the traffic required to save and restore registers. There are those who argue that interrupts and task switches are enormously rare compared to subroutine calls and returns. They are right. They then argue that the rare events can be penalized, if this makes the common events run faster. (The RISC argument, if you will.) I don't agree. There's the issue of "rare for which customer". The HP Spectrum addresses this by allowing differing resource mixes, so for example, a customer who does a lot of multiplies can have a multiplier. There's also the issue of "rare but critical". This brings me back to register window state. If it's big, then interrupt latency and task-switch latency are high. This affects more than just the upper bound on interrupt rate. Interrupts may be announcing important events, and you want to get to work. Aside from any costs due to slow response, there is the fact that a slow chip just won't be as widely useful, or used. Also, machine features tend to determine programming style. This is usually a pity. For example, many device drivers are subroutine packages, rather than full processes, simply because "faster" dominated "cleaner" at design time. -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
henry@utzoo.UUCP (Henry Spencer) (10/20/87)
> ...register window state. If it's big, then interrupt latency and task-switch > latency are high... Well, maybe. If your interrupt handlers are big and complex -- all too likely these days -- then they will be calling functions, and hence register windows will speed them up too. Register windows may end up being a net win despite context-switch overhead. -- PS/2: Yesterday's hardware today. | Henry Spencer @ U of Toronto Zoology OS/2: Yesterday's software tomorrow. | {allegra,ihnp4,decvax,utai}!utzoo!henry
frazier@CS.UCLA.EDU (10/21/87)
In article <8801@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > Register windows may end up being a net win >despite context-switch overhead. >-- There isn't that much of a context-switch overhead, either. Saving the register windows does not dominate the cost of a context-switch. Context switches require hundreds of instructions, while a pipelined system should be able to save the register file in about 100-150 instructions. This cost is more than offset by the savings incurred in the calls between process swaps, even in a system which performs the swaps fairly often. In addition, smart systems only restore the "top" window of a swapped out process. This prevents filling the register file with windows which are then have to be saved for further calls, and incurs no (read little) extra cost if the process, once active, proceeds to execute a series of returns and the windows have to be restored after all. $-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$ Greg Frazier Your feet are frozen to the floor... -more- CS dept., UCLA Internet: frazier@CS.UCLA.EDU UUCP: ...!{ihnp4,ucbvax,sdcrdcf,trwspp,randvax,ism780}!ucla-cs!frazier $-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$-$
howard@cpocd2.UUCP (10/22/87)
In article <201@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes: >There are those who argue that interrupts and task switches are enormously >rare compared to subroutine calls and returns. They are right. They then >argue that the rare events can be penalized, if this makes the common events >run faster. (The RISC argument, if you will.) The argument is simply that this approach leads to the best average performance. Which it does. The CISC argument is then that everything has to be optimized, even if it's rare and optimizing it (locally) adversely affects overall cost and performance. Optimizing everything is the same as optimizing nothing. >I don't agree. There's the issue of "rare for which customer". The HP >Spectrum addresses this by allowing differing resource mixes, so for >example, a customer who does a lot of multiplies can have a multiplier. "Which customer" is the one you based your studies on. This gives you a clean target and a clear metric for choosing between design variations. Lack of a multiplier in the RISC-I was more due to lack of student design time than anything else. This has nothing to do with RISC. Many RISC machines have multipliers. (E.g. MIPS-R2000) >There's also the issue of "rare but critical". This brings me back to >register window state. If it's big, then interrupt latency and task-switch >latency are high. Not necessarily. Many interrupts can be handled almost like procedure calls in some RISC architectures, so all you need to save is (possibly) one register frame. The "original" RISC architecture, the IBM 801, I believe had a separate set of registers for interrupts (I'm possibly confusing this with another machine, but the idea is certainly practical and provides an easy counterexample to Donald's sweeping generalization). Task switch is harder to do well, agreed, but it occurs about 1/100th as often as call/return. All these concerns are addressed in the early RISC papers. Why bring them up now, unless you haven't read those papers? >This affects more than just the upper bound on interrupt >rate. Interrupts may be announcing important events, and you want to get to >work. Aside from any costs due to slow response, there is the fact that a >slow chip just won't be as widely useful, or used. You seem to have a very narrow idea of "slow". Interrupt latency counts, but program execution time doesn't? Give me a break! >Also, machine features >tend to determine programming style. This is usually a pity. For example, >many device drivers are subroutine packages, rather than full processes, >simply because "faster" dominated "cleaner" at design time. And many people write in assembler because they have atrocious compilers, which are in part due to overly complicated architectures. It is better to program in HLL when possible, which supports the notion of hardware designed for compilers instead of the other way round. This is where RISC began. -- Howard A. Landman {oliveb,hplabs}!intelca!mipos3!cpocd2!howard <- works howard%cpocd2%sc.intel.com@RELAY.CS.NET <- recently flaky howard%cpocd2.intel.com@RELAY.CS.NET <- ??? try this
mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer) (10/22/87)
In article <8758@shemp.UCLA.EDU> frazier@CS.UCLA.EDU (Greg Frazier) writes: <In article <8801@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: <> Register windows may end up being a net win <>despite context-switch overhead. <>-- Every time I've looked at an article on register windows, I think of something I refer to as "the Johnson stack machine." This machine seemed to have all the advantages of register windows, and avoided most of the disadvantages. The basic idea was to not bother with either a data cache or a register file. Instead, you used all that chip space for a register-speed cache that is mapped onto memory. The mapping is tracked by a base/length pair. It's a write-through cache. The trick is that this is mapped onto the top of the stack. Pushing something onto the stack involved incrementing the length counter, possibly wrapping around in the file (my mind goes hazy on how that worked - it may have actually required shuffling elements in the cache). But context switches were *cheap*. The save half involved nothing more than making sure all writes had actually finished. Loading a new context required reloading (tagging words as invalid, maybe?) the file. I have good reason to believe that AT&T was working on hardware that used this model. Does anyone know what became of that? Better yet, can someone who actually understands hardware explain why I've not seen anything like the above in the real world? Thanx, <mike -- I'm gonna lasso you with my rubberband lazer, Mike Meyer Pull you closer to me, and look right to the moon. mwm@berkeley.edu Ride side by side when worlds collide, ucbvax!mwm And slip into the Martian tide. mwm@ucbjade.BITNET
lamaster@ames.arpa (Hugh LaMaster) (10/22/87)
This may have come up before- if so, I missed it. The Sept. 87 issue of IEEE Computer has an article, one of whose authors is The Michael Flynn, which examines certain aspects of RISC architectures. The simulations described are using rather simple benchmarks, so it would be dangerous to generalize too much from the article, but register windows don't seem to be a win even where expect them to do best: the benchmarks were medium sized Pascal programs with lots of procedure calls. (Another part of the article is an examination of tradeoffs between compact instructions to reduce instruction traffic, and simpler instructions + instruction cache using the same chip area...) Do any of the register window advocates have similar studies which demonstrate a WIN for register windows?
tim@amdcad.AMD.COM (Tim Olson) (10/22/87)
In article <933@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: ... good stuff about register windows deleted ... |an easy counterexample to Donald's sweeping generalization). Task switch is |harder to do well, agreed, but it occurs about 1/100th as often as call/return. Actually, we measure more like .5% - 1% procedure calls ( == 1% - 2% call+return). Assuming 40ns cycles and ~100 - 200 context switches/sec, then context switches are only 1/1000 to 1/4000 as often as call/return. -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
baum@apple.UUCP (Allen J. Baum) (10/22/87)
-------- [] >In article <5567@jade.BERKELEY.EDU> mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer) writes: >Every time I've looked at an article on register windows, I think of >something I refer to as "the Johnson stack machine." >I have good reason to believe that AT&T was working on hardware that >used this model. Does anyone know what became of that? > >Better yet, can someone who actually understands hardware explain why >I've not seen anything like the above in the real world? Check out the ATT CRISP microprocessor (see ASPLOS II, or Feb.87 Compcon proceedings). That's what they are using. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
hansen@mips.UUCP (Craig Hansen) (10/22/87)
In article <933@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes: > In article <201@PT.CS.CMU.EDU> lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) writes: > >There are those who argue that interrupts and task switches are enormously > >rare compared to subroutine calls and returns. They are right. They then > >argue that the rare events can be penalized, if this makes the common events > >run faster. (The RISC argument, if you will.) > > The argument is simply that this approach leads to the best average > performance. Which it does. Sigh. As always, RISC means many different things to different people. "The RISC argument" leads to varying conclusions, depending on what you're focusing on, and what your design team can and can't do easily. I've worked on three RISC architectures in different environments, and they each ended up quite unique. The MIPS team has the strongest software and measurement skills of the three designs, and it let us delegate a lot more to software, and quickly see where design trade-offs were taking us. Unfortunately, comparing interrupt and task switch rates to subroutine call and return rates is only part of the story. The register window approach trades hardware resources against optimizing software. The MIPS-R2000, which doesn't have hardware register windows, relies on an intelligent use of a flat, uniform, register set, implemented by optimizing compilers. As has been described in this forum previously, procedure arguments and return values are passed in registers. In addition, the compiler system identifies leaf-level routines, and allocates variables into registers, starting at these leaf-level routines, working its way up to call tree. When registers must be spilled to memory, the optimizer selects appropriate locations for the register saving and restoring code (moving up call tree and outside loops, etc.). The end result is a healthy reduction in the amount of register saving and restoring at procedure boundaries. The register window approach opts for simpler software (a good idea if you don't have software), but spends more die area, and ultimately, cycle speed, on register file accesses and register file spilling hardware. Ultimately, the selection of an architecture is the result of several inter-related design trade-offs. While it is not an absolute requirement of register windows, let me observe that every one of these machines built to date takes multiple cycles to perform canonical load and store operations. There's no question that register windows can, in some cases, reduce load and store frequencies, but to really pay their cost, they have to reduce load and store frequencies sufficiently to offset the higher cost of load and store operations on these machines. I've not been on the design team of a register windowed machine, but it seems that the H/W designers might have assumed that the register windows were going to eliminate so many memory references that they didn't spend the time they should have on making loads and stores go fast. If you are concerned about UNIX kernel performance, it would be worthwhile to note that the MIPS-R2000 instruction set and compiler system are designed to permit efficient references to data contained in structures (a reference to a structure-element through a pointer is a single-cycle operation, but is two-to-three cycles at minimum for the SPARC and Am29k register window-based machines. Now what's the relative frequency of structure references in a UNIX kernel? (Hint: Dhrystone will give you the wrong answer. I'm not trying to be glib here; dhrystone is fine benchmark for comparing some things [uncached machines without optimizing compilers, x86 compilers for Personal Computers], but the characteristics of data references, particularly the use of "addressing modes" and data dependencies, aren't representative of anything.) If you've got some dusty FORTRAN card decks that you've been using on some itty-bitty-mainframe computer, you should note that procedure calls aren't all that frequent, compared to Dhrystone. Having an optimizer that can put a couple of dozen variables (not necessarily local variables) into registers efficiently, and make fast references to global variables (FORTRAN is notorious for its use of variables stuffed into common blocks) is a big win here. The MIPS compiler system puts aside one register to point to a region of the global data space, where one single-cycle instruction can get to it's value. Again, here, the speed of basic load and store operations will matter more than whether the registers are windowed. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen
aglew@ccvaxa.UUCP (10/24/87)
>>Henry Spencer: >>Register windows may end up being a net win despite context-switch overhead. > >There isn't that much of a context-switch overhead, either. Saving >the register windows does not dominate the cost of a >context-switch. Context switches require hundreds of instructions, >while a pipelined system should be able to save the register file >in about 100-150 instructions. > >Greg Frazier Why do context switches require hundreds of instructions? On Real-Time OSes a context switch can require much less than a hundred instructions - and that's without a CISCy dispatch instruction. Even on UNIX a context switch can be made to occur in not much more time than a syscall (which is another story). The 'hidden' loss due to cache flushing, etc., can cost a lot, but on real memory systems that shouldn't bother you. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.
tim@amdcad.AMD.COM (Tim Olson) (10/24/87)
In article <821@mips.UUCP>, hansen@mips.UUCP (Craig Hansen) writes: +----- | There's no question that register windows can, in some cases, reduce load | and store frequencies, but to really pay their cost, they have to reduce | load and store frequencies sufficiently to offset the higher cost of load | and store operations on these machines. I've not been on the design team of | a register windowed machine, but it seems that the H/W designers might have | assumed that the register windows were going to eliminate so many memory | references that they didn't spend the time they should have on making loads | and stores go fast. +----- Yes, the register windows eliminate *many* memory references (although fast loads/stores are still a high priority). For example, MIPS published a paper at the recent ASPLOS II convention showing dynamic load/store percentages, as well as the percent of load/store instructions that required a non-zero offset calculation: nroff asl load/store % 28% 30% non-zero offset% 88% 80% Contrast this to the statistics gathered on the Am29000 simulator (register-windowed machine): nroff asm29k load/store % 16% 16% non-zero offset% 9.0% 9.2% Now, since MIPS didn't publish the input they used on these programs to derive their numbers, we obviously cannot perform a direct comparison. We assume, however, that the input wasn't "specialized" in any way. One possible explination for this is that non-register-windowed machines have many more loads/stores to the stack during procedure call entry/exit, resulting in higher load/store percentages as well as higher utilization of the base+offset addressing mode. The main point to realize here is that a series of trade-offs are made in any processor design, and what might be a big win for one processor may have minimal impact on another. These individual tradeoffs cannot be taken as "absolutes" without examining their relationship with the rest of the architecture. -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
hansen@mips.UUCP (Craig Hansen) (10/25/87)
In article <18843@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes: > [a comparison of load/store frequencies and non-zero offset frequencies] > Now, since MIPS didn't publish the input they used on these programs to > derive their numbers, we obviously cannot perform a direct comparison. Even if the input were the same, because the instruction count isn't necessarily the same for these two machines, the comparison of load/store frequencies isn't really relevant. In addition, as1 and am29k aren't even the same program! (There are several variants of nroff around, too.) > We assume, however, that the input wasn't "specialized" in any way. > One possible explination for this is that non-register-windowed machines > have many more loads/stores to the stack during procedure call > entry/exit, resulting in higher load/store percentages as well as higher > utilization of the base+offset addressing mode. Yes, a possible explanation, but wrong. When compiling nroff (a version released as part of UMIPS-BSD), using our inter-procedural register allocator, and an input file of 700 lines, 4670 words, and 30530 characters, we get the following dynamic statistics: 26.4% loads+stores 52.1 instructions per call Average registers saved per call: 1.6 Register save+restore: 23.3% of loads+stores (7.0% of instructions) (This percentage includes variables which must reside in memory to handle call by address parameters.) If you eliminated all these register save/restores, load+store frequency would be 20.9% (remember that the total instruction count changes too). Loads/stores to variables through global pointer register: 15.2% (Because of use of the global pointer register, these references are single instructions, all of which have non-zero offset values.) I would suspect that the Am29000 statistics do not count these as non-zero offset values, though they are more than half of all the references. Other load/stores: 4.2% (Zero-values offsets are frequent here within this sub-class, but are only about 3% of total instruction references.) > The main point to realize here is that a series of trade-offs are made > in any processor design, and what might be a big win for one processor > may have minimal impact on another. These individual tradeoffs cannot > be taken as "absolutes" without examining their relationship with the > rest of the architecture. Amen. Tim should also be careful, though, to realize that the tools and compilers used to measure the architecture also influences the results. Unless you set out to design both the architecture and the compiler concurrently, you won't be able to make good trade-offs between them. It should also be noted, again, that the selection of programs used as benchmarks will influence the results too. Making your architectural trade-offs on the basis of Dhrystone, or even nroff, which is not very representative of more modern C code, isn't very smart. I dare say, though, that trading the MIPS architecture for the MIPS architecture + windowed-registers - displacements - single-cycle load/stores, would be a loss. Regards, -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen
tim@amdcad.AMD.COM (Tim Olson) (10/26/87)
In article <833@mips.UUCP> hansen@mips.UUCP (Craig Hansen) writes: | Loads/stores to variables through global pointer register: 15.2% (Because of | use of the global pointer register, these references are single instructions, | all of which have non-zero offset values.) I would suspect that the Am29000 | statistics do not count these as non-zero offset values, though they are | more than half of all the references. You are correct; these were not counted. The statistics were gathered by looking (at "runtime") for explicit address calculations of load addresses. I don't really consider the use of the global pointer register to be an "non-zero offset" addressing mode, however. Because the lower 16 bits of the global pointer register are zero, the offset is really just a concatenation -- no "add" is performed. | Tim should also be careful, though, to realize that the tools and | compilers used to measure the architecture also influences the results. Certainly. | Unless you set out to design both the architecture and the compiler | concurrently, you won't be able to make good trade-offs between them. THE compiler? Which one? C? Pascal? FORTRAN? Ada? Common LISP? Smalltalk? Obviously you must design the architecture with a great deal of thought put into how the resources are to be used by software and what can be easily/cheaply performed in software vs hardware, but I disagree with your statement. MIPS certainly didn't write an entire Ada compiler before starting architectural design, but that in no way decreases the viability of Ada running on a MIPS machine. | It should also be noted, again, that the selection of programs used as | benchmarks will influence the results too. Making your architectural | trade-offs on the basis of Dhrystone, or even nroff, which is not very | representative of more modern C code, isn't very smart. Making architectural decisions based on *any* single program isn't very smart. You should examine a large body of code, looking at older, heavily-used programs as well as more "modern" code (output of C++ compilers, object-oriented programming). Discounting nroff is kind of silly, however; it happens to be a large, heavily-used utility. MIPS used it, along with other significant UNIX utilities, in early papers on the R2000 [Rowen, et al, VLSI Systems Design, March 1986], and continues to reference it in John Mashey's performance briefs. -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
earl@mips.UUCP (Earl Killian) (10/26/87)
In article <18855@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes: > In article <833@mips.UUCP> hansen@mips.UUCP (Craig Hansen) writes: > | It should also be noted, again, that the selection of programs used as > | benchmarks will influence the results too. Making your architectural > | trade-offs on the basis of Dhrystone, or even nroff, which is not very > | representative of more modern C code, isn't very smart. > Making architectural decisions based on *any* single program isn't very > smart. You should examine a large body of code, looking at older, > heavily-used programs as well as more "modern" code (output of C++ > compilers, object-oriented programming). It sounds like everyone's in agreement, and yet, so far the discussion has talked about one program! What's interesting about programs is that some statistics are fairly consistent and some vary all over the place. To show the variance of the statistics relevant to this discussion, consider a wide range of programs (statistics for the MIPSco architecture): non-sp/gp non-sp/gp sp-based reg gp-based 0-offset non-0 offset loads stores ld/st ld/st ld/st ld/st ld/st ----- ----- ----- ----- ----- ----- ----- espresso 19.6% 1.1% 0.1% 0.1% 1.3% 18.7% 0.4% spice 26.9% 16.3% 7.2% 2.8% 4.2% 1.5% 27.5% wolf 25.3% 8.2% 7.1% 1.9% 3.6% 8.0% 12.9% yacc 15.7% 2.1% 0.9% 0.5% 2.5% 12.4% 1.5% diff 16.2% 3.2% 0.5% 0.7% 4.6% 7.2% 6.4% compress 18.3% 10.6% 0.1% 3.5% 8.1% 7.8% 9.4% uopt 21.8% 8.4% 5.6% 5.2% 1.2% 6.8% 11.4% as1 18.3% 11.2% 4.4% 6.8% 3.7% 3.8% 10.8% nroff 18.8% 8.6% 0.4% 7.7% 14.5% 3.1% 1.7% tex 21.9% 13.8% 3.6% 9.2% 10.8% 5.1% 7.0% ccom 18.7% 12.2% 3.4% 11.9% 3.8% 5.0% 6.9% doduc 29.4% 10.2% 10.1% 4.1% 12.3% 1.5% 11.6% The sum of the last 5 columns should be equal to the sum of the first 2. In other words the last 5 are a partition of the loads and stores into sp-based, register saves and restores at procedure entry/exit, gp-based (that is small static variables addressed by a dedicated register, which always have a nonzero offset on the MIPSco machine), other load/stores with zero offset, and other load/stores with nonzero offset. Everything is a percentage of the total instructions executed. My conclusion: be careful of drawing conclusions from a small number of data points. For example, when deciding whether to have an offset for load/store, looking at just espresso (0.4% nonzero) or just spice (27.5% nonzero) would lead to very different results. Caveat: this statistics are from one architecture / compiler. Your actual mileage may vary. In particular, I would bet the 29k compilers probably try hard to avoid needing offsets. To the extent that these techniques succeed, it would reduce the 27.5% slowdown you'd expect in spice. However, I think these statistics do give some indication of the desirability of having a load/store offset.
daveb@geac.UUCP (10/26/87)
In article <5567@jade.BERKELEY.EDU> mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer) writes: >Every time I've looked at an article on register windows, I think of >something I refer to as "the Johnson stack machine." >I have good reason to believe that AT&T was working on hardware that >used this model. Does anyone know what became of that? Well, ICL (in Britain) was working with stack-top caches on the large 2900's some years ago (10!). If anyone from Blighty would care to comment on what happened subsequently, I'll refrain from blurting out what was then the hot technique in the internal rumor-mill. (Can you say "half-cache"?) --dave (walking security breach) c-b -- David Collier-Brown. {mnetor|yetti|utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
bcase@apple.UUCP (Brian Case) (10/26/87)
In article <18843@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes: > [[MIPS numbers]] > nroff asl > load/store % 28% 30% > non-zero offset% 88% 80% > >Contrast this to the statistics gathered on the Am29000 simulator >(register-windowed machine): > > nroff asm29k > load/store % 16% 16% > non-zero offset% 9.0% 9.2% > >Now, since MIPS didn't publish the input they used on these programs to >derive their numbers, we obviously cannot perform a direct comparison. Just out of curiosity, are these numbers static or dynamic? I would hope that they both represent dynamic measurements. Another thing to notice is that the load/store operations incurred by stack cache over-/under-flows are guaranteed to be serial in nature; that is, they are going to be to/from sequential addresses and are thus amenable to load-/store-multiple (i.e. burst-mode) operations. It may be that a majority of load/store operations in non-stack-cache machines are also around procedure calls and would also be amenable to burst-mode transfers. However, since over-/under-flows are rare events in stack- cache machines (at least on the Am29000), the effect may not be significant. Do you MIPS Co./AMD guys have any data on sequential/non-sequential load/store behavior?
tim@amdcad.AMD.COM (Tim Olson) (10/27/87)
In article <6554@apple.UUCP> bcase@apple.UUCP (Brian Case) writes: | In article <18843@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes: | >Contrast this to the statistics gathered on the Am29000 simulator | >(register-windowed machine): | > | > nroff asm29k | > load/store % 16% 16% | > non-zero offset% 9.0% 9.2% | > | Just out of curiosity, are these numbers static or dynamic? I would | hope that they both represent dynamic measurements. Yes, they are dynamic measurements. -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
henry@utzoo.UUCP (Henry Spencer) (10/28/87)
> ...you used all that chip space for a > register-speed cache that is mapped onto memory. The mapping is tracked > by a base/length pair. It's a write-through cache. > The trick is that this is mapped onto the top of the stack... How much this wins depends on details. If the stuff is addressed as normal memory -- remember that optimizing compilers often want to use their temporaries in a non-LIFO pattern, so making "push" and "pop" special cases isn't necessarily a win -- then you pay the various penalties involved in needing memory addresses for all your temporaries. If the cache pretends to be registers, then what you have is register windows with a lot more writes to memory, some small fraction of which incidentally make it unnecessary to do writes at context-switch time. This is probably not a win unless you are big on interrupt latency at the expense of throughput, because you *are* doing a lot more memory accesses, although they are less concentrated. It's an interesting idea that might work well if done right, but it's *not* "all of the advantages of register windows with none of the disadvantages". It has some of the advantages and has disadvantages of its own. -- PS/2: Yesterday's hardware today. | Henry Spencer @ U of Toronto Zoology OS/2: Yesterday's software tomorrow. | {allegra,ihnp4,decvax,utai}!utzoo!henry
aglew@ccvaxa.UUCP (10/28/87)
..> Tim Olson posts comparisons of load/stores, and load/stores ..> with non-zero offsets, for MIPS (lots of indexing) and AMD29K (little). How do you count load/stores with non-zero offsets for the AMD29000, seeing as it has no such addressing mode? Was this for a design alternative, that did have indexing? Or do you count by looking at the number of loads that have an add immediately preceding to the register used in the address of the load/store? Variations of this last method would seem to me to be very susceptible to masking by code rearrangement in the optimizer. A while back I posted static counts of address constants on VAX 4.2 BSD. As I remember it, about 40% were for locals (stack offsets), but 20% were for pointer+field_offset. How do you optimize those away?
tim@amdcad.UUCP (10/29/87)
In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes: | How do you count load/stores with non-zero offsets for the AMD29000, | seeing as it has no such addressing mode? Was this for a design | alternative, that did have indexing? Or do you count by looking at the | number of loads that have an add immediately preceding to the register | used in the address of the load/store? Variations of this last method | would seem to me to be very susceptible to masking by code rearrangement | in the optimizer. Yes, we counted the "offset" loads and stores by looking for an add (dynamic, at runtime) preceding the load or store which computes into the load/store address register. You are correct that this might be suceptible to code rearrangement -- however, our internal compiler doesn't schedule loads, and the only other code motion is due to loop-invarience [not a problem here] and filling branch delay slots [the add, load/store will still be sequential at runtime]. | A while back I posted static counts of address constants on VAX 4.2 BSD. | As I remember it, about 40% were for locals (stack offsets), but 20% were | for pointer+field_offset. How do you optimize those away? Perhaps there is a big difference in static vs dynamic statistics? -- tim
bcase@apple.UUCP (Brian Case) (10/29/87)
In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes: > >..> Tim Olson posts comparisons of load/stores, and load/stores >..> with non-zero offsets, for MIPS (lots of indexing) and AMD29K (little). > >How do you count load/stores with non-zero offsets for the AMD29000, >seeing as it has no such addressing mode? Was this for a design >alternative, that did have indexing? Or do you count by looking at the >number of loads that have an add immediately preceding to the register >used in the address of the load/store? Variations of this last method >would seem to me to be very susceptible to masking by code rearrangement >in the optimizer. > >A while back I posted static counts of address constants on VAX 4.2 BSD. >As I remember it, about 40% were for locals (stack offsets), but 20% were >for pointer+field_offset. How do you optimize those away? Well, I can tell you that, at least for the original 29000, an offset addressing mode was NEVER a design alternative. I would have had a tantrum. Anyway, when Tim told me that he was collecting these statistics, one of my first thoughts was: "But wait: optimization will move some of the add instructions away from the object load or store instruction. That will skew the results!" Upon further thought, I realized that if optimization can move the add instruction away (out of a loop or to fill a branch delay or load/store delay slot), then it *shouldn't* be counted anyway. The more of these that happen, the more sense it makes not to have the offset addressing mode. In other words, lets count only those offset additions that would be *profitably* integrated with the load or store instruction itself. Even though the numbers Tim collected are low, I believe that a real optimizing compiler (coming soon I hear) will produce even lower (if only marginally lower) numbers. My compiler does not do load/store latency scheduling; if it did, more opportunities to overlap the offset add would arise. In retrospect, large performance increases could have been realized, especially for the 29000 with VDRAM memory configurations (this has been proven by hand coding; sigh, one has only so much time...). I'll reiterate: you just can't say things like: "processors need to have x, no matter what the rest of the architecture is like." The offset addressing mode has its own costs. From the statistics collected from the output of my (not-wonderfully-optimizing) compiler, one is led to believe that an offset addressing mode would have little positive impact on the 29000. Further, it would only add another delay in the memory addressing path. I was very glad to see the numbers you collected from studying the VAX. Were those dynamic measurements?
bcase@apple.UUCP (Brian Case) (10/29/87)
In article <18903@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes: >Yes, we counted the "offset" loads and stores by looking for an add >(dynamic, at runtime) preceding the load or store which computes into >the load/store address register. > >You are correct that this might be suceptible to code rearrangement -- >however, our internal compiler doesn't schedule loads, and the only >other code motion is due to loop-invarience [not a problem here] and >filling branch delay slots [the add, load/store will still be sequential >at runtime]. Well, it is true that load/store scheduling isn't done, but the offset computation instruction *will* be factored out of a loop when its source operands are loop invariant; I have seen the compiler do this. And then, if this instruction is the only one in the loop header, and there is only a branch to the loop header, the offset add will get moved even farther away. Anybody wanna write a debugger? :-) (garbage garbage garbage so this message won't be rejected)
bcase@apple.UUCP (Brian Case) (10/29/87)
In article <6571@apple.UUCP> bcase@apple.UUCP (Brian Case) writes: >In article <28200058@ccvaxa> aglew@ccvaxa.UUCP writes: >>A while back I posted static counts of address constants on VAX 4.2 BSD. ****** >>As I remember it, about 40% were for locals (stack offsets), but 20% were >>for pointer+field_offset. How do you optimize those away? > >I was very glad to see the numbers you collected from studying the VAX. >Were those dynamic measurements? ******* Oops, sorry for the stupid question.
aglew@ccvaxa.UUCP (10/31/87)
..> Procedure call vs. interrupt frequency. Something else to point out: globally optimizing compilers can do inline expansion of much code, reducing procedure call overhead and frequency. This cannot be done to interrupts. Of course, there are a whole slew of tradeoffs here. Eg. reduced call overhead vs. cache misses, if the function is called more than once. Code size. Inlining militates against shared libraries. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.
aglew@ccvaxa.UUCP (10/31/87)
..> Address constants, register windows, and indexed addressing modes ..> vs. AMD's non-indexed addressing for the AMD29000. Addressing modes have fascinated me for a while, and I must admit that I am overjoyed by AMD's indications that non-indexing is not too painful. Now, would no register windows + non-indexing be useful? Something else to consider here is a non-virtual machine. Remember base registers? Base registers were originally introduced so that programs could be relocated, before virtual memory - on relocation, just update the base registers, plus the pool of address constants that the base registers are loaded from (ever wonder why PL/1 had a BASED attribute?). This requires that the base register + rest of address addition be atomic, or at least that no relocation may occur between them. It is possible to do this without base register addressing modes, but that handicaps optimizing compilers. Usually, base register addressing is: BASE + REGISTER + OFFSET since people want to have the REGISTER OFFSET addressing mode. Three input adders, ugh. But if we can get away without indexing, you just have BASE + LITERAL BASE + REGISTER And, as I am fond of pointing out, you can use OR instead of ADD. Why would I want base register addressing in this age of virtual memory? Well, as you know, Crays aren't virtual... it might be nice to be able to introduce a high performance, non-virtual, addition to your product line. The biggest problem is keeping your compiler writer's dirty hands off the base registers, since you want to reserve them for the OS. Especially, you don't want them to use the base registers like an extended constant pool (grr..) Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.
aglew@ccvaxa.UUCP (11/02/87)
>I had an idea some time ago that I'm surprised I've never seen discussed. >Suppose, for example, that your instruction processor has four stages. With >conventional pipelining, that means that four consecutive instructions from >the same program are at some stage of execution at the same time. > >Instead, why not have four different execution threads being performed >simultaneously? This eliminates the dependency checks and latency delays >inherent in "vertical" pipelining. (Many RISCs put these into the compiler >instead of the architecture, but they're still there). On a multi-user >system with a reasonable load level, it seems to me that this should >represent a performance improvement. Of course, it won't look good on the >standard benchmarks. > >Frank Adams ihnp4!philabs!pwa-b!mmintl!franka >Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108 This idea, multiprocessing through pipelining, was used in the Denelcore HEP machine. Except that HEP had many more stages in the pipeline. If there are folk out there who worked at Denelcore, they can tell us what happened; I think the gist of it is that they went under because they were trying out too many new ideas, and because of manufacturing problems. I've heard stories like <<Somebody in manufacturing saw this great big loop of wire on the backplane, and decided to cut it off and make it shorter. Except that the loop was there for timing...>>. Apparently UNIX exercised the machine in ways that their proprietary OS did not, and exposed bugs... Overall, though, I *really* *like* the multiprocessing through pipelining idea. I was first introduced to it by a British prof who said "but who in the world can keep 40 processes at a time active?". "UNIX", I thought. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa I always felt that disclaimers were silly and affected, but there are people who let themselves be affected by silly things, so: my opinions are my own, and not the opinions of my employer, or any other organisation with which I am affiliated. I indicate my employer only so that other people may account for any possible bias I may have towards my employer's products or systems.
fu@uicsrd.csrd.uiuc.edu (11/03/87)
/* Written 7:49 am Oct 26, 1987 by daveb@geac.UUCP in uicsrd:comp.arch */ In article <5567@jade.BERKELEY.EDU> mwm@eris.BERKELEY.EDU (Mike (My watch has windows) Meyer) writes: >Every time I've looked at an article on register windows, I think of >something I refer to as "the Johnson stack machine." >I have good reason to believe that AT&T was working on hardware that >used this model. Does anyone know what became of that? >> Well, ICL (in Britain) was working with stack-top caches on the >>large 2900's some years ago (10!). If anyone from Blighty would >>care to comment on what happened subsequently, I'll refrain from >>blurting out what was then the hot technique in the internal >>rumor-mill. (Can you say "half-cache"?) I assume you are taking about the ICL 2980, which was announced over 10 years ago as the top-of-the-line model. I didn't work on this particular machine but on the follow on, but I seem to remember that the cache or slave store (in ICL-ese) only held stack elements due to it's small size. As the ICL 2900 architecture was stack based, this seemed a reasonable trade off. There was no specific architectural support for this slave store while, from what I can recall of the AT&T stack cache, its operation is supported in the instruction set. In the follow up machine we had slightly larger chips to play with (1k ecl rams!) and decided to cache all primary operands. Fortunately the machine was cancelled due to lack of small gyms necessary to house the systems had we actually finished it (over 200 pcbs with over 100 packages per board). The last I heard, ICL no longer builds large systems, but buys them ready to sell from the Japanese. John Fu University of Illinois, Center for Supercomputer Research and Development. ARPANET: fu%uicsrd@a.cs.uiuc.edu or sometimes fu%tallis@decwrl.dec.com
jpdres10@usl-pc.UUCP (Green Eric Lee) (11/04/87)
Distribution: I've been following this conversation for some time. I wonder, how much does the adder in the register path (to add the register stack pointer to the instruction's desired register) slow things down in the AMD29000? It might also seem like if you have register windows, you have the same thing sitting there in the middle of the address path between instruction latch and register addressing, but one scheme I've read about is to make your register windows out of shift-registers, i.e. no adding, as far as the machine is concerned, you just have your 16 registers out there or whatever, and when a procedure call is made, or a return made, shift the current registers down or up (with some provisions for stack overflow). And, of course, Plain Old Registers have no problem at all with something out there in the register addressing path, except, of course, the decode tree... it seems offhand that the MIPS approach, with lots of registers and a very smart compiler capable of allocating them to procedures as needed, might offer some potential future speed advantages over traditional (non-shift-register) register windows or AMD's "stack window" approach, unless some very heavy pipelining is employed. --- Eric Green elg@usl.CSNET from BEYOND nowhere: {ihnp4,cbosgd}!killer!elg, P.O. Box 92191, Lafayette, LA 70509 {ut-sally,killer}!usl!elg "there's someone in my head, but it's not me..."
rpw3@amdcad.AMD.COM (rpw3) (11/07/87)
In article <28200061@ccvaxa> aglew@ccvaxa.UUCP writes: +--------------- | ..> Procedure call vs. interrupt frequency. | Something else to point out: globally optimizing compilers can do inline | expansion of much code, reducing procedure call overhead and frequency. | This cannot be done to interrupts. +--------------- Just another note about the Am29000... The entire register set does *NOT* have to be saved on either a system call or an interrupt, only on a "forced" context switch. For a system call, not even the temps have to be saved; a system call is a "subroutine", and temps are assumed destroyed across subroutines. A system call also uses the same register window (set of busy/free registers) as the user (yup!). Thus, if the current register window happens to have enough registers free at the end, no registers need be saved/restored to complete the system call. Interrupts (at least those which call C code, and aren't just "soft-DMA" assembly-language heads) *do* have to save the "global" temps, but still need only save/restore the "local" stack-cache registers actually used. Only in the case where an interrupt forces a context switch (say, at a runtime quantum expiration) does the entire set have to be saved/restored. (Context switches due to blocking-I/O had to have a system call first, so a lot of the registers were free already.) A related point: Since a system calls are implemented with a synchronous "trap"-type instruction, virtually none of the processor state has to be saved, either, only the user PC (29000's "PC1"). Contrast this with CISC machines for which any mode change saves beaucoup bits... Do a "vmstat" on your VAX, and you'll see that making system calls and interrupts faster than context switches is a "good thing". Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun,attmail}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
bcase@apple.UUCP (Brian Case) (11/09/87)
In article <230@usl-pc.UUCP> jpdres10@usl-pc.UUCP (Green Eric Lee) writes: >I've been following this conversation for some time. I wonder, how >much does the adder in the register path (to add the register stack >pointer to the instruction's desired register) slow things down in the >AMD29000? It might also seem like if you have register windows, you >have the same thing sitting there in the middle of the address path >between instruction latch and register addressing, but one scheme I've >read about is to make your register windows out of shift-registers, >i.e. no adding, as far as the machine is concerned, you just have your >16 registers out there or whatever, and when a procedure call is made, >or a return made, shift the current registers down or up (with some >provisions for stack overflow). And, of course, Plain Old Registers >have no problem at all with something out there in the register >addressing path, except, of course, the decode tree... it seems >offhand that the MIPS approach, with lots of registers and a very >smart compiler capable of allocating them to procedures as needed, >might offer some potential future speed advantages over traditional >(non-shift-register) register windows or AMD's "stack window" >approach, unless some very heavy pipelining is employed. Well, that's a lot of stuff for one posting. First off, the adder in the register addressing path doesn't cost much, in the current implementation, other than silicon area; reason: the chip does register write before read, and the write must complete before reads can be done so that an extra level of forwarding (bypassing is another term often used) can be avoided. So, while the write is being done, we might as well do something useful like an address add. Machines like Berkeley RISC II (and I suspect SUN SPARC) simply integrate the "add" into the decode tree. This is somewhat less expensive, but the idea is the same. Ok, so the question comes up (at least it did at AMD): "What about other implementations, e.g. ECL?" Well, it is quite likely that a 4-stage pipeline is still going to be desirable in other technologies. It is also likely that the elimination of a stage of forwarding with write-before-read is desirable. At least with the kind of ECL we were considering at AMD, things should scale reasonably well. Sure things aren't going to scale exactly, but we felt the large register file was worth the risk of a slight mis-match in pipestage lengths. As for the viability of flat register files, there is some evidence beyond whatever you consider MIPS, et. al. to have provided. David Wall of DECWRL constructed what is basically an optimizing linker for the experimental Titan machine. This linker uses complete knowledge of the program object code, and can even make use of fed-back run-time info, to construct a very good register allocation. The effect is somewhat like register windows. Unfortunately, this technique is based on static analysis, even when run-time info is fed back, and so doesn't do the same job as register windows. However, in practice, it is worth something. The Titan has 63 registers. Wall didn't say how well the technique would work with fewer registers. Anyway, here is the reference: Wall, D. W., "Global Register Allocation At Link Time," ACM SIGPLAN Compiler Construction Conference, Palo Alto, June 1986. Sorry, I don't have the Vol and No info since I got the paper directly from David. Ok, so to address future speed advantages, yes there might be some speed advantages for those with simple register files. However, for the Am29000, the critical paths were quite balanced (Dave Witt, are you out there?) with, I believe, the TLB and/or instruction cache being the limiting factor. Next came the ALU, and then the register file. Unless you want to do things like spread the ALU cost over two pipestages (possible to do), I don't think the register file is going to be the limiting factor. First, I don't mean to imply that we considered all other implementation technologies nor that we considered any very seriously. Second, some of what I said about the 29000 implementation may not be completely correct, but the gist is correct. Third, what do other people have to say? Probably not much, since talking about this would be tantamount to disclosing future plans. Oh welwor
david@neptune.AMD.COM (11/10/87)
In article <6681@apple.UUCP> bcase@apple.UUCP (Brian Case) writes: >Ok, so to address future speed advantages, yes there might be some speed >advantages for those with simple register files. However, for the Am29000, >the critical paths were quite balanced (Dave Witt, are you out there?) >with, I believe, the TLB and/or instruction cache being the limiting >factor. Next came the ALU, and then the register file. Unless you want >to do things like spread the ALU cost over two pipestages (possible to do), >I don't think the register file is going to be the limiting factor. well, since my friend bcase requested a response from me, on the 29k design the stack relative add was one of the speed paths encountered on the part, but certainly no worse that the 64-32 funnel shift or worst case alu adds, tlb translation or conditional jump and read from the branch target cache. Specifically for that particular path, in one half clock phase, the internal pipe was required to discharge the instruction bus and statically add the stack pointer to the a,b,c offset in three separate 7-bit adders. In parallel, a zero detect and a check on the msb of the a,b,c values determined the selection in a 3:1 multiplexor to enable the stack-relative local register, the global registers, or the indirect pointers. The output of the multiplexor was the address for the row/column decode for the 3-port register file which would be locally decoded and accessed in the next half clock phase for a double read. (the write is obviously delayed due to the internal pipe and therefore not a speed path). The total amount of gate delays for this path (including a small amount of lookahead for the adder) was 12 gates. In initial silicon at nominal temp this worst case path was passing in excess of 35mhz. In my opinion, in our internal pipe, it was not a major concern in terms of designing in this functionality.
henry@utzoo.UUCP (Henry Spencer) (11/11/87)
> ... the MIPS approach, with lots of registers and a very > smart compiler capable of allocating them to procedures as needed, > might offer some potential future speed advantages over traditional > (non-shift-register) register windows or AMD's "stack window" > approach... Don't forget that this is, to some extent, a function of what language you want to use. The assumption behind lots-of-registers-plus-smart- compiler is that calling patterns are predictable enough to permit the compiler to base optimizations on them. This is not so (or not necessarily so) for more dynamic languages like Smalltalk. -- Those who do not understand Unix are | Henry Spencer @ U of Toronto Zoology condemned to reinvent it, poorly. | {allegra,ihnp4,decvax,utai}!utzoo!henry
hankd@pur-ee.UUCP (Hank Dietz) (10/04/88)
In article <ANDREW.88Sep28160417@jung.harlqn.uucp>, andrew@jung.harlqn.uucp (Andrew Watson) writes: > In article <91@zeno.MN.ORG> gene@zeno.MN.ORG (Gene H. Olson) writes: > > [stuff as to ignoring register windows for SPARC...] > > SPARC offers you the option of register windows. If they don't > help in some situations (they seem to work well in others) there > is no reason you need use them. Since only two of them are > required, there is no serious penalty if they are unused by an > implementation. Its clearly a win-win situation. > > Just a small point from a compiler-writer who's been attempting to write a code > generator for the SPARC that *does* ignore register windows - it's possible, > but the architecture really doesn't support it. > > [complaint about having no burst register push/pop instructions] At HICSS'21, in January 1988, I chaired a discussion on RISC architecture, which quickly became a discussion of register windowing concepts. There were several key points which surfaced in the discussion; the following is just a bit about what you really want instead of burst regsiter save/restore instructions.... LAZY STORE/LOAD SHOULD BE USED. How Many Windows Do You Need? For some reason, it has become popular to assume that the more registers you have, the better off you are. However, consider window management using lazy operations: you need only two *actual* register sets and more help only a very little. Suppose we start executing in window A. Suddenly, we find that we must execute a subroutine call, making the current window B (yet keeping A in its register set). In most machines, not every memory reference cycle is actually used to refer to memory... let's suppose that F is the fraction of time during which the memory reference time slot is empty. Suppose also that the number of instructions between creating window A and switching to window B is I, and the number of instructions executed between creating B and reverting to A is J. Further suppose that there are R registers used in a typical register window. If R*(1/F) > I and R*(1/F) > J then by having the hardware automatically save and restore "in-use" registers lazily (whenever it sees free memory cycles), one can pretend to have as many register sets as one wants, even though the hardware need implement only two: by the time a register set must be reused for another set, the set will have been saved using lazy stores, hence, it will be ready and waiting. Likewise, when a set is to be removed, the hardware begins lazy-reloading of the previous set, hence no delay is seen. Compare this to the very-hard-to-hide delay of using burst stores, with or without register windowing.... For three conceptual sets, A, B, and C, the sequence is like: A begins in set 0 A executes some stuff B begins in set 1, A remains in 0 B executes some stuff, while A is lazily stored C begins in set that had been A (set 0), B remains in 1 C executes some stuff, while B is lazily stored C terminates, B is still available in 1 B continues to execute (in 1), while A is lazily reloaded (in 0) B terminates, A is now available again (set 0) A continues to execute in set 0 A terminates It is easy to see that having more sets simply makes the hardware more tolerant to variations in R, F, I, and J (since these values are effectively averaged over all the actual register sets). Notice that a lazy-store/lazy-reload, register-name-addressible top-of-stack cache is virtually equivalent to the lazy window scheme, but probably is a bit more general. ...There is plenty more to say about this, if people are interested. (Of course, when it comes to register window use, I'm biased: I'd rather use the compiler-driven register/cache/CReg management that Chi and I have developed; however, I'm perfectly happy to start a discussion which will lead everyone else to the same conclusion ;-). -Prof. Hank Dietz __ /| _ | | __ / | Compiler-oriented / |--| | | | | Architecture / | | |__| |_/ Researcher from \__ | | | \ | Purdue \ | \ \ \ \ hankd@ee.ecn.purdue.edu
bcase@cup.portal.com (10/05/88)
> LAZY STORE/LOAD SHOULD BE USED. >you need only two *actual* register sets and more help only a very little. [SEE THE ORIGINAL FOR ADDITONAL VERBOSE CONTEXT] >one can pretend to have as many register sets as >one wants, even though the hardware need implement only two: by the time a >register set must be reused for another set, the set will have been saved >using lazy stores, hence, it will be ready and waiting. Likewise, when a set >is to be removed, the hardware begins lazy-reloading of the previous set, >hence no delay is seen. Compare this to the very-hard-to-hide delay of using >burst stores, with or without register windowing.... In my opinion, this is BS: This is tantamount to saying you can predict the sequence of calls and returns!! If my CPU starts lazily saving but the next thing it does is return, it has wasted all the memory cycles spent lazily saving. If my CPU lazily restores but the next thing it does is call, it has wasted all the memory cylces spent restoring. This is related to the question: "How many windows should be saved/restored on a call/ return?" This question was answered by the Berkeley people: exactly one. The reason it is one is that you can't predict the future. "But," you say, "the lazy cycles aren't wasted since they don't interfere with normal memory references." Sorry, I don't buy this because to insert memory references so that they don't interefere means that you are able to predict the pattern of memory references. Sure by looking ahead, and all that, but this is *not simple*. You must be able to predict conditional branches! Lastly, a register window scheme with multiple windows provides hysterisis *without any* memory references at all! The two-window-plus-laziness scheme doesn't. Just because the memory interface isn't completly saturated *doesn't* mean that the unused capacity can be used without penalty. There is an allocation penalty associated with arbitrating for any fixed resource; think of an intersection controlled by a stop light at rush hour. The resource simple can't be used 100% of the time: we have to switch the direction of traffic flow every now and then. This is the purpose behind bursting: once the direction of traffic (memory) flow has been established, speed through the intersection as fast as possible. Changing the direction of flow after every car (memory reference) is inefficient. I think queuing theory also comes into play here. True, if you could look ahead at the instruction stream, predict correcly all the conditional branches etc., then you could know exectly where to insert the lazy loads and stores. But I think this is unrealistic. You have to consider the realities of implementation! If my analysis has erred, please enlighten me!
jmd@granite.dec.com (John Danskin) (10/07/88)
I just love writing replies to unidentified flamers from the portal system 8-). In article <9725@cup.portal.com> bcase@cup.portal.com writes: >> LAZY STORE/LOAD SHOULD BE USED. >>you need only two *actual* register sets and more help only a very little. >[SEE THE ORIGINAL FOR ADDITONAL VERBOSE CONTEXT] >>one can pretend to have as many register sets as >>one wants, even though the hardware need implement only two: by the time a >>register set must be reused for another set, the set will have been saved >>using lazy stores, hence, it will be ready and waiting. Likewise, when a set >>is to be removed, the hardware begins lazy-reloading of the previous set, >>hence no delay is seen. Compare this to the very-hard-to-hide delay of using >>burst stores, with or without register windowing.... > >In my opinion, this is BS: This is tantamount to saying you can predict >the sequence of calls and returns!! If my CPU starts lazily saving but >the next thing it does is return, it has wasted all the memory cycles spent >lazily saving. [Followed by a bunch more stuff about how yu can't predict the instruction stream] There are a couple of ways people have proposed doing lazy saving: o Katenevis (or somebody working with him) proposed trying to save register windows during unused memory cycles. They did some analysis and concluded that there were not enough unused cycles. Part of the problem here might have been that they needed to save the whole window before any of it was usable. o When you start a new register window, flag all of the registers as 'clean'. Whenever you update one, flag it as dirty. Now, when it comes time to reuse a register window, whenever the processor tries to update a previously dirty register, copy it somewhere and try to save it during an otherwise unused memory cycle. On return, flag registers which have lost their values as 'empty' (or something). When the processor tries to refer to an empty register, the value is reloaded. This is really lazy. You only save/restore registers when you need to. None of these schemes imply knowing the instruction stream in advance. I guess the point that (unnamed) missed was that (other unnamed) assumed that the hardware was managing the lazy loads/stores, not the compiler. Of course, these schemes produce more complicated silicon than burst read/write schemes. In particular, anything with flags/register introduces latency into the register read/write path which is the critical path unless you have blown everything. Is it a win? simulate and see. Tell the world. By the way: There is a paper: "Register Windows Vs. General Registers: A Comparison of Memory Access Patterns" by Scott Morrison and Nancy Walker of UC Berkeley. Which shows that the MIPS R2000 (aside from running faster) achieves fewer memory references (in almost all cases) than SPARC with all levels of optimization and as many as 7 register windows. a) Does anyone know if/where (Earl?) this paper was published? (I got a copy from MIPS people, they love to give it away). b) Does anybody at SUN have an answer (tell us how they got it all wrong, register windows really DO save memory references). c) Anybody at AMD (Tim?) want to say something about how burst read/write makes the extra references OK? -- John Danskin | decwrl!jmd DEC Technology Development | (415) 853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.
hankd@pur-ee.UUCP (Hank Dietz) (10/07/88)
Pertaining to my posting about lazy store/reload of register frames: In article <9725@cup.portal.com>, bcase@cup.portal.com writes: > [Flaming as to lazy ops not being free and some obscure concept about > needing to predict the future...] > The reason it is one is that you can't predict the future. > [More flaming about memory intereference...] > ... You must be able to predict conditional branches! > > Lastly, a register window scheme with multiple windows provides hysterisis > *without any* memory references at all! The two-window-plus-laziness > scheme doesn't. > > ... True, if you could look ahead at > the instruction stream, predict correcly all the conditional branches > etc., then you could know exectly where to insert the lazy loads and > stores. But I think this is unrealistic. You don't seem to know what lazy operations are. A lazy operation is NOT a simple "delayed" load or store, and it isn't triggered by an opcode being executed; it is an operation which is automatically triggered either when certain conditions exist which favor its efficient completion or when its result is required -- one does NOT insert lazy operations in anything. Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die space consumed by them). What do you do when the 9th nested call is made? The 10th? You do a sit-and-wait-for-it burst store, that's what... would you really describe that as being "*without any* memory references at all!" With lazy store/reload built-in, by the time you make a call nested deeper than you have windows, it is very likely that you've got at least one register set saved so that you can immediately reuse it -- without waiting for any memory references. When a procedure returns, the frame which was saved begins to lazily reload its values into the set it was flushed from. Now, you can argue that there might not be enough time between calls or between returns to lazily store or reload a set, but that's unlikely because: Calls: The lazy stores only have to store registers which are live and dirty, i.e., whose value will be referenced after return from the call and also is not the same as a value stored somewhere in memory (e.g., a variable) by the programmer's code. In most RISC processors, the only instructions able to make a register dirty are register = register op register... which all have a free memory reference cycle! In the worst case, you'd lag behind by just one register store, which sure beats doing a non-lazy burst every so often. QED. Returns: The obvious worst case is if the instruction immediately after each call is a return. However, if you do return immediately, then those registers were not live after the call (see above): you would not have saved 'em in the first place, so they'd be restored in zero time. Since live values are usually kept in registers primarily to be operated upon, a similar argument to that given for calls can be applied... lazy reloading might not keep-up with returns, but it shouldn't be far behind. How unlikely? Specify an instruction set and timings and find out. That's one of the reasons I posted this -- to inspire a bit of thought and research. However, one has very good reason to believe it will work quite well, and if it works even nearly as well as non-lazy windowing but requires far fewer windows, just look at all that valuable die space you just bought! As for predicting the future, it has very little to do lazy operation, but moreso it's a misnomer. Statically examining code (e.g., at compile time) there is no past or future, you know everything probabilistically and that's usually good enough -- I agree that hardware can't predict the future, so you ought not try to make it do that... hardware is good at other things, like telling you exactly which way THIS branch goes or which memory module THIS pointer references. The static/dynamic tradeoffs are what my research group, CARP, is all about. __ /| _ | | __ / | Compiler-oriented / |--| | | | | Architecture / | | |__| |_/ Researcher from \__ | | | \ | Purdue \ | \ \ \ \ hankd@ee.ecn.purdue.edu
tim@crackle.amd.com (Tim Olson) (10/07/88)
In article <287@granite.dec.com> jmd@granite.UUCP (John Danskin) writes: | By the way: | There is a paper: | "Register Windows Vs. General Registers: A Comparison of | Memory Access Patterns" by Scott Morrison and Nancy Walker | of UC Berkeley. | | Which shows that the MIPS R2000 (aside from running faster) achieves | fewer memory references (in almost all cases) than SPARC with all | levels of optimization and as many as 7 register windows. This says fewer overall memory references, but what is missing here is the ratio of loads and stores to the rest of the instruction mix. I wouldn't be surprised if it is just that the Sun compiler is not doing as good a job in general, and so the total number of instructions (including loads and stores) increased with respect to the MIPS compiler. However, the number of loads and stores as a percentage of the instruction mix might be lower. | a) Does anyone know if/where (Earl?) this paper was published? | (I got a copy from MIPS people, they love to give it away). | | b) Does anybody at SUN have an answer (tell us how they got it all | wrong, register windows really DO save memory references). | | c) Anybody at AMD (Tim?) want to say something about how burst | read/write makes the extra references OK? Well, I don't know what SUN seems to be doing wrong, but let's try this: bsd 4.3 nroff with the 4.3 libraries running nroff /usr/doc/misc/sysperf/2.t [a 10655 byte file] results in: ---------- Pipeline ---------- 32.63% idle pipeline: 18.39% Instruction Fetch Wait 11.44% Data Transaction Wait 0.69% Page Boundary Crossing Fetch Wait 0.01% Unfilled Cache Fetch Wait 0.00% Load/Store Multiple Executing <-- Hmm, not much time here! 2.07% Load/Load Transaction Wait 0.03% Pipeline Latency ---------- Bus Utilization ---------- Inst Bus Utilization: 63.97% 8669133 Instruction Fetches Data Bus Utilization: 9.75% 979830 Loads 340998 Stores ---------- Instruction Mix ---------- 1.86% Calls 15.65% Jumps 10.73% Loads 3.74% Stores 4.33% No-ops ---------- Register File Spilling/Filling ---------- 3 Spills <-- this is why 0 Fills Spill/Fill sizes: 1 registers: 0 time(s) ( 0.00%) 2 registers: 1 time(s) ( 33.33%) 3 registers: 0 time(s) ( 0.00%) 4 registers: 1 time(s) ( 33.33%) 5 registers: 0 time(s) ( 0.00%) 6 registers: 0 time(s) ( 0.00%) 7 registers: 0 time(s) ( 0.00%) 8 registers: 0 time(s) ( 0.00%) 9 registers: 0 time(s) ( 0.00%) 10 registers: 0 time(s) ( 0.00%) 11 registers: 0 time(s) ( 0.00%) 12 registers: 1 time(s) ( 33.33%) 13 registers: 0 time(s) ( 0.00%) 14 registers: 0 time(s) ( 0.00%) 15 registers: 0 time(s) ( 0.00%) 16 registers: 0 time(s) ( 0.00%) > 16 registers: 0 time(s) ( 0.00%) So for the entire nroff run, we wrote a total of 18 words out to memory due to stack cache overflow. And we loaded 0 words from the stack due to underflow (this is because nroff exits() while it is still a few procedures down in the overall call chain). I would be interested in seeing how this compares to a non-register-windowed processor, in particular the total number of loads/stores, the loads/stores as a percentage of instruction mix, and the number of words of scalar data transfered to/from the stack. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
ok@quintus.uucp (Richard A. O'Keefe) (10/07/88)
In article <9410@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >Pertaining to my posting about lazy store/reload of register frames: > Wasn't there an MIT paper with a title something like "Lazy Scoping -- the Dream of a Lifetime" which had to do with lazy register saves and restores?
tim@crackle.amd.com (Tim Olson) (10/07/88)
In article <9410@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: | Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die | space consumed by them). What do you do when the 9th nested call is made? | The 10th? You do a sit-and-wait-for-it burst store, that's what... would | you really describe that as being "*without any* memory references at all!" True, but then non-lazy stores are only loading or storing what is absolutely required, when it is required. Lazy operations are continually trying to load/store ahead. This seems like the real misnomer -- non-lazy windows are truely lazy (only doing what is required) vs "lazy windows" (which are quite active). Consider an "on-demand" load/store window scheme with 4 windows, vs. a "background" load/store window scheme with 2 windows (which was implied to be all that was required). If the call chain looks like 1 2 3 4 5 6 7 6 7 6 7 6 7 6 7 8 7 6 5 6 i.e. spends a lot of around a local maximum depth (certainly not atypical). The "on-demand" window scheme has no saving or restoring to do while bouncing around between levels 7 and 6 because of the built-in hysterisis. The "background" load/store scheme, however, is continually saving and restoring. This is a waste of memory bandwidth. What register windows are buying is this hysterisis in saving and restoring the stack frame, and the only way to get it is to provide a large number of windows. Once that is available, background load/stores don't buy much, because register file spilling/filling just doesn't occur that often in real programs (maybe 0.5% of all calls spill) One other problem with "background" load/stores occurs when memory operations take more than a single cycle. In this case, a "background" memory operation may be started when the memory was otherwise idle, and right after that, a regular load or store is requested. The requested operation must wait for the background one to complete, decreasing performance (this is what I think Brian Case was talking about when he mentioned having to predict future operations -- to ensure that this kind of collision doesn't occur). Finally, if background loads/stores are interspersed with the regular load/store stream, they cannot take full advantage of their sequential nature, and thus cannot take full advantage of any faster burst-mode capability that memory may provide. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
bcase@cup.portal.com (10/08/88)
John Danskin at decwrl writes: (the stupid portal mailer doesn't include the "who wrote it" line when you reply, sigh): >I just love writing replies to unidentified flamers from the portal system 8-). ??Er, I don't understand what you mean by unidentified; my login name was included in your article. Do you mean that my full name wasn't included? It is Brian Case... I am dismayed to discover that I am being judged because my postings originate from Portal! Anybody want to give me an account on a respectable UNIX system? ||||| >In article <9725@cup.portal.com> bcase@cup.portal.com writes: >>In my opinion, this is BS: This is tantamount to saying you can predict >>the sequence of calls and returns!! If my CPU starts lazily saving but >>the next thing it does is return, it has wasted all the memory cycles spent >>lazily saving. John left out my comment that I believe the lazy loads/stores interfere with normal memory traffic. I am probably wrong though.... >None of these schemes [the ones he mentioned in his reply to my >posting] imply knowing the instruction stream in advance. >I guess the point that (unnamed) missed was that (other unnamed) assumed >that the hardware was managing the lazy loads/stores, not the compiler. Which unnamed am I? :-) Anyway, I was assuming that hardware would be scheduling the lazy loads/stores. While the hardware at least has knowledge about what instructions are in its pipeline (safely assuming a pipelined implementation, I think), it still can't predict the future. I still maintain that a great deal of disruptive loading/storing can go on just to find out later that it was wasted. With more than two windows, hysterisis causes the register file to capture the working set of the register window stack. I guess the point that I was trying to make in my posting was that hardware can't know exactly when to insert them lazy loads/stores: if it inserts one that causes a cache miss (though this is probably unlikely; there are probably better cases to use as an example) and immediately after the insertion a conditonal branch is taken to a "real" load (one that the program wants to do), the "real" load will be delayed by the lazy load/store. But, you say, if the lazy load/ store had been done explicitly, it would be guaranteed to delay the progress of the program. Yes, but if the explicit lazy loads/stores are done as part of a burst, each one is much more efficient. I think something that might be able to end this discussion is the exposition of the hardware algorithm that would be used to implement lazy load/store. It can't be complicated, so it should be suitable for posting. If it is good and works, then it will be clear that I was wrong to attack the original suggestion that two winodows are sufficient. (Of course, there is a whole contingent who says that "no windows" is sufficient!) >Of course, these schemes produce more complicated silicon than burst >read/write schemes. In particular, anything with flags/register >introduces latency into the register read/write path which is the >critical path unless you have blown everything. As I have said before, I think that one of the various caches in the system (excluding the register file if you think of that as a cache), e.g., the TLB or instruction cache, will be the critical path. >Is it a win? simulate and see. Tell the world. Yes, this is the right thing to do. >By the way: > There is a paper: [from Berkeley] > Which shows that the MIPS R2000 (aside from running faster) achieves > fewer memory references (in almost all cases) than SPARC with all > levels of optimization and as many as 7 register windows. Well, this is news to me! I must blush and appologize for many of my past postings if these results are true for architectures like the 29K. I do hope they included the effects of high-bandwidth, sequential-access memories.
bcase@cup.portal.com (10/09/88)
First off, let me appologize to the net in general for starting this discussion at all and particularly for apparently starting it off with a touch of flame. My intent was not to flame or get personal, but I did want to say what I really thought was right. Believe me, I'll drop it after this because I am clearly not making myself understood and am probably wrong anyway. My intent is not to piss-off anyone; I am not that kind of person (usually :-). I might meet anyone here at a conference or whatever some day; I don't want the reaction to be "oh, yeah, that flaming *sshole from portal." Hank Dietz (Hope I spelled the name right, our mailer doesn't include the name automagically) writes: >In article <9725@cup.portal.com>, bcase@cup.portal.com writes: >> ... True, if you could look ahead at >> the instruction stream, predict correcly all the conditional branches >> etc., then you could know exectly where to insert the lazy loads and >> stores. But I think this is unrealistic. >You don't seem to know what lazy operations are. A lazy operation is NOT a >simple "delayed" load or store, and it isn't triggered by an opcode being >executed; it is an operation which is automatically triggered either when >certain conditions exist which favor its efficient completion or when its >result is required -- one does NOT insert lazy operations in anything. No, I do understand what lazy operations are; the word "insert" as I have used it refers to "inserting" the loads/stores into the pipeline. I think I understand what lazy means here because we considered using this technique (call it lazy or dribble-back or whatever; BTW, it was called dribble-back a long time ago, I don't know who named it) in the implementation of the 29K. That was in the very early days of the design. We must not have had access to the info you have now because we considered it too expensive or complicated or something to implement. At least I think I understand what lazy means; correct me if I am wrong. >Let's say you have 8 non-lazy windows (and one heck of a lot of valuable die >space consumed by them). What do you do when the 9th nested call is made? >The 10th? You do a sit-and-wait-for-it burst store, that's what... would >you really describe that as being "*without any* memory references at all!" No! Of course not; clearly that has memory references! The research that I did at AMD (I instrumented the PCC to count the number and size of overflows and underflows the VAX would have if it had a register file like the ones we were considering for the 29K; I modified applications like vi and circuit design tools to allocate their large, non-scalar local items from the heap instead of from the stack and then compiled and ran them. At exit, they they spit out stats) showed that overflowing and underflowing would happen infrequently; the actual results from simulations and real chips supports my research conclusions. Thus, yes, overflows do happen, but they happen infrequently in real life (there are pathological cases for both schemes, though). What I meant is that on average, several "windows" (the 29K doesn't have "windows") can capture the working set of the scalar run-time stack, and can effect most of the calls/returns with no memory references. Sorry for the confusion. Your comment about the windows taking one heck of a lot of valuable die space is correct; I happen to believe that that valuable die space is occupied by a valuable resource: registers. Let me say for the record that I prefer the 29K approach, where all registers, or a very large number of them, are simultaneously addressable, over the SPARC approach where only one window's worth is addressable at any one time. I feel that registers are a very performance-improving resource, one for which there is no substitute. They are the fastest part of the memory hierarchy and have three times the bandwidth of a single-ported data cache, etc. etc. etc. >Now, you can argue that there might not be enough time between calls or >between returns to lazily store or reload a set, but that's unlikely >because: >Calls: The lazy stores only have to store registers which are live > and dirty, i.e., whose value will be referenced after return > from the call and also is not the same as a value stored > somewhere in memory (e.g., a variable) by the programmer's > code. In most RISC processors, the only instructions able > to make a register dirty are register = register op > register... which all have a free memory reference cycle! > In the worst case, you'd lag behind by just one register > store, which sure beats doing a non-lazy burst every so > often. QED. I think QED is a bit strong. If you can prove that the lazy scheme "sure beats doing a non-lazy burst every so often," then I am wrong, of course. But your saying it doesn't make it so. As I said in a previous posting, I think seeing the algorithm, which can't be complicated or it wouldn't be suitable for hardware implementation, might end the discussion; and if it is good, it will certianly make me look foolish. Maybe that's what I need, although it wouldn't be the first time.... Let me take one last crack at making my point: loads/stores tyically take two or more cycles. Hardware will have to be very clever about finding the right place to put the lazy loads/stores. I think the added lazy loads/stores will bring the data memory channel utilization up to near 100% (which can be a good thing if doing so does not slow down the general rate of load/store completion; pipelining might solve the problem...). Thus, I believe that the lazy loads/stores will, even assuming the optimal placement in time, interfere with the loads/stores actually required by the algorithm being executed. This interference will lower the benefit of lazy loads/stores to bring this scheme at least on equal footing with the non-lazy, burst scheme. Since the burst scheme is simpler, I would prefer it. But the lazy scheme uses fewer registers, so you prefer it instead. Since the lazy scheme does not take advantage of the benefits afforded the sequential access pattern of the burst scheme, I would not choose it (since I know that I can make the burst scheme fast). Since the lazy scheme makes better utilization of the memory channel (closer to 100%), you would choose it. Or whatever. I am not saying that these comments are God's absolute truth. I am just trying to make clear my original point. The stuff about predicting the future was just my (apparently stupid) way of saying that figuring out where to insert the lazy loads/stores is very difficult (at least that's what I think). And if you don't put them in the best places, you will interfere with loads/stores required by the algorithm at hand. If interference happens, which I believe will, then I don't see the advantage of laziness over explicitness. At least the explicit scheme can take advantage of high- bandwidth sequential access memories. There, now I think I have made myself clear. Wrong maybe, but clear. >The static/dynamic tradeoffs are what my research group, >CARP, is all about. This is valuable research; I didn't mean to attack your group or your research directions either directly or indirectly. I was once in grad school; you are probably all thinking "I swear, some of those industry- types are truly complete jerks." I know *I* thought that about industry- types more than a few times! Sigh, I guess we all eventually become that which we despise! (Parents?, e.g.)
mcdonald@uxe.cso.uiuc.edu (10/10/88)
What is a register window?
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/11/88)
In article <9837@cup.portal.com> bcase@cup.portal.com writes: >John Danskin at decwrl writes: (the stupid portal mailer doesn't include >>By the way: >> There is a paper: [from Berkeley] >> Which shows that the MIPS R2000 (aside from running faster) achieves >> fewer memory references (in almost all cases) than SPARC with all >> levels of optimization and as many as 7 register windows. >Well, this is news to me! I must blush and appologize for many of my >past postings if these results are true for architectures like the 29K. >I do hope they included the effects of high-bandwidth, sequential-access >memories. > > > > > > > I wonder what that paper is. There is another paper which may be of interest: Look in the September 1987 IEEE Computer, "And Now a Case for More Complex Instruction Sets", IEEE Computer, Sept. 87, Vol. 20, No. 9, pages 71-83. This paper makes the case, again, that you need to simulate and test, not hypothesize, what the effects of the various optimizations are. However, while we are hypothesizing, the results of the paper again call into question the efficacy of register windows. For example: "From data traffic considerations alone, it seems that OBI360 [a particular 1 address instruction format looked at in the paper] with a register set of about size 16 plus a small data cache is preferable to multiple register sets for most area combinations..." [page 83] How about someone publishing a simulation showing the results of using lazy windows? -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
johnl@ima.ima.isc.com (John R. Levine) (10/13/88)
In article <46500028@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes: >What is a register window? It's a little piece of Lexan in the chip package of a microprocessor that lets you see into the machine's registers. Writers of optimizing compilers find these windows invaluable for debugging and tuning object code. For example, in case of register overflow you can see bits floating near the register that overflowed. In case of register collisions, which all compiler writers will agree are a big pain, if you examine the registers very closely with a magnifying glass, you can usually see little scratches and dents where the collision occurred. Register windows are such a useful tool that it's a shame that the SPARC is the only commercial chip that provides them, although even on the SPARC the Lexan gets scratched after a few months, making them hard to see through. I wish that quartz windows weren't so expensive, they last a lot longer. -- John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869 { bbn | think | decvax | harvard | yale }!ima!johnl, Levine@YALE.something Rome fell, Babylon fell, Scarsdale will have its turn. -G. B. Shaw
dillon@jumbo.dec.com (John Dillon) (10/13/88)
In article <2768@ima.ima.isc.com>, johnl@ima.ima.isc.com (John R. Levine) writes: > In article <46500028@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes: > >What is a register window? > It's a little piece of Lexan in the chip package of a microprocessor that lets > you see into the machine's registers. Writers of optimizing compilers find > these windows invaluable for debugging and tuning object code. For example, in > case of register overflow you can see bits floating near the register that > overflowed. In case of register collisions, which all compiler writers will > agree are a big pain, if you examine the registers very closely with a > magnifying glass, you can usually see little scratches and dents where the > collision occurred. > > Register windows are such a useful tool that it's a shame that the SPARC is > the only commercial chip that provides them, although even on the SPARC the > Lexan gets scratched after a few months, making them hard to see through. I > wish that quartz windows weren't so expensive, they last a lot longer. > -- > John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869 > { bbn | think | decvax | harvard | yale }!ima!johnl, Levine@YALE.something > Rome fell, Babylon fell, Scarsdale will have its turn. -G. B. Shaw I must take Mr. Levine to task for forgetting the AMD 29000. AMD goes to the trouble of providing the user with pictures of the metallization, since metal shows the little scratches and dents much better than silicon. The 29000 does not provide the same register windows as SPARC. Instead, AMD adopted the neat strategy of trapping to supervisor code when the register window overflows and underflows. This strategy dovetails nicely with AMD's commitment to shadow registers and serial scan. The trap handler, using the shadow register (actually a miniature on-chip ccd camera obscura), takes a picture of the metallization and scans it out to the companion AMD 291984 image processing chip, which reconstructs the picture for the user. Consider the advantages and disadvantages (as I see them :-) + no need for either lexan or quartz window + the packaged part is insensitive to light + no need for a magnifying glass: you put the image on your display + only compiler writers need purchase the 291984 - the 291984 is unlikely to be in your machine when you need it - it takes time to reconstruct and scan the image: no real-time window No, I have no connection with either AMD or Sun. -- John disclaimer n. 1. the act of disclaiming; the renouncing, repudiating, or denying of a claim; disavowal. 2. a person who disclaims. 3. a statement, document, or the like, that disclaims.
news@ism780c.isc.com (News system) (10/14/88)
In article <2768@ima.ima.isc.com> johnl@ima.UUCP (John R. Levine) writes: [definition of a register window] > >It's a little piece of Lexan in the chip package of a microprocessor that lets >you see into the machine's registers. Writers of optimizing compilers find >these windows invaluable for debugging and tuning object code. Historical Note: The first programmable digital computer that I worked with was a CPC (card programed calculator). This machine had 800 decimal digits of storage that could be arranged into one or more registers via a patch panel. The implementation of the digits was mechanical, much like an automobile odometer. The value in a register could actually be seen (numbers painted on wheels). The primary debugging aid for this machine was a flashlight to illuminate the registers. Marv Rubinstein
paul@unisoft.UUCP (n) (10/14/88)
In article <13381@jumbo.dec.com> dillon@jumbo.dec.com (John Dillon) writes: >I must take Mr. Levine to task for forgetting the AMD 29000. > >This strategy dovetails nicely with AMD's commitment to shadow registers >and serial scan. The trap handler, using the shadow register (actually >a miniature on-chip ccd camera obscura), takes a picture of the >metallization and scans it out to the companion AMD 291984 image >processing chip, which reconstructs the picture for the user. > >Consider the advantages and disadvantages (as I see them :-) > + no need for either lexan or quartz window > + the packaged part is insensitive to light > + no need for a magnifying glass: you put the image on your display > + only compiler writers need purchase the 291984 > - the 291984 is unlikely to be in your machine when you need it > - it takes time to reconstruct and scan the image: no real-time window Actually these last two are not a problem, one simply pages in a simulation of a 291984 and uses that (it is a RISC chip after all). On a related note: I'm looking for venture for a new state of the art product 'Register Windex' we expect a great demand for this product especially for those systems that use 'dirty' bits for management of their windows. Paul -- Paul Campbell, UniSoft Corp. 6121 Hollis, Emeryville, Ca ..ucbvax!unisoft!paul Nothing here represents the opinions of UniSoft or its employees (except me) "Gorbachev is returning to the heritage of the great Lenin" - Ronald Reagan 1988 (then the Wasington Post attacked RR [from the right] for being a Leninist)