eppstein@tom.columbia.edu.UUCP (02/24/87)
Register window machines such as the Berkeley RISCs have been discussed recently in this group. I have some questions about these architectures: (1) Has anyone tried making the window block size be just one register, i.e. the window can have an arbitrary alignment in relation to memory? I would expect this to be more efficient in terms of registers used (and therefore also memory), but register access time might suffer. (2) Do any architectures dribble out dirty registers near the bottom of the set of on-chip registers in otherwise unused memory cycles, or similarly dribble them in when they move back on-chip rather than stopping everything else while bringing them back in? (3) Are there machines with both dynamic (windowed) and static (normal) registers? For instance it might be useful for quick O.S. interrupts to use only statics and not go to the trouble of setting up a register window stack (which would involve lots of reads and writes moving the old stack completely out to memory and then back again later). Static registers could also be useful for non-local variables. -- David Eppstein, eppstein@cs.columbia.edu, Columbia U. Computer Science Dept.
howard@cpocd2.UUCP (02/25/87)
In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes: >Register window machines such as the Berkeley RISCs have been discussed >recently in this group. I have some questions about these architectures: > >(1) Has anyone tried making the window block size be just one register, >i.e. the window can have an arbitrary alignment in relation to memory? >I would expect this to be more efficient in terms of registers used (and >therefore also memory), but register access time might suffer. Decoding gets a lot more complicated, but this approach might have some merit if the memory itself was expensive relative to the decoding (probably not the case in any MOS technology). The maximum number of useful windows is about 6 or 7 anyway, for the UNIX programs examined by the Berkeley group. Beyond that the increased access time costs more than the extra windows save. Of course, with finer granularity, the access time would grow more slowly ... >(2) Do any architectures dribble out dirty registers near the bottom of >the set of on-chip registers in otherwise unused memory cycles, or >similarly dribble them in when they move back on-chip rather than >stopping everything else while bringing them back in? The Berkeley RISC machines don't have any unused memory cycles. So for them, the answer is no. This exact matching of processor memory requirement to available memory bandwidth is a subtle, powerful, and not-often-appreciated feature of those architectures. >(3) Are there machines with both dynamic (windowed) and static (normal) >registers? Even in the Berkeley RISCs, 10 or so of the 32 registers are "global", meaning they are the same in all windows. So, yes. >For instance it might be useful for quick O.S. interrupts to >use only statics and not go to the trouble of setting up a register >window stack (which would involve lots of reads and writes moving the >old stack completely out to memory and then back again later). This is a severely muddled question. Mu! Register windows IN HARDWARE, which is what we're discussing, require almost NO setup time. So I don't understand what "trouble" and "reads" and "writes" you're talking about. If you are talking about context switches, then yes there is an increased overhead since (depending on how you implement it) multiple windows of registers may need to be saved. But context switches occur about two orders of magnitude less frequently (in UNIX) than calls/returns, and so this has no significant performance impact. As far as "quick O.S. interrupts" are concerned, these only need one set of registers, and so can be treated pretty much just like a procedure call (being careful not to use the overlapped registers). No context switch is necessary in most cases. Again, interrupts occur far less frequently than calls/returns, and so even if it wasn't easy to handle them efficiently, the performance effect would be small. The only exception might be extremely high speed real-time applications. >Static registers could also be useful for non-local variables. That's what they get used for, all right. It turns out that in C virtually all arrays are global, for example, and if you're looping through all the elements of one it's nice to keep the address in a register. -- Howard A. Landman ...!intelca!mipos3!cpocd2!howard
jack@mcvax.UUCP (02/25/87)
Something that I would be very interested in is wether there is any research done into machines with an unlimited register file. What I have in mind is a stackish machine, with an intelligent stack cache that will delay writes around the stack pointer (since there's a very good chance that they will never have to be written at all). The cache doesn't even have to be that intelligent: data should be written out as soon as a write to a different address lands in the same cache entry. Moreover, whenever the stack pointer is decremented (on routine exit, for instance), you can scratch all data above it. One of the advantages of this seems to be that compilers become simpler (no special cases for routines that run out of registers), and routines that are recursive at the beginning would benefit by not having (unnecessary) register-saves. -- Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp) The shell is my oyster.
bcase@amdcad.UUCP (02/26/87)
In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes: >Register window machines such as the Berkeley RISCs have been discussed >recently in this group. I have some questions about these architectures: > >(3) Are there machines with both dynamic (windowed) and static (normal) >registers? For instance it might be useful for quick O.S. interrupts to >use only statics and not go to the trouble of setting up a register >window stack (which would involve lots of reads and writes moving the >old stack completely out to memory and then back again later). Static >registers could also be useful for non-local variables. The Pyramid machine has 16 "global" registers that are addressable by all procedures; i.e., they are a part of every procedure context. bcase --------- This space is not for rent; it will soon be used, so keep watching.
bcase@amdcad.UUCP (02/26/87)
In article <7284@boring.mcvax.cwi.nl> jack@boring.UUCP (Jack Jansen) writes: > >Something that I would be very interested in is wether there >is any research done into machines with an unlimited register >file. >What I have in mind is a stackish machine, with an intelligent >stack cache that will delay writes around the stack pointer >(since there's a very good chance that they will never have >to be written at all). Close, but not quite, the C Machine Stack Cache. See SIGARCH/SIGPLAN notices, vol. 17, no. 4, April 1982.
baum@apple.UUCP (02/27/87)
-------- [] > >>(3) Are there machines with both dynamic (windowed) and static (normal) >>registers? For instance it might be useful for quick O.S. interrupts to >>use only statics and not go to the trouble of setting up a register >>window stack (which would involve lots of reads and writes moving the >>old stack completely out to memory and then back again later). Ditzel et. al. have finally published details of the CRISP processor, which is a chip implements a stack cache with 32 registers. This implementation appears to be a register window with variable overlap, in effect. An entry instruction at the head of each procedure allocates space in the 'register file', spilling to main mem. if required. A cleanup instruction after return specifies how many top of stack locations will be needed, and hardware will 'unspill' to keep that many valid. The architecture is memory to memory... the stack cache can be addresses as ordinary memory (but is obviously faster), and can be byte addressed. They claim exceptional high hit ratios on the stack cache, and performance of about 9x a VAX780 running at 16Mhz (for some number of benchmarks including PCC-- smaller benchmarks show 30x!). There are a number of other architectural innovations, most importantly a zero cycle branch. There are papers in both this weeks Compcon and ISSCC proceedings. The interrupt is very simple. The PC, PSW, and a Interrupt ID get pushed onto the interrupt stack. The stack cache, however, is still addressed by the regular stack pointer; therefore all accesses will go to main memory instead of the stack. If the interrupt code runs in the same address space, simply changing the interrupt stack pointer to be the same as the stack pointer suddenly gives you access to the stack cache. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
chuck@amdahl.UUCP (02/27/87)
In article <448@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes: >>[David implies that an interrupt requires a complete context switch >>of writing all active windows to memory and reloading them after >>handling the interrupt] > >[Howard points out that for an interrupt, the operating system code >that handles the interrupt can promise not to trash active windows, >and continue to use the existing stack. He also points out that >context switches between user processes occur relatively infrequently >and thus the overhead of writing out all active windows and reading in >new windows on a context switch is not important.] I agree with this, so I'm going to change the subject slightly. Lately I've been thinking about operating systems which would perform a complete context switch on every interrupt. In particular, code for handling interrupts would run in a restricted environment (ie user mode) where their ability to accidentally trash other interrupt handlers, the operating system, and device drivers would be minimized. My biggest problem with this concept is that context switching, even on a machine with "only" 16 registers is extremely expensive. I'd be interested in hearing about architechtures that minimize the expense of a context switch. (The expected advantage of the OS I'm toying with is that it would make it much easier to write and debug programs (subroutines?) that are normally part of a Unix kernel. I've been writing some device drivers lately, and I'm getting frustrated with the cycle of fixing a bug, compiling the code, linking it with the kernel, rebooting the system, testing the code and finding another bug, rebooting the system to fix the new bug... (If device drivers ran in user land, my cycle would be reduced to: fixing a bug, compiling the code, testing the code and finding a bug...) -- Chuck
rw@beatnix.UUCP (02/27/87)
In article <5763@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: > >I agree with this, so I'm going to change the subject slightly. Lately >I've been thinking about operating systems which would perform a complete >context switch on every interrupt. In particular, code for handling >interrupts would run in a restricted environment (ie user mode) where >their ability to accidentally trash other interrupt handlers, the operating >system, and device drivers would be minimized. > >My biggest problem with this concept is that context switching, even >on a machine with "only" 16 registers is extremely expensive. >I'd be interested in hearing about architechtures that minimize the >expense of a context switch. In Embos (Elxsi Message Based Operating System), I/O drivers (called access managers) are user processes. Everything is a "user process" -- there are no priveleged instructions. While there are intra-process interrupts, they are not used to implement I/O. I/O is done through normal use of the message system. I/O controllers look like software processes, and the access managers just send and receive messages with them. This makes debugging AMs for new controllers easy because you can simulate the controller in software transparently to the AM. The Elxsi architecture has 16 (64 bit) GPRs per process. The current implemen- tation has 15 register sets (each set is a GPR plus other process context). A process in each CPU called the register set manager tries to keep all register sets filled with ready-to-run processes. Context switches between processes in register sets are done by the microcode in about 10 microseconds. This number, while true, isn't very significant, since the first thing an AM does when dispatched to handle an I/O completion message is a receive instruction, and that takes about another 60 microseconds. You could speed this up without affecting the context switch time, though. Even in a machine without multiple register sets, you could easily get a suitably fast context switch given appropriate hardware and/or microcode support. Multiple register sets just turned out to be the best way for our situation. So Elxsi is toward the other end of the spectrum of the lots of registers per process context vs. lots of process contexts tradeoff. But of course those things are influenced by the kind of machine too. You'd expect different choices for a large, lots-of-users, multi-CPU machine implemented on 3 giant ECL boards than for a single chip single-or-few-user machine. Russell Williams ..!{sun|styx}!elxsi!rw
jay@voder.UUCP (02/27/87)
In article <14986@amdcad.UUCP>, bcase@amdcad.UUCP (Brian Case) writes: > In article <7284@boring.mcvax.cwi.nl> jack@boring.UUCP (Jack Jansen) writes: > > > >Something that I would be very interested in is wether there > >is any research done into machines with an unlimited register > >file. > > Close, but not quite, the C Machine Stack Cache. See SIGARCH/SIGPLAN > notices, vol. 17, no. 4, April 1982. It seems like a waste to have a machine with unlimited registers. At any given point there is only one set of registers active and other registers are just sitting there. Instead, a better idea would be to limit the number of windows to, say two or three and do loads and stores if you exceed the number of available windows. It might be a wothwhile experiment to find out how many windows are optimum for running UNIX. Jay
mason@tmsoft.UUCP (02/28/87)
In article <7284@boring.mcvax.cwi.nl> jack@boring.UUCP (Jack Jansen) writes: >What I have in mind is a stackish machine, with an intelligent >stack cache that will delay writes around the stack pointer >(since there's a very good chance that they will never have >to be written at all). I have thought a fair amount about this. The problem for really high performance is that: as all the activity takes place around the stack pointer it is difficult to get much pipelining in progress. This means the cache must be very fancy if you want to do Cray class machines, although RISC class seems pretty easy. I have toyed with this direction for a Ph.D. thesis. >One of the advantages of this seems to be that compilers become >simpler (no special cases for routines that run out of registers), (But then all that fancy compiler technique the RISC people cite would be a waste :-) (But the compiler induced bug rate would certainly go down.) -- ../Dave Mason, TM Software Associates (Compilers & System Consulting) ..!{utzoo seismo!mnetor utcsri utgpu lsuc}!tmsoft!mason
markp@valid.UUCP (03/01/87)
> Register window machines such as the Berkeley RISCs have been discussed > recently in this group. I have some questions about these architectures: > > (1) Has anyone tried making the window block size be just one register, > i.e. the window can have an arbitrary alignment in relation to memory? > I would expect this to be more efficient in terms of registers used (and > therefore also memory), but register access time might suffer. > I believe that the "Strategies for Managing the Register file in RISC" paper, mentioned earlier, concluded that fully-programmble register window size had very low marginal utility over a small variety of fixed sizes (i.e. 4/8/12 registers), and would probably be a lose considering the additional complexity and critical path effects. (i.e. you're better off making a larger file with less control complexity) The other significant conclusion that I recall they made was that the most effective overflow/underflow handling mechanism (averaged across many applications) is to save exactly one context on demand (subroutine call overflows), and to restore exactly one context on demand (subroutine return underflows). Overflow/underflow problems are correlated with the subroutine nesting depth over time of an application. Programs which go down and come back up several levels with few repetitions at any given level (i.e. Ackerman's function) tend to be pathological overflowers, while programs which spend a great deal of time at or near a particular nesting depth practically never overflow. The point here is that sensible thinking about the calling pattern of your program can dramatically improve performance in a register-window machine. This isn't really a very difficult concept, but people don't always think about it. > (3) Are there machines with both dynamic (windowed) and static (normal) > registers? For instance it might be useful for quick O.S. interrupts to > use only statics and not go to the trouble of setting up a register > window stack (which would involve lots of reads and writes moving the > old stack completely out to memory and then back again later). Static > registers could also be useful for non-local variables. Absolutely. A register window implementation generally maintains a set of global registers which are accessible from any context (i.e. rM-rN are always global registers). Non-register window RISC machines must make a distinction between globals, locals, and parameter registers in order to make the best use of the register file, though the allocation is done by the smart compiler/optimizer. I'm sure John can speak more specifically. Mark Papamarcos Valid Logic {ihnp4,hplabs}!pesnta!valid!markp
markp@valid.UUCP (03/02/87)
Incidentally, another good reference in the area of register windows is: "Analyzing Multiple Register Sets" by Hitchcock and Sprunt from CMU Unfortunately, I neglected to write on my copy the publication from which it came, but judging from "1985 IEEE," it's either the 1985 Architecture conference or Transactions. At any rate, the premise behind the paper is to decouple the effects of multiple register sets and register windows from the architecture of the machine. To do this, the authors constructed simulators for RISC I, a 68000, and a VAX and compared the number of memory references generated by applications running with no multiple register sets, with multiple register sets (no overlap), and with overlapped register windows. The decrease in memory references was as great as 2:1 for the VAX with windows as compared to the normal VAX. In one case (Ackerman's function, a pathological program for any subroutine-call benchmark), register windows actually behaved WORSE than a conventional register set due to the constant overflows and underflows. The severity of this problem obviously depends on the number of windows provided in the register file and the size of each window. The reduction in memory references for the 3 architectures varied from very little (25%) to a factor of 7, depending on the application. The RISC I fared particularly poorly without its register windows. The actual payoff is influenced by the frequency of register variable access, overflow/underflow rate, and compiler smartness. In addition, some program constructs such as up-level references in Pascal code can generate tremendous headaches for register-window architectures. Considering the results of this study, it is a rational conclusion that register windows do not necessarily imply RISC, and should be considered as a viable architectural parameter in any machine, subject to the usual tradeoffs. BTW- I could not find any indication in the paper of what the relative sizes were of the normal and overlapped register files. If indeed (as I suspect) the overlapped register file was substantially larger than the normal register file, then similar performance gains could probably be attained through use of a large register file and a very good register allocation scheme in the compiler/optimizer. Actual performance comparison of these 2 approaches with the same size register file is WAY beyond the scope of this posting! (although I would LOVE to see some numbers and/or references, with an indication of just how bad [in practice] the compiler complexity issues are between the two approaches) An earlier poster maintained that a primary purpose of register windows was to allow operation without a cache. This is not strictly true. In fact, register windows can function even better WITH a cache since it generates block write and read requests on overflows and underflows. Furthermore, it is impractical to maintain a large enough register file to function as an effective data cache. In the interest of calling a certain class of gardening tools "spades", register windows should be regarded as no more and no less than they are: a clever architectural mechanism to decrease the register-related overhead of subroutine call and return. Mark Papamarcos Valid Logic {ihnp4,hplabs}!pesnta!valid!markp
jjw@celerity.UUCP (03/04/87)
In article <4376@columbia.UUCP> eppstein@tom.columbia.edu (David Eppstein) writes: >Register window machines such as the Berkeley RISCs have been discussed >recently in this group. I have some questions about these architectures: I can't answer questions 1 or 2, but: >(3) Are there machines with both dynamic (windowed) and static (normal) >registers? For instance it might be useful for quick O.S. interrupts to >use only statics and not go to the trouble of setting up a register >window stack (which would involve lots of reads and writes moving the >old stack completely out to memory and then back again later). Static >registers could also be useful for non-local variables. The Celerity "Accel" (a.k.a. C1200, C1260) provides both types of registers. This is because of the NCR-32 chip we use. That chip has a RISC-like instruction set (register to register operations, "off-line" fetch and store, pipeline with delayed jumps, etc.) but provides a set of 16 static, on-chip registers. the chip also provides access to off-chip registers and Celerity has configured 48 of those into a register stack (16 overlapped with caller, 16 overlapped with callee, and 16 local). Because the chip allows more and faster operations on the static registers they are used whenever possible. The stack registers are used mostly for parameter passing and static register saving but they are also used for local variables when all the static registers are allocated. The system provides several sets (banks) of register stacks (number varies by model) which are allocated to running processes to reduce memory saving and restoring on process switch. For fast OS operation the kernel just switches to a separate bank allocated for this purpose. Other off-chip registers have been allocated as privileged "kernel scratch pad" registers. These registers can be used to hold kernel information about the executing process and are used to quickly save some of the static registers. -- J. J. Whelan Celerity Computing
mash@mips.UUCP (03/05/87)
In article <495@apple.UUCP> baum@apple.UUCP (Allen Baum) writes: >Ditzel et. al. have finally published details of the CRISP processor, >which is a chip implements a stack cache with 32 registers.... >.... They claim exceptional high hit >ratios on the stack cache, and performance of about 9x a VAX780 >running at 16Mhz (for some number of benchmarks including PCC-- >smaller benchmarks show 30x!). There are a number of other >architectural innovations, most importantly a zero cycle branch. >There are papers in both this weeks Compcon and ISSCC proceedings. Unfortunately, I wasn't able to catch either COMPCON, or Dave's later CRISP talk at Stanford, but I do believe that was X a 750, not a 780 [BTL-Research usually compares that way since they have 750s!]. From reports of the Stanford talk, I also heard (maybe somebody else can confirm): 13X (750) Dhrystone (11,000? hard to know what their 750s do?) 20X wc 30X Ackerman's function -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
adb@alice.UUCP (03/10/87)
In article <115@tmsoft.UUCP>, mason@tmsoft.UUCP writes: > >What I have in mind is a stackish machine, with an intelligent > >stack cache that will delay writes around the stack pointer > The problem for really high performance is that: as all the activity takes > place around the stack pointer it is difficult to get much pipelining in progress. ... For the CRISP microprocessor, we avoid the problem associated with stack pointer writes by designing the architecture so stack pointer writes are hardly ever done -- only on procedure entry and exit. The two instructions that modify the SP are carefully managed so addresses are still calculated properly, and the stack cache on flushes on procedure entry. > >One of the advantages of this seems to be that compilers become > >simpler (no special cases for routines that run out of registers), > (But the compiler induced bug rate would certainly go down.) We hope so. Alan Berebaum, AT&T-Something, ..!ihnp4!homxa!adb, ..!research!adb
adb@alice.UUCP (03/10/87)
In article <14@winchester.mips.UUCP>, mash@mips.UUCP writes: > >Ditzel et. al. have finally published details of the CRISP processor, > >which is a chip implements a stack cache with 32 registers.... > >.... They claim exceptional high hit > >ratios on the stack cache, and performance of about 9x a VAX780 > >running at 16Mhz (for some number of benchmarks including PCC-- > >smaller benchmarks show 30x!). There are a number of other > I do believe that was X a 750, not a 780 ... Sorry, John, it really was 9 times a 780, not a 750. I suspected that posting a slide with 750 numbers would confuse people. > 13X (750) Dhrystone (11,000? hard to know what their 750s do?) The Dhrystone number we published was 13,560, and the 750 we used ran about 996. The precise numbers for wc and Ackerman's are 19.5 and 33.6 times a 750, respectively. Alan Berenbaum, AT&T-Something ..!ihnp4!homxa!adb ..!research!adb