dwc@homxc.UUCP (01/24/87)
>There is some ongoing discussion about register usage in programs. >While it is true that one generally wants to keep as many variables >as possible (especially such things as loop indices!) there are times >when using a lot of registers can actually slow things down. >This arises when a subroutine shares registers in common with its >parent. If the routine is called many times the cost of saving and >restoring registers during the call and return can offset any >speed savings thru register usage. I have actually seen this >phenomenon occur quite strikingly in programs that require >multi-precision arithmetic. The cost of saving the registers for >a multi-precise add or multiply routine (say) when they are called >~10^8 or 10^9 times can greatly slow down a program. am i missing something or does the frequency of calls not matter? if it is slower to save the registers over many calls, is it not slower to save the registers for just one call? isn't it just an issue of memory references within the routine? if the register saves result in a greater number of references than without the register saves (i.e. register variables), then it is not worth it no matter how many times the routine is called. and doesn't this only show up with assembly level programming? do compilers take the time to keep track of which registers have been used and only save the 'dirty' ones or do most call and return mechanisms save the entire register set on the stack? any one know? if not, i guess this is still a valid issue in c, since you can declare register variable within any block (is this the correct term?). danny chen ihnp4!homxc!dwc
gwu@vlsi.cs.cmu.edu.UUCP (01/26/87)
> am i missing something or does the frequency of calls not matter? > if it is slower to save the registers over many calls, is it not > slower to save the registers for just one call? isn't it just an > issue of memory references within the routine? if the register > saves result in a greater number of references than without the > register saves (i.e. register variables), then it is not worth it > no matter how many times the routine is called. It IS a matter of references to memory. Picture a routine which has several accesses to its variables, which are all declared as register variables. If in the middle, there is one call to a routine which requires the registers to be saved, it is probably of advantage to keep the variables in registers. The total number of memory references is pretty much the procedure call mechanism. If on the other hand, the calling routine rarely accesses its variables, which are not in registers, but calls another routine very often, the number of memory references is merely those of the calling routine's variables. Whether the program saves only those registers to be used also makes a difference. I suspect that this is compiler implementation dependent. If all registers are automatically saved with every procedure invocation, there's obviously going to be some waste. Even if the compiler is intelligent about saving registers, the above example still shows that the number of calls to other routines can vary the effectiveness of declaring register variables. George J Wu
jgp@moscom.UUCP (01/29/87)
In article <1881@homxc.UUCP> dwc@homxc.UUCP writes: >In some other article someone else writes: >>If the routine is called many times the cost of saving and >>restoring registers during the call and return can offset any >>speed savings thru register usage. >am i missing something or does the frequency of calls not matter? I agree that the frequency shouldn't matter. The real question would be if any compiler is smart enough to ignore a register declaration if the resulting code would run slower (eg. if a parameter is only accessed once, outside of a loop, the compiler would probably be better off to ignore a register declaration). >do compilers take the time to keep track of which registers have >been used and only save the 'dirty' ones or do most call and >return mechanisms save the entire register set on the stack? It depends on the architecture and the compiler, the three easy ways to do it are: a) save all registers b) have the caller save only the registers it is using c) have the callee save only the registers it will use pdp-11's use "a" since they only have 3 register variables anyway. Most 68k compilers use "c" since you get about 12 register variables. I don't know of anyone who uses "b" but it should be about as efficient as "c". The method used has a large effect on whether setjmp/longjmp can put the correct values back into register variables (SYSVID says they may be unpredictable :-(. -- Jim Prescott rochester!moscom!jgp
guy@gorodish.UUCP (01/29/87)
> (SYSVID says they may be unpredictable :-(.
Yup, and you'll just have to live with it. Do you really want to pay
the penalty of extra glop in every call on a 68K just to make a very
rarely-used feature work?
Furthermore, note that there is no guarantee that something *won't*
be placed into a register if it can be; optimizing compilers are
perfectly free to stick variables into registers if they can. That's
why ANSI C says that the values of *all* automatic variables, except
those declared "volatile", are unpredictable when a "longjmp" is
done.
firth@sei.cmu.edu.UUCP (01/30/87)
In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes: >It depends on the architecture and the compiler, the three easy ways >to do it are: > a) save all registers > b) have the caller save only the registers it is using > c) have the callee save only the registers it will use >pdp-11's use "a" since they only have 3 register variables anyway. Most >68k compilers use "c" since you get about 12 register variables. I don't >know of anyone who uses "b" but it should be about as efficient as "c". > >The method used has a large effect on whether setjmp/longjmp can put the >correct values back into register variables (SYSVID says they may be >unpredictable :-(. The codegenerators I wrote for the PDP-11 and VAX-11 use method (b). The main reason for this was precisely the longjump problem: if local frames store non-local state, then that state can be restored only by the very slow process of unwinding the stack. If on the other hand each frame keeps its own state, a longjump is just (reset stack pointer; jump), or at worst, if you keep a closed stack, (reset frame pointer; jump; destination resets stack front pointer). The alternative of having the longjump NOT restore state that happened to be in registers was not permitted by the languages in question. Well, I benchmarked this technique against the alternative of having the callee save, and it came out better on both machines. Surprising for the Vax, since that has hardware to save and restore registers, but in fact the special instructions are only marginally faster than doing it by hand. (Of course, one does not use CALLS under any circumstances; using JSB, controlling the local-variable stack yourself, and passing parameters in registers buys you a factor of three in the procedure call overhead.) The main reasons for the difference are interesting: (a) fewer registers are involved. This is because the callee must save every register it uses ANYWHERE in its body, whereas the caller need save only registers CURRENTLY LIVE. (b) fewer memory accesses. Callee must save and restore always; caller can restore the register from a declared variable some (~1/3) of the time, and so need not save it. For other methodological reasons too boring to post here, I am a firm believer in the "caller save always; callee save never" technique. Robert Firth
franka@mntgfx.UUCP (01/30/87)
In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes: >In article <1881@homxc.UUCP> dwc@homxc.UUCP writes: >>In some other article someone else writes: >>do compilers take the time to keep track of which registers have >>been used and only save the 'dirty' ones or do most call and >>return mechanisms save the entire register set on the stack? >It depends on the architecture and the compiler, the three easy ways >to do it are: > a) save all registers > b) have the caller save only the registers it is using > c) have the callee save only the registers it will use >pdp-11's use "a" since they only have 3 register variables anyway. Most >68k compilers use "c" since you get about 12 register variables. I don't >know of anyone who uses "b" but it should be about as efficient as "c". > Actually b can be much less efficient than c, because the strategy in c can conditionally save registers while b must save all which might be used. E.g., consider the routine int foo(a) struct foo_structure *a; { if (a) { /* long hairy computation which uses * several register variables */ } } Using scheme b, all registers which MAY be used in foo MUST be saved, while scheme c allows you to save registers only if a is non-null. Guy Steele goes into this very well in an MIT Tech Report called "Debunking the Expensive Procedure Call Myth" (Maybe. This may be the subtitle after a dry formal title.). It's written in the context of LISP compilers, but talks about compiling for register machines. It also talks about tail recursion (something even C compiler writers could probably learn from). All in all, a good paper. >The method used has a large effect on whether setjmp/longjmp can put the >correct values back into register variables (SYSVID says they may be >unpredictable :-(. Sad, but true. The non-orthogonality of "modern" machine design again pokes through to bite the programmer. So it goes. Frank Adrian Mentor Graphics, Inc.
gwu@vlsi.cs.cmu.edu.UUCP (01/31/87)
> I agree that the frequency shouldn't matter. The real question would > be if any compiler is smart enough to ignore a register declaration if > the resulting code would run slower (eg. if a parameter is only accessed > once, outside of a loop, the compiler would probably be better off to > ignore a register declaration). Such an intelligent compiler would need to know the number of times that a given subroutine is called. The number of times that a routine will be called, if at all, is often determinable at run-time only, not compile-time. > a) save all registers > b) have the caller save only the registers it is using > c) have the callee save only the registers it will use > pdp-11's use "a" since they only have 3 register variables anyway. Most > 68k compilers use "c" since you get about 12 register variables. I don't > know of anyone who uses "b" but it should be about as efficient as "c". In the case of the callee saving the registers it is planning to use, the compiler won't know whether or not called subroutines use any registers, and therefore will be unable to figure out whether it is more efficient to use register variables in the main routine. Of course, there is always the possibility of combining the compiler and linker. . . . George
sundar@cwruecmp.UUCP (02/01/87)
do compilers take the time to keep track of which registers have been used and only save the 'dirty' ones or do most call and return mechanisms save the entire register set on the stack? It depends on the architecture and the compiler, the three easy ways to do it are a) save all register b) have the caller save only the registers it is using c) have the callee save only the registers it will use pdp-11's use "a" since they only have 3 register variables anyway. Most 68k compilers use "c" since you get about 12 register variables. I don't know of anyone who uses "b" but it should be about as efficient as "c". I am just curious. Isn't it a security hole to use the method c) above? If the caller is a system routine and the callee is my program and I am supposed to save and restore registers that I intend to use, I can have some fun by not saving the registers in the first place and in addition destroying them. sri
guy@gorodish.UUCP (02/01/87)
> I am just curious. Isn't it a security hole to use the method c) > above? If the caller is a system routine and the callee is my program > and I am supposed to save and restore registers that I intend to use, > I can have some fun by not saving the registers in the first place and > in addition destroying them. In general, calls that leave one protection domain and enter another domain that lacks certain privileges that the original domain had are insecure unless some care is taken. The callee can have some fun simply by returning to a location other than the one it was supposed to return to. You have to prevent this from happening. If the caller passes some datum to the callee by reference, the callee must not be able to modify anything other than that datum. Cross-domain calls of this sort would have to be treated differently, whether by the instruction-set processor, the compiler, the run-time support code, or some combination thereof - unless the overhead of treating such calls differently is minimal, and the system can affort to make all calls "secure" against the callee doing something unpleasant to protected objects in the caller's domain ro to the caller itself.
mouse@mcgill-vision.UUCP (02/12/87)
In article <476@mntgfx.MENTOR.COM>, franka@mntgfx.MENTOR.COM (Frank A. Adrian) writes: > In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes: >> a) save all registers >> b) have the caller save only the registers it is using >> c) have the callee save only the registers it will use > Actually b can be much less efficient than c, because [...] > int foo(a) > struct foo_structure *a; > [example] > Using scheme b, all registers which MAY be used in foo MUST be saved, > while scheme c allows you to save registers only if a is non-null. >> The method used has a large effect on whether setjmp/longjmp can put >> the correct values back into register variables (SYSVID says they >> may be unpredictable :-(. > Sad, but true. WHAT?!?! I can't see any reason that longjmp can't get the registers back - if it can't restore them easily on that architecture then just have setjmp save them in the jmp_buf! What's the problem? This will make longjmp basically useless, especially in the face of compilers that optimize non-"register" variables into registers! Note that the possibility of method (c) means that it won't even work to avoid registers in the function calling setjmp because there have to be no live registers at the time of the setjmp, which with method (c) means no registers at all, because functions might not save them - they'd still be live at the setjmp()! der Mouse USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse think!mosart!mcgill-vision!mouse Europe: mcvax!decvax!utcsri!mcgill-vision!mouse ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu
guy@gorodish.UUCP (02/14/87)
>WHAT?!?! I can't see any reason that longjmp can't get the registers >back - if it can't restore them easily on that architecture then just >have setjmp save them in the jmp_buf! What's the problem? The problem, to put it simply, is that if "longjmp" can't get the values of the register variables *at the time the longjmp was called* getting them from values saved by "setjmp" would restore the values of the register variables *at the time the setjmp was called*, which is NOT what the SVID (or various UNIX manuals) claim that "longjmp" does. On some architectures it is painful to try to get the values as of the time "longjmp" was called; you can't just walk the stack to find the values saved when the routine that called "setjmp" last made a procedure call.
howard@cpocd2.UUCP (02/16/87)
In article <651@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP (der Mouse) writes: >In article <476@mntgfx.MENTOR.COM>, franka@mntgfx.MENTOR.COM (Frank A. Adrian) writes: >> In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes: >>> a) save all registers >>> b) have the caller save only the registers it is using >>> c) have the callee save only the registers it will use All this discussion, and no one's suggested using hardware to avoid saving any registers at all (most of the time)! This has been done in a few machines; I'm most familiar with the RISC-I since I helped design the chip. Isn't this a computer ARCHITECTURE group? Did you know that HALF of the memory traffic on a VAX-11/780 is data being saved/restored? -- Howard A. Landman ...!intelca!mipos3!cpocd2!howard
crowl@rochester.UUCP (02/17/87)
In article <425@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >Isn't this a computer ARCHITECTURE group? Did you know that HALF of the >memory traffic on a VAX-11/780 is data being saved/restored? Wait a second! ALL the memory traffic on ANY machine is data that is being saved and restored. That is the purpose of memory. (Picky, picky :-) -- Lawrence Crowl 716-275-5766 University of Rochester crowl@rochester.arpa Computer Science Department ...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627
greg@utcsri.UUCP (Gregory Smith) (02/18/87)
In article <651@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP writes: >>> The method used has a large effect on whether setjmp/longjmp can put >>> the correct values back into register variables (SYSVID says they >>> may be unpredictable :-(. >> Sad, but true. > >WHAT?!?! I can't see any reason that longjmp can't get the registers >back - if it can't restore them easily on that architecture then just >have setjmp save them in the jmp_buf! What's the problem? This will >make longjmp basically useless, especially in the face of compilers >that optimize non-"register" variables into registers! jmp_buf env; main(){ register int i; int j; i = j = 1; if(setjmp(env)){ printf( "i==%d, j==%d\n", i, j ); exit(0); } i = j = 2; func(); } func(){ longjmp( env, 1 ); } The above program should ideally report that i=2. Certainly it will say that j is 2. If i is saved in the jmp_buf, and restored by longjmp, then i will be reported as 1. -- ---------------------------------------------------------------------- Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg Have vAX, will hack...
howard@cpocd2.UUCP (02/18/87)
In article <24929@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes: >In article <425@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >>Isn't this a computer ARCHITECTURE group? Did you know that HALF of the >>memory traffic on a VAX-11/780 is data being saved/restored? > >Wait a second! ALL the memory traffic on ANY machine is data that is being >saved and restored. That is the purpose of memory. (Picky, picky :-) Not really. Save and restore refer specifically to memory traffic generated by procedure calls and returns (respectively) in order to prevent one routine (the callee) from trashing another routine's (the caller's) variables that are in registers. Many memory references do not fall into this category. Normal loads and stores are one example, but also consider instruction fetches. Are you seriously claiming that an instruction fetch is "data being saved [or] restored"? Procedure calls are grotesquely expensive on the most common machines. A good rule of thumb for a VAX is that a procedure call takes as long as executing about 100 computational instructions. This is why so much of the stdio package is macros instead of procedures. There are several ways to try to avoid this overhead. One is hardware support for call/return, as in some RISC machines. Another way is to eliminate procedures from your language; this is what Forth did with its "threaded code", and is the main reason Forth is so fast. On a machine with multiple register windows the speed advantage of Forth would evaporate; of course you could also implement a hardware stack, which might increase its advantage! -- Howard A. Landman ...!intelca!mipos3!cpocd2!howard
montnaro@sprite.UUCP (02/19/87)
In article <24929@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes: >In article <425@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >>Isn't this a computer ARCHITECTURE group? Did you know that HALF of the >>memory traffic on a VAX-11/780 is data being saved/restored? > >Wait a second! ALL the memory traffic on ANY machine is data that is being >saved and restored. That is the purpose of memory. (Picky, picky :-) 'Taint necessarily so. Doesn't the VAX do memory<->memory transfers as well as register<->register and register<->memory? Skip Montanaro ARPA: montanaro%desdemona.tcpip@ge-crd.arpa UUCP: seismo!rochester!steinmetz!desdemona!montanaro GE DECnet: csbvax::mrgate!montanaro@desdemona@smtp@tcpgateway
ad@uwmacc.UUCP (02/20/87)
In article <430@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes: >Procedure calls are grotesquely expensive on the most common machines. A good >rule of thumb for a VAX is that a procedure call takes as long as executing >about 100 computational instructions. This is why so much of the stdio package >is macros instead of procedures. There are several ways to try to avoid this >overhead. One is hardware support for call/return, as in some RISC machines. >Another way is to eliminate procedures from your language; this is what Forth >did with its "threaded code", and is the main reason Forth is so fast. On a >machine with multiple register windows the speed advantage of Forth would >evaporate; of course you could also implement a hardware stack, which might >increase its advantage! Please explain what you mean by multiple register windows and how they impact Forth's speed. Thanks. Alan Dickman 608-262-9421 Arpa: ad@unix.macc.wisc.edu Bitnet: ad@wisvmacc UUCP: ...{allegra,ihnp4,seismo}!uwvax!uwmacc!ad 1210 West Dayton Street, Madison, WI 53706
michael@crlt.UUCP (02/20/87)
[Yum yum!] In article <898@moscom.UUCP>, jgp@moscom.UUCP (Jim Prescott) writes: > > >do compilers take the time to keep track of which registers have > >been used and only save the 'dirty' ones or do most call and > >return mechanisms save the entire register set on the stack? > > It depends on the architecture and the compiler, the three easy ways > to do it are: > a) save all registers > b) have the caller save only the registers it is using > c) have the callee save only the registers it will use > pdp-11's use "a" since they only have 3 register variables anyway. Most > 68k compilers use "c" since you get about 12 register variables. I don't > know of anyone who uses "b" but it should be about as efficient as "c". In code generated by a brute-force compiler, "b" wastes a lot of CPU, by saving registers that will never be trashed. Suppose you're using, say, five of them, and make calls to subroutines that will only use one. Method "c" does one-fifth as much register save/restoration as method "b". The only place where method "c" saves an unused variable that method "b" doesn't is near the top level of your program, where a given register might not yet contain anything that must be preserved (unless the author of the O.S. was a total idiot). Here method "c" will waste time saving and restoring junk. But the top level code would normally be executed much less than the lower level stuff. On the other hand, method "b" could gain a global advantage when subroutines with few register variables make many calls to subroutines with many, which don't in turn make calls to another generation of children (or at least not while their registers need preservation). In cases like this, method "b" has saved the register once, while "c" would save it many times. Method "b" has other advantages as well: - It makes the caller, not the callee, responsible for the integrity of its own environment. Thus, if a hand-coded routine makes an error in register preservation, it will break itself (and finding the error will be easy), not previously-debugged portions of the calling routine. - It offers the opportunity to save CPU by omitting the storage and restoration of register variables that no longer contain anything of value (and a smart enough compiler might be able to determine this). In article <540@sei.cmu.edu.UUCP> firth@sei.cmu.edu.UUCP (Robert Firth) writes: >In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes: >> >>The method used has a large effect on whether setjmp/longjmp can put the >>correct values back into register variables (SYSVID says they may be >>unpredictable :-(. > >The codegenerators I wrote for the PDP-11 and VAX-11 use method (b). The >main reason for this was precisely the longjump problem: if local frames >store non-local state, then that state can be restored only by the very >slow process of unwinding the stack. [] I seem to be missing something here. Why can't setjmp save the entire register set, plus as much of the stack as would be pushed by the caller as it calls, in "env"? All "longjmp" needs to do is restore the state of the caller as of the "setjmp" call, and it is allowed, and even encouraged, to know what kind of code its particular compiler generates. If the caller isn't doing such things as varying its own stack frame, only saving >its< caller's registers when they might be used locally (rather than saving them once on entry and restoring them on exit), and shuttling register variables off to holding areas at unknowable locations in its frame, this will be sufficient. (And if the compiler is smart enough to do this sort of thing, it can also be smart enough to recognize that "setjmp" is being called, and provide as many hooks as necessary.) I would think that "b" could cause more problems for setjmp/longjmp than "c", since setjmp would have to find and save a variable number of stored registers, and longjmp restore them, without damaging other things in the caller's frame. What have I overlooked? >Well, I benchmarked this technique [b] against the alternative of having the >callee save, and it came out better on both machines. [] The main reasons >for the difference are interesting: > >(a) fewer registers are involved. This is because the callee must save > every register it uses ANYWHERE in its body, whereas the caller need > save only registers CURRENTLY LIVE. > >(b) fewer memory accesses. Callee must save and restore always; caller can > restore the register from a declared variable some (~1/3) of the time, > and so need not save it. I find this benchmark interesting. Does (b) actually make up for routines that use fewer register variables? Or does that savings get canceled by the reduced number of saves when the callee has more? =========================================================================== "I've got code in my node." | UUCP: ...!ihnp4!itivax!node!michael | AUDIO: (313) 973-8787 Michael McClary | SNAIL: 2091 Chalmers, Ann Arbor MI 48104 --------------------------------------------------------------------------- Above opinions are the official position of McClary Associates. Customers may have opinions of their own, which are given all the attention paid for. ===========================================================================
faustus@ucbcad.UUCP (02/22/87)
In article <4160@utcsri.UUCP>, greg@utcsri.UUCP (Gregory Smith) writes: > jmp_buf env; > > main(){ > register int i; > int j; > i = j = 1; > if(setjmp(env)){ > printf( "i==%d, j==%d\n", i, j ); > exit(0); > } > i = j = 2; > func(); > } > > func(){ > longjmp( env, 1 ); > } > > The above program should ideally report that i=2. Certainly it will say > that j is 2. If i is saved in the jmp_buf, and restored by longjmp, > then i will be reported as 1. Put a "register int k = 3" in func() -- the program had better not say that i is 3. It shouldn't be hard for setjmp() to restore the registers from the stack frame, as opposed to from the jmp_buf. Wayne
howard@cpocd2.UUCP (02/24/87)
In article <1093@uwmacc.UUCP> ad@uwmacc.UUCP (Alan Lloyd Dickman) writes: >In article <430@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes: >>Procedure calls are grotesquely expensive on the most common machines. >>There are several ways to try to avoid this >>overhead. One is hardware support for call/return, as in some RISC machines. >>Another way is to eliminate procedures from your language; this is what Forth >>did with its "threaded code", and is the main reason Forth is so fast. On a >>machine with multiple register windows the speed advantage of Forth would >>evaporate; of course you could also implement a hardware stack, which might >>increase its advantage! > >Please explain what you mean by multiple register windows and how they >impact Forth's speed. Thanks. Some RISC machines have multiple sets of registers, organized as overlapping windows. This approach to speeding up calls/returns was suggested to the RISC I team at Berkeley by Forest Baskett; simulations convinced us it would work for typical UNIX code, so we implemented it. It worked. The average time for a procedure call was only a few instruction cycles (occasionally, you overflow/underflow the set of windows you have and must actually save/restore, but not every call & return). Also, since the windows overlap (some registers appear in two adjacent windows), we could pass parameters and return results in registers with essentially no overhead. The reason it's fast isn't hard to see. If you can get a clean set of registers by just incrementing a window-number register, then you don't need to save all your old registers away in main memory. If that data can be reactivated by just decrementing a window-number register, then you don't need to restore it. This saves all the overhead of save/restore, as long as everything fits into your registers. The costs include chip area and slowing down of the register access time as more registers are added. SO: 1) Procedure calls/returns are expensive (slow) on most machines. 2) Forth doesn't do procedure calls/returns, so on those machines it has a speed advantage (compared to C, Pascal, Fortran, Lisp, ...). 3) On a machine where calls/returns are fast, the advantage disappears. Avoiding something which is inexpensive buys you nothing. 4) I believe that most of Forth's memory traffic is due to stack manipulations. If this is true, hardware stack support could speed Forth up quite a bit; but it would have to be better than what Burroughs did in their early "stack machines" where only the top two elements were in registers. REFERENCES: Fitzpatrick et al, "A RISCy Approach to VLSI", VLSI Design, 4th quarter 1981 - OR - Fitzpatrick et al, "VLSI Implementations of a Reduced Instruction Set Computer", CMU Conference on VLSI Systems and Computations, 1981 These are essentially the same article; the VLSI Design version has nicer graphics. They both describe the RISC I. Tamir & Sequin, "Strategies for Managing the Register File in RISC", IEEE Trans. Comput., v.C-32, November 1983 Halbert & Kessler, "Windows of Overlapping Registers", CS292R Final Report, U.C. Berkeley, June 1980 This will be hard to get, but it's the most detailed reference. Sherburne et al, "A 32-Bit NMOS Microprocessor with a Large Register File", IEEE JSSC, v.SC-19 #5, October 1984 Describes the RISC II. Sherburne, "Processor Design Tradeoffs in VLSI", PhD dissertation, Comp. Sci. Div., U.C. Berkeley, April 1984 Describes the RISC II. -- Howard A. Landman ...!intelca!mipos3!cpocd2!howard
meissner@dg_rtp.UUCP (02/25/87)
In article <651@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP (der Mouse) writes: > /* stuff about when to save registers deleted */ > WHAT?!?! I can't see any reason that longjmp can't get the registers > back - if it can't restore them easily on that architecture then just > have setjmp save them in the jmp_buf! What's the problem? This will > make longjmp basically useless, especially in the face of compilers > that optimize non-"register" variables into registers! Because of this, ANSI says that even auto's can be trashed by longjmp. That the biz.... (there was also a contingent to remove longjmp/setjmp completely because they are such a wart, but that is another story....) -- Michael Meissner, Data General Uucp: ...mcnc!rti-sel!dg_rtp!meissner It is 11pm, do you know what your sendmail and uucico are doing?
chuck@amdahl.UUCP (02/25/87)
In article <443@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >Some RISC machines have multiple sets of registers, organized as overlapping >windows. This approach to speeding up calls/returns was suggested to the >RISC I team at Berkeley by Forest Baskett; simulations convinced us it would >work for typical UNIX code, so we implemented it. It worked. > Howard A. Landman As I understand it, each register window is a fixed size, and each window is divided into three fixed size subwindows. (I don't remember what these three subwindows are meant to be used for.) One particular aspect of the implementation has been bothering me; namely, the amount of overlap between two windows is a constant size. This has two consequences. First, subroutines which don't use all of their window waste register storage. Typical programs that I write tend to contain short routines that use few variables and need few registers. Thus, my programs can't get the full benefits of the register windows; many of the registers remain unused. Second, subroutines that want to pass more parameters than will fit in an overlapped window, but fewer parameters than would fit in a complete window are also penalized. Not all parameters will be able to be passed inside of registers, and some must be saved in memory. So I'm wondering why the implementation of the RISC I and II don't allow the programmer to specify the number of registers that should be overlapped on the procedure call. Obviously, this would make subroutine calls more expensive and difficult to implement. The overlap size would have to be fetched from memory and added to the window pointer. Checking to see if the register stack had overflowed might be more expensive... However, these seem like trivial problems when compared to the benefit of effectively having a larger register stack. Are there reasons I am missing that make fixed sized windows extremely advantageous? thanks, Chuck
bcase@amdcad.UUCP (02/25/87)
In article <5746@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes: >In article <443@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes: >>Some RISC machines have multiple sets of registers, organized as overlapping >>windows. This approach to speeding up calls/returns was suggested to the >>RISC I team at Berkeley by Forest Baskett; simulations convinced us it would >>work for typical UNIX code, so we implemented it. It worked. >> Howard A. Landman > >As I understand it, each register window is a fixed size, and each window >is divided into three fixed size subwindows. (I don't remember what >these three subwindows are meant to be used for.) One particular aspect >of the implementation has been bothering me; namely, the amount of >overlap between two windows is a constant size. This has two consequences. > ... >So I'm wondering why the implementation of the RISC I and II don't >allow the programmer to specify the number of registers that should >be overlapped on the procedure call. Obviously, this would make >subroutine calls more expensive and difficult to implement. The >overlap size would have to be fetched from memory and added to the >window pointer. Checking to see if the register stack had overflowed >might be more expensive... However, these seem like trivial problems >when compared to the benefit of effectively having a larger >register stack. BTW, the Pyramid machine also have fixed-sized windows (16 in, 16 local, and 16 out, plus 16 globals). Well, the main reason that arbitrary window size was not implemented in the RISC I & II is that it requires an adder in the register address decode path. Using the technology available to them, this would have slowed the machine and used more area. Using the Berkeley scheme, the "add" function is integrated right into the register file address decoders. Arbitrary window size is not that difficult to implement, assuming that you can pay the cost of the adder. The window size does not need to be separately fetched from memory: it can be an immediate field in an instruction. In the implementation that I am thinking of, based on my MS thesis work, the window is allocated at procedure entry time and deallocated at procedure exit time, just as in the RISC I & II. Thus, the window which is allocated must be made large enough to contain the outgoing parameters for any subsequent call. Also, the bounds checking operation to detect overflow and underflow need only be done once at call time and once at exit time. See Ditzel's paper on the C Machine Stack Cache for a more general implementation (but expensive!!). Yes, arbitrary window size does make calls and returns more expensive, but not by much (under the assumptions I have made). The benefits of better register utilization and accomodation of any argument list outweigh the small cost. The call and return are still an order of magnatude faster (in terms of cycles) than the "calls"-style exemplified by the vax. Again, a very interesting alternative to a stack cache is register allocation at link time. Wall from Dec Western Research has done/is doing the first work I have seen in this area. bcase ---------- Keep watching this space just a little longer.
klein@gravity.UUCP (02/25/87)
In article <5746@amdahl.UUCP>, chuck@amdahl.UUCP (Charles Simmons) writes: > Are there reasons I am missing that make fixed sized windows extremely > advantageous? > > thanks, Chuck Complexity. -- Mike Klein klein@sun.{arpa,com} Sun Microsystems, Inc. {ucbvax,hplabs,ihnp4,seismo}!sun!klein Mountain View, CA
aglew@ccvaxa.UUCP (02/26/87)
>So I'm wondering why the implementation of the RISC I and II don't >allow the programmer to specify the number of registers that should >be overlapped on the procedure call. Obviously, this would make >subroutine calls more expensive and difficult to implement. The >overlap size would have to be fetched from memory and added to the >window pointer. Checking to see if the register stack had overflowed >might be more expensive... However, these seem like trivial problems >when compared to the benefit of effectively having a larger >register stack. You've just gone and made procedure calls more expensive. Since register stack overflows rarely occur(*), and the fixed size window handles the 90% case, the numbers don't necessarily work out. >Are there reasons I am missing that make fixed sized windows extremely >advantageous? Register windows require that a "logical register number" be combined with a "register window pointer" to produce the actual "physical register number" in the register file. Usually, this is a simple addition. But then an extra addition is required on all register accesses. This adds a delay equal to that of the fastest carry chain you can provide to the critical path. It may not be possible to absorb this in a pipeline stage. Fixed size windows can mean that you don't have to add all the bits together. Fewer bits => shorter carries => faster register cycle. Note that this was apparently NOT a concern for RISC. --- (*) Register stack overflows rarely occur: for a fixed size register stack, this is true only because routines typically have a small dynamic range in procedure calling depth. The pertinent studies were made on UNIX. I'm not so sure that UNIX is a good example for future systems, since UNIX contains a lot of low level code without benefit of abstraction. Higher level languages, or systems that encourage the use of procedure calls, may well have a greater dynamic range. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms.arpa