rfg@ics.uci.edu (Ron Guilmette) (11/17/89)
In article <2631@yogi.oakhill.UUCP> marvin@yogi.UUCP (Marvin Denman) writes: > >In article <PFEIFFER.89Nov15213136@puye.nmsu.edu> pfeiffer@nmsu.edu (Joe Pfeiffer) writes: >> >>This is probably getting a bit far afield, but I'm very puzzled by >>this statement. It has always been my impression that it doesn't much >>matter whether the registers are preserved by the caller or the >>callee, just so somebody does it. Why is there a greater burden on >>the compiler writer if it's the callee? >> >>-Joe Pfeiffer. > > >The answer probably lies in the fact that many leaf calls will not require >enough registers that they need to save anything if they know which registers >are guaranteed to be "dead". Of course, but the leaf routines never do know which ones are dead, do they? >These [leaf routines] are by frequency a large percentage of >calls so they should not be penalized. Using the current standard a leaf >routine can use all of the temporary set of registers plus all of the >parameter passing registers without any saving. Why should the caller save >"all" of his live registers when only a small fraction will be used? My original question was "Why should the caller save *any* of his live registers when he has absolutely no knowledge of whether or not *any* of them will be used (i.e. clobbered) in the called routine?" I don't think that I have yet seen a good answer to this question. >The converse argument is why should the callee be required to save dead >registers? What if there are *never* any dead registers? Consider this. Given the typical program (which uses at least 30 or more "values" throughout its execution lifetime) it may be possible to use *all* available registers for at least *some* productive purpose at *all* points throughout a program. Even without having global data-flow information, at the very least you could decide to keep several simple global variables in registers, and the odds are very good that you would get at least some performance increase as a result of doing this. On the other hand, if you *do* have good global data flow analysis, then there is no doubt whatsoever that you should be able to use all registers for "live" values at all points throughout a program. Thus, you should always have registers filled with live values (in any case). Thus, you should always have a "pure" caller-saves convention. QED >The current scheme is a fairly efficient compromise between the two, that will >do in the absence of global register allocation. Ah, ha! So if I understand you correctly, you are saying that this parameter passing convention was intentionally designed to be a "compromise" which would not make stupid compilers look too bad (even at the expense of decreasing performance in the presence of a good modern compiler with data-flow analysis). Is that correct? // rfg
phillip@oakhill.UUCP (Mike Phillip) (11/21/89)
In article <1989Nov16.212149.9770@paris.ics.uci.edu> Ron Guilmette <rfg@ics.uci.edu> writes: > >My original question was "Why should the caller save *any* of his live >registers when he has absolutely no knowledge of whether or not *any* of >them will be used (i.e. clobbered) in the called routine?" > >I don't think that I have yet seen a good answer to this question. > I think that the issues of parameter passing and register allocation have become somewhat confused in this discussion thus far, as have the relative merits of callee vs caller saved registers. To make matters even more confusing, the "meta"-issue of compiler technology has been introduced without clearly distinguishing between local, global and inter-procedural register allocation (Note that these are THREE different cases). First of all, the parameter passing registers (r2-r9) are really irrelevant to the caller vs. callee discussion, since they are effectively treated as "caller save" in ALL cases. It just so happens that OCS designates that these registers may be used to hold subroutine parameters during calling sequences. The callee is then free to use these registers as it sees fit without preserving the values/variables/temporaries (choose your favorite term) associated with them. Thus, combining these registers with registers r10-r13 provides a total of 12 "caller save" registers. The OCS also provided 12 "callee save" registers, r14-r25. The main point of debate seems to be the rationale behind the chosen division between caller and callee registers. [... deletion of some discussion ...] > >What if there are *never* any dead registers? Consider this. Given the >typical program (which uses at least 30 or more "values" throughout >its execution lifetime) it may be possible to use *all* available registers >for at least *some* productive purpose at *all* points throughout a program. > >Even without having global data-flow information, at the very least you could >decide to keep several simple global variables in registers, and the odds >are very good that you would get at least some performance increase as a >result of doing this. > >On the other hand, if you *do* have good global data flow analysis, then >there is no doubt whatsoever that you should be able to use all registers >for "live" values at all points throughout a program. > >Thus, you should always have registers filled with live values (in any case). > >Thus, you should always have a "pure" caller-saves convention. QED > This is where things get tricky... The key point to recognize is that regardless of compiler technology, there will likely always need to be a distinction between caller and callee saved registers. When global (but not inter-procedural) register allocation is performed, it is not possible to determine the register usage of called subroutines at the point of the call (i.e. caller cannot see callee). If all registers were designated as being in "caller save" mode, code would need to be inserted by the compiler to "save" the contents of all registers whose associated variables have a live range that spans the subroutine call EACH TIME such a call was made. (Note that this does not necessarily imply that ALL registers need to be saved... some registers may be "dead" at the point of the call). In cases where the callee needs only a few registers, such a calling convention would likely result in unnecessary load and store instructions which use up valuable instruction issuing and memory bandwidth. At the other extreme, if all registers were used in "callee save" mode (with the exception of the parameter passing registers), the callee would generate load and store instructions for every register that it uses, regardless of whether such overhead is warranted by the "live" use of the same register in the caller. Thus, some division is needed between caller and callee saved registers. Why did OCS choose 12/12 ? I wasn't involved in the decision making process, but the issue only becomes significant when the following criteria are met: 1) The callee requires more than 12 registers to perform adequate optimizations. 2) The additional "callee save" registers used actually did NOT need to be saved and restored because their live range did not span the subroutine call in the caller. How often does this occur? Empirically speaking, beats me. (I don't have any such data at my disposal at the moment). But your above argument that good overall optimization in a compiler will result in most registers holding live values across calls actually MINIMIZES the effect of having "too many" caller or callee saved registers. In fact, if the register being used in a callee needs to be saved SOMEWHERE (i.e. the caller has a live value in it across the call), it can be argued that "callee" saved registers are preferable, since the callee can insert the saves and restores only in those flow paths where the register contents are clobbered. In such cases, "caller save" conventions can result in unnecessary overhead. Now onward to inter-procedural analysis... (A good reference for this topic is "Minimizing Register Usage Penalty at Procedure Calls" by Fred Chow from SIGPLAN Notices, July 88) As described in the above paper, "callee save" registers can effectively be treated as "caller save" registers by performing a depth-first traversal of the program call graph. (i.e. by the time the caller is analyzed, it has all relevant register usage info about the callee). The callee can then defer saving and restoring to the caller, making the register appear as though it was operating in "caller save" mode. There are limitations to this approach, however. These limitations occur when the compiler does not have complete register usage information for a particular subroutine (libraries are a commonly cited example). Other examples include: 1) recursive calls 2) calls made through subroutine pointers 3) separate compilation In such cases, even "globally-optimizing, inter-procedural analyzing" compilers need to resort to a default convention, and the trade-offs of caller vs. callee once again become relevant. Thus, OCS does not prevent optimizing compilers from taking advantage of inter-procedural register allocation, but provides the "default" for those cases when such optimizations cannot be made. >>The current scheme is a fairly efficient compromise between the two, that will >>do in the absence of global register allocation. > >Ah, ha! So if I understand you correctly, you are saying that this parameter >passing convention was intentionally designed to be a "compromise" which >would not make stupid compilers look too bad (even at the expense of >decreasing performance in the presence of a good modern compiler with >data-flow analysis). Is that correct? > Yes, the inclusion of both caller and callee save registers is a compromise... always has been, and probably always will... I disagree, however, that the compromise is in place to mask compiler deficiencies or that it hinders the development of state-of-the-art optimizing compilers. (And yes, I know, the Moto compilers can use improvement :^) As pointed out above, the distinction is needed even when inter-procedural analysis is used for register allocation. When such compiler technology is implemented, the relative distribution of caller vs. callee registers becomes LESS critical, since the compiler can essentially "convert" callee-saved registers into caller-saved registers by "pushing" them up in the call graph. Geez, I figured most of the register allocation debates would involve r26-r29, which are the so-called "linker registers". ;^) Mike Phillip Motorola 88000 Development E-mail: oakhill!bushwood!phillip@cs.utexas.edu or cs.utexas.edu!oakhill!bushwood!phillip
mash@mips.COM (John Mashey) (11/21/89)
In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes: >In article <1989Nov16.212149.9770@paris.ics.uci.edu> Ron Guilmette <rfg@ics.uci.edu> writes: >> >>My original question was "Why should the caller save *any* of his live >>registers when he has absolutely no knowledge of whether or not *any* of >>them will be used (i.e. clobbered) in the called routine?" >Yes, the inclusion of both caller and callee save registers is a >compromise... always has been, and probably always will... > >I disagree, however, that the compromise is in place to mask compiler >deficiencies or that it hinders the development of state-of-the-art >optimizing compilers. (And yes, I know, the Moto compilers can use >improvement :^) In defense of Mr. Phillip, both HP PA and MIPS R3000s split the integer registers in approximately the same ratio as does the 88K, from looking at program statistics. Allocation behavior for integer programs, given similar compiler technology, should not be too different. (For floating-point, may not be true, as both HP PA and MIPS have a separateset of FP registers). -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rfg@ics.uci.edu (Ron Guilmette) (11/21/89)
In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes: > >First of all, the parameter passing registers (r2-r9) are really >irrelevant to the caller vs. callee discussion, since they are >effectively treated as "caller save" in ALL cases. Right. The're just more "caller save" registers. >... The main point of debate seems to be the >rationale behind the chosen division between caller and callee registers. I didn't realize that I had started a "debate"! I though I was was just giving voice to self-obvious truths! :-) (Note simley face) >>What if there are *never* any dead registers? I *still* haven't seen a good answer to *this* question. >>... it may be possible to use *all* available registers >>for at least *some* productive purpose at *all* points throughout a program. I wish I had said that! (Oops... I did!) :-) > This is where things get tricky... The key point to recognize is that >regardless of compiler technology, there will likely always need to be a >distinction between caller and callee saved registers. Yes. The distinction is valuable. It is needed so that I can tell you what type of saving convention is harmful. "Caller saving consider harmful!" > If all registers were designated as being in "caller save" mode, code... >... some registers >may be "dead" at the point of the call). Perhaps you didn't read the whole of my previous posting. In any case, the important part (where I assert that its possible to have all registers live at all points) is given above. >In cases where the callee needs only a few registers, [a caller saves] > convention would likely result in unnecessary load and store >instructions... Right. Now if you can just see your way clear to generalize this Obvious Truth from one particular register to the set of *all* registers, then my point will have been proven. All I'm trying to say is that what is bad for one register (i.e. being saved/restored unnecessarily) is bad for *all* registers. > At the other extreme, if all registers were used in "callee save" >mode ... the callee >would generate load and store instructions for every register that it >uses, regardless of whether such overhead is warranted by the "live" use >of the same register in the caller. Right, but you are still begging the question and avoiding the issue. What if the saves & restores are *always* warranted by virtue of the fact that *all* registers contain live values at *all* points? In that case, there would be no unnecessary loads or stores in a strictly callee-saves convention. What I'm having trouble understanding is why various people (Mike P. included) seem to be unwilling to accept that this convention (used for decades on CICS machines) may actually be simply *the* best possible convention, regardless of whether or not you seem to have lots of registers to waste or whether or not you have a reduced instruction set. >Thus, some division is needed >between caller and callee saved registers. What you mean to say is that you need to have both kinds. You certainly have not justified that conclusion with any evidence that addresses my "all live, all the time" argument. I'm willing to be convinced, but I have not been yet. >Why did OCS choose 12/12 ? My question exactly. I believe that 0/31 would have been a better choice. (Note that I classify r0 as a callee-saves register because the caller may assume that it is is preserved across calls without doing anything ;-) Also, I give the ratio as 0/31 because I assume C language where a reault comes back (typically) in one register (by convention r2). >I wasn't involved in the decision making process, but the issue only >becomes significant when the following criteria are met: > 1) The callee requires more than 12 registers to perform adequate > optimizations. > 2) The additional "callee save" registers used actually did NOT need > to be saved and restored because their live range did not span the > subroutine call in the caller. I don't believe that #1 even enters into the issue at all. Regarding #2, you are really missing my point. What I was trying (in my own futile way) to say was that the issue is *always* significant bacause callers can always insure that virtually *all* of their registers contain "live" values at all call points. Thus, any registers clobbered by the callee would necessarily have to be saved and restored. >How often does this occur? Empirically speaking, beats me. Who cares? The question is "Has the OCS effectively (and permanently) crippled an otherwise impressive machine, by ignoring the possibility that newer and better compilers can keep more useful live values in more registers over more calls?" >(I don't have any such data at my disposal at the moment). Yep. >But your above >argument that good overall optimization in a compiler will result in most >registers holding live values across calls actually MINIMIZES the effect >of having "too many" caller or callee saved registers. Wrong. It minimizes the negative implications of the possibility that the callee might save/restore too much stuff in a "callee-saves" convention (because it simply never will). On the other hand, having *lots* of live values (because you have an intelligent optimizing compiler) definitely will show a "caller-saves" convention to be what it is, i.e. a rotten way of doing things, and strictly counterproductive. >In fact, if the >register being used in a callee needs to be saved SOMEWHERE (i.e. the >caller has a live value in it across the call), it can be argued that >"callee" saved registers are preferable, since the callee can insert the >saves and restores only in those flow paths where the register contents >are clobbered. In such cases, "caller save" conventions can result in >unnecessary overhead. Good. You have seen the light. Now just apply this Obvious Truth iteratively to the set of *all* registers. > >Now onward to inter-procedural analysis... > > (A good reference for this topic is "Minimizing Register Usage > Penalty at Procedure Calls" by Fred Chow from SIGPLAN Notices, July 88) > > As described in the above paper, "callee save" registers can >effectively be treated as "caller save" registers by performing a >depth-first traversal of the program call graph. (i.e. by the time the >caller is analyzed, it has all relevant register usage info about the >callee). The callee can then defer saving and restoring to the caller, >making the register appear as though it was operating in "caller save" >mode. There are limitations to this approach, however. You can say that again! >These >limitations occur when the compiler does not have complete register >usage information for a particular subroutine (libraries are a commonly >cited example). Or any sort of external routine! In other words this serious limitation will apply to *most* calls in any large program. A callee-saves convention has no such limitations or problems of this sort. >In such cases, even "globally-optimizing, inter-procedural analyzing" >compilers need to resort to a default convention, and the trade-offs of >caller vs. callee once again become relevant. Right. >Thus, OCS does not prevent >optimizing compilers from taking advantage of inter-procedural register >allocation, but provides the "default" for those cases when such >optimizations cannot be made. Unfortunately compiler writers only want to support one "convention". As it is, I fear that many such people are effectively being forced to blindly conform this inadequate and mediocre default convention for *all* calls so that they may be assured of being able to have the code they generated *interoperate* with "standard" libraries and/or with code produced by other compilers. >Yes, the inclusion of both caller and callee save registers is a >compromise... always has been, and probably always will... If it seemed as though it was a political compromise in which everybody got a part of what they wanted, then that would be be one thing. But this is obviously *not* a political compromise. It is technical compromise between excelent performance and lousy performance. Nobody won and everybody lost. >I disagree, however, that the compromise is in place to mask compiler >deficiencies If what you are saying is true (i.e. if this is strictly a reasonable technical compromise) then please tell me if 88open collected any empirical evidence about *large* programs at various possible "compromise points" along the scale from 0/12 to 12/0. If they did not, perhaps they would like to arrange to do so before the final draft of the OCS is cast into stone. I know that the GNU C compiler allows the user to vary the number of registers in each category, so the tests (benchmarks of the calling conventions we could call them) should be quite simple to perform and would yield absolutely clear cut evidence of major significance to all 88open members. If I'm wrong that "callee-saves" always produces better performance (with smaller executables to boot) then such tests could also have another significant benefit. They could shut me up, and I'm sure everybody would be in favor of that! :-) >Geez, I figured most of the register allocation debates would involve >r26-r29, which are the so-called "linker registers". ;^) Well get to that. Later, later... // rfg
rfg@ics.uci.edu (Ron Guilmette) (11/21/89)
In article <31783@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes: >>In article <1989Nov16.212149.9770@paris.ics.uci.edu> Ron Guilmette <rfg@ics.uci.edu> writes: > >In defense of Mr. Phillip... I didn't realize that anybody was attacking him! >both HP PA and MIPS R3000s split the integer >registers in approximately the same ratio as does the 88K... Like lemmings to the sea. Is everyone making the same mistake in the same way? Inquiring minds want to know! // rfg
edwardm@hpcuhc.HP.COM (Edward McClanahan) (11/22/89)
Ron Guilmette *debates*: > If I'm wrong that "callee-saves" always produces better performance > (with smaller executables to boot) then such tests could also have another > significant benefit. They could shut me up, and I'm sure everybody > would be in favor of that! :-) I offer a simple example... Caller uses 2 registers... Callee uses 30 registers... Using the Caller-saved convention, every time a call is made, 2 registers are *saved*. Using the Callee-saved convention, every time a call is made, 30 registers are *saved*. Similarly, consider the opposite situation (where the caller uses 30 registers and the callee uses only 2). In this case, it would be better to have used the callee-saved convention. Which of the above is more common? This is the very question that the OCS (and HP and MIPS and ...) had to consider. Your arguments about other *special conditions* (such as a callee only needing particular registers in a infrequently executed path, etc...) are all valid, but what is the frequency of these conditions? =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Edward McClanahan Hewlett Packard Company Mail Stop 47UE -or- edwardm%hpda@hplabs.hp.com 19447 Pruneridge Avenue Cupertino, CA 95014 Phone: (408)447-5651
phillip@oakhill.UUCP (Mike Phillip) (11/22/89)
In article <1989Nov20.205338.11760@paris.ics.uci.edu>, rfg@ics.uci.edu (Ron Guilmette) writes: > In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes: > > > > > This is where things get tricky... The key point to recognize is that > >regardless of compiler technology, there will likely always need to be a > >distinction between caller and callee saved registers. > > Yes. The distinction is valuable. It is needed so that I can tell you > what type of saving convention is harmful. "Caller saving consider > harmful!" [deletion of point/counterpoint discussion] > > > At the other extreme, if all registers were used in "callee save" > >mode ... the callee > >would generate load and store instructions for every register that it > >uses, regardless of whether such overhead is warranted by the "live" use > >of the same register in the caller. > > Right, but you are still begging the question and avoiding the issue. > What if the saves & restores are *always* warranted by virtue of the > fact that *all* registers contain live values at *all* points? In that > case, there would be no unnecessary loads or stores in a strictly > callee-saves convention. > > What I'm having trouble understanding is why various people (Mike P. > included) seem to be unwilling to accept that this convention (used > for decades on CICS machines) may actually be simply *the* best possible > convention, regardless of whether or not you seem to have lots of registers > to waste or whether or not you have a reduced instruction set. > You have asked for a situation where "caller save" registers actually provide better performance than "callee save" registers... Consider the following situations: 1) In a language such as C, where "arbitrary" indirection is supported, there are often cases where registers MUST be presumed dead across function calls because the values they contain are "pointed to" by other variables. For example, in the following code... int x; int y; extern void foo(); . . . /* Assume at this point that x and y have been placed in registers */ y = x; foo (&x, &y); if (x == y ) { x = x + y; } else { x = x - y; } Effect on the caller: In such a case, the registers containing x and y are effectively treated as "caller save" by NECESSITY (assuming that foo is an external subroutine) because x and y must be presumed "clobbered" across the call. The contents of the registers holding the value of x and y MUST be stored back into the "home" memory locations for x and y before the call to foo() because the addresses of x and y are passed as parameters. Effect on the callee: If foo() happens to be a leaf routine (which roughly half of all dynamically measured subroutine calls are...), there is NO chance that it will call another subroutine. Thus, the choice of registers is independent of subsequent nested calling sequences. If, as in your scheme, all registers are treated as "callee save", spill code must be generated for EACH register that is used in the callee. In a code sequence like that shown above, it is POSSIBLE that some of the registers selected for use by foo() have already been freed by the caller, thus making the callee spill code redundant. In other words, if the compiler had a convention for prioritizing register usage based on a GUARANTEE that a certain set of registers were available without the need for spill code... more efficient code may result. 2) OCS was not written just for C programs. In Fortran code, anything declared in a COMMON block must be presumed dead across calls. Fortran also handles arrays differently than C, which affects calling sequences... Thus, there are LOTS of situations where "spill" code must be inserted by the compiler before a subroutine call. If the registers being "spilled" are "caller save" then this overhead is not duplicated in the callee... i.e. a good compiler may place aliased (or global) variable contents in "caller save" registers. Key point ====> I'm NOT advocating either extreme of the CALLER vs. CALLEE spectrum. I'm simply trying to point out why there may be some merit in a caller/callee split. I would agree that in a "perfect" programming world where all subroutines are visible to the compiler that the distinction becomes unnecessary. Unfortunately, as you pointed out, this is NOT the case, and I don't feel that the issue is as clear cut as you state it... Further discussion of this probably belongs in comp.compilers or comp.arch. --------------------- Mike Phillip Motorola 88000 Development
tom@ssd.harris.com (Tom Horsley) (11/22/89)
I can't resist entering this discussion, endless and pointless arguments have always appealed to me :-). I can point out one obvious flaw in the idea that the called routine should do the register saves. We have several benchmarks as well as several real programs (as opposed to the typical benchmark :-) in which it is possible to examine the code and see that the current conventions produce fantastic code. This occurs (quite frequently, I might add) when the outer routine contains loops, and the loops contain subroutine calls. Very often (because 12 registers is really an awful lot of registers) the leaf routine does not need to save any registers at all. (For instance, this is true of quite a lot of the low-level str and mem routines in the C library). This means that the subroutine (which, if you will recall, is called in a loop) runs much faster than it would if it was saving every register it used. Quite often, the outer routine is also not complex enough to require the use of all the registers (r14-r26 is still a lot more registers than I typically have available on a CISC machine). All this boils down to the fact that the current convention may not be perfect, but it is, in practice, pretty nice. This is real code I am talking about, which we have really spent some time examining for real quality issues, not some hypothetical situation in which some magic compiler has figured out how to keep something live in every register all the time. We have a good optimizing compiler with a good register allocator that can do neat tricks like make separate lifetime regions for globals around loops so they can be in memory in regions where there are aliased references and in registers in areas where there are not aliased references, and it turns out that more often than you might expect, there are already enough registers. As a practical matter having more registers would help sometimes, but not as often as you might think. This is all still operating without interprocedural analysis. In any system where each routine is compiled in a vacuum, some calling convention is necessary, and I don't have any difficulty making my highly optimizing compiler with its fancy register allocator deal with any convention. If you go into the domain of a global program database with interprocedural register allocation, you no longer need to be bound by OCS rules, you can make any rules you want because you can define your own interface for each routine to minimize global cost (don't ask me how to actually do this, mind you :-). Also, doing interprocedural register allocation at link time is not necessarily restricted because you have to link in external libraries. The 88k architecture is really quite simple to disassemble and analyze. It is not totally beyond the realm of possibility that the linker could even modify register usage for things like libc.a, even without special support from a global program database. (Essentially, there would be a database, it would just be in a format that is kind of hard to interpret). -- ===================================================================== domain: tahorsley@ssd.csd.harris.com USMail: Tom Horsley uucp: ...!novavax!hcx1!tahorsley 511 Kingbird Circle or ...!uunet!hcx1!tahorsley Delray Beach, FL 33444 ======================== Aging: Just say no! ========================
rhg@cpsolv.UUCP (Richard H. Gumpertz) (11/25/89)
THESIS 1: The caller knows what registers are live; the callee does not. Hence the caller should decide what needs saving. THESIS 2: The callee knows what registers he clobbers; the caller does not. Hence the callee should decide what needs saving. In other words, if the caller and callee are presumed to be compiled separately, neither has enough information to do the "right" thing with respect to register-saving. By splitting the registers in two classes, "leaf" (i.e., callees that never call anything) procedures can try to use just the clobberable registers and so avoid any register saving. Similarly, "root" (i.e., callers that are rarely called) can try to use just the preserved registers and so avoid saving anything around calls. Procedures that both do lots of calling and are called frequently are in a middle ground: that is where the compiler will have to work to figure out what goes in which register. Certainly not an easy task, but it can be done. Here is where the compiler writer should earn his keep. By the way, note that a smart compiler might even save registers around a block or loop of calls instead of around each call. Still, in many cases it will just be a trade-off between saving on entry and saving before calling out. If both are done with equal AVERAGE frequency (i.e. the procedure makes one call out, average, per time that it is called), it doesn't matter to execution speed which is used. (Code size may be effected, but in RISC architectures the decision has already been made to increase speed at the cost of ignoring code size anyway.) Clearly the OVERALL average across ALL procedures MUST be one call out per call in. The question is, what does it look like for a particular procedures? What is the distribution of called/call ratios? I personally think some sort of split of register saves is reasonable. Only measurement can tell whether 50-50 is appropriate for a particular level of compiler intelligence and program mix. Still, it's as good a place to start as any. -- =============================================================================== | Richard H. Gumpertz rhg%cpsolv@uunet.uu.NET -or- ...uunet!amgraf!cpsolv!rhg | | Computer Problem Solving, 8905 Mohawk Lane, Leawood, Kansas 66206-1749 | ===============================================================================
edwardm@hpcuhc.HP.COM (Edward McClanahan) (11/28/89)
Tom Horsley touches on an interesting side issue: > I can point out one obvious flaw in the idea that the called routine should > do the register saves. We have several benchmarks as well as several real > programs (as opposed to the typical benchmark :-) in which it is possible > to examine the code and see that the current conventions produce fantastic > code. This occurs (quite frequently, I might add) when the outer routine > contains loops, and the loops contain subroutine calls. Very often (because > 12 registers is really an awful lot of registers) the leaf routine does > not need to save any registers at all. (For instance, this is true of quite > a lot of the low-level str and mem routines in the C library). ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As I recall, the VAX has TWO calling conventions: 1 - CALLS and CALLG explicitly require an "entry mask" in the called procedure indicating which registers to push on the stack. This mask is 16 bits, one bit per register, although you never save certain registers (e.g. R0, PC, etc...). The microcode interpreting the CALLx instruction actually does the PUSHes for you. 2 - JSR and BSR simply push the return PC on the stack and jump to the called procedure. The callee must then "protect" any registers it uses. Both of these schemes implement the callee-saved model, but the JSR/BSR is faster for "low-level...routines". In HP's RISC architecture, we have both callee and caller saved registers (as is the case in the m88k standard). Still, for some "low-level stuff", we needed a faster way to call a function. We implemented "millicode" to fill this gap. The caller simply doesn't have to save any caller-saved registers it happens to be using at the time. Similarly, the optimizer doesn't have to halt optimization across the millicode call. Those "low-level str and mem routines" are implemented in millicode. Is there any provision for this calling convention in the m88k standard? =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Edward McClanahan Hewlett Packard Company Mail Stop 47UE -or- edwardm%hpda@hplabs.hp.com 19447 Pruneridge Avenue Cupertino, CA 95014 Phone: (408)447-5651
thomas@uplog.se (Thomas Hameenaho) (11/28/89)
I have followed the debate about caller vs. callee saves registers with great interest. I have an idea that perhaps could be a bit of both worlds: What about having the caller supply a mask of live registers to the callee, ie. registers that must not be clobbered to the callee? This mask should be ANDed with a mask of the registers that the callee clobbers and handed to the equivalent of MOVEM (68k) or CALLx (VAX). The save/restore should of course be handled by microcode/hardware to make it fast. This way there would never be unnecessary saves. -- Real life: Thomas Hameenaho Email: thomas@uplog.se Snail mail: TeleLOGIC Uppsala AB Phone: +46 18 189406 Box 1218 Fax: +46 18 132039 S - 751 42 Uppsala, Sweden
meissner@dg-rtp.dg.com (Michael Meissner) (11/29/89)
In article <2647@bushwood.oakhill.UUCP> phillip@oakhill.UUCP (Mike Phillip) writes: ... | effectively be treated as "caller save" registers by performing a | depth-first traversal of the program call graph. (i.e. by the time the | caller is analyzed, it has all relevant register usage info about the | callee). The callee can then defer saving and restoring to the caller, | making the register appear as though it was operating in "caller save" | mode. There are limitations to this approach, however. These | limitations occur when the compiler does not have complete register | usage information for a particular subroutine (libraries are a commonly | cited example). Other examples include: | 1) recursive calls | 2) calls made through subroutine pointers | 3) separate compilation | | In such cases, even "globally-optimizing, inter-procedural analyzing" | compilers need to resort to a default convention, and the trade-offs of | caller vs. callee once again become relevant. Thus, OCS does not prevent | optimizing compilers from taking advantage of inter-procedural register | allocation, but provides the "default" for those cases when such | optimizations cannot be made. Separate compilation can be handled by having the compiler generate intermediate code instead of object code or some such, and having the linker generate the final code. Tail-recursive calls can be changed by the compiler into a loop or some such. Calls made through pointers or to different langauges are another problem. Calls made through pointers are somewhat rare in conventional C (except for things like EMACS and such), but they become much more mainline in object oriented languages like C++ (virtual functions). Dynamic shared libraries are another real big problem, where the function being called may not even exist when the function is called (yes I know the BCS/OCS currently do not support shared libraries directly today, but you can roll your own with memctl, and they are coming). IMHO, most users are not willing to pay the price of getting every last ounce of performance that doing global register allocation for the entire program would entail. Yes, there are some things (like the kernel itself, or X11 libraries) that it would be worth it to get every last drop of performance. I know some vendors and companies who never use the normal compiler optimizations, since they perceive that the performance is adequate (or the compilers are inadequate). -- Michael Meissner, Data General. Until 12/15: meissner@dg-rtp.DG.COM After 12/15: meissner@osf.org
pardo@cs.washington.edu (David Keppel) (11/29/89)
thomas@uplog.se (Thomas Hameenaho) writes: >[Idea: caller register mask, callee register mask, save only > registers indicated by AND of masks.] There is a potential problem. It takes extra code to decide whether or not to save certain registers. The time you spend executing that code could be spent naively saving registers and you'd come out about the same. On the VAX, where it already has save-by-mask microcode, the tme penalty might be OK. On the RISC-family processors, you'll probably lose a lot. Is it worth adding save-by-mask hardware? Probaly not. The last time this came around I saved a bunch of articles. You can get them via anonymous ftp: ftp june.cs.washington.edu (128.95.1.4) login anonymous passwd username cd tmp/pardo get proc.call.txt About 750 lines. ;-D on ( USENET by RPC ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
ge@kunivv1.sci.kun.nl (Ge' Weijers) (11/29/89)
thomas@uplog.se (Thomas Hameenaho) writes: >I have an idea that perhaps could be a bit of both worlds: What about having >the caller supply a mask of live registers to the callee, ie. registers that >must not be clobbered to the callee? This mask should be ANDed with a mask of >the registers that the callee clobbers and handed to the equivalent of MOVEM >(68k) or CALLx (VAX). The save/restore should of course be handled by >microcode/hardware to make it fast. >This way there would never be unnecessary saves. Suppose I write a main program that uses all registers. Then your scheme degenerates to a callee-saves scheme. Especially leaf routines do not benefit from your scheme. Even worse, the performance of a program might depend critically on the register number chosen by the optimiser. This makes predicting performance quite hard. And every procedure call gets extra overhead. The VAX CALLS/G instruction does so much often unnecessary work that not using it is a viable option. A compiler I used to work on was faster than PCC only for this reason. (It used a caller saves convention BTW, and register variables were unknown) The same is true for the compiler I'm currently working on. In functional programming languages the procedure call is so common that the performance depends quite critically on the calling convention. I use a pure caller-saves convention, with parameter passing partially in registers. For things like fibonacci numbers and other recursive programs this runs rings around C, Fortran and what have you. (And don't mention the SPARC, because functional programs really exercise register windows to death :-). I think a functional language implementation is better off not using register windows at all, except if global analysis is done to keep the processor from saving and restoring dead register all the time. This is not trivial in recursive cases. Give me a store-multiple-register please?). I suspect the half-half (or perhaps 60-40) convention is one of the best when you don't have global knowledge. If you have global knowledge then you can implement parameter passing using registers, local variables and global variables, whatever suits the problem. Ge' Weijers Ge' Weijers Internet/UUCP: ge@cs.kun.nl Faculty of Mathematics and Computer Science, (uunet.uu.net!cs.kun.nl!ge) University of Nijmegen, Toernooiveld 1 6525 ED Nijmegen, the Netherlands tel. +3180612483 (UTC-2)
meissner@dg-rtp.dg.com (Michael Meissner) (11/30/89)
In article <100050002@hpcuhc.HP.COM> edwardm@hpcuhc.HP.COM (Edward McClanahan) writes: | Both of these schemes implement the callee-saved model, but the JSR/BSR | is faster for "low-level...routines". | | In HP's RISC architecture, we have both callee and caller saved registers | (as is the case in the m88k standard). Still, for some "low-level stuff", | we needed a faster way to call a function. We implemented "millicode" to | fill this gap. The caller simply doesn't have to save any caller-saved | registers it happens to be using at the time. Similarly, the optimizer | doesn't have to halt optimization across the millicode call. Those | "low-level str and mem routines" are implemented in millicode. Is there | any provision for this calling convention in the m88k standard? The OCS doesn't directly support millicode, and I'm not aware of any compilers that use internal calling sequences. However for optimizing the str* and mem* stuff, you really need 10 or more registers (I think our memcpy uses more than the 12 non-preserved registers to fully pipeline the loads and such). -- Michael Meissner, Data General. Until 12/15: meissner@dg-rtp.DG.COM After 12/15: meissner@osf.org
meissner@dg-rtp.dg.com (Michael Meissner) (11/30/89)
In article <THOMAS.89Nov28100512@uplog.uplog.se> thomas@uplog.se (Thomas Hameenaho) writes: | I have followed the debate about caller vs. callee saves registers with | great interest. | | I have an idea that perhaps could be a bit of both worlds: What about having | the caller supply a mask of live registers to the callee, ie. registers that | must not be clobbered to the callee? This mask should be ANDed with a mask of | the registers that the callee clobbers and handed to the equivalent of MOVEM | (68k) or CALLx (VAX). The save/restore should of course be handled by | microcode/hardware to make it fast. | This way there would never be unnecessary saves. Given that the machine is a RISC machine, it only has simple loads and stores. In order to implement the above scheme, some sort of load and/or conditional branches would have to be done, which would be an expensive operation. In general, the time taken to do the prologue stores is 1 cycle per register, and 1 cycle for the epilogue loads (yes I know that loads take 3 cycles if coming from the cache, but in most epilogues, you will have a lot of loads in a row, whose value is not needed immediately, so the loads will be pipelined). Assuming the procedure in question is not a monster procedure, and it does not invoke other large procedures, it is likely that the values being saved will still be in the cache when it comes time to load them again. -- Michael Meissner, Data General. Until 12/15: meissner@dg-rtp.DG.COM After 12/15: meissner@osf.org
tim@binky.sybase.com (Tim Wood) (11/30/89)
In article <9972@june.cs.washington.edu> pardo@june.cs.washington.edu (David Keppel) writes: >thomas@uplog.se (Thomas Hameenaho) writes: >>[Idea: caller register mask, callee register mask, save only >> registers indicated by AND of masks.] > >There is a potential problem. It takes extra code to decide whether >or not to save certain registers. The time you spend executing that >code could be spent naively saving registers and you'd come out about >the same. Hmm, seems like it would all be done at compile time. For every procedure call you compile, note which registers are read or written in the calling routine (or, if "-O" is set, try to determine which ones are not live across the call :-). Then, set the bits in the save mask. Compile in the mask value in your "call-with-mask" instruction sequence (hey, what a great idea for a single instruction! :-) Each compiled procedure would have its own mask, a la VAX. > >On the VAX, where it already has save-by-mask microcode, the tme >penalty might be OK. Except, VAX is oriented toward callee-save. Subroutines must have a 2-byte register-save mask up front in order to be called by the "call-subroutine" instructions. >On the RISC-family processors, you'll probably >lose a lot. Is it worth adding save-by-mask hardware? Probaly not. Please substantiate these assertions. > pardo@cs.washington.edu Sybase, Inc. / 6475 Christie Ave. / Emeryville, CA / 94608 415-596-3500 tim@sybase.com {pacbell,pyramid,sun,{uunet,ucbvax}!mtxinu}!sybase!tim This message is solely my personal opinion. It is not a representation of Sybase, Inc. OK.
dgb@cs.washington.edu (David Bradlee) (12/01/89)
In article <7263@sybase.sybase.com>, tim@binky.sybase.com (Tim Wood) writes: > Hmm, seems like it would all be done at compile time. For every > procedure call you compile, note which registers are read or written in > the calling routine (or, if "-O" is set, try to determine which ones are not > live across the call :-). Then, set the bits in the save mask. > Compile in the mask value in your "call-with-mask" instruction sequence > (hey, what a great idea for a single instruction! :-) > Each compiled procedure would have its own mask, a la VAX. > You seem to imply that the mask is a function of the CALLER, meaning the callee has to examine the mask at runtime to save the appropriate registers, which, as pardo indicated, would take extra code. If you knew all the callsites, the callee could save the union, but if you know all the callsites, you can explicitly save the approriate registers in either the callee or at the callsite, without any mask. Ergo, the mask doesn't help you. -- Dave Bradlee, University of Washington
pardo@cs.washington.edu (David Keppel) (12/01/89)
> pardo@june.cs.washington.edu (David Keppel) writes: >>On the RISC-family processors, you'll probably >>lose a lot. Is it worth adding save-by-mask hardware? Probaly not. In article <7263@sybase.sybase.com> tim@binky.UUCP (Tim Wood) writes: >Please subtantiate these assertions. A couple other people have already; allow me to summarize. A RISC has [is defined as having] a load/store architecture, one instruction issued per clock cycle. Save-by-mask in hardware is non-RISC. Implementing save-by-mask in software requires several instructions to be executed for each (potential) save. Between the blowup in code size and the extra instructions, you're probably better off blindly saving n registers. Remember that conditionally saved registers must also be conditionally restored at function return. A (long) example Suppose callee saves 8, caller saves 8. The compiler will leave caller saves registers empty (don't need saving) aroudn function calls. The caller could thus needlessly save up to 8 registers. If the callee needs registers it will try to use registers that were saved by the callee. If that fails it will save up to 8 registers on its own, but those registers might be dead in the caller, so up to 8 registers could be saved needlessly. The wasted caller-saves and wasted callee-saves can never overlap. At most 8 registers are saved needlessly. If load/store is pipelined, then the worst-case code looks like, say add sp 32 sp store r1 0[sp] store r2 4[sp] store r3 8[sp] store r4 12[sp] store r5 16[sp] store r6 20[sp] store r7 24[sp] store r8 28[sp] call nop load 28[sp] r8 ... subl sp 32 sp That's 19 clock ticks wasted. The symmetric case for callee saves similarly wastes 19 clock ticks. Now consider save-by-mask [r1 to r16 can are saved by mask.] [mask passed in rN, shadow is rM.] [Destination is rightmost operand.] [No branch delay slots.] function: jmp !mask done r1: and mask 1 shadow jmp !shadow r2 save r1 r2: and mask 2 shadow jmp !shadow r2 save r2 jmp !mask done r3: ... r16: and mask 2*^15 shadow jmp !shadow done save r16 done: (Note: you may well be able to do better than this. This is an example of one sequence for one hypothetical machine.) The worst case is 16*3 = 48 instructions, but if jumps introduce more than one bubble, then it could blow up to about 16*4 = 64 instructions. Let's go with 48. On average you should have to save about half the registers (a guess). A clever optimzier will put registers in low-number registers, so that you don't generally have to go all the way to r16 in order to save half the registers. So on average you'll hit about 24 instructions in the prolog and about 24 instructions in the epilog, or about 48 instructions. Of course the register saves are highly machine- and application-dependent. Suppose that generally you have to save only 2 registers and that they are r1 and r3. Then you'll execute 9 instructions on entry and 9 instructions on exit and you're only doing 1 save better than the *worst* case of the naive caller-saves-some and callee-saves-some! Of course this all needs to be parameterized on a machine-by-machine basis, and also on an application-by-application basis. * Average number of registers to save. * Average number of saves that are wasted. * Cost to blindly do split caller/callee saves * Cost to do save-by-mask. If you really care, you should start by reading the stuff availabile via anonymous ftp from june.cs.washington.edu (128.95.1.4) in tmp/pardo/proc.call.txt ;-D on ( Lies, damn lies, and USENET ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
tim@binky.sybase.com (Tim Wood) (12/02/89)
In article <10007@june.cs.washington.edu> pardo@june.cs.washington.edu (David Keppel) writes: >In article <7263@sybase.sybase.com> tim@binky.UUCP (Tim Wood) writes: >>Please subtantiate these assertions [about save-by-mask hardware]. > >A couple other people have already; allow me to summarize.... And a very worthwhile summary it is. Thanks for the effort, it's nice to have it in one place. >If you really care, you should start by reading the stuff availabile >via anonymous ftp @ june.cs.washington.edu (128.95.1.4) tmp/pardo/proc.call.txt > pardo@cs.washington.edu I'll keep this reference for when we get Internet access. Thanks. -TW Sybase, Inc. / 6475 Christie Ave. / Emeryville, CA / 94608 415-596-3500 tim@sybase.com {pacbell,pyramid,sun,{uunet,ucbvax}!mtxinu}!sybase!tim This message is solely my personal opinion. It is not a representation of Sybase, Inc. OK.
rhg@cpsolv.UUCP (Richard H. Gumpertz) (12/05/89)
In article <THOMAS.89Nov28100512@uplog.uplog.se> thomas@uplog.se (Thomas Hameenaho) writes: >I have an idea that perhaps could be a bit of both worlds: What about having >the caller supply a mask of live registers to the callee, ie. registers that >must not be clobbered to the callee? This mask should be ANDed with a mask of >the registers that the callee clobbers and handed to the equivalent of MOVEM >(68k) or CALLx (VAX). The save/restore should of course be handled by >microcode/hardware to make it fast. >This way there would never be unnecessary saves. Gee, what a great idea! We could add microcode to our RISC machines and turn them into CISC machines. Moby sigh. :-) -- =============================================================================== | Richard H. Gumpertz rhg%cpsolv@uunet.uu.NET -or- ...uunet!amgraf!cpsolv!rhg | | Computer Problem Solving, 8905 Mohawk Lane, Leawood, Kansas 66206-1749 | ===============================================================================