firth@sei.cmu.edu (Robert Firth) (11/03/88)
An Example Procedure Protocol ----------------------------- Readers of this newsgroup know that I'm rarely reticent about criticising other people's hardware and software. Here is your chance to do me the same service. The following is the procedure call, entry and exit protocol I devised to support traditional "Algol-like" languages on the MIPS R2000 hardware. These languages include Algol-60, BCPL, ISO Pascal, and Modula-2. From the point of view of the procedural interface, the main features are . procedures can recurse . all local objects and function results have static size . there are procedure variables and parametric procedures . procedures can access non-local variables . there are constructs that jump out of procedures Now, not all the above languages have all these features. So we must be prepared to implement all of them, but if possible optimise them away selectively. I give first the "canonical" form, and then discuss how much can be selectively removed. Local variables are allocated from a stack. The register RP is the stack pointer (actual register numbers don't matter), and at any time it designates the start of the stack frame of the current procedure. The stack grows UPWARDS, and it is an open stack, ie there is no stack front pointer and the CALLER moves RP so that it will be correct on entry to the callee. Parameters are passed in U0..U7, by value or reference as appropriate, upto a maximum of 8 words. After that, they are stored on the stack in the place where the callee expects to find them. Function results, if small, are returned in <U0,U1>; if large, they are returned in an object allocated by the caller and passed as an extra VAR parameter. Non-local objects are accessed by following a static chain, the head of which is passed by the caller. A procedure value in general consists of a pair (code address, static link), and that is how a parametric procedure is passed. A longjump is implemented by resetting RP and then taking the jump. The actual code of the BCPL "longjump" routine is move RP,U0 ; reset RP from first parameter j U1 ; jump to address passed as second parameter and this is generated inline for things like Pascal GOTO. Call Sequence ------------- ; before you get here, save all live registers lw U0, first_parameter lw U1, second_parameter ... (A1) lw RL, callee_address (A2) lw RE, static_link (A3) addi RP, frame_size ; raise stack pointer (A4) jal RL ; call procedure (A5) nop (A6) subi RP, frame_size ; lower stack pointer ; expect result in U0 Entry Sequence -------------- (B1) sw RL,0(RP) ; save return link (B2) sw RE,4(RP) ; save static link sw U0,8(RP) ; save parameter 1 ... Return Sequence --------------- ; get result into U0 (C1) lw RL,0(RP) ; restore return link (C2) nop (C3) j RL ; jump (C4) nop The cost of a basic procedure call, entry and exit is therefore 12 cycles, ignoring any parameters or result. Not bad, when you reflect that the Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison, of course, since that machine is handicapped by having special call and return instructions). Let us now optimise this code. 1. How often do we need to compute the callee address? Only if it is a procedure variable or a parametric procedure. In the code I've seen, that is about 30% of the time, so deduct [0.7 * A1] 2. How often do we have to pass a static link? Never in BCPL or Modula-2, of course. But otherwise, only if the callee is a parametric procedure or an explicit inner procedure. That happens less than 25% of the time, so we can deduct [0.75 * A2]. (Happily, procedure variables never need a static link in any of the above languages that allow them.) 3. Similarly, how often do we have to STORE a static link that has been passed us by the caller? Only if we are, indeed, an inner procedure. That happens hardly ever - say 5% of the time - so deduct [0.95 * B2] (You must pass the link to a parametric procedure, since you don't know whether the actual is an inner. But the actual rarely needs to store the link, since it DOES know whether it's inner.) 4. Now, how often do we have to raise and lower the stack? Not once per call, clearly, because if we have two calls one after the other then the SUBI after the first cancels the ADDI before the second. To cut a long story short, we need ~ 45 moves in 100 calls, so deduct 1.10 instructions [0.55 * (A3+A6)] 5. What about that NOP after the call. How often can we fill that? In general, almost always. The stack raise can go there (it has no delay and so RP will be valid at point B1). The load of a parameter can go there, provided the callee is aware of the fact. Or an extraneous instruction can go there. That adds up to over 97% of the time, so I'm simply going to deduct A5 completely. 6. How often do we have to save and restore the return address? Whenever we are a non-leaf procedure. Dynamically, about 25% of calls are calls of leaf procedures, so deduct 0.5 [0.25 * (B1+C1)] 7. Finally, what about the two NOPs C2 and C4? It is almost always possible to fill the first, for instance by the load of the result or the other last instruction of the routine. The second is much harder. I do not allow the load of the result to go there, since then U0 would be invalid immediately upon return. The actual figure is under 10%, so I shall be pessimistic here and say that exactly one of the NOP instructions can be removed. Adding that all up, I find that, of a canonical 12 cycles, near enough half can be removed by straightforward local optimisation. The result is a protocol able to accommodate the languages given above, at an average cost of SIX cycles per procedure call. I'll offer two reasons for this result (a) a very clean, simple, and efficient basic machine design (thanks, MIPS) (b) a protocol that has taken care to partition the work so that each piece is done by the party most able to optimise it. This second reason is why I believe a special hardware instruction is a bad idea. Such an instruction must almost always implement almost all of the protocol. But if half the protocol can be optimised away, then the instructoon, to win, must execute more than twice as fast as the equivalent naive instructions. I don't believe that can be done. Thank you for your attention.
mcg@omepd (Steven McGeady) (11/16/88)
In article <7583@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: > An Example Procedure Protocol > ----------------------------- > >Readers of this newsgroup know that I'm rarely reticent about >criticising other people's hardware and software. Here is your >chance to do me the same service. ... [example of proposed MIPS R2000 linkage code] ... >The cost of a basic procedure call, entry and exit is therefore 12 cycles, >ignoring any parameters or result. Not bad, when you reflect that the >Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison, >of course, since that machine is handicapped by having special call and >return instructions). In general I ignore Mr. Firth's gratuitous slams on the 960 (why does he have it out for just our processor? Did Intel decline to offer him a job at some point in the past?) ... but I feel a need to set the record straight. The 960 does a good deal *more* with its call and return instructions that Mr. Firth has noted, not a good deal less. Further, the 960 architecture (as opposed to a single implementation) is benefited by a call and return instructions, as they (as I will show) allow the processor to scale its performance to different implementation techniques. I'll start by saying that Mr. Firth is being generous. The current implementation of the 960 has a call instruction that takes 9 cycles, and a return that takes 7, for a total of 16, rather than 15 cycles. His generosity stops at this point. During 16 cycles of a call/ret, the following operations take place: 1) the stack is adjusted for saving procedure-local registers, the frame pointer is adjusted, and the previous value of the frame-pointer is saved 2) procedure-local registers are saved in the local register cache 3) the return intruction pointer is loaded with the address of the next instruction in the caller "Well!" I hear you cry. "What if I don't want to do all of that?" There is an easy answer. Mr. Firth fails to mention the 'bal' (branch and link) instruction in the 960, the exact analog of the MIPS 'jal' instruction, with the expception that the 960 does not have a delay slot. The bal instruction execute in 2 cycles if the target is in the instruction cache (which, incidentally, is the same restriction that MIPS has, though Mr. Firth does not mention it). Since you also have 32 registers on the 960, one can construct the same calling sequence on the 960 that Mr. Firth constructs on the R2000. This is wise for leaf procedures, and we do it. It is generally unwise for non-leaf procedures: Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving procedure-local registers to the stack. Typical C programs use between 4 and 12 procedure-local variables (variables local to the procedure which are not referenced by enclosed scopes and thus can be allocated permanantly to registers). This adds between 4 and 12 additional memory references at call and return in either the calling site or the called site (depending on whether you are using caller or callee save or some combination thereof). If you (charitably) consider the average to be 6, and (charitably) assume that Mr. Firth is running from zero-wait-state cache and that he gets no cache misses, and (charitably) assume that he can fill the delay slots on three of these stores, you are left with an additional 18 clocks, making Mr. Firth's "12 cycle" (unoptimized) call/return into a 30 cycle call/return. Now, lest I be unfair, I will point out that the 80960 has a penalty when the register cache spills to memory (24 clocks). However, since procedures that oscillate within a relatively small call depth *never* spill to memory, the effect is taken in only a fraction of all routines. Patterson has done some research to this effect, and our research shows that 4 sets of 16 registers (non-overlapped) as currently implemented in the 80960K*, handle about 80-85% of calls and returns (8 sets would handle about 98%). So let's add 8 cycles to our average 80960 call/return time, making it 24 cycles - really not all that much different from MIPS. Now let us set about optimizing each of these. Dave Wall (see SigPlan'88 Compiler Construction Conference Proceedings) would argue that link-time allocation of a large (~ 58) global register set would minimize the register spill overhead of a MIPS-style machine (though the differences between 32 and 58 registers are not clear). However, MIPS (see also SigPlan'88) does not use this technique, instead "shrink-wrapping" register spills and pushing them up (and down) the call tree. I don't have the papers handy so I can't reproduce actual data. Nevertheless, with these techniques and those Mr. Firth outlines, you can further reduce this time, *if you have an extremely sophisticated compiler*. The 80960 gets comparable performance without this sophistication. The 80960 linker employs a very simple technique of promoting calls to leaf procedures from 'call' instructions, which the compiler outputs to 'bal' instructions - the called site declares that it is a leaf procedure (with the '.leafproc' pseudo-op) and provides both leaf and non-leaf entry prologues. The linker (at link time and across modules) replaces call instructions with bal instructions to the same address. If strange code somewhere has taken the address of the procedure and calls it indirectly, it enters though the standard prologue. I charge that, in general, and factoring out compiler and tool differences, the difference between the MIPS calling sequence and the current implementation of the 960 calling sequence is minimal. This is supported by David Wall's paper (though I should point out that Wall does not like register-window schemes, though I don't think he has looked at the 960's non-overlapping register cache scheme). A globally-optimizing compiler for the 960 would instead concentrate on procedure inlining and more advanced call-tree depth-reduction techniques. So, if MIPS did just as well with no call/return, why did we implement it? There are a number of reasons not touched above: 1) code density - the 960 call/return instructions typically occupy one word each (two for call if there is more than a 21-bit displacement involved); as Mr. Firth admirably showed, many architectures use 6-12 words for this. In call-intensive programs this adds up quickly (960 code density is about 15-25% worse than a VAX, MIPS code density is reported to be 40-50% worse for comparable compiler technology). 2) increased parallelism - stack-cache spills in the 960 can be overlapped with subsequent (target) instruction execution. Furthermore, the 960 call/return instructions in the current implementation are implemented with a minimum of hardware to keep chip cost low. These instructions (like all 960 "microcode" flows) are really on-chip macros of (mostly) normal instructions. Future implementations, however, will be able to exploit additional hardware to dramatically increase the speed of these operations. Ultimately, we expect call/ret to be a 2 clock operation (rendering the bal optimization obsolete). The MIPS linkage sequence will always be frozen as N separate instructions, hence N clocks (pick your favourite N in the range 6-30). 3) provision for a range of implementations - it may seem silly to some people to build less than the fastest possible chip, but to those people I suggest that they attempt to build a MIPS system on a 64 sq.in. board with 100ns DRAMS and see how fast it goes. The 960 is architected so that we can provide processors with simple, low-cost busses that work with slow or moderate-speed memory systems. The speed of the register-cache spill is directly a function of the bus architecture. I have the greatest respect for the MIPS architects and designers. They built an architecture using the technology and under the limitations they had at hand, which included compiler technology (very good), silicon technology (foundry-based and uncertain), and silicon budget (limited). They produced a chip which is admirably well-honed for the workstation market to which it is targeted (except, of course, it doesn't run MS-DOS :-) ). The MIPS implementations will tend to scale directly with increases in process technology and clock speed (I believe that this is MIPS' contention), until they have to put microwave waveguides on the package (apologies to Nick Treddenick). In general, the 960 K-series implementation is a practical and inexpensive one for small systems without large caches and wide busses, and not the peak implementation of architecture. (Read Glen Myers' book *The 80960 Microprocessor Architecture*, Wiley, 1988, ISBN 0 471-61857-8, for more details about some of the trade-offs). The 960 architecture will scale better than linearly, because the architecture allows more opportunities for fine-grained parallelism. I hope that the next time Mr. Firth sallys forth on such an exposition, he either presents a more balanced view, or confines himself to commentary on a processor with which he is familiar. If he has additional questions concerning the 960, I would be glad to answer them. S. McGeady Intel Corp.
firth@sei.cmu.edu (Robert Firth) (11/18/88)
In article <3926@omepd> mcg@omepd (Steven McGeady) writes: >In general I ignore Mr. Firth's gratuitous slams on the 960 (why does >he have it out for just our processor? The example I normally use in "slamming" processor designs is the DEC Vax, whose makers I guess either don't read this newsgroup or are less sensitive. However, it might be appropriate to say that I think machine design is very hard, and I have great respect for the prople who do it. Moreover, many of the features I "slam" are, in my view, not the result of poor hardware engineering but of good hardware engineers who have been given bad advice by rather less good software engineers. >The 960 does a good deal *more* with its call and >return instructions that Mr. Firth has noted, not a good deal less. I don't think so. Let us take your detail > 1) the stack is adjusted for saving procedure-local registers, > the frame pointer is adjusted, and the previous value of > the frame-pointer is saved > 2) procedure-local registers are saved in the local register cache > 3) the return intruction pointer is loaded with the address of > the next instruction in the caller I agree that if you have register windows, you buy the extra work of managing them. That is a different issue. However, the protocol I posted did all of the above that was necessary, and in addition . passed a static link for access to non-locals . allocated procedure local storage . included the overhead of calling a parametric procedure or procedure variable. [I've shown in previous posts why I believe it is not necessary to have both a frame pointer and a stack front pointer, and how one can avoid having to save the caller's frame pointer.] >"Well!" I hear you cry. "What if I don't want to do all of that?" >There is an easy answer. Mr. Firth fails to mention the 'bal' (branch >and link) instruction in the 960, the exact analog of the MIPS 'jal' So I did. The alternative protocol requires a bal at the point of call and a bx at the point of return. That's absolutely all it does, at an average cost of about five cycles for the pair. Since my post claimed that a full protocol on the MIPS cost, on average, just one cycle more, I'm not quite sure whose case is strengthened by this mention. >Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving >procedure-local registers to the stack. Because it is a different issue. However, for the record: On the MIPS, if you are passed parameters in registers, and if you are not a leaf procedure, you must save the parameters to your local memory, at a cost of one cycle per register saved. On the 80960, the register windows do not overlap, so you are passed parameters in the global registers. If you are not a leaf procedure, you must save them by copying them from global to local registers, at a cost of one cycle per register saved. The cost is the same in both cases. For Mr McGready to add the cost (and an inflated estimate, too) to the MIPS case only is inappropriate. >Nevertheless, with these techniques and those >Mr. Firth outlines, you can further reduce this time, *if you have an >extremely sophisticated compiler*. Again, I believe that the issue of global optimisation is not germane. For that reason, my original post confined the optimisations to purely local ones that can be performed by a very simple codegenerator; indeed by a one-pass codegenerator, which is what does perform them. Note, by contrast, that Mr McGready's call=>bal optimisation is a non local optimisation, since it requires the caller to have information about the procedure called. For which reason, Intel were right to do the work in the linker. >So, if MIPS did just as well with no call/return, why did we implement >it? There are a number of reasons not touched above: Many thanks for that information, which I found interesting and valuable. It was my hope, in the original posting, that other readers would find of value a detailed worked example of a single topic, implemented on a single machine. If instead the result was to cause distress or anger, I apologise. Robert Firth
earl@wright.mips.com (Earl Killian) (11/20/88)
In article <3926@omepd>, mcg@omepd (Steven McGeady) writes: >Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving >procedure-local registers to the stack. Typical C programs use between 4 and >12 procedure-local variables (variables local to the procedure which are not >referenced by enclosed scopes and thus can be allocated permanantly to >registers). This adds between 4 and 12 additional memory references at >call and return in either the calling site or the called site (depending on >whether you are using caller or callee save or some combination thereof). >If you (charitably) consider the average to be 6, and (charitably) assume >that Mr. Firth is running from zero-wait-state cache and that he gets no cache >misses, and (charitably) assume that he can fill the delay slots on three >of these stores, you are left with an additional 18 clocks, making Mr. Firth's >"12 cycle" (unoptimized) call/return into a 30 cycle call/return. Rather than "(charitably) considering the average to be 6", it would be better to measure it. I took 15 programs, sorted them by number of registers dynamically saved/restored, and picked the median. Here is its register save/restore statistics: Callee registers saved/restored at entry/exit: N entry entry% cum% regs% cum% instr% cum% 0 229298 11.8% 11.8% 0.0% 0.0% 0.0% 0.0% 1 1698070 87.0% 98.8% 95.3% 95.3% 3.3% 3.3% 2 6636 0.3% 99.1% 0.7% 96.1% 0.0% 3.3% 3 10637 0.5% 99.7% 1.8% 97.9% 0.1% 3.4% 4 2648 0.1% 99.8% 0.6% 98.5% 0.0% 3.4% 6 91 0.0% 99.8% 0.0% 98.5% 0.0% 3.4% 7 3854 0.2% 100.0% 1.5% 100.0% 0.1% 3.5% 10 1 0.0% 100.0% 0.0% 100.0% 0.0% 3.5% Total procedure entries: 1951235 Total register save/restore instructions: 1951235 (3.5%) Average registers saved per entry: 0.9 --