[comp.arch] Procedure Call Protocol

firth@sei.cmu.edu (Robert Firth) (11/03/88)

	An Example Procedure Protocol
	-----------------------------

Readers of this newsgroup know that I'm rarely reticent about
criticising other people's hardware and software.  Here is your
chance to do me the same service.

The following is the procedure call, entry and exit protocol I devised
to support traditional "Algol-like" languages on the MIPS R2000 hardware.
These languages include Algol-60, BCPL, ISO Pascal, and Modula-2. From
the point of view of the procedural interface, the main features are

. procedures can recurse

. all local objects and function results have static size

. there are procedure variables and parametric procedures

. procedures can access non-local variables

. there are constructs that jump out of procedures

Now, not all the above languages have all these features.  So we must be
prepared to implement all of them, but if possible optimise them away
selectively.  I give first the "canonical" form, and then discuss how
much can be selectively removed.

Local variables are allocated from a stack.  The register RP is the
stack pointer (actual register numbers don't matter), and at any
time it designates the start of the stack frame of the current procedure.
The stack grows UPWARDS, and it is an open stack, ie there is no stack
front pointer and the CALLER moves RP so that it will be correct on
entry to the callee.

Parameters are passed in U0..U7, by value or reference as appropriate,
upto a maximum of 8 words.  After that, they are stored on the stack in
the place where the callee expects to find them.  Function results, if
small, are returned in <U0,U1>; if large, they are returned in an object
allocated by the caller and passed as an extra VAR parameter.

Non-local objects are accessed by following a static chain, the head of
which is passed by the caller.  A procedure value in general consists of
a pair (code address, static link), and that is how a parametric procedure
is passed.

A longjump is implemented by resetting RP and then taking the jump.  The
actual code of the BCPL "longjump" routine is

	move RP,U0	; reset RP from first parameter
	j U1		; jump to address passed as second parameter

and this is generated inline for things like Pascal GOTO.

Call Sequence
-------------

	; before you get here, save all live registers
	lw U0, first_parameter
	lw U1, second_parameter
	...
(A1)	lw RL, callee_address
(A2)	lw RE, static_link
(A3)	addi RP, frame_size	; raise stack pointer
(A4)	jal RL			; call procedure
(A5)	nop
(A6)	subi RP, frame_size	; lower stack pointer
	; expect result in U0

Entry Sequence
--------------

(B1)	sw RL,0(RP)		; save return link
(B2)	sw RE,4(RP)		; save static link
	sw U0,8(RP)		; save parameter 1
	...

Return Sequence
---------------

	; get result into U0
(C1)	lw RL,0(RP)		; restore return link
(C2)	nop
(C3)	j RL			; jump
(C4)	nop

The cost of a basic procedure call, entry and exit is therefore 12 cycles,
ignoring any parameters or result.  Not bad, when you reflect that the
Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison,
of course, since that machine is handicapped by having special call and
return instructions).

Let us now optimise this code.

1. How often do we need to compute the callee address?  Only if it is
   a procedure variable or a parametric procedure.  In the code I've seen,
   that is about 30% of the time, so deduct [0.7 * A1]

2. How often do we have to pass a static link?  Never in BCPL or Modula-2,
   of course.  But otherwise, only if the callee is a parametric procedure
   or an explicit inner procedure.  That happens less than 25% of the time,
   so we can deduct [0.75 * A2].  (Happily, procedure variables never need
   a static link in any of the above languages that allow them.)

3. Similarly, how often do we have to STORE a static link that has been
   passed us by the caller?  Only if we are, indeed, an inner procedure.
   That happens hardly ever - say 5% of the time - so deduct [0.95 * B2]
   (You must pass the link to a parametric procedure, since you don't
   know whether the actual is an inner.  But the actual rarely needs to
   store the link, since it DOES know whether it's inner.)

4. Now, how often do we have to raise and lower the stack?  Not once per
   call, clearly, because if we have two calls one after the other then
   the SUBI after the first cancels the ADDI before the second.  To cut
   a long story short, we need ~ 45 moves in 100 calls, so deduct 1.10
   instructions [0.55 * (A3+A6)]

5. What about that NOP after the call.  How often can we fill that?
   In general, almost always. The stack raise can go there (it has
   no delay and so RP will be valid at point B1).  The load of a
   parameter can go there, provided the callee is aware of the fact.
   Or an extraneous instruction can go there.  That adds up to over
   97% of the time, so I'm simply going to deduct A5 completely.

6. How often do we have to save and restore the return address?  Whenever
   we are a non-leaf procedure.  Dynamically, about 25% of calls are
   calls of leaf procedures, so deduct 0.5 [0.25 * (B1+C1)]

7. Finally, what about the two NOPs C2 and C4?  It is almost always possible
   to fill the first, for instance by the load of the result or the other
   last instruction of the routine.  The second is much harder.  I do not
   allow the load of the result to go there, since then U0 would be invalid
   immediately upon return.  The actual figure is under 10%, so I shall
   be pessimistic here and say that exactly one of the NOP instructions
   can be removed.

Adding that all up, I find that, of a canonical 12 cycles, near enough
half can be removed by straightforward local optimisation.  The
result is a protocol able to accommodate the languages given above,
at an average cost of SIX cycles per procedure call.  I'll offer two
reasons for this result

(a) a very clean, simple, and efficient basic machine design (thanks, MIPS)

(b) a protocol that has taken care to partition the work so that
    each piece is done by the party most able to optimise it.

This second reason is why I believe a special hardware instruction is
a bad idea.  Such an instruction must almost always implement almost
all of the protocol.  But if half the protocol can be optimised away,
then the instructoon, to win, must execute more than twice as fast as
the equivalent naive instructions.  I don't believe that can be done.

Thank you for your attention.

mcg@omepd (Steven McGeady) (11/16/88)

In article <7583@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:

>	An Example Procedure Protocol
>	-----------------------------
>
>Readers of this newsgroup know that I'm rarely reticent about
>criticising other people's hardware and software.  Here is your
>chance to do me the same service.

...
[example of proposed MIPS R2000 linkage code]
...

>The cost of a basic procedure call, entry and exit is therefore 12 cycles,
>ignoring any parameters or result.  Not bad, when you reflect that the
>Intel 80960 takes 15 cycles to do a good deal less (an unfair comparison,
>of course, since that machine is handicapped by having special call and
>return instructions).

In general I ignore Mr. Firth's gratuitous slams on the 960 (why does
he have it out for just our processor?  Did Intel decline to offer him
a job at some point in the past?) ... but I feel a need to set the
record straight.  The 960 does a good deal *more* with its call and
return instructions that Mr. Firth has noted, not a good deal less.
Further, the 960 architecture (as opposed to a single implementation)
is benefited by a call and return instructions, as they (as I will show)
allow the processor to scale its performance to different implementation
techniques.

I'll start by saying that Mr. Firth is being generous.  The current
implementation of the 960 has a call instruction that takes 9 cycles,
and a return that takes 7, for a total of 16, rather than 15 cycles.
His generosity stops at this point.  During 16 cycles of a call/ret,
the following operations take place:

	1) the stack is adjusted for saving procedure-local registers,
	   the frame pointer is adjusted, and the previous value of
	   the frame-pointer is saved
	2) procedure-local registers are saved in the local register cache
	3) the return intruction pointer is loaded with the address of
	   the next instruction in the caller

"Well!" I hear you cry.  "What if I don't want to do all of that?"
There is an easy answer.  Mr. Firth fails to mention the 'bal' (branch
and link) instruction in the 960, the exact analog of the MIPS 'jal'
instruction, with the expception that the 960 does not have a delay slot.
The bal instruction execute in 2 cycles if the target is in the instruction
cache (which, incidentally, is the same restriction that MIPS has, though
Mr. Firth does not mention it).  Since you also have 32 registers on the
960, one can construct the same calling sequence on the 960 that Mr. Firth
constructs on the R2000.  This is wise for leaf procedures, and we do it.
It is generally unwise for non-leaf procedures:

Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving
procedure-local registers to the stack.  Typical C programs use between 4 and
12 procedure-local variables (variables local to the procedure which are not
referenced by enclosed scopes and thus can be allocated permanantly to
registers).  This adds between 4 and 12 additional memory references at
call and return in either the calling site or the called site (depending on
whether you are using caller or callee save or some combination thereof).
If you (charitably) consider the average to be 6, and (charitably) assume
that Mr. Firth is running from zero-wait-state cache and that he gets no cache
misses, and (charitably) assume that he can fill the delay slots on three
of these stores, you are left with an additional 18 clocks, making Mr. Firth's
"12 cycle" (unoptimized) call/return into a 30 cycle call/return.

Now, lest I be unfair, I will point out that the 80960 has a penalty when the
register cache spills to memory (24 clocks).  However, since procedures that
oscillate within a relatively small call depth *never* spill to memory, the
effect is taken in only a fraction of all routines.  Patterson has done some
research to this effect, and our research shows that 4 sets of 16 registers
(non-overlapped) as currently implemented in the 80960K*, handle about
80-85% of calls and returns (8 sets would handle about 98%).  So let's add
8 cycles to our average 80960 call/return time, making it 24 cycles - really
not all that much different from MIPS.

Now let us set about optimizing each of these.  Dave Wall (see SigPlan'88
Compiler Construction Conference Proceedings) would argue that link-time
allocation of a large (~ 58) global register set would minimize the register
spill overhead of a MIPS-style machine (though the differences between 32
and 58 registers are not clear).  However, MIPS (see also SigPlan'88) does not
use this technique, instead "shrink-wrapping" register spills and pushing them
up (and down) the call tree.  I don't have the papers handy so I can't
reproduce actual data.  Nevertheless, with these techniques and those
Mr. Firth outlines, you can further reduce this time, *if you have an
extremely sophisticated compiler*.

The 80960 gets comparable performance without this sophistication.
The 80960 linker employs a very simple technique of promoting calls to
leaf procedures from 'call' instructions, which the compiler outputs
to 'bal' instructions - the called site declares that it is a leaf
procedure (with the '.leafproc' pseudo-op) and provides both leaf and
non-leaf entry prologues.  The linker (at link time and across modules)
replaces call instructions with bal instructions to the same address.
If strange code somewhere has taken the address of the procedure and
calls it indirectly, it enters though the standard prologue.

I charge that, in general, and factoring out compiler and tool differences,
the difference between the MIPS calling sequence and the current implementation
of the 960 calling sequence is minimal.  This is supported by David Wall's
paper (though I should point out that Wall does not like register-window
schemes, though I don't think he has looked at the 960's non-overlapping
register cache scheme).  A globally-optimizing compiler for the 960 would
instead concentrate on procedure inlining and more advanced call-tree
depth-reduction techniques.

So, if MIPS did just as well with no call/return, why did we implement
it?  There are a number of reasons not touched above:

	1) code density - the 960 call/return instructions typically
	   occupy one word each (two for call if there is more than a
	   21-bit displacement involved);  as Mr. Firth admirably
	   showed, many architectures use 6-12 words for this.
	   In call-intensive programs this adds up quickly (960
	   code density is about 15-25% worse than a VAX, MIPS code
	   density is reported to be 40-50% worse for comparable
	   compiler technology).

	2) increased parallelism -  stack-cache spills in the 960
	   can be overlapped with subsequent (target) instruction
	   execution.  Furthermore, the 960 call/return instructions
	   in the current implementation are implemented with a
	   minimum of hardware to keep chip cost low.  These
	   instructions (like all 960 "microcode" flows) are really
	   on-chip macros of (mostly) normal instructions.  Future
	   implementations, however, will be able to exploit additional
	   hardware to dramatically increase the speed of these operations.
	   Ultimately, we expect call/ret to be a 2 clock operation
	   (rendering the bal optimization obsolete).  The MIPS
	   linkage sequence will always be frozen as N separate
	   instructions, hence N clocks (pick your favourite N in
	   the range 6-30).

	3) provision for a range of implementations - it may seem
	   silly to some people to build less than the fastest possible
	   chip, but to those people I suggest that they attempt to
	   build a MIPS system on a 64 sq.in. board with 100ns DRAMS
	   and see how fast it goes.  The 960 is architected so that
	   we can provide processors with simple, low-cost busses
	   that work with slow or moderate-speed memory systems.
	   The speed of the register-cache spill is directly a function
	   of the bus architecture.

I have the greatest respect for the MIPS architects and designers.
They built an architecture using the technology and under the limitations
they had at hand, which included compiler technology (very good), 
silicon technology (foundry-based and uncertain), and silicon budget
(limited).  They produced a chip which is admirably well-honed for
the workstation market to which it is targeted (except, of course,
it doesn't run MS-DOS :-) ).  The MIPS implementations will tend to
scale directly with increases in process technology and clock speed
(I believe that this is MIPS' contention), until they have to
put microwave waveguides on the package (apologies to Nick Treddenick).

In general, the 960 K-series implementation is a practical and inexpensive
one for small systems without large caches and wide busses, and not the peak
implementation of architecture.  (Read Glen Myers' book
*The 80960 Microprocessor Architecture*, Wiley, 1988, ISBN 0 471-61857-8,
for more details about some of the trade-offs).  The 960 architecture will
scale better than linearly, because the architecture allows more
opportunities for fine-grained parallelism.

I hope that the next time Mr. Firth sallys forth on such an exposition,
he either presents a more balanced view, or confines himself to commentary
on a processor with which he is familiar.  If he has additional questions
concerning the 960, I would be glad to answer them.

S. McGeady
Intel Corp.

firth@sei.cmu.edu (Robert Firth) (11/18/88)

In article <3926@omepd> mcg@omepd (Steven McGeady) writes:

>In general I ignore Mr. Firth's gratuitous slams on the 960 (why does
>he have it out for just our processor?

The example I normally use in "slamming" processor designs is the DEC
Vax, whose makers I guess either don't read this newsgroup or are less
sensitive.  However, it might be appropriate to say that I think machine
design is very hard, and I have great respect for the prople who do it.
Moreover, many of the features I "slam" are, in my view, not the result
of poor hardware engineering but of good hardware engineers who have
been given bad advice by rather less good software engineers.

>The 960 does a good deal *more* with its call and
>return instructions that Mr. Firth has noted, not a good deal less.

I don't think so.  Let us take your detail

>	1) the stack is adjusted for saving procedure-local registers,
>	   the frame pointer is adjusted, and the previous value of
>	   the frame-pointer is saved
>	2) procedure-local registers are saved in the local register cache
>	3) the return intruction pointer is loaded with the address of
>	   the next instruction in the caller

I agree that if you have register windows, you buy the extra work of
managing them.  That is a different issue.  However, the protocol I
posted did all of the above that was necessary, and in addition

	. passed a static link for access to non-locals
	. allocated procedure local storage
	. included the overhead of calling a parametric procedure
	  or procedure variable.

[I've shown in previous posts why I believe it is not necessary to have
 both a frame pointer and a stack front pointer, and how one can avoid
 having to save the caller's frame pointer.]

>"Well!" I hear you cry.  "What if I don't want to do all of that?"
>There is an easy answer.  Mr. Firth fails to mention the 'bal' (branch
>and link) instruction in the 960, the exact analog of the MIPS 'jal'

So I did.  The alternative protocol requires a bal at the point of call
and a bx at the point of return.  That's absolutely all it does, at an
average cost of about five cycles for the pair.  Since my post claimed
that a full protocol on the MIPS cost, on average, just one cycle more,
I'm not quite sure whose case is strengthened by this mention.

>Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving
>procedure-local registers to the stack.

Because it is a different issue.  However, for the record:

On the MIPS, if you are passed parameters in registers, and if you are not
a leaf procedure, you must save the parameters to your local memory, at a
cost of one cycle per register saved.

On the 80960, the register windows do not overlap, so you are passed
parameters in the global registers.  If you are not a leaf procedure,
you must save them by copying them from global to local registers, at
a cost of one cycle per register saved.

The cost is the same in both cases.  For Mr McGready to add the cost
(and an inflated estimate, too) to the MIPS case only is inappropriate.

>Nevertheless, with these techniques and those
>Mr. Firth outlines, you can further reduce this time, *if you have an
>extremely sophisticated compiler*.

Again, I believe that the issue of global optimisation is not germane.
For that reason, my original post confined the optimisations to purely
local ones that can be performed by a very simple codegenerator; indeed
by a one-pass codegenerator, which is what does perform them.

Note, by contrast, that Mr McGready's call=>bal optimisation is a non local
optimisation, since it requires the caller to have information about the
procedure called.  For which reason, Intel were right to do the work in
the linker.

>So, if MIPS did just as well with no call/return, why did we implement
>it?  There are a number of reasons not touched above:

Many thanks for that information, which I found interesting and valuable.

It was my hope, in the original posting, that other readers would find
of value a detailed worked example of a single topic, implemented on a
single machine.  If instead the result was to cause distress or anger,
I apologise.

Robert Firth

earl@wright.mips.com (Earl Killian) (11/20/88)

In article <3926@omepd>, mcg@omepd (Steven McGeady) writes:
>Mr. Firth wisely neglects to count in his 12 cycle cost the cost of saving
>procedure-local registers to the stack.  Typical C programs use between 4 and
>12 procedure-local variables (variables local to the procedure which are not
>referenced by enclosed scopes and thus can be allocated permanantly to
>registers).  This adds between 4 and 12 additional memory references at
>call and return in either the calling site or the called site (depending on
>whether you are using caller or callee save or some combination thereof).
>If you (charitably) consider the average to be 6, and (charitably) assume
>that Mr. Firth is running from zero-wait-state cache and that he gets no cache
>misses, and (charitably) assume that he can fill the delay slots on three
>of these stores, you are left with an additional 18 clocks, making Mr. Firth's
>"12 cycle" (unoptimized) call/return into a 30 cycle call/return.

Rather than "(charitably) considering the average to be 6", it would
be better to measure it.  I took 15 programs, sorted them by number of
registers dynamically saved/restored, and picked the median.  Here is
its register save/restore statistics:

Callee registers saved/restored at entry/exit:
   N      entry entry%   cum%  regs%   cum% instr%   cum%
   0     229298  11.8%  11.8%   0.0%   0.0%   0.0%   0.0%
   1    1698070  87.0%  98.8%  95.3%  95.3%   3.3%   3.3%
   2       6636   0.3%  99.1%   0.7%  96.1%   0.0%   3.3%
   3      10637   0.5%  99.7%   1.8%  97.9%   0.1%   3.4%
   4       2648   0.1%  99.8%   0.6%  98.5%   0.0%   3.4%
   6         91   0.0%  99.8%   0.0%  98.5%   0.0%   3.4%
   7       3854   0.2% 100.0%   1.5% 100.0%   0.1%   3.5%
  10          1   0.0% 100.0%   0.0% 100.0%   0.0%   3.5%
Total procedure entries: 1951235
Total register save/restore instructions: 1951235 (3.5%)
Average registers saved per entry: 0.9
--