[comp.sys.m88k] Register Allocation

rfg@ics.uci.edu (Ron Guilmette) (11/17/89)

In article <2631@yogi.oakhill.UUCP> marvin@yogi.UUCP (Marvin Denman) writes:
>
>In article <PFEIFFER.89Nov15213136@puye.nmsu.edu> pfeiffer@nmsu.edu (Joe Pfeiffer) writes:
>>
>>This is probably getting a bit far afield, but I'm very puzzled by
>>this statement.  It has always been my impression that it doesn't much
>>matter whether the registers are preserved by the caller or the
>>callee, just so somebody does it.  Why is there a greater burden on
>>the compiler writer if it's the callee?
>>
>>-Joe Pfeiffer.
>
>
>The answer probably lies in the fact that many leaf calls will not require 
>enough registers that they need to save anything if they know which registers 
>are guaranteed to be "dead".

Of course, but the leaf routines never do know which ones are dead, do they?

>These [leaf routines] are by frequency a large percentage of 
>calls so they should not be penalized.  Using the current standard a leaf 
>routine can use all of the temporary set of registers plus all of the 
>parameter passing registers without any saving.  Why should the caller save 
>"all" of his live registers when only a small fraction will be used?

My original question was "Why should the caller save *any* of his live
registers when he has absolutely no knowledge of whether or not *any* of
them will be used (i.e. clobbered) in the called routine?"

I don't think that I have yet seen a good answer to this question.

>The converse argument is why should the callee be required to save dead
>registers?

What if there are *never* any dead registers?  Consider this.  Given the
typical program (which uses at least 30 or more "values" throughout
its execution lifetime) it may be possible to use *all* available registers
for at least *some* productive purpose at *all* points throughout a program.

Even without having global data-flow information, at the very least you could
decide to keep several simple global variables in registers, and the odds
are very good that you would get at least some performance increase as a
result of doing this.

On the other hand, if you *do* have good global data flow analysis, then
there is no doubt whatsoever that you should be able to use all registers
for "live" values at all points throughout a program.

Thus, you should always have registers filled with live values (in any case).

Thus, you should always have a "pure" caller-saves convention.  QED

>The current scheme is a fairly efficient compromise between the two, that will
>do in the absence of global register allocation.  

Ah, ha!  So if I understand you correctly, you are saying that this parameter
passing convention was intentionally designed to be a "compromise" which
would not make stupid compilers look too bad (even at the expense of
decreasing performance in the presence of a good modern compiler with
data-flow analysis).  Is that correct?

// rfg

phillip@oakhill.UUCP (Mike Phillip) (11/21/89)

In article <1989Nov16.212149.9770@paris.ics.uci.edu> Ron Guilmette <rfg@ics.uci.edu> writes:
>
>My original question was "Why should the caller save *any* of his live
>registers when he has absolutely no knowledge of whether or not *any* of
>them will be used (i.e. clobbered) in the called routine?"
>
>I don't think that I have yet seen a good answer to this question.
>

I think that the issues of parameter passing and register allocation have
become somewhat confused in this discussion thus far, as have the
relative merits of callee vs caller saved registers.  To make matters
even more confusing, the "meta"-issue of compiler technology has been
introduced without clearly distinguishing between local, global and
inter-procedural register allocation (Note that these are THREE different
cases).

First of all, the parameter passing registers (r2-r9) are really
irrelevant to the caller vs. callee discussion, since they are
effectively treated as "caller save" in ALL cases.  It just so happens
that OCS designates that these registers may be used to hold subroutine
parameters during calling sequences.  The callee is then free to use
these registers as it sees fit without preserving the
values/variables/temporaries (choose your favorite term) associated with
them.   Thus, combining these registers with registers r10-r13 provides
a total of 12 "caller save" registers.  The OCS also provided 12 "callee
save" registers, r14-r25.  The main point of debate seems to be the
rationale behind the chosen division between caller and callee registers.

[... deletion of some discussion ...]

>
>What if there are *never* any dead registers?  Consider this.  Given the
>typical program (which uses at least 30 or more "values" throughout
>its execution lifetime) it may be possible to use *all* available registers
>for at least *some* productive purpose at *all* points throughout a program.
>
>Even without having global data-flow information, at the very least you could
>decide to keep several simple global variables in registers, and the odds
>are very good that you would get at least some performance increase as a
>result of doing this.
>
>On the other hand, if you *do* have good global data flow analysis, then
>there is no doubt whatsoever that you should be able to use all registers
>for "live" values at all points throughout a program.
>
>Thus, you should always have registers filled with live values (in any case).
>
>Thus, you should always have a "pure" caller-saves convention.  QED
>

    This is where things get tricky...  The key point to recognize is that
regardless of compiler technology, there will likely always need to be a
distinction between caller and callee saved registers.  When global (but
not inter-procedural) register allocation is performed, it is not
possible to determine the register usage of called subroutines at the
point of the call (i.e. caller cannot see callee).  

    If all registers were designated as being in "caller save" mode, code
would need to be inserted by the compiler to "save" the contents of all
registers whose associated variables have a live range that spans the
subroutine call EACH TIME such a call was made.  (Note that this does not
necessarily imply that ALL registers need to be saved... some registers 
may be "dead" at the point of the call). 
In cases where the callee needs only a few registers, such
a calling convention would likely result in unnecessary load and store
instructions which use up valuable instruction issuing and memory
bandwidth.

    At the other extreme, if all registers were used in "callee save"
mode (with the exception of the parameter passing registers), the callee
would generate load and store instructions for every register that it
uses, regardless of whether such overhead is warranted by the "live" use
of the same register in the caller.  Thus, some division is needed
between caller and callee saved registers.  Why did OCS choose 12/12 ?
I wasn't involved in the decision making process, but the issue only
becomes significant when the following criteria are met:
   1) The callee requires more than 12 registers to perform adequate
      optimizations.
   2) The additional "callee save" registers used actually did NOT need
      to be saved and restored because their live range did not span the
      subroutine call in the caller.

How often does this occur?  Empirically speaking, beats me.  (I don't
have any such data at my disposal at the moment).  But your above
argument that good overall optimization in a compiler will result in most
registers holding live values across calls actually MINIMIZES the effect
of having "too many" caller or callee saved registers.  In fact, if the
register being used in a callee needs to be saved SOMEWHERE (i.e. the
caller has a live value in it across the call), it can be argued that 
"callee" saved registers are preferable, since the callee can insert the
saves and restores only in those flow paths where the register contents
are clobbered.  In such cases, "caller save" conventions can result in
unnecessary overhead.

Now onward to inter-procedural analysis...

    (A good reference for this topic is "Minimizing Register Usage
    Penalty at Procedure Calls" by Fred Chow from SIGPLAN Notices, July 88)

    As described in the above paper, "callee save" registers can
effectively be treated as "caller save" registers by performing a
depth-first traversal of the program call graph.  (i.e. by the time the
caller is analyzed, it has all relevant register usage info about the
callee).  The callee can then defer saving and restoring to the caller,
making the register appear as though it was operating in "caller save"
mode.   There are limitations to this approach, however.  These
limitations occur when the compiler does not have complete register
usage information for a particular subroutine (libraries are a commonly
cited example).  Other examples include:
    1) recursive calls
    2) calls made through subroutine pointers
    3) separate compilation

In such cases, even "globally-optimizing, inter-procedural analyzing"
compilers need to resort to a default convention, and the trade-offs of
caller vs. callee once again become relevant.  Thus, OCS does not prevent
optimizing compilers from taking advantage of inter-procedural register
allocation, but provides the "default" for those cases when such
optimizations cannot be made.

>>The current scheme is a fairly efficient compromise between the two, that will
>>do in the absence of global register allocation.  
>
>Ah, ha!  So if I understand you correctly, you are saying that this parameter
>passing convention was intentionally designed to be a "compromise" which
>would not make stupid compilers look too bad (even at the expense of
>decreasing performance in the presence of a good modern compiler with
>data-flow analysis).  Is that correct?
>

Yes, the inclusion of both caller and callee save registers is a
compromise... always has been, and probably always will...

I disagree, however, that the compromise is in place to mask compiler
deficiencies or that it hinders the development of state-of-the-art
optimizing compilers.  (And yes, I know, the Moto compilers can use
improvement :^)

As pointed out above, the distinction is needed even when
inter-procedural analysis is used for register allocation.  When such
compiler technology is implemented, the relative distribution of caller
vs. callee registers becomes LESS critical, since the compiler can 
essentially "convert" callee-saved registers into caller-saved registers 
by "pushing" them up in the call graph.

Geez, I figured most of the register allocation debates would involve
r26-r29, which are the so-called "linker registers".  ;^)


Mike Phillip
Motorola 88000 Development

E-mail:    oakhill!bushwood!phillip@cs.utexas.edu     or
           cs.utexas.edu!oakhill!bushwood!phillip

mash@mips.COM (John Mashey) (11/21/89)

In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes:
>In article <1989Nov16.212149.9770@paris.ics.uci.edu> Ron Guilmette <rfg@ics.uci.edu> writes:
>>
>>My original question was "Why should the caller save *any* of his live
>>registers when he has absolutely no knowledge of whether or not *any* of
>>them will be used (i.e. clobbered) in the called routine?"
>Yes, the inclusion of both caller and callee save registers is a
>compromise... always has been, and probably always will...
>
>I disagree, however, that the compromise is in place to mask compiler
>deficiencies or that it hinders the development of state-of-the-art
>optimizing compilers.  (And yes, I know, the Moto compilers can use
>improvement :^)

In defense of Mr. Phillip, both HP PA and MIPS R3000s split the integer
registers in approximately the same ratio as does the 88K, from
looking at program statistics.  Allocation behavior for integer programs,
given similar compiler technology, should not be too different.
(For floating-point, may not be true, as both HP PA and MIPS have
a separateset of FP registers).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rfg@ics.uci.edu (Ron Guilmette) (11/21/89)

In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes:
>
>First of all, the parameter passing registers (r2-r9) are really
>irrelevant to the caller vs. callee discussion, since they are
>effectively treated as "caller save" in ALL cases.

Right.  The're just more "caller save" registers.

>...  The main point of debate seems to be the
>rationale behind the chosen division between caller and callee registers.

I didn't realize that I had started a "debate"!  I though I was was just
giving voice to self-obvious truths!  :-)  (Note simley face)

>>What if there are *never* any dead registers?

I *still* haven't seen a good answer to *this* question.

>>... it may be possible to use *all* available registers
>>for at least *some* productive purpose at *all* points throughout a program.

I wish I had said that!  (Oops... I did!) :-)

>    This is where things get tricky...  The key point to recognize is that
>regardless of compiler technology, there will likely always need to be a
>distinction between caller and callee saved registers.

Yes.  The distinction is valuable.  It is needed so that I can tell you
what type of saving convention is harmful.  "Caller saving consider
harmful!"

>    If all registers were designated as being in "caller save" mode, code...

>... some registers 
>may be "dead" at the point of the call). 

Perhaps you didn't read the whole of my previous posting.  In any case,
the important part (where I assert that its possible to have all registers
live at all points) is given above.

>In cases where the callee needs only a few registers, [a caller saves]
> convention would likely result in unnecessary load and store
>instructions...

Right.  Now if you can just see your way clear to generalize this Obvious
Truth from one particular register to the set of *all* registers, then
my point will have been proven.  All I'm trying to say is that what is bad
for one register (i.e. being saved/restored unnecessarily) is bad for
*all* registers.

>    At the other extreme, if all registers were used in "callee save"
>mode ... the callee
>would generate load and store instructions for every register that it
>uses, regardless of whether such overhead is warranted by the "live" use
>of the same register in the caller.

Right, but you are still begging the question and avoiding the issue.
What if the saves & restores are *always* warranted by virtue of the
fact that *all* registers contain live values at *all* points?  In that
case, there would be no unnecessary loads or stores in a strictly
callee-saves convention.

What I'm having trouble understanding is why various people (Mike P.
included) seem to be unwilling to accept that this convention (used
for decades on CICS machines) may actually be simply *the* best possible
convention, regardless of whether or not you seem to have lots of registers
to waste or whether or not you have a reduced instruction set.

>Thus, some division is needed
>between caller and callee saved registers.

What you mean to say is that you need to have both kinds.  You certainly
have not justified that conclusion with any evidence that addresses my "all
live, all the time" argument.  I'm willing to be convinced, but I have not
been yet.

>Why did OCS choose 12/12 ?

My question exactly.  I believe that 0/31 would have been a better choice.
(Note that I classify r0 as a callee-saves register because the caller
may assume that it is is preserved across calls without doing anything ;-)
Also, I give the ratio as 0/31 because I assume C language where a
reault comes back (typically) in one register (by convention r2).

>I wasn't involved in the decision making process, but the issue only
>becomes significant when the following criteria are met:
>   1) The callee requires more than 12 registers to perform adequate
>      optimizations.

>   2) The additional "callee save" registers used actually did NOT need
>      to be saved and restored because their live range did not span the
>      subroutine call in the caller.

I don't believe that #1 even enters into the issue at all.

Regarding #2, you are really missing my point.  What I was trying (in my
own futile way) to say was that the issue is *always* significant bacause
callers can always insure that virtually *all* of their registers contain
"live" values at all call points.  Thus, any registers clobbered by the
callee would necessarily have to be saved and restored.

>How often does this occur?  Empirically speaking, beats me.

Who cares?  The question is "Has the OCS effectively (and permanently)
crippled an otherwise impressive machine, by ignoring the possibility
that newer and better compilers can keep more useful live values in
more registers over more calls?"

>(I don't have any such data at my disposal at the moment).

Yep.

>But your above
>argument that good overall optimization in a compiler will result in most
>registers holding live values across calls actually MINIMIZES the effect
>of having "too many" caller or callee saved registers.

Wrong.  It minimizes the negative implications of the possibility that
the callee might save/restore too much stuff in a "callee-saves" convention
(because it simply never will).  On the other hand, having *lots* of live
values (because you have an intelligent optimizing compiler) definitely will
show a "caller-saves" convention to be what it is, i.e. a rotten way of doing
things, and strictly counterproductive.

>In fact, if the
>register being used in a callee needs to be saved SOMEWHERE (i.e. the
>caller has a live value in it across the call), it can be argued that 
>"callee" saved registers are preferable, since the callee can insert the
>saves and restores only in those flow paths where the register contents
>are clobbered.  In such cases, "caller save" conventions can result in
>unnecessary overhead.

Good.  You have seen the light.  Now just apply this Obvious Truth iteratively
to the set of *all* registers.

>
>Now onward to inter-procedural analysis...
>
>    (A good reference for this topic is "Minimizing Register Usage
>    Penalty at Procedure Calls" by Fred Chow from SIGPLAN Notices, July 88)
>
>    As described in the above paper, "callee save" registers can
>effectively be treated as "caller save" registers by performing a
>depth-first traversal of the program call graph.  (i.e. by the time the
>caller is analyzed, it has all relevant register usage info about the
>callee).  The callee can then defer saving and restoring to the caller,
>making the register appear as though it was operating in "caller save"
>mode.   There are limitations to this approach, however.

You can say that again!

>These
>limitations occur when the compiler does not have complete register
>usage information for a particular subroutine (libraries are a commonly
>cited example).

Or any sort of external routine!  In other words  this serious limitation
will apply to *most* calls in any large program.  A callee-saves convention
has no such limitations or problems of this sort.

>In such cases, even "globally-optimizing, inter-procedural analyzing"
>compilers need to resort to a default convention, and the trade-offs of
>caller vs. callee once again become relevant.

Right.

>Thus, OCS does not prevent
>optimizing compilers from taking advantage of inter-procedural register
>allocation, but provides the "default" for those cases when such
>optimizations cannot be made.

Unfortunately compiler writers only want to support one "convention".  As it
is, I fear that many such people are effectively being forced to blindly
conform this inadequate and mediocre default convention for *all* calls so
that they may be assured of being able to have the code they generated
*interoperate* with "standard" libraries and/or with code produced by
other compilers.

>Yes, the inclusion of both caller and callee save registers is a
>compromise... always has been, and probably always will...

If it seemed as though it was a political compromise in which everybody
got a part of what they wanted, then that would be be one thing.  But this
is obviously *not* a political compromise.  It is technical compromise
between excelent performance and lousy performance.  Nobody won and
everybody lost.

>I disagree, however, that the compromise is in place to mask compiler
>deficiencies

If what you are saying is true (i.e. if this is strictly a reasonable
technical compromise) then please tell me if 88open collected any
empirical evidence about *large* programs at various possible
"compromise points" along the scale from 0/12 to 12/0.

If they did not, perhaps they would like to arrange to do so before the
final draft of the OCS is cast into stone.  I know that the GNU C compiler
allows the user to vary the number of registers in each category, so the
tests (benchmarks of the calling conventions we could call them) should
be quite simple to perform and would yield absolutely clear cut evidence
of major significance to all 88open members.

If I'm wrong that "callee-saves" always produces better performance
(with smaller executables to boot) then such tests could also have another
significant benefit.  They could shut me up, and I'm sure everybody
would be in favor of that! :-)

>Geez, I figured most of the register allocation debates would involve
>r26-r29, which are the so-called "linker registers".  ;^)

Well get to that.  Later, later...

// rfg

rfg@ics.uci.edu (Ron Guilmette) (11/21/89)

In article <31783@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes:
>>In article <1989Nov16.212149.9770@paris.ics.uci.edu> Ron Guilmette <rfg@ics.uci.edu> writes:
>
>In defense of Mr. Phillip...

I didn't realize that anybody was attacking him!

>both HP PA and MIPS R3000s split the integer
>registers in approximately the same ratio as does the 88K...

Like lemmings to the sea.  Is everyone making the same mistake in the same
way?

Inquiring minds want to know!

// rfg

edwardm@hpcuhc.HP.COM (Edward McClanahan) (11/22/89)

Ron Guilmette *debates*:

> If I'm wrong that "callee-saves" always produces better performance
> (with smaller executables to boot) then such tests could also have another
> significant benefit.  They could shut me up, and I'm sure everybody
> would be in favor of that! :-)

I offer a simple example...

  Caller uses 2 registers...

  Callee uses 30 registers...

Using the Caller-saved convention, every time a call is made, 2 registers
are *saved*.

Using the Callee-saved convention, every time a call is made, 30 registers
are *saved*.

Similarly, consider the opposite situation (where the caller uses 30 registers
and the callee uses only 2).  In this case, it would be better to have used
the callee-saved convention.

Which of the above is more common?  This is the very question that the OCS
(and HP and MIPS and ...) had to consider.

Your arguments about other *special conditions* (such as a callee only needing
particular registers in a infrequently executed path, etc...) are all valid,
but what is the frequency of these conditions?

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  Edward McClanahan
  Hewlett Packard Company
  Mail Stop 47UE              -or-     edwardm%hpda@hplabs.hp.com
  19447 Pruneridge Avenue
  Cupertino, CA  95014                 Phone: (408)447-5651

phillip@oakhill.UUCP (Mike Phillip) (11/22/89)

In article <1989Nov20.205338.11760@paris.ics.uci.edu>, rfg@ics.uci.edu (Ron Guilmette) writes:
> In article <2647@bushwood.oakhill.UUCP> oakhill!bushwood!phillip@cs.utexas.edu (Mike Phillip) writes:
> >
> 
> >    This is where things get tricky...  The key point to recognize is that
> >regardless of compiler technology, there will likely always need to be a
> >distinction between caller and callee saved registers.
> 
> Yes.  The distinction is valuable.  It is needed so that I can tell you
> what type of saving convention is harmful.  "Caller saving consider
> harmful!"

       [deletion of point/counterpoint discussion]
> 
> >    At the other extreme, if all registers were used in "callee save"
> >mode ... the callee
> >would generate load and store instructions for every register that it
> >uses, regardless of whether such overhead is warranted by the "live" use
> >of the same register in the caller.
> 
> Right, but you are still begging the question and avoiding the issue.
> What if the saves & restores are *always* warranted by virtue of the
> fact that *all* registers contain live values at *all* points?  In that
> case, there would be no unnecessary loads or stores in a strictly
> callee-saves convention.
> 
> What I'm having trouble understanding is why various people (Mike P.
> included) seem to be unwilling to accept that this convention (used
> for decades on CICS machines) may actually be simply *the* best possible
> convention, regardless of whether or not you seem to have lots of registers
> to waste or whether or not you have a reduced instruction set.
> 

   You have asked for a situation where "caller save" registers actually
   provide better performance than "callee save" registers...  Consider
   the following situations:

   1)  In a language such as C, where "arbitrary" indirection is
       supported, there are often cases where registers MUST be presumed
       dead across function calls because the values they contain are
       "pointed to" by other variables.  For example, in the following
       code...

               int x;
               int y;
               extern void foo();
                  .
                  .
                  .

       /* Assume at this point that x and y have been placed in registers */

               y = x;
               foo (&x, &y);	
               if (x == y ) {
                  x =  x + y;
               } else {
                  x = x - y;
               }

       Effect on the caller:

       In such a case, the registers containing x and y are effectively
       treated as "caller save" by NECESSITY (assuming that foo is an
       external subroutine) because x and y must be presumed "clobbered"
       across the call.  The contents of the registers
       holding the value of x and y MUST be stored back into the "home"
       memory locations for x and y before the call to foo() because the
       addresses of x and y are passed as parameters.

       Effect on the callee:

       If foo() happens to be a leaf routine (which roughly half of
       all dynamically measured subroutine calls are...), 
       there is NO chance that it will call
       another subroutine.  Thus, the choice of registers is independent
       of subsequent nested calling sequences.  If, as in your scheme, all
       registers are treated as "callee save", spill code must be
       generated for EACH register that is used in the callee.  In a code
       sequence like that shown above, it is POSSIBLE that some of the
       registers selected for use by foo()
       have already been freed by the caller, thus making the callee
       spill code redundant.  In other words, if the compiler had a
       convention for prioritizing register usage based on a GUARANTEE
       that a certain set of registers were available without the need
       for spill code... more efficient code may result.

    2) OCS was not written just for C programs.  In Fortran code,
       anything declared in a COMMON block must be presumed dead
       across calls.  Fortran also handles arrays differently than C,
       which affects calling sequences...
       
    Thus, there are LOTS of situations where "spill"
    code must be inserted by the compiler before a subroutine call.
    If the registers being "spilled" are "caller save" then this
    overhead is not duplicated in the callee... i.e. a good compiler
    may place aliased (or global) variable contents in "caller save" 
    registers.

    Key point ====>  I'm NOT advocating either extreme of the CALLER
    vs. CALLEE spectrum.  I'm simply trying to point out why there
    may be some merit in a caller/callee split.  I would agree that
    in a "perfect" programming world where all subroutines are visible
    to the compiler that the distinction becomes unnecessary.
    Unfortunately, as you pointed out, this is NOT the case, and I
    don't feel that the issue is as clear cut as you state it...

    Further discussion of this probably belongs in comp.compilers or comp.arch.


    ---------------------
    Mike Phillip
    Motorola 88000 Development

tom@ssd.harris.com (Tom Horsley) (11/22/89)

I can't resist entering this discussion, endless and pointless arguments
have always appealed to me :-).

I can point out one obvious flaw in the idea that the called routine should
do the register saves. We have several benchmarks as well as several real
programs (as opposed to the typical benchmark :-) in which it is possible
to examine the code and see that the current conventions produce fantastic
code. This occurs (quite frequently, I might add) when the outer routine
contains loops, and the loops contain subroutine calls. Very often (because
12 registers is really an awful lot of registers) the leaf routine does
not need to save any registers at all. (For instance, this is true of quite
a lot of the low-level str and mem routines in the C library). This means
that the subroutine (which, if you will recall, is called in a loop) runs
much faster than it would if it was saving every register it used. Quite
often, the outer routine is also not complex enough to require the use
of all the registers (r14-r26 is still a lot more registers than I typically
have available on a CISC machine). All this boils down to the fact that
the current convention may not be perfect, but it is, in practice, pretty
nice.

This is real code I am talking about, which we have really spent some time
examining for real quality issues, not some hypothetical situation in which
some magic compiler has figured out how to keep something live in every
register all the time. We have a good optimizing compiler with a good
register allocator that can do neat tricks like make separate lifetime
regions for globals around loops so they can be in memory in regions where
there are aliased references and in registers in areas where there are
not aliased references, and it turns out that more often than you might
expect, there are already enough registers. As a practical matter having
more registers would help sometimes, but not as often as you might think.

This is all still operating without interprocedural analysis. In any
system where each routine is compiled in a vacuum, some calling convention
is necessary, and I don't have any difficulty making my highly optimizing
compiler with its fancy register allocator deal with any convention.

If you go into the domain of a global program database with interprocedural
register allocation, you no longer need to be bound by OCS rules, you can
make any rules you want because you can define your own interface for each
routine to minimize global cost (don't ask me how to actually do this, mind
you :-).

Also, doing interprocedural register allocation at link time is not
necessarily restricted because you have to link in external libraries. The
88k architecture is really quite simple to disassemble and analyze. It is
not totally beyond the realm of possibility that the linker could even
modify register usage for things like libc.a, even without special support
from a global program database. (Essentially, there would be a database, it
would just be in a format that is kind of hard to interpret).
--
=====================================================================
domain: tahorsley@ssd.csd.harris.com  USMail: Tom Horsley
  uucp: ...!novavax!hcx1!tahorsley            511 Kingbird Circle
      or  ...!uunet!hcx1!tahorsley            Delray Beach, FL  33444
======================== Aging: Just say no! ========================

rhg@cpsolv.UUCP (Richard H. Gumpertz) (11/25/89)

THESIS 1: The caller knows what registers are live; the callee does not. 
Hence the caller should decide what needs saving. 

THESIS 2: The callee knows what registers he clobbers; the caller does
not.  Hence the callee should decide what needs saving. 

In other words, if the caller and callee are presumed to be compiled
separately, neither has enough information to do the "right" thing with
respect to register-saving.

By splitting the registers in two classes, "leaf" (i.e., callees that
never call anything) procedures can try to use just the clobberable
registers and so avoid any register saving.  Similarly, "root" (i.e.,
callers that are rarely called) can try to use just the preserved
registers and so avoid saving anything around calls.

Procedures that both do lots of calling and are called frequently are in
a middle ground: that is where the compiler will have to work to figure
out what goes in which register.  Certainly not an easy task, but it can
be done.  Here is where the compiler writer should earn his keep.  By
the way, note that a smart compiler might even save registers around a
block or loop of calls instead of around each call.  Still, in many
cases it will just be a trade-off between saving on entry and saving
before calling out.  If both are done with equal AVERAGE frequency (i.e. 
the procedure makes one call out, average, per time that it is called),
it doesn't matter to execution speed which is used.  (Code size may be
effected, but in RISC architectures the decision has already been made
to increase speed at the cost of ignoring code size anyway.)

Clearly the OVERALL average across ALL procedures MUST be one call out
per call in.  The question is, what does it look like for a particular
procedures?  What is the distribution of called/call ratios?

I personally think some sort of split of register saves is reasonable. 
Only measurement can tell whether 50-50 is appropriate for a particular
level of compiler intelligence and program mix.  Still, it's as good a
place to start as any. 


-- 
===============================================================================
| Richard H. Gumpertz rhg%cpsolv@uunet.uu.NET -or- ...uunet!amgraf!cpsolv!rhg |
| Computer Problem Solving, 8905 Mohawk Lane, Leawood, Kansas 66206-1749      |
===============================================================================

edwardm@hpcuhc.HP.COM (Edward McClanahan) (11/28/89)

Tom Horsley touches on an interesting side issue:

> I can point out one obvious flaw in the idea that the called routine should
> do the register saves. We have several benchmarks as well as several real
> programs (as opposed to the typical benchmark :-) in which it is possible
> to examine the code and see that the current conventions produce fantastic
> code. This occurs (quite frequently, I might add) when the outer routine
> contains loops, and the loops contain subroutine calls. Very often (because
> 12 registers is really an awful lot of registers) the leaf routine does
> not need to save any registers at all. (For instance, this is true of quite
> a lot of the low-level str and mem routines in the C library).
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As I recall, the VAX has TWO calling conventions:

  1 - CALLS and CALLG explicitly require an "entry mask" in the called
      procedure indicating which registers to push on the stack.  This
      mask is 16 bits, one bit per register, although you never save
      certain registers (e.g. R0, PC, etc...).  The microcode interpreting
      the CALLx instruction actually does the PUSHes for you.

  2 - JSR and BSR simply push the return PC on the stack and jump to the
      called procedure.  The callee must then "protect" any registers it
      uses.

Both of these schemes implement the callee-saved model, but the JSR/BSR
is faster for "low-level...routines".

In HP's RISC architecture, we have both callee and caller saved registers
(as is the case in the m88k standard).  Still, for some "low-level stuff",
we needed a faster way to call a function.  We implemented "millicode" to
fill this gap.  The caller simply doesn't have to save any caller-saved
registers it happens to be using at the time.  Similarly, the optimizer
doesn't have to halt optimization across the millicode call.  Those
"low-level str and mem routines" are implemented in millicode.  Is there
any provision for this calling convention in the m88k standard?

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  Edward McClanahan
  Hewlett Packard Company
  Mail Stop 47UE              -or-     edwardm%hpda@hplabs.hp.com
  19447 Pruneridge Avenue
  Cupertino, CA  95014                 Phone: (408)447-5651

thomas@uplog.se (Thomas Hameenaho) (11/28/89)

I have followed the debate about caller vs. callee saves registers with
great interest.

I have an idea that perhaps could be a bit of both worlds: What about having
the caller supply a mask of live registers to the callee, ie. registers that
must not be clobbered to the callee? This mask should be ANDed with a mask of
the registers that the callee clobbers and handed to the equivalent of MOVEM
(68k) or CALLx (VAX). The save/restore should of course be handled by
microcode/hardware to make it fast.
This way there would never be unnecessary saves.
--
Real life:	Thomas Hameenaho		Email:	thomas@uplog.se
Snail mail:	TeleLOGIC Uppsala AB		Phone:	+46 18 189406
		Box 1218			Fax:	+46 18 132039
		S - 751 42 Uppsala, Sweden

meissner@dg-rtp.dg.com (Michael Meissner) (11/29/89)

In article <2647@bushwood.oakhill.UUCP> phillip@oakhill.UUCP (Mike Phillip) writes:

	...

|  effectively be treated as "caller save" registers by performing a
|  depth-first traversal of the program call graph.  (i.e. by the time the
|  caller is analyzed, it has all relevant register usage info about the
|  callee).  The callee can then defer saving and restoring to the caller,
|  making the register appear as though it was operating in "caller save"
|  mode.   There are limitations to this approach, however.  These
|  limitations occur when the compiler does not have complete register
|  usage information for a particular subroutine (libraries are a commonly
|  cited example).  Other examples include:
|      1) recursive calls
|      2) calls made through subroutine pointers
|      3) separate compilation
|  
|  In such cases, even "globally-optimizing, inter-procedural analyzing"
|  compilers need to resort to a default convention, and the trade-offs of
|  caller vs. callee once again become relevant.  Thus, OCS does not prevent
|  optimizing compilers from taking advantage of inter-procedural register
|  allocation, but provides the "default" for those cases when such
|  optimizations cannot be made.

Separate compilation can be handled by having the compiler generate
intermediate code instead of object code or some such, and having the
linker generate the final code.  Tail-recursive calls can be changed
by the compiler into a loop or some such.  Calls made through pointers
or to different langauges are another problem.  Calls made through
pointers are somewhat rare in conventional C (except for things like
EMACS and such), but they become much more mainline in object oriented
languages like C++ (virtual functions).  Dynamic shared libraries are
another real big problem, where the function being called may not even
exist when the function is called (yes I know the BCS/OCS currently do
not support shared libraries directly today, but you can roll your own
with memctl, and they are coming).

IMHO, most users are not willing to pay the price of getting every
last ounce of performance that doing global register allocation for
the entire program would entail.  Yes, there are some things (like the
kernel itself, or X11 libraries) that it would be worth it to get
every last drop of performance.  I know some vendors and companies who
never use the normal compiler optimizations, since they perceive that
the performance is adequate (or the compilers are inadequate).
--

Michael Meissner, Data General.
Until 12/15:	meissner@dg-rtp.DG.COM
After 12/15:	meissner@osf.org

pardo@cs.washington.edu (David Keppel) (11/29/89)

thomas@uplog.se (Thomas Hameenaho) writes:
>[Idea: caller register mask, callee register mask, save only
> registers indicated by AND of masks.]

There is a potential problem.  It takes extra code to decide whether
or not to save certain registers.  The time you spend executing that
code could be spent naively saving registers and you'd come out about
the same.

On the VAX, where it already has save-by-mask microcode, the tme
penalty might be OK.  On the RISC-family processors, you'll probably
lose a lot.  Is it worth adding save-by-mask hardware?  Probaly not.

The last time this came around I saved a bunch of articles.  You can
get them via anonymous ftp:

	ftp june.cs.washington.edu (128.95.1.4)
	login anonymous
	passwd username
	cd tmp/pardo
	get proc.call.txt

About 750 lines.

	;-D on  ( USENET by RPC )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

ge@kunivv1.sci.kun.nl (Ge' Weijers) (11/29/89)

thomas@uplog.se (Thomas Hameenaho) writes:

>I have an idea that perhaps could be a bit of both worlds: What about having
>the caller supply a mask of live registers to the callee, ie. registers that
>must not be clobbered to the callee? This mask should be ANDed with a mask of
>the registers that the callee clobbers and handed to the equivalent of MOVEM
>(68k) or CALLx (VAX). The save/restore should of course be handled by
>microcode/hardware to make it fast.
>This way there would never be unnecessary saves.

Suppose I write a main program that uses all registers. Then your scheme
degenerates to a callee-saves scheme. Especially leaf routines do not benefit
from your scheme. Even worse, the performance of a program might depend
critically on the register number chosen by the optimiser. This makes
predicting performance quite hard. And every procedure call gets extra overhead.

The VAX CALLS/G instruction does so much often unnecessary work that
not using it is a viable option. A compiler I used to work on was faster
than PCC only for this reason. (It used a caller saves convention BTW,
and register variables were unknown)
The same is true for the compiler I'm currently working on. In functional
programming languages the procedure call is so common that the performance
depends quite critically on the calling convention. I use a pure caller-saves
convention, with parameter passing partially in registers. For things
like fibonacci numbers and other recursive programs this runs rings around
C, Fortran and what have you. (And don't mention the SPARC, because
functional programs really exercise register windows to death :-).
I think a functional language implementation is better off not using
register windows at all, except if global analysis is done to
keep the processor from saving and restoring dead register all the time.
This is not trivial in recursive cases. Give me a store-multiple-register
please?).

I suspect the half-half (or perhaps 60-40) convention is one of the best when
you don't have global knowledge. If you have global knowledge then you
can implement parameter passing using registers, local variables and global
variables, whatever suits the problem.

Ge' Weijers
Ge' Weijers                                    Internet/UUCP: ge@cs.kun.nl
Faculty of Mathematics and Computer Science,   (uunet.uu.net!cs.kun.nl!ge)
University of Nijmegen, Toernooiveld 1         
6525 ED Nijmegen, the Netherlands              tel. +3180612483 (UTC-2)

meissner@dg-rtp.dg.com (Michael Meissner) (11/30/89)

In article <100050002@hpcuhc.HP.COM> edwardm@hpcuhc.HP.COM (Edward McClanahan) writes:

|  Both of these schemes implement the callee-saved model, but the JSR/BSR
|  is faster for "low-level...routines".
|  
|  In HP's RISC architecture, we have both callee and caller saved registers
|  (as is the case in the m88k standard).  Still, for some "low-level stuff",
|  we needed a faster way to call a function.  We implemented "millicode" to
|  fill this gap.  The caller simply doesn't have to save any caller-saved
|  registers it happens to be using at the time.  Similarly, the optimizer
|  doesn't have to halt optimization across the millicode call.  Those
|  "low-level str and mem routines" are implemented in millicode.  Is there
|  any provision for this calling convention in the m88k standard?

The OCS doesn't directly support millicode, and I'm not aware of any
compilers that use internal calling sequences.  However for optimizing
the str* and mem* stuff, you really need 10 or more registers (I think
our memcpy uses more than the 12 non-preserved registers to fully
pipeline the loads and such).

--

Michael Meissner, Data General.
Until 12/15:	meissner@dg-rtp.DG.COM
After 12/15:	meissner@osf.org

meissner@dg-rtp.dg.com (Michael Meissner) (11/30/89)

In article <THOMAS.89Nov28100512@uplog.uplog.se> thomas@uplog.se (Thomas Hameenaho) writes:

|  I have followed the debate about caller vs. callee saves registers with
|  great interest.
|  
|  I have an idea that perhaps could be a bit of both worlds: What about having
|  the caller supply a mask of live registers to the callee, ie. registers that
|  must not be clobbered to the callee? This mask should be ANDed with a mask of
|  the registers that the callee clobbers and handed to the equivalent of MOVEM
|  (68k) or CALLx (VAX). The save/restore should of course be handled by
|  microcode/hardware to make it fast.
|  This way there would never be unnecessary saves.

Given that the machine is a RISC machine, it only has simple loads and
stores.  In order to implement the above scheme, some sort of load
and/or conditional branches would have to be done, which would be an
expensive operation.

In general, the time taken to do the prologue stores is 1 cycle per
register, and 1 cycle for the epilogue loads (yes I know that loads
take 3 cycles if coming from the cache, but in most epilogues, you
will have a lot of loads in a row, whose value is not needed
immediately, so the loads will be pipelined).  Assuming the procedure
in question is not a monster procedure, and it does not invoke other
large procedures, it is likely that the values being saved will still
be in the cache when it comes time to load them again.
--

Michael Meissner, Data General.
Until 12/15:	meissner@dg-rtp.DG.COM
After 12/15:	meissner@osf.org

tim@binky.sybase.com (Tim Wood) (11/30/89)

In article <9972@june.cs.washington.edu> pardo@june.cs.washington.edu (David Keppel) writes:
>thomas@uplog.se (Thomas Hameenaho) writes:
>>[Idea: caller register mask, callee register mask, save only
>> registers indicated by AND of masks.]
>
>There is a potential problem.  It takes extra code to decide whether
>or not to save certain registers.  The time you spend executing that
>code could be spent naively saving registers and you'd come out about
>the same.

Hmm, seems like it would all be done at compile time.  For every
procedure call you compile, note which registers are read or written in
the calling routine (or, if "-O" is set, try to determine which ones are not
live across the call :-).  Then, set the bits in the save mask.
Compile in the mask value in your "call-with-mask" instruction sequence
(hey, what a great idea for a single instruction! :-)
Each compiled procedure would have its own mask, a la VAX.

>
>On the VAX, where it already has save-by-mask microcode, the tme
>penalty might be OK.  

Except, VAX is oriented toward callee-save.  Subroutines must
have a 2-byte register-save mask up front in order to be called by the
"call-subroutine" instructions.

>On the RISC-family processors, you'll probably
>lose a lot.  Is it worth adding save-by-mask hardware?  Probaly not.

Please substantiate these assertions.

>		    pardo@cs.washington.edu

Sybase, Inc. / 6475 Christie Ave. / Emeryville, CA / 94608	  415-596-3500
tim@sybase.com          {pacbell,pyramid,sun,{uunet,ucbvax}!mtxinu}!sybase!tim
		This message is solely my personal opinion.
		It is not a representation of Sybase, Inc.  OK.

dgb@cs.washington.edu (David Bradlee) (12/01/89)

In article <7263@sybase.sybase.com>, tim@binky.sybase.com (Tim Wood) writes:
> Hmm, seems like it would all be done at compile time.  For every
> procedure call you compile, note which registers are read or written in
> the calling routine (or, if "-O" is set, try to determine which ones are not
> live across the call :-).  Then, set the bits in the save mask.
> Compile in the mask value in your "call-with-mask" instruction sequence
> (hey, what a great idea for a single instruction! :-)
> Each compiled procedure would have its own mask, a la VAX.
> 
You seem to imply that the mask is a function of the CALLER, meaning
the callee has to examine the mask at runtime to save the appropriate
registers, which, as pardo indicated, would take extra code.  If you
knew all the callsites, the callee could save the union, but if you
know all the callsites, you can explicitly save the approriate
registers in either the callee or at the callsite, without any mask.
Ergo, the mask doesn't help you.

	-- Dave Bradlee, University of Washington

pardo@cs.washington.edu (David Keppel) (12/01/89)

> pardo@june.cs.washington.edu (David Keppel) writes:
>>On the RISC-family processors, you'll probably
>>lose a lot.  Is it worth adding save-by-mask hardware?  Probaly not.

In article <7263@sybase.sybase.com> tim@binky.UUCP (Tim Wood) writes:
>Please subtantiate these assertions.

A couple other people have already; allow me to summarize.

A RISC has [is defined as having] a load/store architecture, one
instruction issued per clock cycle.  Save-by-mask in hardware is
non-RISC.  Implementing save-by-mask in software requires several
instructions to be executed for each (potential) save.

Between the blowup in code size and the extra instructions, you're
probably better off blindly saving n registers.  Remember that
conditionally saved registers must also be conditionally restored at
function return.

A (long) example

Suppose callee saves 8, caller saves 8.  The compiler will leave
caller saves registers empty (don't need saving) aroudn function
calls.  The caller could thus needlessly save up to 8 registers.
If the callee needs registers it will try to use registers that were
saved by the callee.  If that fails it will save up to 8 registers on
its own, but those registers might be dead in the caller, so up to 8
registers could be saved needlessly.  The wasted caller-saves and
wasted callee-saves can never overlap.  At most 8 registers are saved
needlessly.

If load/store is pipelined, then the worst-case code looks like, say

    add sp 32 sp
    store r1 0[sp]
    store r2 4[sp]
    store r3 8[sp]
    store r4 12[sp]
    store r5 16[sp]
    store r6 20[sp]
    store r7 24[sp]
    store r8 28[sp]
    call
    nop
    load 28[sp] r8
    ...
    subl sp 32 sp

That's 19 clock ticks wasted.  The symmetric case for callee saves
similarly wastes 19 clock ticks.

Now consider save-by-mask
  
    [r1 to r16 can are saved by mask.]
    [mask passed in rN, shadow is rM.]
    [Destination is rightmost operand.]
    [No branch delay slots.]
  
  function:
    jmp !mask done
  r1:
    and mask 1 shadow
    jmp !shadow r2
    save r1
  r2:
    and mask 2 shadow
    jmp !shadow r2
    save r2
    jmp !mask done
  r3:
  ...
  
  r16:
    and mask 2*^15 shadow
    jmp !shadow done
    save r16
  done:

(Note: you may well be able to do better than this.  This is an
example of one sequence for one hypothetical machine.)

The worst case is 16*3 = 48 instructions, but if jumps introduce more
than one bubble, then it could blow up to about 16*4 = 64
instructions.  Let's go with 48.  On average you should have to save
about half the registers (a guess).  A clever optimzier will put
registers in low-number registers, so that you don't generally have to
go all the way to r16 in order to save half the registers.  So on
average you'll hit about 24 instructions in the prolog and about 24
instructions in the epilog, or about 48 instructions.

Of course the register saves are highly machine- and
application-dependent.  Suppose that generally you have to save only 2
registers and that they are r1 and r3.  Then you'll execute 9
instructions on entry and 9 instructions on exit and you're only doing
1 save better than the *worst* case of the naive caller-saves-some and
callee-saves-some!

Of course this all needs to be parameterized on a machine-by-machine
basis, and also on an application-by-application basis.

* Average number of registers to save.
* Average number of saves that are wasted.
* Cost to blindly do split caller/callee saves
* Cost to do save-by-mask.

If you really care, you should start by reading the stuff availabile
via anonymous ftp from june.cs.washington.edu  (128.95.1.4) in
tmp/pardo/proc.call.txt

	;-D on  ( Lies, damn lies, and USENET )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

tim@binky.sybase.com (Tim Wood) (12/02/89)

In article <10007@june.cs.washington.edu> pardo@june.cs.washington.edu (David Keppel) writes:
>In article <7263@sybase.sybase.com> tim@binky.UUCP (Tim Wood) writes:
>>Please subtantiate these assertions [about save-by-mask hardware].
>
>A couple other people have already; allow me to summarize....

And a very worthwhile summary it is.  Thanks for the effort, it's
nice to have it in one place.  

>If you really care, you should start by reading the stuff availabile
>via anonymous ftp @ june.cs.washington.edu (128.95.1.4) tmp/pardo/proc.call.txt
>		    pardo@cs.washington.edu

I'll keep this reference for when we get Internet access.  Thanks.

-TW

Sybase, Inc. / 6475 Christie Ave. / Emeryville, CA / 94608	  415-596-3500
tim@sybase.com          {pacbell,pyramid,sun,{uunet,ucbvax}!mtxinu}!sybase!tim
		This message is solely my personal opinion.
		It is not a representation of Sybase, Inc.  OK.

rhg@cpsolv.UUCP (Richard H. Gumpertz) (12/05/89)

In article <THOMAS.89Nov28100512@uplog.uplog.se> thomas@uplog.se (Thomas Hameenaho) writes:
>I have an idea that perhaps could be a bit of both worlds: What about having
>the caller supply a mask of live registers to the callee, ie. registers that
>must not be clobbered to the callee? This mask should be ANDed with a mask of
>the registers that the callee clobbers and handed to the equivalent of MOVEM
>(68k) or CALLx (VAX). The save/restore should of course be handled by
>microcode/hardware to make it fast.
>This way there would never be unnecessary saves.

Gee, what a great idea!  We could add microcode to our RISC machines
and turn them into CISC machines.  Moby sigh.        :-)

-- 
===============================================================================
| Richard H. Gumpertz rhg%cpsolv@uunet.uu.NET -or- ...uunet!amgraf!cpsolv!rhg |
| Computer Problem Solving, 8905 Mohawk Lane, Leawood, Kansas 66206-1749      |
===============================================================================