[comp.arch] register save/restore

grunwald@m.cs.uiuc.edu (10/28/88)

This question is follows the line of recent questions concerning register
save/restore conventions.

Many (most UNIX ) systems apply the convention that the callee must save
any registers used in a procedure. Other systems dictate that the caller
must save the registers.

Question One: Is there an advantage? I can think of many practical advantages
to the former method (callee saves) vs. caller saves, but I can also think
of advantages to the latter.

Now, the second question concerns a saving convention. I'd like to know if
this has been implemented/modelled anywhere, and what the advantages are.

Presume that we have a mask of dirty bits for the register file. Presume
that each procedure specifies a bit mask of registers that are used (as is
already done on the 680x0 and NS32x32).

The only registers that need to be saved are those denoted by the AND-ed
product of the two bit masks. The initial dirty mask of the called procedure
would contain those bits that didn't get AND-ed (i.e. those registers
that aren't used by the current procedure but still contain live data).
When registers get over-written, their dirty bit is set.

This has the advantage of saving only those registers that actually need to
be saved. The cost is similar to current register load/unload masks, for
common architectures. There's a cost/perf. tradeoff as register sets get
larger. For 64 registers, you'd need a double-longword of bits for
the masks (although you'd probably split it into two 32 bit sets, since
very few procedures would use more than 32 registers, and those procedures
could just execute two instructions).

It would require some slight changes to compilers. You'd like to randomize
register accesses across subroutines, or perhaps record information about
register acessses in other subroutines so you don't select the same
registers they used.

So, has this been done before? Modelled? Any data? 

Dirk Grunwald
Univ. of Illinois
grunwald@m.cs.uiuc.edu

henry@utzoo.uucp (Henry Spencer) (10/30/88)

In article <3300037@m.cs.uiuc.edu> grunwald@m.cs.uiuc.edu writes:
>Many (most UNIX ) systems apply the convention that the callee must save
>any registers used in a procedure. Other systems dictate that the caller
>must save the registers. ... Is there an advantage?

As usual, it depends.  Ideally, one wants to save and restore registers
as little as possible, because it costs time and memory.  The caller
knows which registers don't have to be saved because they don't contain
anything interesting.  The callee knows which registers don't have to
be saved because he's going to leave them alone anyway.  Neither is
clearly superior for all situations.  It's not unheard-of to split the
available registers into a callee-saved group and a caller-saved group.
(What does MIPS do?)

The callee-saves bias in Unix is basically historical.  On the 11, there
were so few free registers that the calling sequence simply saved and
restored all of them, and doing this in the callee saved code space.
(This is a slight oversimplification.)  On the VAX, the wonderful all-
singing all-dancing standard calling sequence provided by the hardware
encouraged callee-saves.  Not everyone has bothered to rethink the issues
when changing processors.
-- 
The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

yuval@taux02.UUCP (Gideon Yuval) (10/31/88)

In article <1988Oct30.013510.16861@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>......................................  On the VAX, the wonderful all-
>singing all-dancing standard calling sequence provided by the hardware
>encouraged callee-saves.  Not everyone has bothered to rethink the issues
>when changing processors.

The VAX all-singing CALLS/CALLG was also VERY  slow;  but  the  only  way  to
find  this  out was to time it by (e.g.) the Unix "time" command -- DEC never
published any Vax timing-notes (that I know of).
-- 
Gideon Yuval, yuval@taux01.nsc.com, +972-2-690992 (home) ,-52-522255(work)
 Paper-mail: National Semiconductor, 6 Maskit St., Herzliyah, Israel
                                                TWX: 33691, fax: +972-52-558322

dietz@cs.rochester.edu (Paul Dietz) (10/31/88)

"Callee-saves" and "caller-saves" are just two instances of a
more general register saving strategy.  Suppose we know the
call graph of the program, and we know which registers each
procedure uses.  Register save/restore code can be thought of
as occuring on the edges of this call graph.  If we know the
execution frequency of each arc in the graph, and if the time
to save/restore a register is independent of the number of
register being saved (a big if), the optimal code for saving/
restoring each register can be found using an algorithm for
maximum network flow (finding the cut of minimum total frequency
that disconnects all procedures using the register).

Paul F. Dietz
dietz@cs.rochester.edu

bcase@cup.portal.com (Brian bcase Case) (11/01/88)

>The VAX all-singing CALLS/CALLG was also VERY  slow;  but  the  only  way  to
>find  this  out was to time it by (e.g.) the Unix "time" command -- DEC never
>published any Vax timing-notes (that I know of).

Yes, this never ceased to amaze me.  How in the world would you write
a good compiler for this machine?; it seems impossible to make good
code selection decisions without knowing how long things take.  I
guess they thought the only consideration was code size; at leas the
archtitecture manual *had* to tell you the instruction encodings!  :-)
Oh, yeah, I forgot:  smaller code is faster code.  Sigh.

lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (11/02/88)

In article <228@taux02.UUCP> yuval@taux02.UUCP (Gideon Yuval) writes:
>The VAX all-singing CALLS/CALLG was also VERY  slow;

On the 11/780, it was slow because the write-to-memory FIFO wasn't very
deep. The typical CALLS, with a typical register save mask, wrote more
words than the FIFO could absorb. So, the CPU stalled, waiting for the
memory to make more room in the FIFO. 

Of course, there are engineering reasons for avoiding deep FIFO's.  Since
this single design decision caused a bottleneck, I assume that it was
an oversight.

The Nautilius (8700, etc) was carefully tuned against real instruction
traces.  I believe that CALLS runs somewhat better on these machines.


	

-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

firth@sei.cmu.edu (Robert Firth) (11/03/88)

	Register Saving across Procedure Calls
	--------------------------------------

Which is better - caller saves or callee saves?


A. Is this the right question?

First, and most important, if you are designing a professional-quality
production compiler, this is the wrong question.  Such a compiler must
perform interprocedural optimisation if it is to be respectably state
of the art.  

However, if you want to design a prototype, amateur, or deliberately
low-cost compiler, the issue is probably one worth considering.  To
keep this note short, I'm going to assume you understand the basic
issue and are familiar with current hardware and software technology.


B. Are the strategies equally sound?

The point I consider most important, is that there is a definite
semantic asymmetry between the two strategies.  If the caller saves,
then the caller is saving, locally, his own local state.  This seems
to me basically correct.  If the callee saves, then the callee is
saving, local to him, state that belongs to someone else.  Moreover,
he is saving state of greater extent - the caller's registers - in
space of lesser extent - his own stack frame.  This seems to me
semantically unsound.

Now, I tend to let few things get in the way of efficiency, especially
efficiency of something as crucial as the procedure call, but semantic
correctness is one of those things.  So in this case, I'm going to
come out and say that "callee saves" is fundamentally wrong, and should
be avoided if possible, even at some cost.  


C. Which is more efficient?

Happily, however, the efficiency arguments, in my experience, support
the "caller saves" strategy, so one can indeed do well by doing good.

The most blatant case is that of the longjump, which appears in other
languages as a GOTO or RAISE statement.  This causes a jump out of a
procedure to somewhere further up the call chain, and so must reset
the environment of the destination.  If the caller saves state, then
this is simple: the jump is a jump, and the destination knows where
all the state has been saved.  In most implementations, one need only
reset the frame pointer to the current incarnation of the destination
procedure, and take the jump to the label.

But if the callee saves, then the caller has no idea how to recover
his saved state, which may be buried any number of stack frames further
down.  It is therefore necessary to unwind the entire stack before taking
the jump.  The difference in cost can easily be a factor of 100 or more.

I do not regard this as a marginal point.  The exception is beginning to
be used as a normal programming tool; it is a feature of several modern
languages and will probably be a feature of most new ones.  Its efficient
implementation is as desirable as the efficient implementation of, say,
for-loops or array assignments.

Turning though to the main topic - which is the faster strategy for a
normal call and return - I see two issues here: the number of registers
to be saved and restored, and the cost of each save and restore.


D. Some facts and guesses

In my experience, there is almost no difference between the number
of registers used by the caller and the number used by the callee.

Small procedures tend to use fewer than less small, and leaf procedures
tend to be a bit smaller, so on balance it seems marginally better for
the callee to save. (What this also tells us is that interprocedural
optimisation of leaves and leaf-callers only will give you big returns)

But this is outweighed by two factors

* The callee must save all registers it will use throughout the body;
  the caller need save only the registers that are live at the point of
  call.

* When two or more calls occur in succession, both callees must save,
  but the caller need save only once.

Rough guesses I have accumulated over time are

* at any call point, caller is using ~ 2/3 of the registers it will
  use at all (though this is partly due to defects in register
  allocation strategies)

* on average, a procedure call is (almost) immediately followed by
  another about 2/3 of the time. This implies that if the caller
  saves, it will have to save ~ 45 times for every 100 calls.

These two factors together imply that the cost of a caller-saves protocol
is about 1/3 that of a callee-saves protocol.  (Do you believe that?)

Now consider the cost of a save and restore.  There are two factors that
make it cheaper for the caller to save

* the register may be slaving a known value.  The caller then need not
  save at all, merely restore.  I find this is true of at least 20% of
  live registers. (Consider for instance the MC68020.  If you are working
  hard, you probably have 5 or 6 live D registers, of which at least one
  is holding a constant, and 4 or 5 live A registers, of which perhaps
  two are holding pointers to data structures.  That's 3 out of 10)

* the store may be combined with another operation.  For example, the
  last operation on the register may have been an add

    ADDL2 X,R1

  This can be changed into

    ADDL3 X,R1,save_place

  A small saving, admittedly, but a saving.

There is a third factor I have not assessed satisfactorily, which applies
especially to RISC machines.  Is the caller or the callee better able to
distribute loads and stores through the code, so as to overlap any load
or store delays?  I suspect it is the callee, but that is a hunch.


E. What about Hardware Help?

On the question of a "high-level procedure call" implemented in hardware,
such as the VAX CALLS, I think my opinion is known: such instructions
are worthless.  But what about a simpler instruction, such as a hardware
register save, or save-under-mask?  A good example is provided by the
MOVEM instruction of the MC68020, or the LDM/STM of the PE3200.  These
are typical, and typically their break-even point is at about 4
registers.  This suggests to me that they are or marginal value - after
all, if you are already passing parameters in registers, how many are
left to save around the typical call?

The instructions are probably still worth having for things like a task
context save (if only the PE3200 permitted them to be used that way!),
but their contribution to a procedure protocol is slight.


F. Possibly Offensive Remark

I agree with Henry, that procedure calling protocols need to be thought
afresh for each fresh machine.  Unfortunately, very few compiler shops
seem prepared to do this.  I see over and again the model of a closed,
downward-growing stack; caller pushes parameters; callee saves registers
and allocates local space.  One can do better than this by a factor of
two or three, on almost any machine I know, with a different model more
amenable to local optimisation.

We are used to making up in software for badly designed hardware.  Now
we have the reverse: making up in hardware for badly designed software.
At least, that is my explanation for register windows.

daryl@hpcllla.HP.COM (Daryl Odnert) (11/03/88)

Dirk Grunwald (grunwald@m.cs.uiuc.edu) asks:
> Many (most UNIX ) systems apply the convention that the callee must save
> any registers used in a procedure. Other systems dictate that the caller
> must save the registers.

Some systems use a mixed strategy.  For example, both the HP Precision
Architecture (HPPA) and the MIPS R2000 split up the register set into
two partitions, a caller-saves set and a callee-saves set.

The compiler is free to use any register in the caller-saves set without
saving or restoring that register.  These registers cannot be used to hold
live values across procedure calls.  If a register in the callee-saves
(or entry-saves) set can only be used if the procedure saves the value
on entry and restores it at exit.  Of course, these registers can be
used to hold live values across calls.

> Question One: Is there an advantage? I can think of many practical advantages
> to the former method (callee saves) vs. caller saves, but I can also think
> of advantages to the latter.

The mixed strategy seems to work well.  The difficult question is determining
the right number of registers to put in each of the two partitions.
Some benchmarks favor larger caller-saves partitions, other could take
advantage of a large callee-saves set.

> Now, the second question concerns a saving convention. I'd like to know if
> this has been implemented/modelled anywhere, and what the advantages are.

"Minimizing Register Usage Penalty at Procedure Calls" by Fred C. Chow
of MIPS Computer Systems.  It is published in the "Proceedings of the
SIGPLAN '88 Conference on Programming Language Design and Implementation"
(pg 85-94.)

"Register Windows vs. Register Allocation" by David W. Wall of
DEC Western Research Lab.  Same conference proceedings (pg 67-78).


Daryl Odnert   daryl%hpda@hplabs.hp.com
Hewlett Packard Information Software Division

johnl@ima.ima.isc.com (John R. Levine) (11/03/88)

In article <7580@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
>Which is better - caller saves or callee saves?
>
>[he argues that for a variety of reasons, caller saves is faster.]

The argument I most often heard for the callee saving the registers is that
it makes the object code smaller.  Most subroutines are called from more than
one place (otherwise, why make it a subroutine?  I know, libraries, modularity,
etc.) so that you would need N copies of caller-saves, one for each call,
but you only need one copy of callee-saves.
-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ bbn | think | decvax | harvard | yale }!ima!johnl, Levine@YALE.something
Rome fell, Babylon fell, Scarsdale will have its turn.  -G. B. Shaw

bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) (11/03/88)

In article <7580@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
}	Register Saving across Procedure Calls
}
}Which is better - caller saves or callee saves?
}A. Is this the right question?
}
}First, and most important, if you are designing a professional-quality
}production compiler, this is the wrong question.  Such a compiler must
}perform interprocedural optimisation if it is to be respectably state
}of the art.  
} ...

You must also consider the problems of separate compilation and multiple
language applications.  If register saving differs from module to module,
you'd better have language extensions that allow you to specify external
routines that you call to use some standard procedure call mechanism, as
well as ways to specify that function that you're writing may be called by
some external module and that it must likewise use a standard convention.
The alternative is to require smart linkers.

This is probably a religious issue, much like network byte ordering versus
swap-only-as-required.

}B. Are the strategies equally sound?
}
}The point I consider most important, is that there is a definite
}semantic asymmetry between the two strategies.  If the caller saves,
}then the caller is saving, locally, his own local state.  This seems
}to me basically correct.  If the callee saves, then the callee is
}saving, local to him, state that belongs to someone else.
} ...
}This seems to me
}semantically unsound.
}
} ... I'm going to
}come out and say that "callee saves" is fundamentally wrong, and should
}be avoided if possible, even at some cost.  
}
}C. Which is more efficient?
}
}Happily, however, the efficiency arguments, in my experience, support
}the "caller saves" strategy, so one can indeed do well by doing good.
}
}The most blatant case is that of the longjump, which appears in other
}languages as a GOTO or RAISE statement.  This causes a jump out of a
}procedure to somewhere further up the call chain, and so must reset
}the environment of the destination.  If the caller saves state, then
}this is simple: the jump is a jump, and the destination knows where
}all the state has been saved.  In most implementations, one need only
}reset the frame pointer to the current incarnation of the destination
}procedure, and take the jump to the label.
}
}But if the callee saves, then the caller has no idea how to recover
}his saved state, which may be buried any number of stack frames further
}down.  It is therefore necessary to unwind the entire stack before taking
}the jump.  The difference in cost can easily be a factor of 100 or more.

It's interesting to examine the ACIS implementation of longjmp/setjmp for
the IBM RTs.  The standard procedure call convention is callee-save, and
longjmp does NOT unwind the stack.  Contrast this with the Vaxen BSD
implementation of longjmp/setjmp, which DOES unwind the stack.  Vaxen BSD,
of course, uses callee-save too.  What is the difference?  Well, for the IBM
RTs, your registers have the same values as when they returned from the
setjmp.  On Vaxen, your registers have the same values as they had when you
called the next function from within the same function that called the
setjmp.  So depending on one or the other behaviour for your register
variables is not safe.  It's a minor but significant semantic difference.
[Anybody know what POSIX decided for this?]  Now, how to avoid unwinding the
stack and still retain the same semantics?  It's actually not hard -- given
that, for those ``other'' languages at least, GOTO and RAISE are part of the
language, the compiler can just always save the contents of register
variables before calling other functions _only for those functions that
contain GOTO or RAISE_, and restore the registers variables when the
exception occurs.  Thus, you can get the efficiency of callee-saving (a big
win for those often-used, little leaf functions that use only a few scratch
registers) and retain the semantics that you want.

Of course, it's hard to argue a similar case for C, since setjmp/longjmp is
NOT part of the language....  And for those super-duper-smart compilers that
puts a variable into a register for the first half of a function and another
variable into the same register for the second half, unwinding the stack to
restore registers from stack frames isn't quite enough either!

-bsy
-- 
Internet:	bsy@cs.cmu.edu		Bitnet:	bsy%cs.cmu.edu%smtp@interbit
CSnet:	bsy%cs.cmu.edu@relay.cs.net	Uucp:	...!seismo!cs.cmu.edu!bsy
USPS:	Bennet Yee, CS Dept, CMU, Pittsburgh, PA 15213-3890
Voice:	(412) 268-7571

jkjl@munnari.oz (John Lim) (11/03/88)

One issue about the caller/callee saves argument that hasnt been brought
up is code size. Callee saves minimizes code size. For example in C :

	a() {} 
	b() {}
	c() {}
	f() {a();b();c();}
	g() {b();a();c();}

If caller saves, assuming that saves and restores take 1 instruction each,
we get 12 save/restore instructions used for the above code. If callee saves,
we get 6 save/restore instructions needed only.

Not too important you might think, but i remember that M'soft used the pascal
calling convention in Windows to save 5% (if i remember) of code, which is
similar in principle to the caller/callee argument.

Luckily, this isn't so much of an issue when you arent confined to 640K
of mem...
	
	john lim

amos@taux02.UUCP (Amos Shapir) (11/03/88)

I haven't seen anyone mention the idea of mixing caller-saves and callee-saves
methods: the caller hands to the callee a mask of live registers; the callee
ANDs this with a mask of the registers it uses and saves the registers whose
corresponding bits are set.  The mask the callee hands to the routines it calls
is the OR of these two masks. (I hope that's clear).
-- 
	Amos Shapir				amos@nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522261  TWX: 33691, fax: +972-52-558322
34 48 E / 32 10 N			(My other cpu is a NS32532)

cik@l.cc.purdue.edu (Herman Rubin) (11/04/88)

In article <7580@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:
> 
> 	Register Saving across Procedure Calls
> 	--------------------------------------
> 
> Which is better - caller saves or callee saves?
> 
> 
> A. Is this the right question?

It is, if library subroutines are to be used.  A compiler cannot change
the procedure to be followed in this case.

			..................

| B. Are the strategies equally sound?
| 
| The point I consider most important, is that there is a definite
| semantic asymmetry between the two strategies.  If the caller saves,
| then the caller is saving, locally, his own local state.  This seems
| to me basically correct.  If the callee saves, then the callee is
| saving, local to him, state that belongs to someone else.  Moreover,
| he is saving state of greater extent - the caller's registers - in
| space of lesser extent - his own stack frame.  This seems to me
| semantically unsound.
| 
| Now, I tend to let few things get in the way of efficiency, especially
| efficiency of something as crucial as the procedure call, but semantic
| correctness is one of those things.  So in this case, I'm going to
| come out and say that "callee saves" is fundamentally wrong, and should
| be avoided if possible, even at some cost.  
| 
| 
| C. Which is more efficient?
| 
| Happily, however, the efficiency arguments, in my experience, support
| the "caller saves" strategy, so one can indeed do well by doing good.
| 
| The most blatant case is that of the longjump, which appears in other
| languages as a GOTO or RAISE statement.  This causes a jump out of a
| procedure to somewhere further up the call chain, and so must reset
| the environment of the destination.  If the caller saves state, then
| this is simple: the jump is a jump, and the destination knows where
| all the state has been saved.  In most implementations, one need only
| reset the frame pointer to the current incarnation of the destination
| procedure, and take the jump to the label.
| 
| But if the callee saves, then the caller has no idea how to recover
| his saved state, which may be buried any number of stack frames further
| down.  It is therefore necessary to unwind the entire stack before taking
| the jump.  The difference in cost can easily be a factor of 100 or more.

 The only problem is in a dropback over several calls; it would then be
necessary to have the callee place the information about what was saved and
where so the caller could find it quickly.  Possibly this should be use
instead of an automatic restore on return.

> I do not regard this as a marginal point.  The exception is beginning to
> be used as a normal programming tool; it is a feature of several modern
> languages and will probably be a feature of most new ones.  Its efficient
> implementation is as desirable as the efficient implementation of, say,
> for-loops or array assignments.
> 
> Turning though to the main topic - which is the faster strategy for a
> normal call and return - I see two issues here: the number of registers
> to be saved and restored, and the cost of each save and restore.
> 
> 
> D. Some facts and guesses
> 
> In my experience, there is almost no difference between the number
> of registers used by the caller and the number used by the callee.
> 
| Small procedures tend to use fewer than less small, and leaf procedures
| tend to be a bit smaller, so on balance it seems marginally better for
| the callee to save. (What this also tells us is that interprocedural
| optimisation of leaves and leaf-callers only will give you big returns)
| 
| But this is outweighed by two factors
| 
| * The callee must save all registers it will use throughout the body;
|   the caller need save only the registers that are live at the point of
|   call.
| 
| * When two or more calls occur in succession, both callees must save,
|   but the caller need save only once.
| 
| Rough guesses I have accumulated over time are
| 
> * at any call point, caller is using ~ 2/3 of the registers it will
>   use at all (though this is partly due to defects in register
>   allocation strategies)
> 
> * on average, a procedure call is (almost) immediately followed by
>   another about 2/3 of the time. This implies that if the caller
>   saves, it will have to save ~ 45 times for every 100 calls.
> 
> These two factors together imply that the cost of a caller-saves protocol
> is about 1/3 that of a callee-saves protocol.  (Do you believe that?)
> 
No.  My experience is quite different.  The cases where I do this are because
the callee does not save; if I have a conditional subroutine call (not that
uncommon) it may be necessary to do a save before the call--the save is the
second call.

> Now consider the cost of a save and restore.  There are two factors that
> make it cheaper for the caller to save
> 
> * the register may be slaving a known value.  The caller then need not
>   save at all, merely restore.  I find this is true of at least 20% of
>   live registers. (Consider for instance the MC68020.  If you are working
>   hard, you probably have 5 or 6 live D registers, of which at least one
>   is holding a constant, and 4 or 5 live A registers, of which perhaps
>   two are holding pointers to data structures.  That's 3 out of 10)
> 
> * the store may be combined with another operation.  For example, the
>   last operation on the register may have been an add
> 
>     ADDL2 X,R1
> 
>   This can be changed into
> 
>     ADDL3 X,R1,save_place
> 
>   A small saving, admittedly, but a saving.
> 
> There is a third factor I have not assessed satisfactorily, which applies
> especially to RISC machines.  Is the caller or the callee better able to
> distribute loads and stores through the code, so as to overlap any load
> or store delays?  I suspect it is the callee, but that is a hunch.

This assumes that there are only a few registers.  Try this on a machine
with many registers, such as the CYBER205 with 256 registers.  And consider
the problem on a machine with vector registers.  In most cases, most of the
vector registers will be in use; I believe they all do not have nearly 
enough.  This is likely to be at least 4096 bytes; the number of words 
depends on the word length.  The problem is that any convention may or
may not be right for the given application.

> 
> E. What about Hardware Help?
> 
> On the question of a "high-level procedure call" implemented in hardware,
> such as the VAX CALLS, I think my opinion is known: such instructions
> are worthless.  But what about a simpler instruction, such as a hardware
> register save, or save-under-mask? 

The VAX has register save (and restore) under mask.

			....................

Probably the best help that hardware can give is to have a "dirty" bit
or a mask on call as to what it may be necessary to save.  The bit is
cleared on saving and reset on restoring.  If a mask is used, the subroutine
would remove the bits corresponding to saved registers to recompute its
mask.  The problem with a mask is that if there are a large number of 
registers, the mask is long.

It seems from the references above, that the correspondent is ignoring the
saving of flaoting-point registers where they are separate.  They fall into
the same category.  It is much less likely that floating point registers
will be of the restore only type than pointers.
> 
> F. Possibly Offensive Remark
> 
> I agree with Henry, that procedure calling protocols need to be thought
> afresh for each fresh machine.  Unfortunately, very few compiler shops
> seem prepared to do this.  I see over and again the model of a closed,
> downward-growing stack; caller pushes parameters; callee saves registers
> and allocates local space.  One can do better than this by a factor of
> two or three, on almost any machine I know, with a different model more
> amenable to local optimisation.

			......................

It is necessary to consider the problem for each machine.  And with planned
exceptions, and especially if an interrupt (and we should also have programmed
interrupts on conditions, instead of having to test at each occasion), I see
no good alternative to having the callee save.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

pardo@june.cs.washington.edu (David Keppel) (11/04/88)

bcase@cup.portal.com (Brian bcase Case) writes:
>[ Amazing: DEC never gave instruction timings ]
>[ How are you supposed to write a good compiler? ]

In both defend DEC and support of RISC designers, most CISCs have
great variability in timing *for a given instruction*.  Some relevent
considerations:

* Addressing modes
* Current state of the pipeline
* Cache hit/miss for instruction & memory operands
* State of the cache (if it is busy handling coherency it
  will respond slower on a hit)

Note, for example, that based on subtle variations on usage a single
80386 instruction may execute a factor of 3 faster/slower according to
the manual.  That doesn't count pipeline stalls and the '386 doesn't
have any 3-memory-operand instructions.

One reason compiler-writers like RISCs is that you can *use* the
machine description meaningfully.

	;-D on  ( Anybody got a 7-operand instruction? )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

baum@Apple.COM (Allen J. Baum) (11/04/88)

[]
>In article <7580@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
.
.
>F. Possibly Offensive Remark
>
>I agree with Henry, that procedure calling protocols need to be thought
>afresh for each fresh machine.  Unfortunately, very few compiler shops
>seem prepared to do this.  I see over and again the model of a closed,
>downward-growing stack; caller pushes parameters; callee saves registers
>and allocates local space.  One can do better than this by a factor of
>two or three, on almost any machine I know, with a different model more
>amenable to local optimisation.

Would you care to elaborate on how to gain a factor of two or three?
I'd like to see an example or two..

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

earl@wright (Earl Killian) (11/04/88)

In article <960006@hpcllla.HP.COM>, daryl@hpcllla (Daryl Odnert) writes:
>Some systems use a mixed strategy.  For example, both the HP Precision
>Architecture (HPPA) and the MIPS R2000 split up the register set into
>two partitions, a caller-saves set and a callee-saves set.

Most people don't realize it, but so does 4.3bsd on the VAX.  r0-r5
are caller saves, r6-r11 are callee-saves.  Some VAX compilers (but
not 4.3bsd cc!) take advantage of this, with good effect.
-- 

cprice@mips.COM (Charlie Price) (11/04/88)

In article <960006@hpcllla.HP.COM> daryl@hpcllla.HP.COM (Daryl Odnert) writes:

On callee save versys caller save for registers, Daryl gives the reference:

>"Minimizing Register Usage Penalty at Procedure Calls" by Fred C. Chow
>of MIPS Computer Systems.  It is published in the "Proceedings of the
>SIGPLAN '88 Conference on Programming Language Design and Implementation"
>(pg 85-94.)
>
>"Register Windows vs. Register Allocation" by David W. Wall of
>DEC Western Research Lab.  Same conference proceedings (pg 67-78).

It is probably worth mentioning that these proceedings are published as:

SIGPLAN NOTICES
Volume 23, Number 7
July, 1988

-- 
Charlie Price    cprice@mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

cantrell@Alliant.COM (Paul Cantrell) (11/04/88)

I'd like to make some minor comments on a really good article by Robert Firth
on register save procedures across procedure calls.

Having programmed several 680x0 systems with registers-saved-by callee, and
now working on a 680x0 architecture which has the caller save it's own
registers, I've had the chance to program the same instruction set with
both conventions used.

In article <7580@aw.sei.cmu.edu> firth@bd.sei.cmu.edu (Robert Firth) writes:
>First, and most important, if you are designing a professional-quality
>production compiler, this is the wrong question.  Such a compiler must
>perform interprocedural optimisation if it is to be respectably state
>of the art.  
>
>However, if you want to design a prototype, amateur, or deliberately
>low-cost compiler, the issue is probably one worth considering.  To
>keep this note short, I'm going to assume you understand the basic
>issue and are familiar with current hardware and software technology.

Well, you may have slightly overstated this - I'd guess that 98% of the
production quality compilers available today do not do interprocedural
optimizations. However, I agree that this is desirable.

>C. Which is more efficient?
>
>Happily, however, the efficiency arguments, in my experience, support
>the "caller saves" strategy, so one can indeed do well by doing good.
>
	[he goes on to describe the longjump case as being more efficient
	 when caller saves]

I would tend to ignore longjump since this is an infrequently used mechanism
compared to procedure calls in general. I think the efficiency of the basic
call/return is what needs to be looked at here. He strongly argues that I
shouldn't feel that way, but I'll leave it at that.

>In my experience, there is almost no difference between the number
>of registers used by the caller and the number used by the callee.
>
>Small procedures tend to use fewer than less small, and leaf procedures
>tend to be a bit smaller, so on balance it seems marginally better for
>the callee to save. (What this also tells us is that interprocedural
>optimisation of leaves and leaf-callers only will give you big returns)

Yes, this is one problem I have with caller saving - it substantially
increases the cost of calling small procedures that need very few registers.
The register save restore done by the caller can easilly outweigh the entire
cost of the procedure itself, if it is something simple like a queue
manipulation or an assembly language routine which gives you access to a
special instruction. I don't think the word 'marginal' applies here - from
doing code inspection I think this can account for a lot of wasted time. As
you point out, it simply argues strongly for interprocedural analysis. (An
obvious thing to do for such simple leaf procedures is to inline them, and
get rid of the procedure call overhead entirely).

A nasty side effect of our compiler (you could argue that this is simply
a bug in the register allocation, but I think it's a little more complicated
than that) is that for small C routines, adding 'register' statements may
actually slow the code down by causing many save/restores to be generated.
This obviously is impacted by where the variable is used, how often, and
where the procedure calls are in relation to usage of the register variable.
My only point is that the programmer expectes that adding 'register' to
those variables which are used frequently should make his code run faster,
not slower. In the callee saves convention, it is usually trivial for the
program to determine whether 'register' is called for - it is almost certainly
based on how many times he uses the variable within the procedure. But
for caller saves, it is almost impossible for him to tell.

>But this is outweighed by two factors
>
>* The callee must save all registers it will use throughout the body;
>  the caller need save only the registers that are live at the point of
>  call.
>
>* When two or more calls occur in succession, both callees must save,
>  but the caller need save only once.

From code inspection of typical C code, the first point doesn't seem to be
much of a win or loss, it's true that only the live registers need be saved
if caller is saving, but in 'good' C code there are typically always enough
registers in use (if the compiler has done a decent job of register allocation)
such that you always end up saving a large number of registers.

The second point that you can avoid multiple save/restores when you have
several procedure calls in a row is certainly true, but again, the code
inspection I have done shows that a fair amount of the time you end up
doing all the save/restoring on each one because of conditional branching
making the path through the calls unpredictable at compile time. However,
this sometimes can be a large win - I suspect that this is the single
largest reason that you can expect a performance gain with caller saving.

Anyway, here is a list of what I consider the pros and cons of caller saving
his own registers:

Pros:
	1) Avoids multiple save/restore operations across consecutive
	   procedure calls.

	2) Saved register state is local to owner, not buried on the stack
	   by the various called procedures.

	3) Only 'live' registers need be saved

	4) If a copy of the data exists and is easy to obtain, no save need
	   be done.

Cons:

	1) Often causes more saves than required when calling leaf procedures
	   since they are small, but this is the most common operation so the
	   penalty becomes large.

	2) Makes programs slightly larger. Instead of one copy of the register
	   save/restore, there has to be a copy at every invokation. This may
	   have performance impact because of cache size, main memory size.
	   However, Pro#1 may decrease the impact of this some.

	3) For assembly language programming, code may be slightly harder to
	   write and understand since determining which registers must be
	   saved/restored depends on how the thread of control can be affected
	   by conditional branching, etc. Typically, with the callee saves
	   convention, the registers would be saved/restored at entry/exit
	   time (I'm gonna get flamed on that one).

Conclusion:

Neither convention seems to be all that much better. I'd say that caller saving
has a slight edge performance wise, callee saving has a slight edge in terms
of readability/maintainability (only if you are using assembly language).

I think interprocedural analysis would be enough of a win over either of these
two methods that it strongly argues for people to move in that direction.

					PC

hankd@pur-ee.UUCP (Hank Dietz) (11/05/88)

In article <1009@l.cc.purdue.edu>, cik@l.cc.purdue.edu (Herman Rubin) writes:
> In article <7580@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:
...[stuff ommitted]...
> > Which is better - caller saves or callee saves?
> > 
> > 
> > A. Is this the right question?
> 
> It is, if library subroutines are to be used.  A compiler cannot change
> the procedure to be followed in this case.
...[much more stuff ommitted]...

This is actually a prime argument for caller saves.  The reason is simple:
the compiler optimizations, if constrained to be performed without changing
the code of the called (library) routine, can only be applied in the caller.
Hence, you want to do as much of the call processing as possible in the
caller, because that's the only way you have a shot at optimizing it.  It
turns out that this is actually near optimal, because even though the
compiler can't change the code inside called (library) routines, the
compiler can access summary information specifying things like which
registers are actually used in the called routine...  essentially the
"right" way to do it.

						-hankd@ee.ecn.purdue.edu

pardo@june.cs.washington.edu (David Keppel) (11/05/88)

cantrell@alliant.Alliant.COM (Paul Cantrell) writes:
>[ function inlining avoids leaf-register-allocation problems ]
>[ also saves procedure call costs! ]

If the leaf procedure is called often and the hardware has an I-cache,
then inlining the leaves may make the code *slower*.

At an extreme (cache-miss penalty is high and sequential instructions
are not prefetched), callee saves might win simply on the basis of
code and resulting miss-rate penalty.

	;-D on  ( The walking virus )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

cprice@mips.COM (Charlie Price) (11/05/88)

In article <1009@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <7580@aw.sei.cmu.edu>, firth@sei.cmu.edu (Robert Firth) writes:
>> 	Register Saving across Procedure Calls
>> 	--------------------------------------
>> Which is better - caller saves or callee saves?
>> 
>> A. Is this the right question?
>
>It is, if library subroutines are to be used.  A compiler cannot change
>the procedure to be followed in this case.

Not necessarily.

MIPS has ucode libraries (an intermediate representation) available
and compiling C with -O4 optimization selected uses
these libraries and does interprocedural register allocation including
the library modules.  For some applications this is worth doing.

On the other hand, what about shared-library routines with
dynamic linkage at runtime?
-- 
Charlie Price    cprice@mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

hutchson@mozart.uucp (Stephen Hutcheson) (11/05/88)

In a previous incarnation, I was an interested party (compiler developer)
to a project that defined a calling sequence for a new architecture.  The
various arguments about code size and information available were bandied
about, to little effect.  The final decision was to let the caller save.
We had (and hoped to mix) code in several languages, including more than
the normal percentage of assembly-language code, some of it very crufty.
It was observed that if the caller saved his own context, the callee could
not easily foul it up.  The older architecture used callee-saves, and clever
callees often didn't save.  Half-clever callees didn't save as often as they
should have, and it had been a recurring problem.  The new "defensive driving"
approach would have that problem only within a subroutine; the old approach
had it across subroutines.

joe@modcomp.UUCP (11/05/88)

firth@bd.sei.cmu.edu (Robert Firth) writes:

> Which is better - caller saves or callee saves?
> [...]
> Which is more efficient?
> [...]
> The most blatant case is that of the longjump, which appears in other
> languages as a GOTO or RAISE statement.  This causes a jump out of a
> procedure to somewhere further up the call chain, and so must reset the
> environment of the destination.  If the caller saves state, then this is
> simple: the jump is a jump, and the destination knows where all the state
> has been saved.  In most implementations, one need only reset the frame
> pointer to the current incarnation of the destination procedure, and take
> the jump to the label.
> 
> But if the callee saves, then the caller has no idea how to recover his
> saved state, which may be buried any number of stack frames further down.
> It is therefore necessary to unwind the entire stack before taking the jump.
> The difference in cost can easily be a factor of 100 or more.
                                         ^^^^^^^^^^^^^^^^^^^^^

The callee-save implementations that I have seen all have a fast longjump
mechanism.  Typically, the setjmp(x) call saves (an adjusted version of) the
entire machine state in x, and longjmp(x) jumps simply by restoring that
state.  No attempt is made to depend on information which may or may not be
on the stack after the setjmp call.

Joe Korty		I'm suffering from a virus ...
uunet!modcomp!joe		... but my machine isn't.

aglew@urbsdc.Urbana.Gould.COM (11/06/88)

>I haven't seen anyone mention the idea of mixing caller-saves and callee-saves
>methods: the caller hands to the callee a mask of live registers; the callee
>ANDs this with a mask of the registers it uses and saves the registers whose
>corresponding bits are set.  The mask the callee hands to the routines it calls
>is the OR of these two masks. (I hope that's clear).
>
>	Amos Shapir				amos@nsc.com
>National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
>Tel. +972 52 522261  TWX: 33691, fax: +972-52-558322
>34 48 E / 32 10 N			(My other cpu is a NS32532)

There has been a paper on this; sorry, memory fails.

The best argument I heard against this sort of operation is that it makes
high performance instruction dispatch difficult - you can't dispatch 
instructions after the SAVE-MASK until the mask has been computed, because
you don't know which registers are used. In general, instructions that
do not have static register addressing imply an instruction dispatch stall.

This is, of course, not a question on the current generation of microprocessors,
which do not really use any advanced instruction dispatch techniques.

henry@utzoo.uucp (Henry Spencer) (11/06/88)

In article <3473@pt.cs.cmu.edu> bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) writes:
>It's interesting to examine the ACIS implementation of longjmp/setjmp for
>the IBM RTs.  The standard procedure call convention is callee-save, and
>longjmp does NOT unwind the stack.  Contrast this with the Vaxen BSD
>implementation of longjmp/setjmp, which DOES unwind the stack.  Vaxen BSD,
>of course, uses callee-save too.  What is the difference?  Well, for the IBM
>RTs, your registers have the same values as when they returned from the
>setjmp.  On Vaxen, your registers have the same values as they had when you
>called the next function from within the same function that called the
>setjmp.  So depending on one or the other behaviour for your register
>variables is not safe.  It's a minor but significant semantic difference.
>[Anybody know what POSIX decided for this?]

X3J11, not POSIX, is the relevant group here.  And X3J11 has wimped out on
it, in a big way:  the values of *any* local variables (not just register
variables -- the compiler may be quietly promoting things into registers!)
that have changed since the setjmp are *indeterminate* after a longjmp,
unless the variables are declared "volatile".  Note, there is no guarantee
that you get *either* of the above cases!  The values may even be trash!

Some of us thought this was a damn stupid idea, since it invalidates
essentially every existing program that uses setjmp/longjmp, but we were
unable to convince X3J11 of this.  They claim that this is a "quality
of implementation" issue.

>Now, how to avoid unwinding the
>stack and still retain the same semantics?  It's actually not hard -- given
>that, for those ``other'' languages at least, GOTO and RAISE are part of the
>language, the compiler can just always save the contents of register
>variables before calling other functions _only for those functions that
>contain GOTO or RAISE_, and restore the registers variables when the
>exception occurs...
>
>Of course, it's hard to argue a similar case for C, since setjmp/longjmp is
>NOT part of the language...

Au contraire, it *is* part of the language.  Realistically, the major C 
library functions have to be considered part of the language.  Most every C
implementor has cursed this fact, since it puts significant constraints on
calling-sequence design.  X3J11 has put enough constraints on the invocation
of setjmp, in fact, that the compiler can do the same sorts of things as it
could for a language with built-in setjmp/longjmp.  This is important,
because it's the only sane way to handle setjmp/longjmp if you are doing
fancy register allocation and can't do a stack unwind.  (Fancy register
handling means non-register variables, which most users expect are safe,
are in fact in danger.  Stack unwinding relies on the stack layout being
either invariant or self-describing [pdp11 and vax respectively], and is
not possible with efficient calling sequences on most modern machines.)
-- 
The Earth is our mother.        |    Henry Spencer at U of Toronto Zoology
Our nine months are up.         |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

csmith@mozart.uucp (Chris Smith) (11/07/88)

In article <7580@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:

> Which is better - caller saves or callee saves?

Convex computers use caller saves; here are a few more observations to
toss in.


> B. Are the strategies equally sound?

> The point I consider most important, is that there is a definite
> semantic asymmetry between the two strategies.  If the caller saves,
> then the caller is saving, locally, his own local state.  

It's worth noting that this also gives the caller a chance to do a free
"context switch" of register contents -- a burst of register-memory
traffic is inevitable no matter who does the save, but putting it in the
caller allows him to capitalize on the opportunity to load up a different
-- more useful -- set of values after the call.


> Rough guesses I have accumulated over time are
>
> * at any call point, caller is using ~ 2/3 of the registers it will
>   use at all (though this is partly due to defects in register
>   allocation strategies)
>
> * on average, a procedure call is (almost) immediately followed by
>   another about 2/3 of the time. This implies that if the caller
>   saves, it will have to save ~ 45 times for every 100 calls.
>
> These two factors together imply that the cost of a caller-saves protocol
> is about 1/3 that of a callee-saves protocol.  (Do you believe that?)

Dynamically, our registers tend to be fuller than that -- they fill up at
the drop of a hat anyway, but if they don't, loop unrolling sees to it
that they do.  But this one:

> * the register may be slaving a known value.  The caller then need not
>   save at all, merely restore.  I find this is true of at least 20% of
>   live registers. 

operates very powerfully on a register machine, where loads are required
for every memory operand.  

All in all, on this machine and on *one* benchmark, the Fortran
validation tests, it looks like caller-saves is just under half the cost
of callee-saves.  (Counting only the registers saves and restores.)


One other point: debuggers are up to hunting through saved frames to find a
variable allocated to R4, but when the variables flit around as they are
prone to do when the domain of register allocation is the intervals between
calls, it puts quite a strain on the debugger tables.

dave@micropen (David F. Carlson) (11/08/88)

In article <2557@munnari.oz>, jkjl@munnari.oz (John Lim) writes:
> 
> Not too important you might think, but i remember that M'soft used the pascal
> calling convention in Windows to save 5% (if i remember) of code, which is
> similar in principle to the caller/callee argument.
> 
> Luckily, this isn't so much of an issue when you arent confined to 640K
> of mem...
> 	
> 	john lim

I thought Microsoft did this because they had source to Apple's Lisa,
which is the object of current litigation, and Lisa used pascal calling
conventions because it was in pascal.  It made "duplicating" the Apple 
easier given that 1:1 transfer of their Macintosh software that already used
the pascal conventions (M-Word, etc.)

Apple owns several Trademarks on the above.
Microsoft owns several Trademarks on the above.

-- 
David F. Carlson, Micropen, Inc.
micropen!dave@ee.rochester.edu

"The faster I go, the behinder I get." --Lewis Carroll

johnl@ima.ima.isc.com (John R. Levine) (11/08/88)

In article <578@micropen> dave@micropen (David F. Carlson) writes:
>In article <2557@munnari.oz>, jkjl@munnari.oz (John Lim) writes:
>> 
>> Not too important you might think, but i remember that M'soft used the pascal
>> calling convention in Windows to save 5% (if i remember) of code, ...
>
>I thought Microsoft did this because they had source to Apple's Lisa ...
>[which was written in Pascal.]

Probably not. A highly reliable source who wrote a lot of the Windows code
tells me that Windows is written in C and assembler. Considering that the MS C
compiler emits special code for Windows linkages and the Pascal compiler
doesn't, I believe him. The main difference between C and Pascal calling
sequences on an 8086 is that the C sequence has the caller pop the arguments
off the stack, while Pascal has the callee pop. There is a "return and pop N"
instruction that favors callee pop, thus a code saving. Naturally, if the
caller and callee disagree about the number of arguments passed, chaos ensues.
Pascal also passes arguments left to right while C passes them right to left;
except for the rare varargs function this makes no practical difference.

When I was working on Javelin, we also went from C to Pascal calling, and also
noticed about a 5% space saving. In that case, though, the compiler returned
pointer values in the ES:BX or BX, depending on size, which was a big win
because the BX register can be dereferenced directly, while the AX (the normal
value register) can't.

To return to the original point, before that we switched from Lattice C, which
was caller-save, to Wizard (father of Turbo) which was mixed, the callee
saving SI and DI and the caller saving anything else. The mixed convention
definitely saved a little space, though since the save and restore
instructions are only a byte apiece it wasn't much.
-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ bbn | spdcc | decvax | harvard | yale }!ima!johnl, Levine@YALE.something
Disclaimer:  This is not a disclaimer.

srg@quick.COM (Spencer Garrett) (11/09/88)

In article <234@taux02.UUCP>, amos@taux02.UUCP (Amos Shapir) writes:
- I haven't seen anyone mention the idea of mixing caller-saves and callee-saves
- methods: the caller hands to the callee a mask of live registers; the callee
- ANDs this with a mask of the registers it uses and saves the registers whose
- corresponding bits are set.  The mask the callee hands to the routines it calls
- is the OR of these two masks. (I hope that's clear).

Problem is, the mask thus generated quickly tends toward all 1's if
it's saving any stores, and is equivalent to callee-saves if it isn't.

jjw@celerity.UUCP (Jim ) (11/10/88)

In article <6800006@modcomp> joe@modcomp.UUCP writes in response to
firth@bd.sei.cmu.edu (Robert Firth):
>The callee-save implementations that I have seen all have a fast longjump
>mechanism.  Typically, the setjmp(x) call saves (an adjusted version of) the
>entire machine state in x, and longjmp(x) jumps simply by restoring that
>state.  No attempt is made to depend on information which may or may not be
>on the stack after the setjmp call.

Of course, this is the mechanism which results in the situation described by
bsy@PLAY.MACH.CS.CMU.EDU (Bennet Yee) in article <3473@pt.cs.cmu.edu>:

> ... your registers have the same values as when they returned from the
>setjmp.  On Vaxen, your registers have the same values as they had when you
>called the next function from within the same function that called the
>setjmp.  So depending on one or the other behaviour for your register
>variables is not safe.  It's a minor but significant semantic difference.

This is more than a minor difference since optimizing compilers will keep any
variable in a register when necessary.

henry@utzoo.uucp (Henry Spencer) in article <17965@utzoo.uucp> points out
one solution:
>			... X3J11 has wimped out on
>it, in a big way:  the values of *any* local variables (not just register
>variables -- the compiler may be quietly promoting things into registers!)
>that have changed since the setjmp are *indeterminate* after a longjmp,
>unless the variables are declared "volatile".  Note, there is no guarantee
>that you get *either* of the above cases!  The values may even be trash!

As Harry Spencer stated this "invalidates essentially every existing
program that uses setjmp/longjmp."

It is possible for the information saved in the "jump buffer" to indicate
where the register information is being saved on the stack so that it can be
restored by longjmp.  This is the mechanism used in the FPS Model 500.
There are some difficulties with signal handlers using longjumps since the
callee could be in the middle of saving the state when the signal occurs.

The use of setjmp does require that the compiler generate some additional
code on calls (especially if signal handlers can perform longjmp's) so there
are compiler optimization options which do not guarantee "non-volatile"
variable contents.

jjw@celerity.UUCP (Jim ) (11/10/88)

>In article <234@taux02.UUCP>, amos@taux02.UUCP (Amos Shapir) writes:
>I haven't seen anyone mention the idea of mixing caller-saves and
>callee-saves methods: the caller hands to the callee a mask of live
>registers ...

One "problem" with this is that managing and testing the bits in the mask
can cost more than just saving the registers.  This is defineitely the case
in the FPS Model 500 where the registers can be saved on the register stack
at a cost of a machine cycle each.

This is what we refer to as a "smart cycle" problem -- you can spend more
cycles trying to be clever than it costs to do it the "dumb" way.

andrew@frip.gwd.tek.com (Andrew Klossner) (11/11/88)

>>> Which is better - caller saves or callee saves?
>>> Is this the right question?

>> It is, if library subroutines are to be used.  A compiler cannot change
>> the procedure to be followed in this case.

> This is actually a prime argument for caller saves.  The reason is simple:
> the compiler optimizations, if constrained to be performed without changing
> the code of the called (library) routine, can only be applied in the caller.

It is also a prime argument for callee saves.  When the library routine
is being compiled, the compiler optimizations are constrained to be
performed without changing the code of the calling routine.  The choice
between optimizing routines near the frontier of the call graph and
optimizing routines back toward the root of the graph should (IMHO) be
made in favor of the frontier.

  -=- Andrew Klossner   (uunet!tektronix!tekecs!frip!andrew)    [UUCP]
                        (andrew%frip.gwd.tek.com@relay.cs.net)  [ARPA]

hankd@pur-ee.UUCP (Hank Dietz) (11/12/88)

In article <10600@tekecs.TEK.COM>, andrew@frip.gwd.tek.com (Andrew Klossner) writes:
> >>> Which is better - caller saves or callee saves?
...
> >> It is, if library subroutines are to be used.  A compiler cannot change
> >> the procedure to be followed in this case.
>
> > This is actually a prime argument for caller saves.  The reason is simple:
> > the compiler optimizations, if constrained to be performed without changing
> > the code of the called (library) routine, can only be applied in the caller.
> 
> It is also a prime argument for callee saves.  When the library routine
> is being compiled, the compiler optimizations are constrained to be
> performed without changing the code of the calling routine.  The choice

Not true! The fact that while compiling the library routines you can't
change the callers is not an arguement for callee saves.

The reason is that no knowledge of the caller is available when compiling
the callee (the callee being the library routine), whereas complete info is
available about the callee when compiling the caller.  It is the
availability of information about the routine you can't change which makes
the best optimizations possible, and since library routines are generally
callee routines which predate the callers, you can't win with callee saves.

							-hankd

chris@mimsy.UUCP (Chris Torek) (11/13/88)

In article <196@celerity.UUCP> jjw@celerity.UUCP (Jim) writes:
>It is possible for the information saved in the "jump buffer" to indicate
>where the register information is being saved on the stack so that it can be
>restored by longjmp.  This is the mechanism used in the FPS Model 500.

This is probably the best approach.  It does affect optimisation,
obviously, but at least at the moment, calls to setjmp() are rare;
not too many functions should have much trouble here.

>There are some difficulties with signal handlers using longjumps since the
>callee could be in the middle of saving the state when the signal occurs.

This is not hard to solve:  Save all registers in memory before entry 
to setjmp(); have setjmp() note (in some fashion) when it is done saving
state; and have longjmp() check to be sure the state is done, and if not,
restore nothing.  The last will succeed since there are no live registers
around the setjmp call itself.

>The use of setjmp does require that the compiler generate some additional
>code on calls (especially if signal handlers can perform longjmp's) so there
>are compiler optimization options which do not guarantee "non-volatile"
>variable contents.

This is certainly acceptable.

The situation is much worse in GCC, which decides which variables
should be placed in registers, and adheres to the letter of X3J11 by
guaranteeing only volatile variables.  Under `-traditional', GCC
attempts to accept old PCC-based code, to the extent of turning off the
`volatile' keyword, but NOT to the extent of not promoting variables
into registers, so there is no way to guarantee any local variable!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

mangler@cit-vax.Caltech.Edu (Don Speck) (11/17/88)

In article <10723@cup.portal.com>, bcase@cup.portal.com (Brian bcase Case) writes:
> Oh, yeah, I forgot:  smaller code is faster code.  Sigh.

On a machine with only a 4KB direct-mapped cache, this is TRUE.
My vax-750's spend 30% of their cycles in MEM STALL.  (The back
cover has a chart that shows where to find this pin).  That's
about one cache miss per instruction!

Has anybody tried putting larger cache RAM chips into a vax-750?