[comp.arch] register save

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/05/91)

  Here's an interesting thought. I have seen good arguments which say
that the program called should save registers because it knows which
ones it will use. I've seen a less persuasive argument that the caller
should do the save because it knows what must be saved. It came to me
that perhaps there's a way to "split the difference" and save only the
set which must the caller and called routine agree must be saved.

  Consider a call instruction which includes a bitmapped register set.
I would include all those which will be used before reloading. Then
assume a "save as needed" instruction which takes a bitmap of all the
registers which must be used by the called routine. The registers which
were in both sets would be saved, possibly by a restore mask which would
enumerate which registers had been saved so they could be popped again.

  Oh heavens, a CISC instruction! But the time to execute should be one
clock per register saved, at least as fast as saving all the registers
every time.

  While we're doing heresy, let's consider having the CPU able to
remap the registers on the fly. If you call it writable control store
it's CISC, and therefore bad. If you call it register sets it's RISC
and it's good. At any rate, in every case where there are N registers
needed, and N or more registers which the caller doesn't need saved,
then zero registers have to be written to memory. And the one design
rule common to hardware and software is "it's quicker to not do it."

  I think interrupts fall out if you set all the "caller" bits and save
everything, or you could use other logic completely.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
  "I'll come home in one of two ways, the big parade or in a body bag.
   I prefer the former but I'll take the latter" -Sgt Marco Rodrigez

chased@rbbb.Eng.Sun.COM (David Chase) (03/05/91)

davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  Here's an interesting thought. I have seen good arguments which say
>that the program called should save registers because it knows which
>ones it will use. I've seen a less persuasive argument that the caller
>should do the save because it knows what must be saved.

With a "good enough" compiler, the caller might profitably do the save
because the caller has the opportunity to optimize saves and restores.
For example, you might have a subroutine call in a loop. For some
registers, saves and restores are loop-invariant, meaning "these
registers are not used in the loop".  For another example, if there's
a series of calls, then the saves and restores might be redundant --
that is, not all registers are used betwen calls.

Just another reason why caller-saves might be worthwhile.

David Chase
Sun

jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (03/05/91)

The combination of a decent compiler and library mechanism lets you do
optimal register saves fairly easily.  Let the compiler record, for each
routine it puts in the library, what registers the routine uses.  When
you compile a call to that routine, the register allocator can try to
avoid registers the routine uses, and when it uses one, it can generate
the needed save and restores.

This works trivially within a single compilation unit (to use Ada
terminology), and it works with library routines in languages where
you can force recompilation of all users of a routine after it is
recompiled (again, the Ada rules work here).

This approach causes problems for old-fashioned languages like C, the
separate compilation rules of which are essentially the same as those
of FORTRAN (save your flames, I don't mind using C when I have to).

					Doug Jones
					jones@cs.uiowa.edu

khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (03/05/91)

In article <4718@ns-mx.uiowa.edu> jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes:

... This (ed. saving register use info) approach causes problems for
    old-fashioned languages like C, the 
   separate compilation rules of which are essentially the same as those
   of FORTRAN (save your flames, I don't mind using C when I have to).

It also presents problems with various forms of dynamic linking ....
you don't necessarily know that the "universe" has changed until you
are in the middle of execution.... not that I'm against milking static
analysis for all that it's worth ;>
--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/05/91)

In article <9047@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes:
>Just another reason why caller-saves might be worthwhile.

A rather long analysis of caller/calee/dataflow/hybrid register saving
appears in one of the recent Software Practice & Experience issues.
From my very brief skim there seemed little (by my definition <20%)
difference between many of the methods (apart from the dataflow
algorithm used in one of the methods taking bulk cycles for little or no
return -- but dataflow is _always_ part of your compiler, right?). This
was all done on a 68k.

-kym

hrubin@pop.stat.purdue.edu (Herman Rubin) (03/05/91)

In article <9047@exodus.Eng.Sun.COM>, chased@rbbb.Eng.Sun.COM (David Chase) writes:
> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
> >  Here's an interesting thought. I have seen good arguments which say
> >that the program called should save registers because it knows which
> >ones it will use. I've seen a less persuasive argument that the caller
> >should do the save because it knows what must be saved.
> 
> With a "good enough" compiler, the caller might profitably do the save
> because the caller has the opportunity to optimize saves and restores.
> For example, you might have a subroutine call in a loop. For some
> registers, saves and restores are loop-invariant, meaning "these
> registers are not used in the loop".  For another example, if there's
> a series of calls, then the saves and restores might be redundant --
> that is, not all registers are used betwen calls.
> 
> Just another reason why caller-saves might be worthwhile.

Most of the situations I have seen are of two types; either the subroutine
has few instructions, in which case it is likely to use few registers,
and these probably should be inlined; or the subroutine does a lot of
computing, in which case register save/restore, while not trivial, is
at least not a major part of the cost.  It is true that there are cases
of convoluted subroutines in which it is desirable to have many registers
available, although few will be used, but, except when a program is 
essentially doing, or has recently done, some sort of context switch,
many registers will be in use.

In addition, an important problem is that of conditional subroutine
calls, including "planned exceptions."  This is definitely handled 
best by having the callee do the saving.

Other articles have discussed ways of decreasing the amount of work
needed for the saving, such as bit arrays to indicate which registers
are busy, and dynamic renaming of registers.  There are also hardware
methods proposed.  But in too many situations I have seen, callee
saving is the most expensive way.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/06/91)

In article <7317@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

| Most of the situations I have seen are of two types; either the subroutine
| has few instructions, in which case it is likely to use few registers,
| and these probably should be inlined; or the subroutine does a lot of
| computing, in which case register save/restore, while not trivial, is
| at least not a major part of the cost.  

  You're right, but there are other cases which don't fall into these
two cases. The other cases fall into the category of procedures which do
a lot more work (and dirty more registers) for some input data than
others. You can argue that the complex cases should be handled by
another level of procedure calls, but explicit coding or compiler
inlining may prevent this.

  Given the case where paths of multiple complexity occur, the "save if
needed" instruction might be used several times:

	SIN	1,2
	CMP	FP[1],#2
	BLE	TRIVIAL
	SIN	4,5,7,8,9
COMPLX:
	[ ... ]

  Yes, I assume that the cost of the instruction is inherently low,
perhaps with pipelining zero if no registers are saved. Still, it's an
interesting thought, particularly if the "save" allows remapping of
registers rather than actual transfer to memory or cache.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
  "I'll come home in one of two ways, the big parade or in a body bag.
   I prefer the former but I'll take the latter" -Sgt Marco Rodrigez

ddean@rain.andrew.cmu.edu (Drew Dean) (03/06/91)

In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  Consider a call instruction which includes a bitmapped register set.
>I would include all those which will be used before reloading. Then
>assume a "save as needed" instruction which takes a bitmap of all the
>registers which must be used by the called routine. The registers which
>were in both sets would be saved, possibly by a restore mask which would
>enumerate which registers had been saved so they could be popped again.
>
>  Oh heavens, a CISC instruction! But the time to execute should be one
>clock per register saved, at least as fast as saving all the registers
>every time.

The VAX did this in its CALLS instruction.  Many people would rank this
instruction right up there with POLY for usefulness. :-) Levy & Eckhouse,
_Computer Programming and Architecture: The VAX_, 2nd Ed. says this on page
213: "On the other hand, CALLS, which accounts for only 2.54 percent of the
instructions, accounts for 21.59 percent of the time!" [emphasis theirs].
The surrounding text indicates this is a VAX 11/780, running the VMS Fortran
compiler in 1982, and the measurements were made with a hardware monitor.
Most functions don't need the complexity of CALLS (note that it also setup a
frame pointer (or 2 ?), and does a few other things as well.

Good idea, but experience is against it...oh well....
-- 
Drew Dean
Drew_Dean@rain.andrew.cmu.edu
[CMU provides my net connection; they don't necessarily agree with me.]

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/06/91)

In article <12234@pt.cs.cmu.edu> ddean@rain.andrew.cmu.edu (Drew Dean) writes:

| The VAX did this in its CALLS instruction.  

  Using this as an example of saving some registers is somewhat like
counting the boiler of a steam engine as overhead on the whistle. CALLS
set up stack frames, saved a bunch of stuff unconditionally, etc. 

  Also, CALLS saves all the registers named, even if the target doesn't
use them. This is not at all like the idea I proposed, which would save
only the set which (a) the caller wanted saved and (b) the target was
going to use.

  It's really not the same thing at all.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
  "I'll come home in one of two ways, the big parade or in a body bag.
   I prefer the former but I'll take the latter" -Sgt Marco Rodrigez

npw@eleazar.dartmouth.edu (Nicholas Wilt) (03/07/91)

In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  Consider a call instruction which includes a bitmapped register set.
>I would include all those which will be used before reloading. Then
>assume a "save as needed" instruction which takes a bitmap of all the
>registers which must be used by the called routine. The registers which
>were in both sets would be saved, possibly by a restore mask which would
>enumerate which registers had been saved so they could be popped again.
>
>  Oh heavens, a CISC instruction! But the time to execute should be one
>clock per register saved, at least as fast as saving all the registers
>every time.

The Motorola 68881 architecture's FMOVEM instruction has a mechanism to
do something like this.  I've never heard of anyone using it, though.

--Nick
  npw@eleazar.dartmouth.edu

cet1@cl.cam.ac.uk (C.E. Thompson) (03/07/91)

In article <12234@pt.cs.cmu.edu> ddean@rain.andrew.cmu.edu (Drew Dean) writes:
>In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>>  Consider a call instruction which includes a bitmapped register set.
>>I would include all those which will be used before reloading. Then
>>assume a "save as needed" instruction which takes a bitmap of all the
>>registers which must be used by the called routine. The registers which
>>were in both sets would be saved, possibly by a restore mask which would
>>enumerate which registers had been saved so they could be popped again.
>> ...
>
>The VAX did this in its CALLS instruction.  ...
> {Diatribe (mostly deserved) against the CALLS instruction omitted}

CALLS/CALLG use only a callee-specified register mask: it is located in 
the callee's code. I undertood Bill Davidsen to be proposing an instruction
that combined a caller-specified mask and a callee-specified mask.

By the way, it is not clear that the notorious overheads of CALLS/CALLG are 
due to the mask-driven register storing at all: they do many (far too many) 
other things as well. Nor are mask-specified load/stores in themselves 
particularly CISCy --- at least one nominally RISC processor (Acorn's ARM)  
uses them.

Chris Thompson
JANET:    cet1@uk.ac.cam.phx
Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk

jbuck@galileo.berkeley.edu (Joe Buck) (03/07/91)

In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
|> >  Consider a call instruction which includes a bitmapped register set.
|> >I would include all those which will be used before reloading. Then
|> >assume a "save as needed" instruction which takes a bitmap of all the
|> >registers which must be used by the called routine. The registers which
|> >were in both sets would be saved, possibly by a restore mask which would
|> >enumerate which registers had been saved so they could be popped again.
|> >
|> >  Oh heavens, a CISC instruction! But the time to execute should be one
|> >clock per register saved, at least as fast as saving all the registers
|> >every time.

In article <12234@pt.cs.cmu.edu>, ddean@rain.andrew.cmu.edu (Drew Dean) writes:
|> The VAX did this in its CALLS instruction.  Many people would rank this
|> instruction right up there with POLY for usefulness. :-) Levy & Eckhouse,
|> _Computer Programming and Architecture: The VAX_, 2nd Ed. says this on page
|> 213: "On the other hand, CALLS, which accounts for only 2.54 percent of the
|> instructions, accounts for 21.59 percent of the time!" [emphasis theirs].
|> The surrounding text indicates this is a VAX 11/780, running the VMS Fortran
|> compiler in 1982, and the measurements were made with a hardware monitor.
|> Most functions don't need the complexity of CALLS (note that it also setup a
|> frame pointer (or 2 ?), and does a few other things as well.

Dave Patterson (who invented the acronym "RISC") said semi-jokingly that
the whole RISC movement started as a rebellion against the kind of thinking
that produced the VAX CALLS instruction.

One of the worst inefficiencies about CALLS is that the register save mask
isn't saved in the instruction word itself; instead, the save mask appears
as a 16-bit quantity stored at the beginning of the function to be called.
This means that the machine must figure out at runtime, right in the critical
path for any pipelining you might want to do, information that is known at
compile time -- it must read back this word from external memory, then
parse it, then save the registers it indicates.

In Hennessey and Patterson's book, they stress the rule that you avoid
making the processor figure out at runtime any information that is already
known at compile time (in this case, what registers to save).


--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck

hrubin@pop.stat.purdue.edu (Herman Rubin) (03/07/91)

In article <12234@pt.cs.cmu.edu>, ddean@rain.andrew.cmu.edu (Drew Dean) writes:
> In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
> >  Consider a call instruction which includes a bitmapped register set.
> >I would include all those which will be used before reloading. Then
> >assume a "save as needed" instruction which takes a bitmap of all the
> >registers which must be used by the called routine. The registers which
> >were in both sets would be saved, possibly by a restore mask which would
> >enumerate which registers had been saved so they could be popped again.
> >
> >  Oh heavens, a CISC instruction! But the time to execute should be one
> >clock per register saved, at least as fast as saving all the registers
> >every time.
> 
> The VAX did this in its CALLS instruction.  Many people would rank this
> instruction right up there with POLY for usefulness. :-) Levy & Eckhouse,
> _Computer Programming and Architecture: The VAX_, 2nd Ed. says this on page
> 213: "On the other hand, CALLS, which accounts for only 2.54 percent of the
> instructions, accounts for 21.59 percent of the time!" [emphasis theirs].
> The surrounding text indicates this is a VAX 11/780, running the VMS Fortran
> compiler in 1982, and the measurements were made with a hardware monitor.
> Most functions don't need the complexity of CALLS (note that it also setup a
> frame pointer (or 2 ?), and does a few other things as well.

The relevant question is, "How bad is this from the standpoint of the possible
alternatives?"  In a subroutine call, the following is involved:

	Save the system state register and set up the new one.
	Save the modified program counter.
	Save those registers which should be saved.
	Save any other information needed to restore.

On return, this needs to be reversed.

So we have in the neighborhood of a half-dozen to more than a dozen things to
do.  In addition, we are likely to have cache failures.  I do not find it 
at all unusual that this instruction averages 10 times as long as other
instructions.

Now some of this could have been saved IF the compiler could have used the
jump to subroutine or branch to subroutine for subroutines in the same 
compilation, but this would have required more ingenuity in the compiler,
and might not have saved that much for non-simple subroutines.  Probably
more would have been saved by inlining small subroutines; abs (or even
better fabs) would have taken even less space to inline than to issue a
call.  However, library functions not inlined will always have separate
compilation.

But a subroutine call is not likely to be cheap if any register saving is
involved.  Nor will it be any cheaper if 10 instructions are used rather
than one 10 times as slow.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

sef@kithrup.COM (Sean Eric Fagan) (03/07/91)

In article <7403@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>The relevant question is, "How bad is this from the standpoint of the possible
>alternatives?"  In a subroutine call, the following is involved:
>
>	Save the system state register and set up the new one.
>	Save the modified program counter.
>	Save those registers which should be saved.
>	Save any other information needed to restore.
>
>On return, this needs to be reversed.
>
>So we have in the neighborhood of a half-dozen to more than a dozen things to
>do.  In addition, we are likely to have cache failures.  I do not find it 
>at all unusual that this instruction averages 10 times as long as other
>instructions.

Why, Herman?  Don't you like complex instructions?

One of the reasons the vax calls instruction is so slow is, as was pointed
out by another poster, that, in addition to what you describe, it also goes
out and fetches 16 bits at the specified address, uses that to determine
which registers to save, and then jumps to address+2bytes.  Ugh.  Too many
memory references for my taste.  Note, also, that there is no reason why you
should save the status register ("condition codes"); in fact, you may want
to use that (if present) to determine what you should do.  (E.g.,

	Boolean foo(integer);
	...
	if (foo(10) = True) then ... ...

could become

	push	10
	call	foo
	b.false	wherever
	...
No?)

The method suggested which started this thread, which was proposed about
this time last year, is that you do something like:

	call	foo, 0x1234

where foo does

	save	0x3

and the resgisters saved are described by the bitmask from 0x1234&0x3.  Now,
what do you do, however, when you call bar, which uses a mask of 0x1230?
foo does a

	call	bar, 0x3

and the resultant mask is 0, thus no registers are saved, *despite* bar
trashing some of foo's caller's registers.

Any thoughts on that one?

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

ddean@rain.andrew.cmu.edu (Drew Dean) (03/07/91)

In article <7403@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>In article <12234@pt.cs.cmu.edu>, ddean@rain.andrew.cmu.edu (Drew Dean) writes:
[discussion about VAX CALLS instruction and what a procedure call does deleted]
>But a subroutine call is not likely to be cheap if any register saving is
>involved.  Nor will it be any cheaper if 10 instructions are used rather
>than one 10 times as slow.

Instead of "is not likely to be cheap", let's see what actual measurement
gives us.  I quote  Hennessy & Patterson, p. 137:

3.11 [30] <3.7, 3.9> Find a machine that has a powerful instruction set
feature, such as the CALLS instruction on the VAX.  Replace the powerful
instruction with a simpler sequence that accomplishes what is needed.
Measure the resultant running time.  How do the two compare?  Why might
they be different?  In the early 1980s, engineers at DEC performed a quick
experiment to evaluate the impact of replacing CALLS. They found a 30%
improvement in the run time on a very call-intensive program when the CALLS
was simply replaced (parameters remained on the stack).  How do your results
compare?

Also, under the first Pitfall of Ch. 3, (p. 125) we find a long discussion
of the CALLS instruction, which I omit for brevity, but quote "The vast
majority of calls in real programs do not require this amount of overhead."
[Any typos are mine, and I apologize....]

Can someone give a reference to Patterson's finding that most procedures are
relatively "simple" ?  (ie. I remember <= 6 local vars, and about the same
number of parameters dominates usage in real code.)  If you find, after
simulation that a lot of time is being spent saving & restoring registers
from the stack, then investigate mechanisms like register windows.

I believe the biggest lesson from all of this is to build instruction sets
that compilers can use to the best advantage, be it a RISC or CISC design.
Anyone want to simulate a 32-bit Lilith and see what things look like ?
-- 
Drew Dean
Drew_Dean@rain.andrew.cmu.edu
[CMU provides my net connection; they don't necessarily agree with me.]

srg@quick.com (Spencer Garrett) (03/07/91)

In article <3219@crdos1.crd.ge.COM>,
	davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:

>   Consider a call instruction which includes a bitmapped register set.
> I would include all those which will be used before reloading. Then
> assume a "save as needed" instruction which takes a bitmap of all the
> registers which must be used by the called routine. The registers which
> were in both sets would be saved, possibly by a restore mask which would
> enumerate which registers had been saved so they could be popped again.

Alas, this approach sounds good, and would work well for very shallow
calling trees, but it degenerates to callee-saves if the calling
tree has any depth to it.  Consider:

	a) if you are successful in avoiding register saves, then
	the registers-in-use mask will tend toward all ones, and
	every register needed by the called routine will get saved.

	b) if the bitmask isn't filling up, then you aren't
	postponing any saves, and the technique isn't buying
	you anything.

You could get a lot more performance from this technique if you
could introduce some hysteresis into the process (ie - don't restore
a register until you get back to a routine that uses it), but
this would be a nightmare to implement, since registers would have
to be saved and restored from different stack frames.

The studies I have seen have not shown a huge difference in performance
between caller-saves and callee-saves *across a broad range of applications*.
There are, of course, specific instances in which either is clearly
much better than the other.  In general, callee-saves is easier to
implement, but caller-saves offers more potential for optimization.
I am unaware of any systems which are purely one or the other,
although most come very close to a callee-saves convention.

avg@hq.demos.su (Vadim Antonov) (03/07/91)

>In article <9047@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes:
>>Just another reason why caller-saves might be worthwhile.

Do not forget that register save instruction can enjoy burst mode
for moving register blocks from/to memory. IMHO it's the only reason
to introduce such instructions. Reasonable cache and instruction prefetching
eliminates any need in specialised frame saving instructions -
a sequence of PUSHs / POPs works fine if you have no blocked CPU<-->memory
transfers.

Vadim Antonov
DEMOS, Moscow, USSR

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/07/91)

In article <11727@pasteur.Berkeley.EDU> jbuck@galileo.berkeley.edu (Joe Buck) writes:

| In Hennessey and Patterson's book, they stress the rule that you avoid
| making the processor figure out at runtime any information that is already
| known at compile time (in this case, what registers to save).

  Your conclusion would be valid if your supposition was correct. The
original posting indicated that you would only save the registers which
the the target would modify and the caller wanted saved. To have a map
at compile time of which registers are used in every called routine is
just within the limits of current state of the art, ie. we know how to
do it, but don't usually. To preserve that without recompiling every
calling routine when a target is recompiled is not practical in terms of
development time vs. time saved.

  Doing the operation at run time using modern design should take
    1 cycle in the call to set the bitmap
      (since this can overlap the cycle to do the call this is "free")
    1 cycle for the AND of the caller and target mask, or ENTER
    1 cycle per register to do the saves
      (this could be burst mode or something and would be at least
       as fast as the individual instructions)

  Since my proposal adds one cycle for each call, it would have to find
one register on the average which did not need to be saved. Looking at
some SPARC code I believe this is likely. A smart chip design might be
able to hide the cycle for instruction overhead, but I counted it
because I'm not positive it can be done.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
  "I'll come home in one of two ways, the big parade or in a body bag.
   I prefer the former but I'll take the latter" -Sgt Marco Rodrigez

peter@ficc.ferranti.com (Peter da Silva) (03/07/91)

In article <1991Mar6.163348.13003@dartvax.dartmouth.edu> npw@eleazar.dartmouth.edu (Nicholas Wilt) writes:
> The Motorola 68881 architecture's FMOVEM instruction has a mechanism to
> do something like this.  I've never heard of anyone using it, though.

I understand it can be used to implement a fast memmov() routine.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

kers@hplb.hpl.hp.com (Chris Dollin) (03/07/91)

Joe Buck writes:

   One of the worst inefficiencies about CALLS is that the register save mask
   isn't saved in the instruction word itself; instead, the save mask appears
   as a 16-bit quantity stored at the beginning of the function to be called.
   This means that the machine must figure out at runtime, right in the 
   critical path for any pipelining you might want to do, information that is
   known at compile time -- it must read back this word from external memory, 
   then parse it, then save the registers it indicates.

I may be missing something ... but what makes you think that the registers used
by a called routine are known to the caller at *compile-time*?

I suppose that one could arrange for that information to be made available to
the linker, which would then patch up all the calls so that they'd know which
registers to save, but you still have to deal with procedure variables.

[In any case, it would seem sensible to decouple register saving from procedure
calling, as indeed seems to be the case with RISC architectures.]



--

Regards, Kers.      | "You're better off  not dreaming of  the things to come;
Caravan:            | Dreams  are always ending  far too soon."

peter@ficc.ferranti.com (Peter da Silva) (03/07/91)

> and the resgisters saved are described by the bitmask from 0x1234&0x3.  Now,
> what do you do, however, when you call bar, which uses a mask of 0x1230?
> foo does a

> 	call	bar, 0x3

> and the resultant mask is 0, thus no registers are saved, *despite* bar
> trashing some of foo's caller's registers.

You could solve this by keeping running track of "dirty" registers, which
get set by "call" and cleared by "save".

Let's see how this works here:

You have to have "call" or-in the mask to be used, and have "save" clear
the bits of the registered saved. That way any bits left over from foo's
caller would still be set, unless foo has already saved them.

Looks like it'd work. You could even use established VM techniques to
decide when to save registers, but that's probably overkill.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/08/91)

In article <1991Mar07.014907.11081@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes:

| and the resgisters saved are described by the bitmask from 0x1234&0x3.  Now,
| what do you do, however, when you call bar, which uses a mask of 0x1230?
| foo does a
| 
| 	call	bar, 0x3
| 
| and the resultant mask is 0, thus no registers are saved, *despite* bar
| trashing some of foo's caller's registers.
| 
| Any thoughts on that one?

  I had several, most of which revolved around a "preserved registers"
register. Unfortunately I think that will have to be saved, so it's not
a viable solution.

  I think the answer is that any procedure which calls another procedure
will have to assume saving all registers at its entry point, and
therefore the caller's save mask would be used. Where this becomes a win
is that the leaf procedures will not have to do that, and they get
called the most.

  All this got me thinking that another bit of memory access which
could be saved on a call is the return address. By putting it into a
(dedicated) register at call time the leaf procedures would avoid
several memory accesses, and the CALL instruction would be simpler,
since the IC would just go into the RETADR register and the stack
pointer would not be changed. That, in turn, leaves the ALU free on the
instruction, which clever software might find useful.

  We had this on the GE/Honeywell 36 bit machines. The common call
instruction was TSXn, which left the return address in index register n.
Another potential time saver.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

preston@ariel.rice.edu (Preston Briggs) (03/08/91)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:
>>   Consider a call instruction which includes a bitmapped register set.
>> I would include all those which will be used before reloading. Then
>> assume a "save as needed" instruction which takes a bitmap of all the
>> registers which must be used by the called routine. The registers which

srg@quick.com (Spencer Garrett) writes:
>Alas, this approach sounds good, and would work well for very shallow
>calling trees, but it degenerates to callee-saves if the calling
>tree has any depth to it.

Consider also the following case

	Routine A calls B, specifying that register 1 must be saved.

	Routine B doesn't use register 1, so it's not pushed.

	Further, B calls C.  We need to be sure that r1 is preserved,
	but I assume we can accumulate a "preserve" set.

		AB preserve - AB preserved + BC preserve

		^ those that A required at the callsite to B

			      ^ those that B stored on stack

					     ^ those that B requires at the
						callsite to C

		Sort of ugly, but perhaps not terrible.

	If C requires r1, it'll be pushed on entrance to C
	and popped on exit from C.  Now, make B call C many times
	in a row.  It would have been much nicer to save r1 on
	the transition from A to B (by either the caller or callee).

On the same subject, interested parties might check a paper
by Guy Steele (and Sussman?) from the 1980 Conference on Lisp and
Functional Programming, called "Lazy Scoping, the Dream of a Lifetime"
(or thereabouts -- my proceedings aren't where they belong).

He introduces the notion of "racks" (sort of combining register and stack).
The idea is that each has a little state machine, a value, and a stack.
There are 4 operations: fetch, update, save, restore.
You "save" live racks before a call and "restore" them after the call.
If someone (arbitrarily deep in the call sequence) tries to assign
to a saved rack, the saved value is preserved on the rack's private stack and
a new value is established.  When restoring, you pop only if you've pushed.

He explored several possible implementations of racks in the context
of a Scheme interpreter.  In the case of the interpreter, different
special-purpose racks could use different implementations for best
performance.

I dunno how to build the things, but it's a fun paper to read.

Preston Briggs

preston@ariel.rice.edu (Preston Briggs) (03/08/91)

I wrote:
>On the same subject, interested parties might check a paper
>by Guy Steele (and Sussman?) from the 1980 Conference on Lisp and
>Functional Programming, called "Lazy Scoping, the Dream of a Lifetime"

The correct reference is

	The Dream of a Lifetime: A Lazy Variable Extent Mechanism
	Guy L. Steele Jr. and Gerald J. Sussman
	Conference Record of the 1980 Lisp Conference

sef@kithrup.COM (Sean Eric Fagan) (03/08/91)

In article <3228@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  All this got me thinking that another bit of memory access which
>could be saved on a call is the return address. 

Oh, almost everyone does that these days.  i860, MIPS, Sparc?, Cray, 88k,
etc.  (After all, if you don't have a stack, where are you going to put the
return address except in a register?)

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

torbenm@diku.dk (Torben [gidius Mogensen) (03/08/91)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:

>  All this got me thinking that another bit of memory access which
>could be saved on a call is the return address. By putting it into a
>(dedicated) register at call time the leaf procedures would avoid
>several memory accesses, and the CALL instruction would be simpler,
>since the IC would just go into the RETADR register and the stack
>pointer would not be changed. That, in turn, leaves the ALU free on the
>instruction, which clever software might find useful.

>  We had this on the GE/Honeywell 36 bit machines. The common call
>instruction was TSXn, which left the return address in index register n.
>Another potential time saver.

The ARM family of processors do this too: The BL (branch and link)
instruction saves the PC (R15) in R14, which is called the Link
register. There is no JSR instruction in the usual sense, so
procedures save and restore the Link register on the stack when
necessary. Return is just MOV R15,R14.

	Torben Mogensen (torbenm@diku.dk)

rpw3@rigden.wpd.sgi.com (Rob Warnock) (03/10/91)

In article <1991Mar7.063957.17197@quick.com> srg@quick.com
(Spencer Garrett) writes:
+---------------
| You could get a lot more performance from this technique if you
| could introduce some hysteresis into the process (ie - don't restore
| a register until you get back to a routine that uses it), but
| this would be a nightmare to implement, since registers would have
| to be saved and restored from different stack frames.
+---------------

This is *exactly* what the Am29000 stack-cache register windowing does
for you -- provides a *large* amount of hysteresis in the "spill/fill"
activity, so much so that register spill/fills generally account for well
under 1% of all cycles (Dhrystone/grep/diff/nroff/asm/Stanford, etc.).


-Rob

hrubin@pop.stat.purdue.edu (Herman Rubin) (03/11/91)

In article <1991Mar07.014907.11081@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
> In article <7403@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> >The relevant question is, "How bad is this from the standpoint of the possible
> >alternatives?"  In a subroutine call, the following is involved:
> >
> >	Save the system state register and set up the new one.
> >	Save the modified program counter.
> >	Save those registers which should be saved.
> >	Save any other information needed to restore.
> >
> >On return, this needs to be reversed.
> >
> >So we have in the neighborhood of a half-dozen to more than a dozen things to
> >do.  In addition, we are likely to have cache failures.  I do not find it 
> >at all unusual that this instruction averages 10 times as long as other
> >instructions.
> 
> Why, Herman?  Don't you like complex instructions?
> 
> One of the reasons the vax calls instruction is so slow is, as was pointed
> out by another poster, that, in addition to what you describe, it also goes
> out and fetches 16 bits at the specified address, uses that to determine
> which registers to save, and then jumps to address+2bytes.  Ugh.  Too many
> memory references for my taste.  Note, also, that there is no reason why you
> should save the status register ("condition codes"); in fact, you may want
> to use that (if present) to determine what you should do.  (E.g.,

			............................

Unless one is going to save all registers, it is necessary to know which ones
to save.  I am used to situations (call them "planned exceptions", if you will)
in which the subroutine call is a rare occurrence.  Now there is a way around
this at some hardware cost if one would have a means of instructing the machine
to proceed in a particular manner unless a user interrupt occurs, and to be
able to reset in that case.

If a machine has a goodly number of registers, the "normal" situation will
be for a program to be using lots of them.  It is not that unusual for a
subroutine, even one doing a fair amount of work, to use many less.  For
the caller to save registers without regard for what the callee uses will
be expensive unless there is some automatic means of doing this such as
register blocks on the PYRAMID.  Otherwise there has to be some means of
finding out which registers the callee uses, and reading the mask is about
as quick a way as any.

I do not think our current instructions are complex enough, in many situations.
In a register saving situation without register blocks, the memory references
are needed.  Now one could have anticipating reads of the save mask, so that
that reference will not slow things down.  If one wants to use an intersection
mask, an "old active" mask for subsequent calls will still be needed.

You do need to save such things as the status register.  The old status
register should still be available if one wants to use it.

--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/12/91)

In article <7613@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

| I do not think our current instructions are complex enough, in many situations.

  I'm sure that will make it into other quote files besides mine... but
I know what you mean. To justify a complex (microcode) instruction, I
believe you have to satisfy the following:

	/- be faster than doing the same thing in multiple instructions
       <                - OR -
	\- be as fast and smaller than the hardcode to save space

	- AND -

	not use chip space which could be used to make the *overall*
	performance better by doing somthing else.

You can sum this up as "best use of the real estate" or "best overall
performance" and many other people have done so. I have no attachment to
hardcoded instructions, I just want top performance overall, without any
horible "worst case" programs which run orders of magnitude slower than
you would expect.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

davec@nucleus.amd.com (Dave Christie) (03/12/91)

In article <MGY90JF@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes:
>> and the resgisters saved are described by the bitmask from 0x1234&0x3.  Now,
>> what do you do, however, when you call bar, which uses a mask of 0x1230?
>> foo does a
>
>> 	call	bar, 0x3
>
>> and the resultant mask is 0, thus no registers are saved, *despite* bar
>> trashing some of foo's caller's registers.
>
>You could solve this by keeping running track of "dirty" registers, which
>get set by "call" and cleared by "save".
>
>Let's see how this works here:
>
>You have to have "call" or-in the mask to be used, and have "save" clear
>the bits of the registered saved. That way any bits left over from foo's
>caller would still be set, unless foo has already saved them.
>
>Looks like it'd work. You could even use established VM techniques to
>decide when to save registers, but that's probably overkill.

Yes, this does it.  In a former life I designed hardware support for a
mainframe with VAX style every-thing-but-the-kitchen-sink call/return 
instructions to do the register save/restores in essentially zero time.
The paper that Preston Briggs referred to seems very similar - a little
state machine + stack for each register; also a stack for the save masks
and maybe some other control state.  It had to be transparent to software 
for backwards compatibility.  Fun to design, but fraught with extra bits 
of complexity due to pipelining, branches, fault tolerance, context 
switches, etc. 

In the final analysis, it was hard to justify the complexity and realestate,
compared with the cost/performance of, say, R3000 call/return.  It was
never implemented.  In fact, it was one (of several) of the experiences
with CISC design that made me a RISC advocate... And added another straw
to their CISC camel's back.

IMHO, the 29000 stack cache is a much better way (no plug intended) 
(well, maybe a little...)

>-- 
>Peter da Silva.  `-_-'  peter@ferranti.com
>+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

---------------
Dave Christie		My opinions only.

don@dgbt.doc.ca (Donald McLachlan) (03/12/91)

	The idea of saving the return address in a register, rather than
on the stack sounds nice to me, but ...

	This requires that the calling function knows that the called
function is a leaf function, not very practical from a high level language
point of view as far as I can tell.

	The only way I can think to generalise this would be to always
put the return address in a dedicated register. This would require that
the "call" would first push the old contents onto the stack and then
load in the new return address. The matching return would use in the
dedicated register as the return address. The function making the call
would then be responsible for grabbing the old return address off the
stack and loading it into the dedicated register.

	A further optimisation I could see for this is any function that
performs a call could save the "return register" on the stack once at
invocation, and restore it prior to returning. (gee isn't this what is done
now:-)

	Now that all the mechanics are out of the way (the way I see them)
only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE???
I interpret this as ... What is the ratio of calls to "leaf functions"
versus calls to "non-leaf functions"?

	Does anyone have any data, or a good guess at this last question?

dhoyt@vx.acs.umn.edu (DAVID HOYT) (03/12/91)

In article <1991Mar11.192116.1974@dgbt.doc.ca>, don@dgbt.doc.ca (Donald McLachlan) writes...
>	The only way I can think to generalise this would be to always
>put the return address in a dedicated register. This would require that
>the "call" would first push the old contents onto the stack and then
>load in the new return address. The matching return would use in the
>dedicated register as the return address. The function making the call
>would then be responsible for grabbing the old return address off the
>stack and loading it into the dedicated register.

  Imagine if you have say 32 registers.  On procedure call r7 is the return
address and r0..r6 are the first seven parameters.  Now if our code looks
like this

    save r7                            | sub::
    for( i = 0; i < 10000000; i++ )    |    ld r0, r1 ; add 1, r1; stor r0, r1
        sub( a + i )                   |    jmp r7
    restore r7

  We've done 10,000,001 reads and the same number of writes.  Now if we had
a  vax jsr/rsb instruction (one write, one read per call)  We would have 20m
reads and writes.  Doubling the memory accesses. Obviously even with a good
cache, we'd much rather do our subroutine calls the first way, rather than
the vax way.  And the normal CallS instruction that the vax uses for
subroutine calls is much more expensive than the jsr/rsb calls.  We could
greatly speed up subroutine calls on a vax by using 4/16 registers this way,
or even 8/16 I suspect.

  In my example sub() is just a normal subroutine, so we don't need a smart
linker or code inliner.  In fact the only penalty with the all register
procedure call is that we have an additional two (unconditional) branches.
With a small direct map instruction buffer (ala Cray) and branch intelligent
pipeline, the execution time cost of the two branches would be close to
nothing.  Basically giving you inline virtual functions, that we wouldn't
even have to declare inline, everybody wins, wee ha!

david | dhoyt@vx.acs.umn.edu | dhoyt@umnacvx.bitnet

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/12/91)

In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes:

| 	The only way I can think to generalise this would be to always
| put the return address in a dedicated register. This would require that
| the "call" would first push the old contents onto the stack and then
| load in the new return address. 

  The previous return info has to be saved, but not by the CALL
instruction. There are lots of non-recursive languages (FORTRAN) which
don't need to worry about another call and could save the return address
as needed. Also, a smart compiler would fake a CALL and let the target
procedure return directly to the caller when desirable.[possible.

| 	Now that all the mechanics are out of the way (the way I see them)
| only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE???
| I interpret this as ... What is the ratio of calls to "leaf functions"
| versus calls to "non-leaf functions"?

  Wish I had the time to do a few measurements. My guess is that it
would be over half the calls with common coding practices, but that's
pure gut feeling, so don't expect me to prove it. Any ability of the
compiler to inline procedures allows more procedures to be leaf, so it
will provide more benefit when optimized.

  I also note that some languages can pass the arguments inline in the
code, using various techniques to avoid executing them. No, it doesn't
require self modifying code, just an extra level of indirection (saying
they're bad languages doesn't mean you can't design hardware to run
them well). If you must have a pointer back to the caller in a register
anyway, you can use it in the return. The world is not *exclusively* C.

  One nice thing about it is that it shouldn't ever have a penalty in
performance... the worst case appears to be no worse than what we
usually use now, and the bext case seems to save two memory accesses and
two changes in the stack pointer.

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

turner@lance.tis.llnl.gov (Michael Turner) (03/12/91)

In article <12269@pt.cs.cmu.edu> ddean@rain.andrew.cmu.edu (Drew Dean) writes:

>Can someone give a reference to Patterson's finding that most procedures are
>relatively "simple" ?  (ie. I remember <= 6 local vars, and about the same
>number of parameters dominates usage in real code.)  If you find, after
>simulation that a lot of time is being spent saving & restoring registers
>from the stack, then investigate mechanisms like register windows.

My first reaction to this was something like "the code *better* have
fewer than 6 parameters and 6 local variables per subroutine call on
average, or I'm not going to work on it."  Cooling down a little, I
thought: maybe there's an idea--instead of optimizing architectures
around how code *is*, perhaps it would make sense to look at how code
*should* be.  Personally, I wouldn't mind using a machine that
grossly penalized subroutines with more than 4 parameters, in exchange
for some slightly greater reward for using fewer, because I seldom write
code that uses more than 4 routinely, and *hate* inheriting code that
does.  More often than not, it seems to me, people use extra parameters
rather than splitting up code into separate functions to make it more
modular.  Register windows are a start on killing off this pernicious
tendency.

I'm not sure how you'd set up an evaluation, since there are so many
different views of that ineffable sense of quality about code.  Perhaps
you'd need to average out the responses of various programmers to various
kinds of code.  Preferably in some relatively objective way, like how long
it took them to find planted bugs, rather than according to their immediate
reactions to a code inspection.  Rewarding maintainability with extra
performance.  Why not?  Maybe compilers are being too kind....
---
Michael Turner
turner@tis.llnl.gov

jackk@leland.Stanford.EDU (Jack Kouloheris) (03/12/91)

In article <3241@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes:
>
>| 	The only way I can think to generalise this would be to always
>| put the return address in a dedicated register. This would require that
>| the "call" would first push the old contents onto the stack and then
>| load in the new return address. 
>
>  The previous return info has to be saved, but not by the CALL
>instruction. There are lots of non-recursive languages (FORTRAN) which
>don't need to worry about another call and could save the return address
>as needed. Also, a smart compiler would fake a CALL and let the target
>procedure return directly to the caller when desirable.[possible.
>

People seem to be talking about this call/return mechanism as if i
it is something novel.....This is the standard mechanism in
the S/360, S/370, and S/390 systems and has been for more than 25 years.
All the languages that don't allow recursion have a standard call/return
convention via register 15. There is no hardware stack on the 370...
the call intruction is usually an assembler macro that does a BALR 15
(Branch and Link Register). A convention is followed for creating
a register save area, which is then filled via a Store Multiple
instruction. If one always saves all registers on a call (as some
compilers do), then a standard OS routine can traceback the call sequence
via register 15 (as well as the contents of each register on entry to
the call). Of course today optimizing compilers (as on the RS/6000)
can save just the required register across a call.

mash@mips.com (John Mashey) (03/12/91)

In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes:
>	The idea of saving the return address in a register, rather than
>on the stack sounds nice to me, but ...
>	This requires that the calling function knows that the called
>function is a leaf function, not very practical from a high level language
>point of view as far as I can tell.
>
>	The only way I can think to generalise this would be to always
>put the return address in a dedicated register. This would require that

1) Many RISCs around do this, i.e., use the moral equivalent of
the S/360 BRANCH-AND-LINK instruction.

2) The callee:
	a) Saves the return address register (for sure)
OR
	b) The callee (if a leaf, and it certainly knows that)
	never saves the return address register
OR MAYBE
	c) If the callee has "leaf paths", ie.., ones where it doesn't
	call other subroutines, even though it has some where it does, the
	compiler may choose to rearrange the save/restores to avoid them
	on some of the paths where they would be unnecessary.  (Some compilers
	do this, i.e., cc -O3 on MIPS, for example, in certain cases.)

3) AS has been posted here in the past, it has been observed that
the number of registers saved/restored per function call can be kep
small (like about 2, on the average, for suites of programs of
SPEC-like nature), if:
	a) You have a good optimizer with a good register allocator.
	b) You have ENOUGH registers that a reasonable number (say 50%)
	can be made scratch (caller save).  Typically, enough == 20..32
	available for general use.

4) In many environments, the frequency of leaf routines is high,
on the order of 50%.

5) There is plenty of analysis of this topic in Hennessy and Patterson,
including detailed statistics about the effects of more registers
and/or better optimization.

6) On any MIPS-based machine, assuming that it includes pixie, prof,
and pixstats (most do), it is trivial to this kind of analysis:
do:
	pixie  program		creates program.pixie
	program.pixie		run it
	pixstats program	getthe statistics
and look for the section that lists the number of callee-saved
registers saved/restored: this is a table that goes from 0 to 10.
Perhaps easier, look for the line labeled "Average registers saved
per entry:"

To get % of calls to leaf routines is a little harder, but one of the
"prof" sections gives the number of calls to each function,a
and the percentage of total, although it doesn't identify whether
something is a leaf or not.  Also, this begsthe question of whether
or not something is clearly a leaf (i.e., zero calls to other functions),
or dynamically a leaf (i.e., some calls of it do NOT cause it to call others;
there are compilers that can move register save/restores around to
sometimes avoid saving them when unnecesary).

7) Finally, the really crucial thing in all of this is to understand
how big a percentage of run time is spent in subroutine call overhead,
and THEN, and ONLY THEN figure out if it's worth a lot of hardware
to do something about it, especially if that hardware is very difficult
to pipeline.  The evidence, so far, is that good compilers plus enough
registers get rid of an awful lot of the overhead.  One can argue,
or not, about whether register windows get rid of enough more to be
worth it.  Certainly, I'd prefer register windows to complex function
call instructions. Among other things, in looking at RISC code,
in which instructions are frequently re-ordered, I've quite often seen
the pieces of the subroutine call mechanism spread all over a piece of
code, in ways that fill cycles that would otherwise be stalls,
and in sequences for which hardware is unimaginable...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

sef@kithrup.COM (Sean Eric Fagan) (03/12/91)

In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes:
>	The idea of saving the return address in a register, rather than
>on the stack sounds nice to me, but ...
>	This requires that the calling function knows that the called
>function is a leaf function, not very practical from a high level language
>point of view as far as I can tell.

Uhm... that doesn't quite follow.  How did you come up with that conclusion?

The *called* function knows whether or not it's a leaf function or not; this
is fine, as it's the one that has to save the return address or not.  (That
is, if I call any other functions, I will save my return address.  I don't
need to know anything about the functions I call%.  If I don't call any
other functions, then, obviously, I don't need to save the return-address
register.)

----
% One thing you can do, with enough magic software, is to have the linker do
N-level searches on all called functions, and decide what set of registers
it can safely use without saving.
----

>	The only way I can think to generalise this would be to always
>put the return address in a dedicated register. 

That's what happens in *lots* of machines:  crays, 88k's, MIPS's, i860's,
etc.  The call instruction only saves the return address in the register.
Between software and hardware conventions, it's pretty quick (note that they
don't save anything into memory; accessing memory is *slow*, and should be
avoided almost as much as divides should be 8-)).

>This would require that
>the "call" would first push the old contents onto the stack and then
>load in the new return address. 

If you call any functions, you are not a leaf function.  Therefore, as part
of your prologue, you save the return address you were given, and restore it
as part of the function epilogue.  Note that if you only have a couple of
function calls (i.e., not in a loop), then you can save the address for each
call, on the assumption that it might be faster.  E.g.:

	void foo(int i) {
		...
		if (foobar == i)
			blech(i&foobar);
		...
	}

might generate code that looks like:

	foo:
		...	; prologue
		load	r2, foobar
		load	r3, i	; probably already in a register, actually
		bne	r2, r3, skip.0
		and	r2, r3, r4	; assuming a delay slot
		PUSH	r31	; push being a macro
		PUSH	r4
		call	blech
		POP	r31	; restore return address
	skip.0:
		; whatever

>	Now that all the mechanics are out of the way (the way I see them)
>only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE???

At worst, it costs nothing more than what is "traditionally" done.  Best
case, it save a few memory accesses, which is a Good Thing.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

bruce@cs.su.oz (Bruce Janson) (03/12/91)

In article <1353@ncis.tis.llnl.gov> turner@lance.tis.llnl.gov (Michael Turner) writes:
>..
>.. maybe there's an idea--instead of optimizing architectures
>around how code *is*, perhaps it would make sense to look at how code
>*should* be.  Personally, I wouldn't mind using a machine that
>grossly penalized subroutines with more than 4 parameters, in exchange
>for some slightly greater reward for using fewer, because I seldom write
>..

The MIPS compilers put the first 4 words of integer arguments into
registers $4..$7 (aka a0-a3) and the first 2 single or double precision
arguments into floating point registers $f12 and $f14.
[See pp. D-2 and D-3 of the mips RISC ARCHITECTURE book by Gerry Kane.]
Apparently the MIPS people do a lot of simulation and measurement,
they may even have done the tests that you were considering...

Cheers,
bruce.

Bruce Janson
Basser Department of Computer Science
University of Sydney
Sydney, N.S.W., 2006
AUSTRALIA

Internet:	bruce@cs.su.oz.au
Telephone:	+61-2-692-3272
Fax:		+61-2-692-3838

ccplumb@rose.uwaterloo.ca (Colin Plumb) (03/12/91)

turner@lance.tis.llnl.gov (Michael Turner) wrote:
>  Cooling down a little, I
>thought: maybe there's an idea--instead of optimizing architectures
>around how code *is*, perhaps it would make sense to look at how code
>*should* be.  Personally, I wouldn't mind using a machine that
>grossly penalized subroutines with more than 4 parameters, in exchange
>for some slightly greater reward for using fewer, because I seldom write
>code that uses more than 4 routinely, and *hate* inheriting code that
>does.

Well, I can live with 4 parameters as long as you don't want them to fit in
one register each.  If you write graphics code, you pass things like 3-vectors
(3 registers), basis matrices (9 registers), affine transformations (12)
and bicubic patches (16) around.

In integers, you often pass rectangle boundaries (4 registers) around.

In general, I'd like to suggest that asking the hardware to enforce
coding style is like killing ants with a sledgehammer.  Both too
effective and not effective enough.  You'll piss off people who have
good reason to want to do what they're doing, and people who want to
write bad code will continue to write bad code.
-- 
	-Colin

tim@proton.amd.com (Tim Olson) (03/12/91)

In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes:

| 	Now that all the mechanics are out of the way (the way I see them)
| only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE???
| I interpret this as ... What is the ratio of calls to "leaf functions"
| versus calls to "non-leaf functions"?

From a collection of benchmarks we have, we have run dynamic leaf-call
statistics, and see that leaf routines account for 40% - 60% of all
function calls.  The benchmark with the fewest leaf-routine calls was
"compress", with less than 1%; the most leaf-routine calls were in a
kalman filter benchmark (96%).  Because about half of all calls are to
leaf functions, leaf-procedure optimizations are important.


--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

spot@CS.CMU.EDU (Scott Draves) (03/13/91)

In article <1991Mar12.115845.20736@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:


   Well, I can live with 4 parameters as long as you don't want them to fit in
   one register each.  If you write graphics code, you pass things like 3-vectors
   (3 registers), basis matrices (9 registers), affine transformations (12)
   and bicubic patches (16) around.

   In integers, you often pass rectangle boundaries (4 registers) around.

you are almost certainly passing *pointers* to these arrays and
matrices around.  at least i hope you are...

   In general, I'd like to suggest that asking the hardware to enforce
   coding style is like killing ants with a sledgehammer.

it's not enforcing anything, it just runs faster if you use its style.
--

			christianity is stupid
Scott Draves		communism is good
spot@cs.cmu.edu		give up

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)

In article <912@spim.mips.COM> mash@mips.com (John Mashey) writes:

| 7) Finally, the really crucial thing in all of this is to understand
| how big a percentage of run time is spent in subroutine call overhead,
| and THEN, and ONLY THEN figure out if it's worth a lot of hardware
| to do something about it, especially if that hardware is very difficult
| to pipeline.  The evidence, so far, is that good compilers plus enough
| registers get rid of an awful lot of the overhead.  One can argue,

  Well I'm not about to argue but I may pose a question, in systems
which use register windows,  does someone have a good handle on the cost
of doing all the register to register I see just before a call? I see
some papers which show big gains by not moving data register to memory
to register putting arguments on a stack, but they seem to discount the
cost of the moves, and the reduced performance of the optimizer having
to treat a lot of registers as temps or plain unusable. I would have to
read some papers to clearly unstand how to measure the optimizer impact,
so if someone has done it speak up.

  Obviously there's a win in register windows if their large enough, but
I know know it isn't as big as the papers on RISC claim.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

mike@vlsivie.tuwien.ac.at (Michael K. Gschwind) (03/13/91)

In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes:
>
>	The idea of saving the return address in a register, rather than
>on the stack sounds nice to me, but ...
It is nice AND saves a lot of cycles ...
>
>	This requires that the calling function knows that the called
>function is a leaf function, not very practical from a high level language
>point of view as far as I can tell.
Simply not true. As you yourself explain later.
>	A further optimisation I could see for this is any function that
>performs a call could save the "return register" on the stack once at
>invocation, and restore it prior to returning. (gee isn't this what is done
>now:-)
Exactly.
>
>	Now that all the mechanics are out of the way (the way I see them)
>only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE???
>I interpret this as ... What is the ratio of calls to "leaf functions"
>versus calls to "non-leaf functions"?
Dynamically, about 50% of all functions called are leaf-functions. 
Assuming you save 2 memory accesses AND 2 stack manipulations, this is 
a quite interesting optimization. Also, the CALL instruction becomes
much simpler. Just move the PC to some register and some register to the
PC. - No stack handling involved. Everything is much smoother (and
simpler to design - read my thesis ;-)

			bye,
				mike



Michael K. Gschwind, Dept. of VLSI-Design, Vienna University of Technology
mike@vlsivie.tuwien.ac.at	1-2-3-4 kick the lawsuits out the door 
mike@vlsivie.uucp		5-6-7-8 innovate don't litigate         
e182202@awituw01.bitnet		9-A-B-C interfaces should be free
Voice: (++43).1.58801 8144	D-E-F-O look and feel has got to go!
Fax:   (++43).1.569697

torek@elf.ee.lbl.gov (Chris Torek) (03/14/91)

In article <3255@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com
(bill davidsen) writes:
>... in systems which use register windows, does someone have a good
>handle on the cost of doing all the register to register I see just
>before a call?

In many (yeah sure :-/ ) cases you can put the values in the windowed
registers in the first place, and/or leaf functions (those again!) can
avoid using a new window (doing everything in the caller's window and
any `callee saves' temporary registers).

>  Obviously there's a win in register windows if their large enough, but
>I know know it isn't as big as the papers on RISC claim.

Depending on your compiler and applications, it may even be zero or
negative.  Applications which have a lot of `up-down' motion over a
small call stack range, and which thus fit perfectly in the windows,
will get either large gains, or (if your register allocation is good
enough) none at all, on machines with register windows (SPARC) vs
machines with `just a lot of registers' (AM29000).
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

tim@proton.amd.com (Tim Olson) (03/14/91)

In article <3255@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
|   Well I'm not about to argue but I may pose a question, in systems
| which use register windows,  does someone have a good handle on the cost
| of doing all the register to register I see just before a call? I see
| some papers which show big gains by not moving data register to memory
| to register putting arguments on a stack, but they seem to discount the
| cost of the moves, and the reduced performance of the optimizer having
| to treat a lot of registers as temps or plain unusable. I would have to
| read some papers to clearly unstand how to measure the optimizer impact,
| so if someone has done it speak up.

The register-to-register movement that you mention is not due to a
windowed-register implementation; simply passing parameters in
registers will cause this, because a function may be called by more
than one parent, and they all must agree on what registers the
parameters must be passed in.

However, the register-to-register movement is not required all the
time; register allocators can bind local variables to the correct
outgoing parameter register through copy propagation.  For example,
the function:

----------------------------------------
void g(int, int, int);

void f(int j)
{
    int i1, i2;

    i1 = j+5;
    i2 = j-7;
    g(i2, j, i1);
}
----------------------------------------

compiles to the following, using the MetaWare C compiler for the 29K:

----------------------------------------
_f:
	sub	gr1,gr1,24
	asgeu	V_SPILL,gr1,gr126
	add	lr1,gr1,36
;5   |    int i1, i2;			<- local variables i1, i2
;6   |
;7   |    i1 = j+5;
	add	lr4,lr8,5		<- i1 is assigned local register 4
;8   |    i2 = j-7;				(3rd parameter reg)
	sub	lr2,lr8,7		<- i2 is assigned local register 2
;9   |    g(i2, j, i1);				(1st parameter reg)
	call	lr0,_g			<- so this call needs no copying,
	add	lr3,lr8,0		<- 	except incoming parameter j
;10  |}
	add	gr1,gr1,24
	nop
	jmpi	lr0
	asleu	V_FILL,lr1,gr127
----------------------------------------
--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

henry@zoo.toronto.edu (Henry Spencer) (03/14/91)

In article <10896@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>>... in systems which use register windows, does someone have a good
>>handle on the cost of doing all the register to register I see just
>>before a call?
>
>In many (yeah sure :-/ ) cases you can put the values in the windowed
>registers in the first place...

In fact, it is tempting to generalize this and say "register-to-register
moves are the sign of a defective compiler".  This isn't quite always true,
of course.  In particular, if you are passing the value of a register
variable, it often needs to be copied because the call/return sequence
will generally destroy the passed copy.  Still, r-t-r moves should be
viewed with suspicion.

Turn the question around:  if those moves weren't being done register
to register, how *would* they be done?  Register to memory?  That costs
more, not less.

>... Applications which have a lot of `up-down' motion over a
>small call stack range, and which thus fit perfectly in the windows,
>will get either large gains, or (if your register allocation is good
>enough) none at all, on machines with register windows (SPARC) vs
>machines with `just a lot of registers' (AM29000).

AHEM.  Chris, the Am29k has *both*.  You can use 128 of those registers
as a register-window bank if you want, with the added bonus that the
size of the windows is completely up to you and can be different for
each function.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/14/91)

In article <10896@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:

| In many (yeah sure :-/ ) cases you can put the values in the windowed
| registers in the first place, and/or leaf functions (those again!) can
| avoid using a new window (doing everything in the caller's window and
| any `callee saves' temporary registers).

  The obvious exception is pass thru values, where values you got from
your caller are passed to a leaf. These come in in the wrong place, so
you can't avoid the move. And consecutive calls will always need a move
to get the args in the right place.

  Maybe there's some joy to having the caller control the offset of the
windows. Then I can work with the things I will pass to leafA int x4..6,
and leafB in 9..12, and specify that my x4 become x0 for leafA, but my
x9 becomes x0 for leafB. That would probably raise more problems than it
kills, though.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

torek@elf.ee.lbl.gov (Chris Torek) (03/15/91)

In article <1991Mar13.200326.29116@zoo.toronto.edu>
henry@zoo.toronto.edu (Henry Spencer) writes:
>AHEM.  Chris, the Am29k has *both*.  You can use 128 of those registers
>as a register-window bank if you want, with the added bonus that the
>size of the windows is completely up to you and can be different for
>each function.

True.  I was originally going to use SPARC and MIPS in my examples, but
I figured people would object to comparing a 127-register machine
(Sun's SPARC chips, 8 windows of which 1 is reserved for traps, 16
registers per window, plus 8 more overlapping into the trap window
[this makes the trap handlers somewhat painful], plus 8 `global'
registers of which one is wired zero) to a 31-register machine (MIPS
R[23]000, 32 registers, one wired zero).  So when I said:

>>... and which thus fit perfectly in the windows,
>>will get either large gains, or (if your register allocation is good
>>enough) none at all, on machines with register windows (SPARC) vs
>>machines with `just a lot of registers' (AM29000).

perhaps I should have made it `...with register windows (SPARC, AM29000)
vs machines with ``just a lot of registers'' (AM29000)'.

(Acutally, the SPARC has 8 `user-usable' global registers, the last one
being the %y register, but it is a special case.  Still, it might make
a good holding register for leaf functions.  Now I am tempted to hack
gcc to do this. :-) )
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov