davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/05/91)
Here's an interesting thought. I have seen good arguments which say that the program called should save registers because it knows which ones it will use. I've seen a less persuasive argument that the caller should do the save because it knows what must be saved. It came to me that perhaps there's a way to "split the difference" and save only the set which must the caller and called routine agree must be saved. Consider a call instruction which includes a bitmapped register set. I would include all those which will be used before reloading. Then assume a "save as needed" instruction which takes a bitmap of all the registers which must be used by the called routine. The registers which were in both sets would be saved, possibly by a restore mask which would enumerate which registers had been saved so they could be popped again. Oh heavens, a CISC instruction! But the time to execute should be one clock per register saved, at least as fast as saving all the registers every time. While we're doing heresy, let's consider having the CPU able to remap the registers on the fly. If you call it writable control store it's CISC, and therefore bad. If you call it register sets it's RISC and it's good. At any rate, in every case where there are N registers needed, and N or more registers which the caller doesn't need saved, then zero registers have to be written to memory. And the one design rule common to hardware and software is "it's quicker to not do it." I think interrupts fall out if you set all the "caller" bits and save everything, or you could use other logic completely. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "I'll come home in one of two ways, the big parade or in a body bag. I prefer the former but I'll take the latter" -Sgt Marco Rodrigez
chased@rbbb.Eng.Sun.COM (David Chase) (03/05/91)
davidsen@crdos1.crd.ge.com (bill davidsen) writes: > Here's an interesting thought. I have seen good arguments which say >that the program called should save registers because it knows which >ones it will use. I've seen a less persuasive argument that the caller >should do the save because it knows what must be saved. With a "good enough" compiler, the caller might profitably do the save because the caller has the opportunity to optimize saves and restores. For example, you might have a subroutine call in a loop. For some registers, saves and restores are loop-invariant, meaning "these registers are not used in the loop". For another example, if there's a series of calls, then the saves and restores might be redundant -- that is, not all registers are used betwen calls. Just another reason why caller-saves might be worthwhile. David Chase Sun
jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (03/05/91)
The combination of a decent compiler and library mechanism lets you do optimal register saves fairly easily. Let the compiler record, for each routine it puts in the library, what registers the routine uses. When you compile a call to that routine, the register allocator can try to avoid registers the routine uses, and when it uses one, it can generate the needed save and restores. This works trivially within a single compilation unit (to use Ada terminology), and it works with library routines in languages where you can force recompilation of all users of a routine after it is recompiled (again, the Ada rules work here). This approach causes problems for old-fashioned languages like C, the separate compilation rules of which are essentially the same as those of FORTRAN (save your flames, I don't mind using C when I have to). Doug Jones jones@cs.uiowa.edu
khb@chiba.Eng.Sun.COM (Keith Bierman fpgroup) (03/05/91)
In article <4718@ns-mx.uiowa.edu> jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes:
... This (ed. saving register use info) approach causes problems for
old-fashioned languages like C, the
separate compilation rules of which are essentially the same as those
of FORTRAN (save your flames, I don't mind using C when I have to).
It also presents problems with various forms of dynamic linking ....
you don't necessarily know that the "universe" has changed until you
are in the middle of execution.... not that I'm against milking static
analysis for all that it's worth ;>
--
----------------------------------------------------------------
Keith H. Bierman kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/05/91)
In article <9047@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes: >Just another reason why caller-saves might be worthwhile. A rather long analysis of caller/calee/dataflow/hybrid register saving appears in one of the recent Software Practice & Experience issues. From my very brief skim there seemed little (by my definition <20%) difference between many of the methods (apart from the dataflow algorithm used in one of the methods taking bulk cycles for little or no return -- but dataflow is _always_ part of your compiler, right?). This was all done on a 68k. -kym
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/05/91)
In article <9047@exodus.Eng.Sun.COM>, chased@rbbb.Eng.Sun.COM (David Chase) writes: > davidsen@crdos1.crd.ge.com (bill davidsen) writes: > > Here's an interesting thought. I have seen good arguments which say > >that the program called should save registers because it knows which > >ones it will use. I've seen a less persuasive argument that the caller > >should do the save because it knows what must be saved. > > With a "good enough" compiler, the caller might profitably do the save > because the caller has the opportunity to optimize saves and restores. > For example, you might have a subroutine call in a loop. For some > registers, saves and restores are loop-invariant, meaning "these > registers are not used in the loop". For another example, if there's > a series of calls, then the saves and restores might be redundant -- > that is, not all registers are used betwen calls. > > Just another reason why caller-saves might be worthwhile. Most of the situations I have seen are of two types; either the subroutine has few instructions, in which case it is likely to use few registers, and these probably should be inlined; or the subroutine does a lot of computing, in which case register save/restore, while not trivial, is at least not a major part of the cost. It is true that there are cases of convoluted subroutines in which it is desirable to have many registers available, although few will be used, but, except when a program is essentially doing, or has recently done, some sort of context switch, many registers will be in use. In addition, an important problem is that of conditional subroutine calls, including "planned exceptions." This is definitely handled best by having the callee do the saving. Other articles have discussed ways of decreasing the amount of work needed for the saving, such as bit arrays to indicate which registers are busy, and dynamic renaming of registers. There are also hardware methods proposed. But in too many situations I have seen, callee saving is the most expensive way. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/06/91)
In article <7317@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: | Most of the situations I have seen are of two types; either the subroutine | has few instructions, in which case it is likely to use few registers, | and these probably should be inlined; or the subroutine does a lot of | computing, in which case register save/restore, while not trivial, is | at least not a major part of the cost. You're right, but there are other cases which don't fall into these two cases. The other cases fall into the category of procedures which do a lot more work (and dirty more registers) for some input data than others. You can argue that the complex cases should be handled by another level of procedure calls, but explicit coding or compiler inlining may prevent this. Given the case where paths of multiple complexity occur, the "save if needed" instruction might be used several times: SIN 1,2 CMP FP[1],#2 BLE TRIVIAL SIN 4,5,7,8,9 COMPLX: [ ... ] Yes, I assume that the cost of the instruction is inherently low, perhaps with pipelining zero if no registers are saved. Still, it's an interesting thought, particularly if the "save" allows remapping of registers rather than actual transfer to memory or cache. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "I'll come home in one of two ways, the big parade or in a body bag. I prefer the former but I'll take the latter" -Sgt Marco Rodrigez
ddean@rain.andrew.cmu.edu (Drew Dean) (03/06/91)
In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > Consider a call instruction which includes a bitmapped register set. >I would include all those which will be used before reloading. Then >assume a "save as needed" instruction which takes a bitmap of all the >registers which must be used by the called routine. The registers which >were in both sets would be saved, possibly by a restore mask which would >enumerate which registers had been saved so they could be popped again. > > Oh heavens, a CISC instruction! But the time to execute should be one >clock per register saved, at least as fast as saving all the registers >every time. The VAX did this in its CALLS instruction. Many people would rank this instruction right up there with POLY for usefulness. :-) Levy & Eckhouse, _Computer Programming and Architecture: The VAX_, 2nd Ed. says this on page 213: "On the other hand, CALLS, which accounts for only 2.54 percent of the instructions, accounts for 21.59 percent of the time!" [emphasis theirs]. The surrounding text indicates this is a VAX 11/780, running the VMS Fortran compiler in 1982, and the measurements were made with a hardware monitor. Most functions don't need the complexity of CALLS (note that it also setup a frame pointer (or 2 ?), and does a few other things as well. Good idea, but experience is against it...oh well.... -- Drew Dean Drew_Dean@rain.andrew.cmu.edu [CMU provides my net connection; they don't necessarily agree with me.]
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/06/91)
In article <12234@pt.cs.cmu.edu> ddean@rain.andrew.cmu.edu (Drew Dean) writes: | The VAX did this in its CALLS instruction. Using this as an example of saving some registers is somewhat like counting the boiler of a steam engine as overhead on the whistle. CALLS set up stack frames, saved a bunch of stuff unconditionally, etc. Also, CALLS saves all the registers named, even if the target doesn't use them. This is not at all like the idea I proposed, which would save only the set which (a) the caller wanted saved and (b) the target was going to use. It's really not the same thing at all. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "I'll come home in one of two ways, the big parade or in a body bag. I prefer the former but I'll take the latter" -Sgt Marco Rodrigez
npw@eleazar.dartmouth.edu (Nicholas Wilt) (03/07/91)
In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > Consider a call instruction which includes a bitmapped register set. >I would include all those which will be used before reloading. Then >assume a "save as needed" instruction which takes a bitmap of all the >registers which must be used by the called routine. The registers which >were in both sets would be saved, possibly by a restore mask which would >enumerate which registers had been saved so they could be popped again. > > Oh heavens, a CISC instruction! But the time to execute should be one >clock per register saved, at least as fast as saving all the registers >every time. The Motorola 68881 architecture's FMOVEM instruction has a mechanism to do something like this. I've never heard of anyone using it, though. --Nick npw@eleazar.dartmouth.edu
cet1@cl.cam.ac.uk (C.E. Thompson) (03/07/91)
In article <12234@pt.cs.cmu.edu> ddean@rain.andrew.cmu.edu (Drew Dean) writes: >In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >> Consider a call instruction which includes a bitmapped register set. >>I would include all those which will be used before reloading. Then >>assume a "save as needed" instruction which takes a bitmap of all the >>registers which must be used by the called routine. The registers which >>were in both sets would be saved, possibly by a restore mask which would >>enumerate which registers had been saved so they could be popped again. >> ... > >The VAX did this in its CALLS instruction. ... > {Diatribe (mostly deserved) against the CALLS instruction omitted} CALLS/CALLG use only a callee-specified register mask: it is located in the callee's code. I undertood Bill Davidsen to be proposing an instruction that combined a caller-specified mask and a callee-specified mask. By the way, it is not clear that the notorious overheads of CALLS/CALLG are due to the mask-driven register storing at all: they do many (far too many) other things as well. Nor are mask-specified load/stores in themselves particularly CISCy --- at least one nominally RISC processor (Acorn's ARM) uses them. Chris Thompson JANET: cet1@uk.ac.cam.phx Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk
jbuck@galileo.berkeley.edu (Joe Buck) (03/07/91)
In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: |> > Consider a call instruction which includes a bitmapped register set. |> >I would include all those which will be used before reloading. Then |> >assume a "save as needed" instruction which takes a bitmap of all the |> >registers which must be used by the called routine. The registers which |> >were in both sets would be saved, possibly by a restore mask which would |> >enumerate which registers had been saved so they could be popped again. |> > |> > Oh heavens, a CISC instruction! But the time to execute should be one |> >clock per register saved, at least as fast as saving all the registers |> >every time. In article <12234@pt.cs.cmu.edu>, ddean@rain.andrew.cmu.edu (Drew Dean) writes: |> The VAX did this in its CALLS instruction. Many people would rank this |> instruction right up there with POLY for usefulness. :-) Levy & Eckhouse, |> _Computer Programming and Architecture: The VAX_, 2nd Ed. says this on page |> 213: "On the other hand, CALLS, which accounts for only 2.54 percent of the |> instructions, accounts for 21.59 percent of the time!" [emphasis theirs]. |> The surrounding text indicates this is a VAX 11/780, running the VMS Fortran |> compiler in 1982, and the measurements were made with a hardware monitor. |> Most functions don't need the complexity of CALLS (note that it also setup a |> frame pointer (or 2 ?), and does a few other things as well. Dave Patterson (who invented the acronym "RISC") said semi-jokingly that the whole RISC movement started as a rebellion against the kind of thinking that produced the VAX CALLS instruction. One of the worst inefficiencies about CALLS is that the register save mask isn't saved in the instruction word itself; instead, the save mask appears as a 16-bit quantity stored at the beginning of the function to be called. This means that the machine must figure out at runtime, right in the critical path for any pipelining you might want to do, information that is known at compile time -- it must read back this word from external memory, then parse it, then save the registers it indicates. In Hennessey and Patterson's book, they stress the rule that you avoid making the processor figure out at runtime any information that is already known at compile time (in this case, what registers to save). -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/07/91)
In article <12234@pt.cs.cmu.edu>, ddean@rain.andrew.cmu.edu (Drew Dean) writes: > In article <3219@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > > Consider a call instruction which includes a bitmapped register set. > >I would include all those which will be used before reloading. Then > >assume a "save as needed" instruction which takes a bitmap of all the > >registers which must be used by the called routine. The registers which > >were in both sets would be saved, possibly by a restore mask which would > >enumerate which registers had been saved so they could be popped again. > > > > Oh heavens, a CISC instruction! But the time to execute should be one > >clock per register saved, at least as fast as saving all the registers > >every time. > > The VAX did this in its CALLS instruction. Many people would rank this > instruction right up there with POLY for usefulness. :-) Levy & Eckhouse, > _Computer Programming and Architecture: The VAX_, 2nd Ed. says this on page > 213: "On the other hand, CALLS, which accounts for only 2.54 percent of the > instructions, accounts for 21.59 percent of the time!" [emphasis theirs]. > The surrounding text indicates this is a VAX 11/780, running the VMS Fortran > compiler in 1982, and the measurements were made with a hardware monitor. > Most functions don't need the complexity of CALLS (note that it also setup a > frame pointer (or 2 ?), and does a few other things as well. The relevant question is, "How bad is this from the standpoint of the possible alternatives?" In a subroutine call, the following is involved: Save the system state register and set up the new one. Save the modified program counter. Save those registers which should be saved. Save any other information needed to restore. On return, this needs to be reversed. So we have in the neighborhood of a half-dozen to more than a dozen things to do. In addition, we are likely to have cache failures. I do not find it at all unusual that this instruction averages 10 times as long as other instructions. Now some of this could have been saved IF the compiler could have used the jump to subroutine or branch to subroutine for subroutines in the same compilation, but this would have required more ingenuity in the compiler, and might not have saved that much for non-simple subroutines. Probably more would have been saved by inlining small subroutines; abs (or even better fabs) would have taken even less space to inline than to issue a call. However, library functions not inlined will always have separate compilation. But a subroutine call is not likely to be cheap if any register saving is involved. Nor will it be any cheaper if 10 instructions are used rather than one 10 times as slow. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
sef@kithrup.COM (Sean Eric Fagan) (03/07/91)
In article <7403@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >The relevant question is, "How bad is this from the standpoint of the possible >alternatives?" In a subroutine call, the following is involved: > > Save the system state register and set up the new one. > Save the modified program counter. > Save those registers which should be saved. > Save any other information needed to restore. > >On return, this needs to be reversed. > >So we have in the neighborhood of a half-dozen to more than a dozen things to >do. In addition, we are likely to have cache failures. I do not find it >at all unusual that this instruction averages 10 times as long as other >instructions. Why, Herman? Don't you like complex instructions? One of the reasons the vax calls instruction is so slow is, as was pointed out by another poster, that, in addition to what you describe, it also goes out and fetches 16 bits at the specified address, uses that to determine which registers to save, and then jumps to address+2bytes. Ugh. Too many memory references for my taste. Note, also, that there is no reason why you should save the status register ("condition codes"); in fact, you may want to use that (if present) to determine what you should do. (E.g., Boolean foo(integer); ... if (foo(10) = True) then ... ... could become push 10 call foo b.false wherever ... No?) The method suggested which started this thread, which was proposed about this time last year, is that you do something like: call foo, 0x1234 where foo does save 0x3 and the resgisters saved are described by the bitmask from 0x1234&0x3. Now, what do you do, however, when you call bar, which uses a mask of 0x1230? foo does a call bar, 0x3 and the resultant mask is 0, thus no registers are saved, *despite* bar trashing some of foo's caller's registers. Any thoughts on that one? -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
ddean@rain.andrew.cmu.edu (Drew Dean) (03/07/91)
In article <7403@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >In article <12234@pt.cs.cmu.edu>, ddean@rain.andrew.cmu.edu (Drew Dean) writes: [discussion about VAX CALLS instruction and what a procedure call does deleted] >But a subroutine call is not likely to be cheap if any register saving is >involved. Nor will it be any cheaper if 10 instructions are used rather >than one 10 times as slow. Instead of "is not likely to be cheap", let's see what actual measurement gives us. I quote Hennessy & Patterson, p. 137: 3.11 [30] <3.7, 3.9> Find a machine that has a powerful instruction set feature, such as the CALLS instruction on the VAX. Replace the powerful instruction with a simpler sequence that accomplishes what is needed. Measure the resultant running time. How do the two compare? Why might they be different? In the early 1980s, engineers at DEC performed a quick experiment to evaluate the impact of replacing CALLS. They found a 30% improvement in the run time on a very call-intensive program when the CALLS was simply replaced (parameters remained on the stack). How do your results compare? Also, under the first Pitfall of Ch. 3, (p. 125) we find a long discussion of the CALLS instruction, which I omit for brevity, but quote "The vast majority of calls in real programs do not require this amount of overhead." [Any typos are mine, and I apologize....] Can someone give a reference to Patterson's finding that most procedures are relatively "simple" ? (ie. I remember <= 6 local vars, and about the same number of parameters dominates usage in real code.) If you find, after simulation that a lot of time is being spent saving & restoring registers from the stack, then investigate mechanisms like register windows. I believe the biggest lesson from all of this is to build instruction sets that compilers can use to the best advantage, be it a RISC or CISC design. Anyone want to simulate a 32-bit Lilith and see what things look like ? -- Drew Dean Drew_Dean@rain.andrew.cmu.edu [CMU provides my net connection; they don't necessarily agree with me.]
srg@quick.com (Spencer Garrett) (03/07/91)
In article <3219@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: > Consider a call instruction which includes a bitmapped register set. > I would include all those which will be used before reloading. Then > assume a "save as needed" instruction which takes a bitmap of all the > registers which must be used by the called routine. The registers which > were in both sets would be saved, possibly by a restore mask which would > enumerate which registers had been saved so they could be popped again. Alas, this approach sounds good, and would work well for very shallow calling trees, but it degenerates to callee-saves if the calling tree has any depth to it. Consider: a) if you are successful in avoiding register saves, then the registers-in-use mask will tend toward all ones, and every register needed by the called routine will get saved. b) if the bitmask isn't filling up, then you aren't postponing any saves, and the technique isn't buying you anything. You could get a lot more performance from this technique if you could introduce some hysteresis into the process (ie - don't restore a register until you get back to a routine that uses it), but this would be a nightmare to implement, since registers would have to be saved and restored from different stack frames. The studies I have seen have not shown a huge difference in performance between caller-saves and callee-saves *across a broad range of applications*. There are, of course, specific instances in which either is clearly much better than the other. In general, callee-saves is easier to implement, but caller-saves offers more potential for optimization. I am unaware of any systems which are purely one or the other, although most come very close to a callee-saves convention.
avg@hq.demos.su (Vadim Antonov) (03/07/91)
>In article <9047@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes: >>Just another reason why caller-saves might be worthwhile. Do not forget that register save instruction can enjoy burst mode for moving register blocks from/to memory. IMHO it's the only reason to introduce such instructions. Reasonable cache and instruction prefetching eliminates any need in specialised frame saving instructions - a sequence of PUSHs / POPs works fine if you have no blocked CPU<-->memory transfers. Vadim Antonov DEMOS, Moscow, USSR
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/07/91)
In article <11727@pasteur.Berkeley.EDU> jbuck@galileo.berkeley.edu (Joe Buck) writes: | In Hennessey and Patterson's book, they stress the rule that you avoid | making the processor figure out at runtime any information that is already | known at compile time (in this case, what registers to save). Your conclusion would be valid if your supposition was correct. The original posting indicated that you would only save the registers which the the target would modify and the caller wanted saved. To have a map at compile time of which registers are used in every called routine is just within the limits of current state of the art, ie. we know how to do it, but don't usually. To preserve that without recompiling every calling routine when a target is recompiled is not practical in terms of development time vs. time saved. Doing the operation at run time using modern design should take 1 cycle in the call to set the bitmap (since this can overlap the cycle to do the call this is "free") 1 cycle for the AND of the caller and target mask, or ENTER 1 cycle per register to do the saves (this could be burst mode or something and would be at least as fast as the individual instructions) Since my proposal adds one cycle for each call, it would have to find one register on the average which did not need to be saved. Looking at some SPARC code I believe this is likely. A smart chip design might be able to hide the cycle for instruction overhead, but I counted it because I'm not positive it can be done. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "I'll come home in one of two ways, the big parade or in a body bag. I prefer the former but I'll take the latter" -Sgt Marco Rodrigez
peter@ficc.ferranti.com (Peter da Silva) (03/07/91)
In article <1991Mar6.163348.13003@dartvax.dartmouth.edu> npw@eleazar.dartmouth.edu (Nicholas Wilt) writes: > The Motorola 68881 architecture's FMOVEM instruction has a mechanism to > do something like this. I've never heard of anyone using it, though. I understand it can be used to implement a fast memmov() routine. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
kers@hplb.hpl.hp.com (Chris Dollin) (03/07/91)
Joe Buck writes:
One of the worst inefficiencies about CALLS is that the register save mask
isn't saved in the instruction word itself; instead, the save mask appears
as a 16-bit quantity stored at the beginning of the function to be called.
This means that the machine must figure out at runtime, right in the
critical path for any pipelining you might want to do, information that is
known at compile time -- it must read back this word from external memory,
then parse it, then save the registers it indicates.
I may be missing something ... but what makes you think that the registers used
by a called routine are known to the caller at *compile-time*?
I suppose that one could arrange for that information to be made available to
the linker, which would then patch up all the calls so that they'd know which
registers to save, but you still have to deal with procedure variables.
[In any case, it would seem sensible to decouple register saving from procedure
calling, as indeed seems to be the case with RISC architectures.]
--
Regards, Kers. | "You're better off not dreaming of the things to come;
Caravan: | Dreams are always ending far too soon."
peter@ficc.ferranti.com (Peter da Silva) (03/07/91)
> and the resgisters saved are described by the bitmask from 0x1234&0x3. Now, > what do you do, however, when you call bar, which uses a mask of 0x1230? > foo does a > call bar, 0x3 > and the resultant mask is 0, thus no registers are saved, *despite* bar > trashing some of foo's caller's registers. You could solve this by keeping running track of "dirty" registers, which get set by "call" and cleared by "save". Let's see how this works here: You have to have "call" or-in the mask to be used, and have "save" clear the bits of the registered saved. That way any bits left over from foo's caller would still be set, unless foo has already saved them. Looks like it'd work. You could even use established VM techniques to decide when to save registers, but that's probably overkill. -- Peter da Silva. `-_-' peter@ferranti.com +1 713 274 5180. 'U` "Have you hugged your wolf today?"
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/08/91)
In article <1991Mar07.014907.11081@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes: | and the resgisters saved are described by the bitmask from 0x1234&0x3. Now, | what do you do, however, when you call bar, which uses a mask of 0x1230? | foo does a | | call bar, 0x3 | | and the resultant mask is 0, thus no registers are saved, *despite* bar | trashing some of foo's caller's registers. | | Any thoughts on that one? I had several, most of which revolved around a "preserved registers" register. Unfortunately I think that will have to be saved, so it's not a viable solution. I think the answer is that any procedure which calls another procedure will have to assume saving all registers at its entry point, and therefore the caller's save mask would be used. Where this becomes a win is that the leaf procedures will not have to do that, and they get called the most. All this got me thinking that another bit of memory access which could be saved on a call is the return address. By putting it into a (dedicated) register at call time the leaf procedures would avoid several memory accesses, and the CALL instruction would be simpler, since the IC would just go into the RETADR register and the stack pointer would not be changed. That, in turn, leaves the ALU free on the instruction, which clever software might find useful. We had this on the GE/Honeywell 36 bit machines. The common call instruction was TSXn, which left the return address in index register n. Another potential time saver. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
preston@ariel.rice.edu (Preston Briggs) (03/08/91)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: >> Consider a call instruction which includes a bitmapped register set. >> I would include all those which will be used before reloading. Then >> assume a "save as needed" instruction which takes a bitmap of all the >> registers which must be used by the called routine. The registers which srg@quick.com (Spencer Garrett) writes: >Alas, this approach sounds good, and would work well for very shallow >calling trees, but it degenerates to callee-saves if the calling >tree has any depth to it. Consider also the following case Routine A calls B, specifying that register 1 must be saved. Routine B doesn't use register 1, so it's not pushed. Further, B calls C. We need to be sure that r1 is preserved, but I assume we can accumulate a "preserve" set. AB preserve - AB preserved + BC preserve ^ those that A required at the callsite to B ^ those that B stored on stack ^ those that B requires at the callsite to C Sort of ugly, but perhaps not terrible. If C requires r1, it'll be pushed on entrance to C and popped on exit from C. Now, make B call C many times in a row. It would have been much nicer to save r1 on the transition from A to B (by either the caller or callee). On the same subject, interested parties might check a paper by Guy Steele (and Sussman?) from the 1980 Conference on Lisp and Functional Programming, called "Lazy Scoping, the Dream of a Lifetime" (or thereabouts -- my proceedings aren't where they belong). He introduces the notion of "racks" (sort of combining register and stack). The idea is that each has a little state machine, a value, and a stack. There are 4 operations: fetch, update, save, restore. You "save" live racks before a call and "restore" them after the call. If someone (arbitrarily deep in the call sequence) tries to assign to a saved rack, the saved value is preserved on the rack's private stack and a new value is established. When restoring, you pop only if you've pushed. He explored several possible implementations of racks in the context of a Scheme interpreter. In the case of the interpreter, different special-purpose racks could use different implementations for best performance. I dunno how to build the things, but it's a fun paper to read. Preston Briggs
preston@ariel.rice.edu (Preston Briggs) (03/08/91)
I wrote: >On the same subject, interested parties might check a paper >by Guy Steele (and Sussman?) from the 1980 Conference on Lisp and >Functional Programming, called "Lazy Scoping, the Dream of a Lifetime" The correct reference is The Dream of a Lifetime: A Lazy Variable Extent Mechanism Guy L. Steele Jr. and Gerald J. Sussman Conference Record of the 1980 Lisp Conference
sef@kithrup.COM (Sean Eric Fagan) (03/08/91)
In article <3228@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: > All this got me thinking that another bit of memory access which >could be saved on a call is the return address. Oh, almost everyone does that these days. i860, MIPS, Sparc?, Cray, 88k, etc. (After all, if you don't have a stack, where are you going to put the return address except in a register?) -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
torbenm@diku.dk (Torben [gidius Mogensen) (03/08/91)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes: > All this got me thinking that another bit of memory access which >could be saved on a call is the return address. By putting it into a >(dedicated) register at call time the leaf procedures would avoid >several memory accesses, and the CALL instruction would be simpler, >since the IC would just go into the RETADR register and the stack >pointer would not be changed. That, in turn, leaves the ALU free on the >instruction, which clever software might find useful. > We had this on the GE/Honeywell 36 bit machines. The common call >instruction was TSXn, which left the return address in index register n. >Another potential time saver. The ARM family of processors do this too: The BL (branch and link) instruction saves the PC (R15) in R14, which is called the Link register. There is no JSR instruction in the usual sense, so procedures save and restore the Link register on the stack when necessary. Return is just MOV R15,R14. Torben Mogensen (torbenm@diku.dk)
rpw3@rigden.wpd.sgi.com (Rob Warnock) (03/10/91)
In article <1991Mar7.063957.17197@quick.com> srg@quick.com (Spencer Garrett) writes: +--------------- | You could get a lot more performance from this technique if you | could introduce some hysteresis into the process (ie - don't restore | a register until you get back to a routine that uses it), but | this would be a nightmare to implement, since registers would have | to be saved and restored from different stack frames. +--------------- This is *exactly* what the Am29000 stack-cache register windowing does for you -- provides a *large* amount of hysteresis in the "spill/fill" activity, so much so that register spill/fills generally account for well under 1% of all cycles (Dhrystone/grep/diff/nroff/asm/Stanford, etc.). -Rob
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/11/91)
In article <1991Mar07.014907.11081@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes: > In article <7403@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > >The relevant question is, "How bad is this from the standpoint of the possible > >alternatives?" In a subroutine call, the following is involved: > > > > Save the system state register and set up the new one. > > Save the modified program counter. > > Save those registers which should be saved. > > Save any other information needed to restore. > > > >On return, this needs to be reversed. > > > >So we have in the neighborhood of a half-dozen to more than a dozen things to > >do. In addition, we are likely to have cache failures. I do not find it > >at all unusual that this instruction averages 10 times as long as other > >instructions. > > Why, Herman? Don't you like complex instructions? > > One of the reasons the vax calls instruction is so slow is, as was pointed > out by another poster, that, in addition to what you describe, it also goes > out and fetches 16 bits at the specified address, uses that to determine > which registers to save, and then jumps to address+2bytes. Ugh. Too many > memory references for my taste. Note, also, that there is no reason why you > should save the status register ("condition codes"); in fact, you may want > to use that (if present) to determine what you should do. (E.g., ............................ Unless one is going to save all registers, it is necessary to know which ones to save. I am used to situations (call them "planned exceptions", if you will) in which the subroutine call is a rare occurrence. Now there is a way around this at some hardware cost if one would have a means of instructing the machine to proceed in a particular manner unless a user interrupt occurs, and to be able to reset in that case. If a machine has a goodly number of registers, the "normal" situation will be for a program to be using lots of them. It is not that unusual for a subroutine, even one doing a fair amount of work, to use many less. For the caller to save registers without regard for what the callee uses will be expensive unless there is some automatic means of doing this such as register blocks on the PYRAMID. Otherwise there has to be some means of finding out which registers the callee uses, and reading the mask is about as quick a way as any. I do not think our current instructions are complex enough, in many situations. In a register saving situation without register blocks, the memory references are needed. Now one could have anticipating reads of the save mask, so that that reference will not slow things down. If one wants to use an intersection mask, an "old active" mask for subsequent calls will still be needed. You do need to save such things as the status register. The old status register should still be available if one wants to use it. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/12/91)
In article <7613@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: | I do not think our current instructions are complex enough, in many situations. I'm sure that will make it into other quote files besides mine... but I know what you mean. To justify a complex (microcode) instruction, I believe you have to satisfy the following: /- be faster than doing the same thing in multiple instructions < - OR - \- be as fast and smaller than the hardcode to save space - AND - not use chip space which could be used to make the *overall* performance better by doing somthing else. You can sum this up as "best use of the real estate" or "best overall performance" and many other people have done so. I have no attachment to hardcoded instructions, I just want top performance overall, without any horible "worst case" programs which run orders of magnitude slower than you would expect. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
davec@nucleus.amd.com (Dave Christie) (03/12/91)
In article <MGY90JF@xds13.ferranti.com> peter@ficc.ferranti.com (Peter da Silva) writes: >> and the resgisters saved are described by the bitmask from 0x1234&0x3. Now, >> what do you do, however, when you call bar, which uses a mask of 0x1230? >> foo does a > >> call bar, 0x3 > >> and the resultant mask is 0, thus no registers are saved, *despite* bar >> trashing some of foo's caller's registers. > >You could solve this by keeping running track of "dirty" registers, which >get set by "call" and cleared by "save". > >Let's see how this works here: > >You have to have "call" or-in the mask to be used, and have "save" clear >the bits of the registered saved. That way any bits left over from foo's >caller would still be set, unless foo has already saved them. > >Looks like it'd work. You could even use established VM techniques to >decide when to save registers, but that's probably overkill. Yes, this does it. In a former life I designed hardware support for a mainframe with VAX style every-thing-but-the-kitchen-sink call/return instructions to do the register save/restores in essentially zero time. The paper that Preston Briggs referred to seems very similar - a little state machine + stack for each register; also a stack for the save masks and maybe some other control state. It had to be transparent to software for backwards compatibility. Fun to design, but fraught with extra bits of complexity due to pipelining, branches, fault tolerance, context switches, etc. In the final analysis, it was hard to justify the complexity and realestate, compared with the cost/performance of, say, R3000 call/return. It was never implemented. In fact, it was one (of several) of the experiences with CISC design that made me a RISC advocate... And added another straw to their CISC camel's back. IMHO, the 29000 stack cache is a much better way (no plug intended) (well, maybe a little...) >-- >Peter da Silva. `-_-' peter@ferranti.com >+1 713 274 5180. 'U` "Have you hugged your wolf today?" --------------- Dave Christie My opinions only.
don@dgbt.doc.ca (Donald McLachlan) (03/12/91)
The idea of saving the return address in a register, rather than on the stack sounds nice to me, but ... This requires that the calling function knows that the called function is a leaf function, not very practical from a high level language point of view as far as I can tell. The only way I can think to generalise this would be to always put the return address in a dedicated register. This would require that the "call" would first push the old contents onto the stack and then load in the new return address. The matching return would use in the dedicated register as the return address. The function making the call would then be responsible for grabbing the old return address off the stack and loading it into the dedicated register. A further optimisation I could see for this is any function that performs a call could save the "return register" on the stack once at invocation, and restore it prior to returning. (gee isn't this what is done now:-) Now that all the mechanics are out of the way (the way I see them) only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE??? I interpret this as ... What is the ratio of calls to "leaf functions" versus calls to "non-leaf functions"? Does anyone have any data, or a good guess at this last question?
dhoyt@vx.acs.umn.edu (DAVID HOYT) (03/12/91)
In article <1991Mar11.192116.1974@dgbt.doc.ca>, don@dgbt.doc.ca (Donald McLachlan) writes... > The only way I can think to generalise this would be to always >put the return address in a dedicated register. This would require that >the "call" would first push the old contents onto the stack and then >load in the new return address. The matching return would use in the >dedicated register as the return address. The function making the call >would then be responsible for grabbing the old return address off the >stack and loading it into the dedicated register. Imagine if you have say 32 registers. On procedure call r7 is the return address and r0..r6 are the first seven parameters. Now if our code looks like this save r7 | sub:: for( i = 0; i < 10000000; i++ ) | ld r0, r1 ; add 1, r1; stor r0, r1 sub( a + i ) | jmp r7 restore r7 We've done 10,000,001 reads and the same number of writes. Now if we had a vax jsr/rsb instruction (one write, one read per call) We would have 20m reads and writes. Doubling the memory accesses. Obviously even with a good cache, we'd much rather do our subroutine calls the first way, rather than the vax way. And the normal CallS instruction that the vax uses for subroutine calls is much more expensive than the jsr/rsb calls. We could greatly speed up subroutine calls on a vax by using 4/16 registers this way, or even 8/16 I suspect. In my example sub() is just a normal subroutine, so we don't need a smart linker or code inliner. In fact the only penalty with the all register procedure call is that we have an additional two (unconditional) branches. With a small direct map instruction buffer (ala Cray) and branch intelligent pipeline, the execution time cost of the two branches would be close to nothing. Basically giving you inline virtual functions, that we wouldn't even have to declare inline, everybody wins, wee ha! david | dhoyt@vx.acs.umn.edu | dhoyt@umnacvx.bitnet
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/12/91)
In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes: | The only way I can think to generalise this would be to always | put the return address in a dedicated register. This would require that | the "call" would first push the old contents onto the stack and then | load in the new return address. The previous return info has to be saved, but not by the CALL instruction. There are lots of non-recursive languages (FORTRAN) which don't need to worry about another call and could save the return address as needed. Also, a smart compiler would fake a CALL and let the target procedure return directly to the caller when desirable.[possible. | Now that all the mechanics are out of the way (the way I see them) | only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE??? | I interpret this as ... What is the ratio of calls to "leaf functions" | versus calls to "non-leaf functions"? Wish I had the time to do a few measurements. My guess is that it would be over half the calls with common coding practices, but that's pure gut feeling, so don't expect me to prove it. Any ability of the compiler to inline procedures allows more procedures to be leaf, so it will provide more benefit when optimized. I also note that some languages can pass the arguments inline in the code, using various techniques to avoid executing them. No, it doesn't require self modifying code, just an extra level of indirection (saying they're bad languages doesn't mean you can't design hardware to run them well). If you must have a pointer back to the caller in a register anyway, you can use it in the return. The world is not *exclusively* C. One nice thing about it is that it shouldn't ever have a penalty in performance... the worst case appears to be no worse than what we usually use now, and the bext case seems to save two memory accesses and two changes in the stack pointer. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
turner@lance.tis.llnl.gov (Michael Turner) (03/12/91)
In article <12269@pt.cs.cmu.edu> ddean@rain.andrew.cmu.edu (Drew Dean) writes: >Can someone give a reference to Patterson's finding that most procedures are >relatively "simple" ? (ie. I remember <= 6 local vars, and about the same >number of parameters dominates usage in real code.) If you find, after >simulation that a lot of time is being spent saving & restoring registers >from the stack, then investigate mechanisms like register windows. My first reaction to this was something like "the code *better* have fewer than 6 parameters and 6 local variables per subroutine call on average, or I'm not going to work on it." Cooling down a little, I thought: maybe there's an idea--instead of optimizing architectures around how code *is*, perhaps it would make sense to look at how code *should* be. Personally, I wouldn't mind using a machine that grossly penalized subroutines with more than 4 parameters, in exchange for some slightly greater reward for using fewer, because I seldom write code that uses more than 4 routinely, and *hate* inheriting code that does. More often than not, it seems to me, people use extra parameters rather than splitting up code into separate functions to make it more modular. Register windows are a start on killing off this pernicious tendency. I'm not sure how you'd set up an evaluation, since there are so many different views of that ineffable sense of quality about code. Perhaps you'd need to average out the responses of various programmers to various kinds of code. Preferably in some relatively objective way, like how long it took them to find planted bugs, rather than according to their immediate reactions to a code inspection. Rewarding maintainability with extra performance. Why not? Maybe compilers are being too kind.... --- Michael Turner turner@tis.llnl.gov
jackk@leland.Stanford.EDU (Jack Kouloheris) (03/12/91)
In article <3241@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes: > >| The only way I can think to generalise this would be to always >| put the return address in a dedicated register. This would require that >| the "call" would first push the old contents onto the stack and then >| load in the new return address. > > The previous return info has to be saved, but not by the CALL >instruction. There are lots of non-recursive languages (FORTRAN) which >don't need to worry about another call and could save the return address >as needed. Also, a smart compiler would fake a CALL and let the target >procedure return directly to the caller when desirable.[possible. > People seem to be talking about this call/return mechanism as if i it is something novel.....This is the standard mechanism in the S/360, S/370, and S/390 systems and has been for more than 25 years. All the languages that don't allow recursion have a standard call/return convention via register 15. There is no hardware stack on the 370... the call intruction is usually an assembler macro that does a BALR 15 (Branch and Link Register). A convention is followed for creating a register save area, which is then filled via a Store Multiple instruction. If one always saves all registers on a call (as some compilers do), then a standard OS routine can traceback the call sequence via register 15 (as well as the contents of each register on entry to the call). Of course today optimizing compilers (as on the RS/6000) can save just the required register across a call.
mash@mips.com (John Mashey) (03/12/91)
In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes: > The idea of saving the return address in a register, rather than >on the stack sounds nice to me, but ... > This requires that the calling function knows that the called >function is a leaf function, not very practical from a high level language >point of view as far as I can tell. > > The only way I can think to generalise this would be to always >put the return address in a dedicated register. This would require that 1) Many RISCs around do this, i.e., use the moral equivalent of the S/360 BRANCH-AND-LINK instruction. 2) The callee: a) Saves the return address register (for sure) OR b) The callee (if a leaf, and it certainly knows that) never saves the return address register OR MAYBE c) If the callee has "leaf paths", ie.., ones where it doesn't call other subroutines, even though it has some where it does, the compiler may choose to rearrange the save/restores to avoid them on some of the paths where they would be unnecessary. (Some compilers do this, i.e., cc -O3 on MIPS, for example, in certain cases.) 3) AS has been posted here in the past, it has been observed that the number of registers saved/restored per function call can be kep small (like about 2, on the average, for suites of programs of SPEC-like nature), if: a) You have a good optimizer with a good register allocator. b) You have ENOUGH registers that a reasonable number (say 50%) can be made scratch (caller save). Typically, enough == 20..32 available for general use. 4) In many environments, the frequency of leaf routines is high, on the order of 50%. 5) There is plenty of analysis of this topic in Hennessy and Patterson, including detailed statistics about the effects of more registers and/or better optimization. 6) On any MIPS-based machine, assuming that it includes pixie, prof, and pixstats (most do), it is trivial to this kind of analysis: do: pixie program creates program.pixie program.pixie run it pixstats program getthe statistics and look for the section that lists the number of callee-saved registers saved/restored: this is a table that goes from 0 to 10. Perhaps easier, look for the line labeled "Average registers saved per entry:" To get % of calls to leaf routines is a little harder, but one of the "prof" sections gives the number of calls to each function,a and the percentage of total, although it doesn't identify whether something is a leaf or not. Also, this begsthe question of whether or not something is clearly a leaf (i.e., zero calls to other functions), or dynamically a leaf (i.e., some calls of it do NOT cause it to call others; there are compilers that can move register save/restores around to sometimes avoid saving them when unnecesary). 7) Finally, the really crucial thing in all of this is to understand how big a percentage of run time is spent in subroutine call overhead, and THEN, and ONLY THEN figure out if it's worth a lot of hardware to do something about it, especially if that hardware is very difficult to pipeline. The evidence, so far, is that good compilers plus enough registers get rid of an awful lot of the overhead. One can argue, or not, about whether register windows get rid of enough more to be worth it. Certainly, I'd prefer register windows to complex function call instructions. Among other things, in looking at RISC code, in which instructions are frequently re-ordered, I've quite often seen the pieces of the subroutine call mechanism spread all over a piece of code, in ways that fill cycles that would otherwise be stalls, and in sequences for which hardware is unimaginable... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
sef@kithrup.COM (Sean Eric Fagan) (03/12/91)
In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes: > The idea of saving the return address in a register, rather than >on the stack sounds nice to me, but ... > This requires that the calling function knows that the called >function is a leaf function, not very practical from a high level language >point of view as far as I can tell. Uhm... that doesn't quite follow. How did you come up with that conclusion? The *called* function knows whether or not it's a leaf function or not; this is fine, as it's the one that has to save the return address or not. (That is, if I call any other functions, I will save my return address. I don't need to know anything about the functions I call%. If I don't call any other functions, then, obviously, I don't need to save the return-address register.) ---- % One thing you can do, with enough magic software, is to have the linker do N-level searches on all called functions, and decide what set of registers it can safely use without saving. ---- > The only way I can think to generalise this would be to always >put the return address in a dedicated register. That's what happens in *lots* of machines: crays, 88k's, MIPS's, i860's, etc. The call instruction only saves the return address in the register. Between software and hardware conventions, it's pretty quick (note that they don't save anything into memory; accessing memory is *slow*, and should be avoided almost as much as divides should be 8-)). >This would require that >the "call" would first push the old contents onto the stack and then >load in the new return address. If you call any functions, you are not a leaf function. Therefore, as part of your prologue, you save the return address you were given, and restore it as part of the function epilogue. Note that if you only have a couple of function calls (i.e., not in a loop), then you can save the address for each call, on the assumption that it might be faster. E.g.: void foo(int i) { ... if (foobar == i) blech(i&foobar); ... } might generate code that looks like: foo: ... ; prologue load r2, foobar load r3, i ; probably already in a register, actually bne r2, r3, skip.0 and r2, r3, r4 ; assuming a delay slot PUSH r31 ; push being a macro PUSH r4 call blech POP r31 ; restore return address skip.0: ; whatever > Now that all the mechanics are out of the way (the way I see them) >only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE??? At worst, it costs nothing more than what is "traditionally" done. Best case, it save a few memory accesses, which is a Good Thing. -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
bruce@cs.su.oz (Bruce Janson) (03/12/91)
In article <1353@ncis.tis.llnl.gov> turner@lance.tis.llnl.gov (Michael Turner) writes: >.. >.. maybe there's an idea--instead of optimizing architectures >around how code *is*, perhaps it would make sense to look at how code >*should* be. Personally, I wouldn't mind using a machine that >grossly penalized subroutines with more than 4 parameters, in exchange >for some slightly greater reward for using fewer, because I seldom write >.. The MIPS compilers put the first 4 words of integer arguments into registers $4..$7 (aka a0-a3) and the first 2 single or double precision arguments into floating point registers $f12 and $f14. [See pp. D-2 and D-3 of the mips RISC ARCHITECTURE book by Gerry Kane.] Apparently the MIPS people do a lot of simulation and measurement, they may even have done the tests that you were considering... Cheers, bruce. Bruce Janson Basser Department of Computer Science University of Sydney Sydney, N.S.W., 2006 AUSTRALIA Internet: bruce@cs.su.oz.au Telephone: +61-2-692-3272 Fax: +61-2-692-3838
ccplumb@rose.uwaterloo.ca (Colin Plumb) (03/12/91)
turner@lance.tis.llnl.gov (Michael Turner) wrote: > Cooling down a little, I >thought: maybe there's an idea--instead of optimizing architectures >around how code *is*, perhaps it would make sense to look at how code >*should* be. Personally, I wouldn't mind using a machine that >grossly penalized subroutines with more than 4 parameters, in exchange >for some slightly greater reward for using fewer, because I seldom write >code that uses more than 4 routinely, and *hate* inheriting code that >does. Well, I can live with 4 parameters as long as you don't want them to fit in one register each. If you write graphics code, you pass things like 3-vectors (3 registers), basis matrices (9 registers), affine transformations (12) and bicubic patches (16) around. In integers, you often pass rectangle boundaries (4 registers) around. In general, I'd like to suggest that asking the hardware to enforce coding style is like killing ants with a sledgehammer. Both too effective and not effective enough. You'll piss off people who have good reason to want to do what they're doing, and people who want to write bad code will continue to write bad code. -- -Colin
tim@proton.amd.com (Tim Olson) (03/12/91)
In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes: | Now that all the mechanics are out of the way (the way I see them) | only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE??? | I interpret this as ... What is the ratio of calls to "leaf functions" | versus calls to "non-leaf functions"? From a collection of benchmarks we have, we have run dynamic leaf-call statistics, and see that leaf routines account for 40% - 60% of all function calls. The benchmark with the fewest leaf-routine calls was "compress", with less than 1%; the most leaf-routine calls were in a kalman filter benchmark (96%). Because about half of all calls are to leaf functions, leaf-procedure optimizations are important. -- -- Tim Olson Advanced Micro Devices (tim@amd.com)
spot@CS.CMU.EDU (Scott Draves) (03/13/91)
In article <1991Mar12.115845.20736@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
Well, I can live with 4 parameters as long as you don't want them to fit in
one register each. If you write graphics code, you pass things like 3-vectors
(3 registers), basis matrices (9 registers), affine transformations (12)
and bicubic patches (16) around.
In integers, you often pass rectangle boundaries (4 registers) around.
you are almost certainly passing *pointers* to these arrays and
matrices around. at least i hope you are...
In general, I'd like to suggest that asking the hardware to enforce
coding style is like killing ants with a sledgehammer.
it's not enforcing anything, it just runs faster if you use its style.
--
christianity is stupid
Scott Draves communism is good
spot@cs.cmu.edu give up
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/13/91)
In article <912@spim.mips.COM> mash@mips.com (John Mashey) writes: | 7) Finally, the really crucial thing in all of this is to understand | how big a percentage of run time is spent in subroutine call overhead, | and THEN, and ONLY THEN figure out if it's worth a lot of hardware | to do something about it, especially if that hardware is very difficult | to pipeline. The evidence, so far, is that good compilers plus enough | registers get rid of an awful lot of the overhead. One can argue, Well I'm not about to argue but I may pose a question, in systems which use register windows, does someone have a good handle on the cost of doing all the register to register I see just before a call? I see some papers which show big gains by not moving data register to memory to register putting arguments on a stack, but they seem to discount the cost of the moves, and the reduced performance of the optimizer having to treat a lot of registers as temps or plain unusable. I would have to read some papers to clearly unstand how to measure the optimizer impact, so if someone has done it speak up. Obviously there's a win in register windows if their large enough, but I know know it isn't as big as the papers on RISC claim. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
mike@vlsivie.tuwien.ac.at (Michael K. Gschwind) (03/13/91)
In article <1991Mar11.192116.1974@dgbt.doc.ca> don@dgbt.doc.ca (Donald McLachlan) writes: > > The idea of saving the return address in a register, rather than >on the stack sounds nice to me, but ... It is nice AND saves a lot of cycles ... > > This requires that the calling function knows that the called >function is a leaf function, not very practical from a high level language >point of view as far as I can tell. Simply not true. As you yourself explain later. > A further optimisation I could see for this is any function that >performs a call could save the "return register" on the stack once at >invocation, and restore it prior to returning. (gee isn't this what is done >now:-) Exactly. > > Now that all the mechanics are out of the way (the way I see them) >only one question remains. HOW MUCH DOES THIS ACTUALLY SAVE??? >I interpret this as ... What is the ratio of calls to "leaf functions" >versus calls to "non-leaf functions"? Dynamically, about 50% of all functions called are leaf-functions. Assuming you save 2 memory accesses AND 2 stack manipulations, this is a quite interesting optimization. Also, the CALL instruction becomes much simpler. Just move the PC to some register and some register to the PC. - No stack handling involved. Everything is much smoother (and simpler to design - read my thesis ;-) bye, mike Michael K. Gschwind, Dept. of VLSI-Design, Vienna University of Technology mike@vlsivie.tuwien.ac.at 1-2-3-4 kick the lawsuits out the door mike@vlsivie.uucp 5-6-7-8 innovate don't litigate e182202@awituw01.bitnet 9-A-B-C interfaces should be free Voice: (++43).1.58801 8144 D-E-F-O look and feel has got to go! Fax: (++43).1.569697
torek@elf.ee.lbl.gov (Chris Torek) (03/14/91)
In article <3255@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >... in systems which use register windows, does someone have a good >handle on the cost of doing all the register to register I see just >before a call? In many (yeah sure :-/ ) cases you can put the values in the windowed registers in the first place, and/or leaf functions (those again!) can avoid using a new window (doing everything in the caller's window and any `callee saves' temporary registers). > Obviously there's a win in register windows if their large enough, but >I know know it isn't as big as the papers on RISC claim. Depending on your compiler and applications, it may even be zero or negative. Applications which have a lot of `up-down' motion over a small call stack range, and which thus fit perfectly in the windows, will get either large gains, or (if your register allocation is good enough) none at all, on machines with register windows (SPARC) vs machines with `just a lot of registers' (AM29000). -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
tim@proton.amd.com (Tim Olson) (03/14/91)
In article <3255@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: | Well I'm not about to argue but I may pose a question, in systems | which use register windows, does someone have a good handle on the cost | of doing all the register to register I see just before a call? I see | some papers which show big gains by not moving data register to memory | to register putting arguments on a stack, but they seem to discount the | cost of the moves, and the reduced performance of the optimizer having | to treat a lot of registers as temps or plain unusable. I would have to | read some papers to clearly unstand how to measure the optimizer impact, | so if someone has done it speak up. The register-to-register movement that you mention is not due to a windowed-register implementation; simply passing parameters in registers will cause this, because a function may be called by more than one parent, and they all must agree on what registers the parameters must be passed in. However, the register-to-register movement is not required all the time; register allocators can bind local variables to the correct outgoing parameter register through copy propagation. For example, the function: ---------------------------------------- void g(int, int, int); void f(int j) { int i1, i2; i1 = j+5; i2 = j-7; g(i2, j, i1); } ---------------------------------------- compiles to the following, using the MetaWare C compiler for the 29K: ---------------------------------------- _f: sub gr1,gr1,24 asgeu V_SPILL,gr1,gr126 add lr1,gr1,36 ;5 | int i1, i2; <- local variables i1, i2 ;6 | ;7 | i1 = j+5; add lr4,lr8,5 <- i1 is assigned local register 4 ;8 | i2 = j-7; (3rd parameter reg) sub lr2,lr8,7 <- i2 is assigned local register 2 ;9 | g(i2, j, i1); (1st parameter reg) call lr0,_g <- so this call needs no copying, add lr3,lr8,0 <- except incoming parameter j ;10 |} add gr1,gr1,24 nop jmpi lr0 asleu V_FILL,lr1,gr127 ---------------------------------------- -- -- Tim Olson Advanced Micro Devices (tim@amd.com)
henry@zoo.toronto.edu (Henry Spencer) (03/14/91)
In article <10896@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >>... in systems which use register windows, does someone have a good >>handle on the cost of doing all the register to register I see just >>before a call? > >In many (yeah sure :-/ ) cases you can put the values in the windowed >registers in the first place... In fact, it is tempting to generalize this and say "register-to-register moves are the sign of a defective compiler". This isn't quite always true, of course. In particular, if you are passing the value of a register variable, it often needs to be copied because the call/return sequence will generally destroy the passed copy. Still, r-t-r moves should be viewed with suspicion. Turn the question around: if those moves weren't being done register to register, how *would* they be done? Register to memory? That costs more, not less. >... Applications which have a lot of `up-down' motion over a >small call stack range, and which thus fit perfectly in the windows, >will get either large gains, or (if your register allocation is good >enough) none at all, on machines with register windows (SPARC) vs >machines with `just a lot of registers' (AM29000). AHEM. Chris, the Am29k has *both*. You can use 128 of those registers as a register-window bank if you want, with the added bonus that the size of the windows is completely up to you and can be different for each function. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (03/14/91)
In article <10896@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: | In many (yeah sure :-/ ) cases you can put the values in the windowed | registers in the first place, and/or leaf functions (those again!) can | avoid using a new window (doing everything in the caller's window and | any `callee saves' temporary registers). The obvious exception is pass thru values, where values you got from your caller are passed to a leaf. These come in in the wrong place, so you can't avoid the move. And consecutive calls will always need a move to get the args in the right place. Maybe there's some joy to having the caller control the offset of the windows. Then I can work with the things I will pass to leafA int x4..6, and leafB in 9..12, and specify that my x4 become x0 for leafA, but my x9 becomes x0 for leafB. That would probably raise more problems than it kills, though. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Most of the VAX instructions are in microcode, but halt and no-op are in hardware for efficiency"
torek@elf.ee.lbl.gov (Chris Torek) (03/15/91)
In article <1991Mar13.200326.29116@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >AHEM. Chris, the Am29k has *both*. You can use 128 of those registers >as a register-window bank if you want, with the added bonus that the >size of the windows is completely up to you and can be different for >each function. True. I was originally going to use SPARC and MIPS in my examples, but I figured people would object to comparing a 127-register machine (Sun's SPARC chips, 8 windows of which 1 is reserved for traps, 16 registers per window, plus 8 more overlapping into the trap window [this makes the trap handlers somewhat painful], plus 8 `global' registers of which one is wired zero) to a 31-register machine (MIPS R[23]000, 32 registers, one wired zero). So when I said: >>... and which thus fit perfectly in the windows, >>will get either large gains, or (if your register allocation is good >>enough) none at all, on machines with register windows (SPARC) vs >>machines with `just a lot of registers' (AM29000). perhaps I should have made it `...with register windows (SPARC, AM29000) vs machines with ``just a lot of registers'' (AM29000)'. (Acutally, the SPARC has 8 `user-usable' global registers, the last one being the %y register, but it is a special case. Still, it might make a good holding register for leaf functions. Now I am tempted to hack gcc to do this. :-) ) -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov