[comp.arch] subroutine frequency

dwc@homxc.UUCP (01/24/87)

>There is some ongoing discussion about register usage in programs.
>While it is true that one generally wants to keep as many variables
>as possible (especially such things as loop indices!) there are times
>when using a lot of registers can actually slow things down.

>This arises when a subroutine shares registers in common with its
>parent. If the routine is called many times the cost of saving and
>restoring registers during the call and return can offset any
>speed savings thru register usage. I have actually seen this
>phenomenon occur quite strikingly in programs that require 
>multi-precision arithmetic. The cost of saving the registers for
>a multi-precise add or multiply routine (say) when they are called
>~10^8 or 10^9 times can greatly slow down a program.


am i missing something or does the frequency of calls not matter?
if it is slower to save the registers over many calls, is it not
slower to save the registers for just one call?  isn't it just an
issue of memory references within the routine?  if the register
saves result in a greater number of references than without the
register saves (i.e. register variables), then it is not worth it
no matter how many times the routine is called.

and doesn't this only show up with assembly level programming?
do compilers take the time to keep track of which registers have
been used and only save the 'dirty' ones or do most call and
return mechanisms save the entire register set on the stack?  any
one know?

if not, i guess this is still a valid issue in c, since you can
declare register variable within any block (is this the correct term?).

danny chen
ihnp4!homxc!dwc

gwu@vlsi.cs.cmu.edu.UUCP (01/26/87)

> am i missing something or does the frequency of calls not matter?
> if it is slower to save the registers over many calls, is it not
> slower to save the registers for just one call?  isn't it just an
> issue of memory references within the routine?  if the register
> saves result in a greater number of references than without the
> register saves (i.e. register variables), then it is not worth it
> no matter how many times the routine is called.

     It IS a matter of references to memory. Picture a routine which has
several accesses to its variables, which are all declared as register
variables. If in the middle, there is one call to a routine which requires
the registers to be saved, it is probably of advantage to keep the variables
in registers. The total number of memory references is pretty much the
procedure call mechanism.

     If on the other hand, the calling routine rarely accesses its variables,
which are not in registers, but calls another routine very often, the number
of memory references is merely those of the calling routine's variables.

     Whether the program saves only those registers to be used also makes a
difference. I suspect that this is compiler implementation dependent. If all
registers are automatically saved with every procedure invocation, there's
obviously going to be some waste. Even if the compiler is intelligent about
saving registers, the above example still shows that the number of calls to
other routines can vary the effectiveness of declaring register variables.

						George J Wu

jgp@moscom.UUCP (01/29/87)

In article <1881@homxc.UUCP> dwc@homxc.UUCP writes:
>In some other article someone else writes:
>>If the routine is called many times the cost of saving and
>>restoring registers during the call and return can offset any
>>speed savings thru register usage.
>am i missing something or does the frequency of calls not matter?
I agree that the frequency shouldn't matter.  The real question would
be if any compiler is smart enough to ignore a register declaration if
the resulting code would run slower (eg. if a parameter is only accessed
once, outside of a loop, the compiler would probably be better off to
ignore a register declaration).

>do compilers take the time to keep track of which registers have
>been used and only save the 'dirty' ones or do most call and
>return mechanisms save the entire register set on the stack?
It depends on the architecture and the compiler, the three easy ways
to do it are:
	a) save all registers
	b) have the caller save only the registers it is using
	c) have the callee save only the registers it will use
pdp-11's use "a" since they only have 3 register variables anyway.  Most
68k compilers use "c" since you get about 12 register variables.  I don't
know of anyone who uses "b" but it should be about as efficient as "c".

The method used has a large effect on whether setjmp/longjmp can put the
correct values back into register variables (SYSVID says they may be
unpredictable :-(.
-- 
Jim Prescott	rochester!moscom!jgp

guy@gorodish.UUCP (01/29/87)

> (SYSVID says they may be unpredictable :-(.

Yup, and you'll just have to live with it.  Do you really want to pay
the penalty of extra glop in every call on a 68K just to make a very
rarely-used feature work?

Furthermore, note that there is no guarantee that something *won't*
be placed into a register if it can be; optimizing compilers are
perfectly free to stick variables into registers if they can.  That's
why ANSI C says that the values of *all* automatic variables, except
those declared "volatile", are unpredictable when a "longjmp" is
done.

firth@sei.cmu.edu.UUCP (01/30/87)

In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes:
>It depends on the architecture and the compiler, the three easy ways
>to do it are:
>	a) save all registers
>	b) have the caller save only the registers it is using
>	c) have the callee save only the registers it will use
>pdp-11's use "a" since they only have 3 register variables anyway.  Most
>68k compilers use "c" since you get about 12 register variables.  I don't
>know of anyone who uses "b" but it should be about as efficient as "c".
>
>The method used has a large effect on whether setjmp/longjmp can put the
>correct values back into register variables (SYSVID says they may be
>unpredictable :-(.

The codegenerators I wrote for the PDP-11 and VAX-11 use method (b).  The
main reason for this was precisely the longjump problem: if local frames
store non-local state, then that state can be restored only by the very
slow process of unwinding the stack.  If on the other hand each frame
keeps its own state, a longjump is just (reset stack pointer; jump), or
at worst, if you keep a closed stack, (reset frame pointer; jump;
destination resets stack front pointer).  The alternative of having the
longjump NOT restore state that happened to be in registers was not
permitted by the languages in question.

Well, I benchmarked this technique against the alternative of having the
callee save, and it came out better on both machines.  Surprising for the
Vax, since that has hardware to save and restore registers, but in fact
the special instructions are only marginally faster than doing it by hand.
(Of course, one does not use CALLS under any circumstances; using JSB,
 controlling the local-variable stack yourself, and passing parameters in
 registers buys you a factor of three in the procedure call overhead.)

The main reasons for the difference are interesting:

(a) fewer registers are involved.  This is because the callee must save
    every register it uses ANYWHERE in its body, whereas the caller need
    save only registers CURRENTLY LIVE.

(b) fewer memory accesses.  Callee must save and restore always; caller can
    restore the register from a declared variable some (~1/3) of the time,
    and so need not save it.

For other methodological reasons too boring to post here, I am a firm
believer in the "caller save always; callee save never" technique.

Robert Firth

franka@mntgfx.UUCP (01/30/87)

In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes:
>In article <1881@homxc.UUCP> dwc@homxc.UUCP writes:
>>In some other article someone else writes:
>>do compilers take the time to keep track of which registers have
>>been used and only save the 'dirty' ones or do most call and
>>return mechanisms save the entire register set on the stack?
>It depends on the architecture and the compiler, the three easy ways
>to do it are:
>	a) save all registers
>	b) have the caller save only the registers it is using
>	c) have the callee save only the registers it will use
>pdp-11's use "a" since they only have 3 register variables anyway.  Most
>68k compilers use "c" since you get about 12 register variables.  I don't
>know of anyone who uses "b" but it should be about as efficient as "c".
>

Actually b can be much less efficient than c, because the strategy in c
can conditionally save registers while b must save all which might be
used.  E.g., consider the routine

int foo(a)
struct foo_structure	*a;
{
	if (a) {
		/* long hairy computation which uses
		 * several register variables
		 */
	}
}
Using scheme b, all registers which MAY be used in foo MUST be saved,
while scheme c allows you to save registers only if a is non-null.
Guy Steele goes into this very well in an MIT Tech Report called
"Debunking the Expensive Procedure Call Myth" (Maybe.  This may be
the subtitle after a dry formal title.).  It's written in the context
of LISP compilers, but talks about compiling for register machines.
It also talks about tail recursion (something even C compiler writers
could probably learn from).  All in all, a good paper.

>The method used has a large effect on whether setjmp/longjmp can put the
>correct values back into register variables (SYSVID says they may be
>unpredictable :-(.

Sad, but true.  The non-orthogonality of "modern" machine design again
pokes through to bite the programmer.  So it goes.

Frank Adrian
Mentor Graphics, Inc.

gwu@vlsi.cs.cmu.edu.UUCP (01/31/87)

> I agree that the frequency shouldn't matter.  The real question would
> be if any compiler is smart enough to ignore a register declaration if
> the resulting code would run slower (eg. if a parameter is only accessed
> once, outside of a loop, the compiler would probably be better off to
> ignore a register declaration).

     Such an intelligent compiler would need to know the number of times that
a given subroutine is called. The number of times that a routine will be
called, if at all, is often determinable at run-time only, not compile-time. 

> 	a) save all registers
> 	b) have the caller save only the registers it is using
> 	c) have the callee save only the registers it will use
> pdp-11's use "a" since they only have 3 register variables anyway.  Most
> 68k compilers use "c" since you get about 12 register variables.  I don't
> know of anyone who uses "b" but it should be about as efficient as "c".

     In the case of the callee saving the registers it is planning to use,
the compiler won't know whether or not called subroutines use any registers,
and therefore will be unable to figure out whether it is more efficient to
use register variables in the main routine. Of course, there is always the
possibility of combining the compiler and linker. . . .

						George

sundar@cwruecmp.UUCP (02/01/87)

do compilers take the time to keep track of which registers have 
been used and only save the 'dirty' ones or do most call and
return mechanisms save the entire register set on the stack?
  It depends on the architecture and the compiler, the three easy ways
  to do it are
  a) save all register
  b) have the caller save only the registers it is using
  c) have the callee save only the registers it will use
  pdp-11's use "a" since they only have 3 register variables anyway.  Most
  68k compilers use "c" since you get about 12 register variables.  I don't
  know of anyone who uses "b" but it should be about as efficient as "c".

    I am just curious.  Isn't it a security hole to use the method c)
    above?  If the caller is a system routine and the callee is my program
    and I am supposed to save and restore registers that I intend to use,
    I can have some fun by not saving the registers in the first place and
    in addition destroying them. 

sri

guy@gorodish.UUCP (02/01/87)

>    I am just curious.  Isn't it a security hole to use the method c)
>    above?  If the caller is a system routine and the callee is my program
>    and I am supposed to save and restore registers that I intend to use,
>    I can have some fun by not saving the registers in the first place and
>    in addition destroying them. 

In general, calls that leave one protection domain and enter another
domain that lacks certain privileges that the original domain had are
insecure unless some care is taken.  The callee can have some fun
simply by returning to a location other than the one it was supposed
to return to.  You have to prevent this from happening.  If the
caller passes some datum to the callee by reference, the callee must
not be able to modify anything other than that datum.

Cross-domain calls of this sort would have to be treated differently,
whether by the instruction-set processor, the compiler, the run-time
support code, or some combination thereof - unless the overhead of
treating such calls differently is minimal, and the system can affort
to make all calls "secure" against the callee doing something
unpleasant to protected objects in the caller's domain ro to the
caller itself.

mouse@mcgill-vision.UUCP (02/12/87)

In article <476@mntgfx.MENTOR.COM>, franka@mntgfx.MENTOR.COM (Frank A. Adrian) writes:
> In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes:
>>	a) save all registers
>>	b) have the caller save only the registers it is using
>>	c) have the callee save only the registers it will use
> Actually b can be much less efficient than c, because [...]

> int foo(a)
> struct foo_structure	*a;
> [example]

> Using scheme b, all registers which MAY be used in foo MUST be saved,
> while scheme c allows you to save registers only if a is non-null.

>> The method used has a large effect on whether setjmp/longjmp can put
>> the correct values back into register variables (SYSVID says they
>> may be unpredictable :-(.
> Sad, but true.

WHAT?!?!  I can't see any reason that longjmp can't get the registers
back - if it can't restore them easily on that architecture then just
have setjmp save them in the jmp_buf!  What's the problem?  This will
make longjmp basically useless, especially in the face of compilers
that optimize non-"register" variables into registers!

Note that the possibility of method (c) means that it won't even work
to avoid registers in the function calling setjmp because there have to
be no live registers at the time of the setjmp, which with method (c)
means no registers at all, because functions might not save them -
they'd still be live at the setjmp()!

					der Mouse

USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse
     think!mosart!mcgill-vision!mouse
Europe: mcvax!decvax!utcsri!mcgill-vision!mouse
ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu

guy@gorodish.UUCP (02/14/87)

>WHAT?!?!  I can't see any reason that longjmp can't get the registers
>back - if it can't restore them easily on that architecture then just
>have setjmp save them in the jmp_buf!  What's the problem?

The problem, to put it simply, is that if "longjmp" can't get the
values of the register variables *at the time the longjmp was
called* getting them from values saved by "setjmp" would restore the
values of the register variables *at the time the setjmp was called*,
which is NOT what the SVID (or various UNIX manuals) claim that
"longjmp" does.  On some architectures it is painful to try to get
the values as of the time "longjmp" was called; you can't just walk
the stack to find the values saved when the routine that called
"setjmp" last made a procedure call.

howard@cpocd2.UUCP (02/16/87)

In article <651@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP (der Mouse) writes:
>In article <476@mntgfx.MENTOR.COM>, franka@mntgfx.MENTOR.COM (Frank A. Adrian) writes:
>> In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes:
>>>	a) save all registers
>>>	b) have the caller save only the registers it is using
>>>	c) have the callee save only the registers it will use

All this discussion, and no one's suggested using hardware to avoid saving
any registers at all (most of the time)!  This has been done in a few
machines; I'm most familiar with the RISC-I since I helped design the chip.
Isn't this a computer ARCHITECTURE group?  Did you know that HALF of the
memory traffic on a VAX-11/780 is data being saved/restored?
-- 

	Howard A. Landman
	...!intelca!mipos3!cpocd2!howard

crowl@rochester.UUCP (02/17/87)

In article <425@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>Isn't this a computer ARCHITECTURE group?  Did you know that HALF of the
>memory traffic on a VAX-11/780 is data being saved/restored?

Wait a second!  ALL the memory traffic on ANY machine is data that is being
saved and restored.  That is the purpose of memory.  (Picky, picky :-)
-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627

greg@utcsri.UUCP (Gregory Smith) (02/18/87)

In article <651@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP writes:

>>> The method used has a large effect on whether setjmp/longjmp can put
>>> the correct values back into register variables (SYSVID says they
>>> may be unpredictable :-(.
>> Sad, but true.
>
>WHAT?!?!  I can't see any reason that longjmp can't get the registers
>back - if it can't restore them easily on that architecture then just
>have setjmp save them in the jmp_buf!  What's the problem?  This will
>make longjmp basically useless, especially in the face of compilers
>that optimize non-"register" variables into registers!

jmp_buf env;

main(){
	register int i;
	int j;
	i = j = 1;
	if(setjmp(env)){
		printf( "i==%d, j==%d\n", i, j );
		exit(0);
	}
	i = j = 2;
	func();
}

func(){
	longjmp( env, 1 );
}

The above program should ideally report that i=2. Certainly it will say
that j is 2. If i is saved in the jmp_buf, and restored by longjmp,
then i will be reported as 1.
-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

howard@cpocd2.UUCP (02/18/87)

In article <24929@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
>In article <425@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>>Isn't this a computer ARCHITECTURE group?  Did you know that HALF of the
>>memory traffic on a VAX-11/780 is data being saved/restored?
>
>Wait a second!  ALL the memory traffic on ANY machine is data that is being
>saved and restored.  That is the purpose of memory.  (Picky, picky :-)

Not really.  Save and restore refer specifically to memory traffic generated
by procedure calls and returns (respectively) in order to prevent one
routine (the callee) from trashing another routine's (the caller's) variables
that are in registers.  Many memory references do not fall into this category.
Normal loads and stores are one example, but also consider instruction fetches.
Are you seriously claiming that an instruction fetch is "data being saved [or]
restored"?

Procedure calls are grotesquely expensive on the most common machines.  A good
rule of thumb for a VAX is that a procedure call takes as long as executing
about 100 computational instructions.  This is why so much of the stdio package
is macros instead of procedures.  There are several ways to try to avoid this
overhead.  One is hardware support for call/return, as in some RISC machines.
Another way is to eliminate procedures from your language; this is what Forth
did with its "threaded code", and is the main reason Forth is so fast.  On a
machine with multiple register windows the speed advantage of Forth would
evaporate; of course you could also implement a hardware stack, which might
increase its advantage!
-- 

	Howard A. Landman
	...!intelca!mipos3!cpocd2!howard

montnaro@sprite.UUCP (02/19/87)

In article <24929@rochester.ARPA> crowl@rochester.UUCP (Lawrence Crowl) writes:
>In article <425@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>>Isn't this a computer ARCHITECTURE group?  Did you know that HALF of the
>>memory traffic on a VAX-11/780 is data being saved/restored?
>
>Wait a second!  ALL the memory traffic on ANY machine is data that is being
>saved and restored.  That is the purpose of memory.  (Picky, picky :-)
'Taint necessarily so. Doesn't the VAX do memory<->memory transfers as well
as register<->register and register<->memory?

Skip Montanaro

ARPA: montanaro%desdemona.tcpip@ge-crd.arpa
UUCP: seismo!rochester!steinmetz!desdemona!montanaro
GE DECnet: csbvax::mrgate!montanaro@desdemona@smtp@tcpgateway

ad@uwmacc.UUCP (02/20/87)

In article <430@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes:
>Procedure calls are grotesquely expensive on the most common machines.  A good
>rule of thumb for a VAX is that a procedure call takes as long as executing
>about 100 computational instructions.  This is why so much of the stdio package
>is macros instead of procedures.  There are several ways to try to avoid this
>overhead.  One is hardware support for call/return, as in some RISC machines.
>Another way is to eliminate procedures from your language; this is what Forth
>did with its "threaded code", and is the main reason Forth is so fast.  On a
>machine with multiple register windows the speed advantage of Forth would
>evaporate; of course you could also implement a hardware stack, which might
>increase its advantage!

Please explain what you mean by multiple register windows and how they
impact Forth's speed.  Thanks.

Alan Dickman  608-262-9421
Arpa: ad@unix.macc.wisc.edu
Bitnet: ad@wisvmacc 
UUCP: ...{allegra,ihnp4,seismo}!uwvax!uwmacc!ad
1210 West Dayton Street, Madison, WI 53706

michael@crlt.UUCP (02/20/87)

[Yum yum!]

In article <898@moscom.UUCP>, jgp@moscom.UUCP (Jim Prescott) writes:
> 
> >do compilers take the time to keep track of which registers have
> >been used and only save the 'dirty' ones or do most call and
> >return mechanisms save the entire register set on the stack?
>
> It depends on the architecture and the compiler, the three easy ways
> to do it are:
> 	a) save all registers
> 	b) have the caller save only the registers it is using
> 	c) have the callee save only the registers it will use
> pdp-11's use "a" since they only have 3 register variables anyway.  Most
> 68k compilers use "c" since you get about 12 register variables.  I don't
> know of anyone who uses "b" but it should be about as efficient as "c".

In code generated by a brute-force compiler, "b" wastes a lot of CPU,
by saving registers that will never be trashed.  Suppose you're using,
say, five of them, and make calls to subroutines that will only use one.
Method "c" does one-fifth as much register save/restoration as method "b".

The only place where method "c" saves an unused variable that method "b"
doesn't is near the top level of your program, where a given register might
not yet contain anything that must be preserved (unless the author of the
O.S. was a total idiot).  Here method "c" will waste time saving and
restoring junk.  But the top level code would normally be executed much
less than the lower level stuff.

On the other hand, method "b" could gain a global advantage when subroutines
with few register variables make many calls to subroutines with many, which
don't in turn make calls to another generation of children (or at least not
while their registers need preservation).  In cases like this, method "b"
has saved the register once, while "c" would save it many times.

Method "b" has other advantages as well:

 - It makes the caller, not the callee, responsible for the integrity of
   its own environment.  Thus, if a hand-coded routine makes an error in
   register preservation, it will break itself (and finding the error will
   be easy), not previously-debugged portions of the calling routine.

 - It offers the opportunity to save CPU by omitting the storage and
   restoration of register variables that no longer contain anything
   of value (and a smart enough compiler might be able to determine this).

In article <540@sei.cmu.edu.UUCP> firth@sei.cmu.edu.UUCP (Robert Firth) writes:
>In article <898@moscom.UUCP> jgp@moscom.UUCP (Jim Prescott) writes:
>>
>>The method used has a large effect on whether setjmp/longjmp can put the
>>correct values back into register variables (SYSVID says they may be
>>unpredictable :-(.
>
>The codegenerators I wrote for the PDP-11 and VAX-11 use method (b).  The
>main reason for this was precisely the longjump problem: if local frames
>store non-local state, then that state can be restored only by the very
>slow process of unwinding the stack. []

I seem to be missing something here.  Why can't setjmp save the entire
register set, plus as much of the stack as would be pushed by the
caller as it calls, in "env"?  All "longjmp" needs to do is restore
the state of the caller as of the "setjmp" call, and it is allowed,
and even encouraged, to know what kind of code its particular compiler
generates.  If the caller isn't doing such things as varying its own
stack frame, only saving >its< caller's registers when they might be
used locally (rather than saving them once on entry and restoring them
on exit), and shuttling register variables off to holding areas at
unknowable locations in its frame, this will be sufficient.  (And if
the compiler is smart enough to do this sort of thing, it can also be
smart enough to recognize that "setjmp" is being called, and provide
as many hooks as necessary.)

I would think that "b" could cause more problems for setjmp/longjmp
than "c", since setjmp would have to find and save a variable number
of stored registers, and longjmp restore them, without damaging other
things in the caller's frame.  What have I overlooked?

>Well, I benchmarked this technique [b] against the alternative of having the
>callee save, and it came out better on both machines.  []  The main reasons
>for the difference are interesting:
>
>(a) fewer registers are involved.  This is because the callee must save
>    every register it uses ANYWHERE in its body, whereas the caller need
>    save only registers CURRENTLY LIVE.
>
>(b) fewer memory accesses.  Callee must save and restore always; caller can
>    restore the register from a declared variable some (~1/3) of the time,
>    and so need not save it.

I find this benchmark interesting.  Does (b) actually make up for routines
that use fewer register variables?  Or does that savings get canceled by
the reduced number of saves when the callee has more?

===========================================================================
  "I've got code in my node."	| UUCP:  ...!ihnp4!itivax!node!michael
				| AUDIO: (313) 973-8787
	Michael McClary		| SNAIL: 2091 Chalmers, Ann Arbor MI 48104
---------------------------------------------------------------------------
Above opinions are the official position of McClary Associates.  Customers
may have opinions of their own, which are given all the attention paid for.
===========================================================================

faustus@ucbcad.UUCP (02/22/87)

In article <4160@utcsri.UUCP>, greg@utcsri.UUCP (Gregory Smith) writes:
> jmp_buf env;
> 
> main(){
> 	register int i;
> 	int j;
> 	i = j = 1;
> 	if(setjmp(env)){
> 		printf( "i==%d, j==%d\n", i, j );
> 		exit(0);
> 	}
> 	i = j = 2;
> 	func();
> }
> 
> func(){
> 	longjmp( env, 1 );
> }
> 
> The above program should ideally report that i=2. Certainly it will say
> that j is 2. If i is saved in the jmp_buf, and restored by longjmp,
> then i will be reported as 1.

Put a "register int k = 3" in func() -- the program had better not say that
i is 3.  It shouldn't be hard for setjmp() to restore the registers from 
the stack frame, as opposed to from the jmp_buf.

	Wayne

howard@cpocd2.UUCP (02/24/87)

In article <1093@uwmacc.UUCP> ad@uwmacc.UUCP (Alan Lloyd Dickman) writes:
>In article <430@cpocd2.UUCP>, howard@cpocd2.UUCP (Howard A. Landman) writes:
>>Procedure calls are grotesquely expensive on the most common machines.
>>There are several ways to try to avoid this
>>overhead.  One is hardware support for call/return, as in some RISC machines.
>>Another way is to eliminate procedures from your language; this is what Forth
>>did with its "threaded code", and is the main reason Forth is so fast.  On a
>>machine with multiple register windows the speed advantage of Forth would
>>evaporate; of course you could also implement a hardware stack, which might
>>increase its advantage!
>
>Please explain what you mean by multiple register windows and how they
>impact Forth's speed.  Thanks.

Some RISC machines have multiple sets of registers, organized as overlapping
windows.  This approach to speeding up calls/returns was suggested to the
RISC I team at Berkeley by Forest Baskett; simulations convinced us it would
work for typical UNIX code, so we implemented it.  It worked.  The average
time for a procedure call was only a few instruction cycles (occasionally, you
overflow/underflow the set of windows you have and must actually save/restore,
but not every call & return).  Also, since the windows overlap (some registers
appear in two adjacent windows), we could pass parameters and return results
in registers with essentially no overhead.

The reason it's fast isn't hard to see.  If you can get a clean set of
registers by just incrementing a window-number register, then you don't
need to save all your old registers away in main memory.  If that data
can be reactivated by just decrementing a window-number register, then
you don't need to restore it.  This saves all the overhead of save/restore,
as long as everything fits into your registers.  The costs include chip
area and slowing down of the register access time as more registers are
added.

SO:	1) Procedure calls/returns are expensive (slow) on most machines.
	2) Forth doesn't do procedure calls/returns, so on those machines
	   it has a speed advantage (compared to C, Pascal, Fortran, Lisp, ...).
	3) On a machine where calls/returns are fast, the advantage disappears.
	   Avoiding something which is inexpensive buys you nothing.
	4) I believe that most of Forth's memory traffic is due to stack
	   manipulations.  If this is true, hardware stack support could
	   speed Forth up quite a bit; but it would have to be better than
	   what Burroughs did in their early "stack machines" where only
	   the top two elements were in registers.

REFERENCES:

Fitzpatrick et al, "A RISCy Approach to VLSI", VLSI Design, 4th quarter 1981
				- OR -
Fitzpatrick et al, "VLSI Implementations of a Reduced Instruction Set Computer",
CMU Conference on VLSI Systems and Computations, 1981
	These are essentially the same article; the VLSI Design version has
	nicer graphics.  They both describe the RISC I.

Tamir & Sequin, "Strategies for Managing the Register File in RISC",
IEEE Trans. Comput., v.C-32, November 1983

Halbert & Kessler, "Windows of Overlapping Registers", CS292R Final Report,
U.C. Berkeley, June 1980
	This will be hard to get, but it's the most detailed reference.

Sherburne et al, "A 32-Bit NMOS Microprocessor with a Large Register File",
IEEE JSSC, v.SC-19 #5, October 1984
	Describes the RISC II.

Sherburne, "Processor Design Tradeoffs in VLSI", PhD dissertation, Comp.
Sci. Div., U.C. Berkeley, April 1984
	Describes the RISC II.
-- 

	Howard A. Landman
	...!intelca!mipos3!cpocd2!howard

meissner@dg_rtp.UUCP (02/25/87)

In article <651@mcgill-vision.UUCP> mouse@mcgill-vision.UUCP (der Mouse) writes:
> 	/* stuff about when to save registers deleted */
> WHAT?!?!  I can't see any reason that longjmp can't get the registers
> back - if it can't restore them easily on that architecture then just
> have setjmp save them in the jmp_buf!  What's the problem?  This will
> make longjmp basically useless, especially in the face of compilers
> that optimize non-"register" variables into registers!

Because of this, ANSI says that even auto's can be trashed by longjmp.
That the biz.... (there was also a contingent to remove longjmp/setjmp
completely because they are such a wart, but that is another story....)
-- 
	Michael Meissner, Data General	Uucp: ...mcnc!rti-sel!dg_rtp!meissner

It is 11pm, do you know what your sendmail and uucico are doing?

chuck@amdahl.UUCP (02/25/87)

In article <443@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>Some RISC machines have multiple sets of registers, organized as overlapping
>windows.  This approach to speeding up calls/returns was suggested to the
>RISC I team at Berkeley by Forest Baskett; simulations convinced us it would
>work for typical UNIX code, so we implemented it.  It worked.  
>	Howard A. Landman

As I understand it, each register window is a fixed size, and each window
is divided into three fixed size subwindows.  (I don't remember what
these three subwindows are meant to be used for.)  One particular aspect
of the implementation has been bothering me; namely, the amount of
overlap between two windows is a constant size.  This has two consequences.

First, subroutines which don't use all of their window waste register storage.
Typical programs that I write tend to contain short routines that use
few variables and need few registers.  Thus, my programs can't get the
full benefits of the register windows; many of the registers remain 
unused.

Second, subroutines that want to pass more parameters than will fit
in an overlapped window, but fewer parameters than would fit in a complete
window are also penalized.  Not all parameters will be able to be passed
inside of registers, and some must be saved in memory.

So I'm wondering why the implementation of the RISC I and II don't
allow the programmer to specify the number of registers that should
be overlapped on the procedure call.  Obviously, this would make
subroutine calls more expensive and difficult to implement.  The
overlap size would have to be fetched from memory and added to the
window pointer.  Checking to see if the register stack had overflowed
might be more expensive...  However, these seem like trivial problems
when compared to the benefit of effectively having a larger
register stack.

Are there reasons I am missing that make fixed sized windows extremely
advantageous?

thanks, Chuck

bcase@amdcad.UUCP (02/25/87)

In article <5746@amdahl.UUCP> chuck@amdahl.UUCP (Charles Simmons) writes:
>In article <443@cpocd2.UUCP> howard@cpocd2.UUCP (Howard A. Landman) writes:
>>Some RISC machines have multiple sets of registers, organized as overlapping
>>windows.  This approach to speeding up calls/returns was suggested to the
>>RISC I team at Berkeley by Forest Baskett; simulations convinced us it would
>>work for typical UNIX code, so we implemented it.  It worked.  
>>	Howard A. Landman
>
>As I understand it, each register window is a fixed size, and each window
>is divided into three fixed size subwindows.  (I don't remember what
>these three subwindows are meant to be used for.)  One particular aspect
>of the implementation has been bothering me; namely, the amount of
>overlap between two windows is a constant size.  This has two consequences.
> ...
>So I'm wondering why the implementation of the RISC I and II don't
>allow the programmer to specify the number of registers that should
>be overlapped on the procedure call.  Obviously, this would make
>subroutine calls more expensive and difficult to implement.  The
>overlap size would have to be fetched from memory and added to the
>window pointer.  Checking to see if the register stack had overflowed
>might be more expensive...  However, these seem like trivial problems
>when compared to the benefit of effectively having a larger
>register stack.

BTW, the Pyramid machine also have fixed-sized windows (16 in, 16 local,
and 16 out, plus 16 globals).

Well, the main reason that arbitrary window size was not implemented in the
RISC I & II is that it requires an adder in the register address decode
path.  Using the technology available to them, this would have slowed the
machine and used more area.  Using the Berkeley scheme, the "add" function
is integrated right into the register file address decoders.

Arbitrary window size is not that difficult to implement, assuming that
you can pay the cost of the adder.  The window size does not need to be
separately fetched from memory:  it can be an immediate field in an
instruction.  In the implementation that I am thinking of, based on my
MS thesis work, the window is allocated at procedure entry time and 
deallocated at procedure exit time, just as in the RISC I & II.  Thus,
the window which is allocated must be made large enough to contain the
outgoing parameters for any subsequent call.  Also, the bounds checking
operation to detect overflow and underflow need only be done once at
call time and once at exit time.  See Ditzel's paper on the C Machine
Stack Cache for a more general implementation (but expensive!!).

Yes, arbitrary window size does make calls and returns more expensive, but
not by much (under the assumptions I have made).  The benefits of better
register utilization and accomodation of any argument list outweigh the
small cost.  The call and return are still an order of magnatude faster
(in terms of cycles) than the "calls"-style exemplified by the vax.

Again, a very interesting alternative to a stack cache is register allocation
at link time.  Wall from Dec Western Research has done/is doing the first
work I have seen in this area.

    bcase
----------
Keep watching this space just a little longer.

klein@gravity.UUCP (02/25/87)

In article <5746@amdahl.UUCP>, chuck@amdahl.UUCP (Charles Simmons) writes:
> Are there reasons I am missing that make fixed sized windows extremely
> advantageous?
> 
> thanks, Chuck

Complexity.
--
	Mike Klein		klein@sun.{arpa,com}
	Sun Microsystems, Inc.	{ucbvax,hplabs,ihnp4,seismo}!sun!klein
	Mountain View, CA

aglew@ccvaxa.UUCP (02/26/87)

>So I'm wondering why the implementation of the RISC I and II don't
>allow the programmer to specify the number of registers that should
>be overlapped on the procedure call.  Obviously, this would make
>subroutine calls more expensive and difficult to implement.  The
>overlap size would have to be fetched from memory and added to the
>window pointer.  Checking to see if the register stack had overflowed
>might be more expensive...  However, these seem like trivial problems
>when compared to the benefit of effectively having a larger
>register stack.

	You've just gone and made procedure calls more expensive.
Since register stack overflows rarely occur(*), and the fixed size
window handles the 90% case, the numbers don't necessarily work out.


>Are there reasons I am missing that make fixed sized windows extremely
>advantageous?

	Register windows require that a "logical register number"
be combined with a "register window pointer" to produce the actual
"physical register number" in the register file.
Usually, this is a simple addition. 
	But then an extra addition is required on all register accesses.
This adds a delay equal to that of the fastest carry chain you can provide
to the critical path. It may not be possible to absorb this in a 
pipeline stage.
	Fixed size windows can mean that you don't have to add all the
bits together. Fewer bits => shorter carries => faster register cycle.

	Note that this was apparently NOT a concern for RISC.

---
(*) Register stack overflows rarely occur: for a fixed size register
	stack, this is true only because routines typically have a
	small dynamic range in procedure calling depth.

	The pertinent studies were made on UNIX. I'm not so sure that
	UNIX is a good example for future systems, since UNIX contains
	a lot of low level code without benefit of abstraction.
	Higher level languages, or systems that encourage the use
	of procedure calls, may well have a greater dynamic range.


Andy "Krazy" Glew. Gould CSD-Urbana.    USEnet:  ihnp4!uiucdcs!ccvaxa!aglew
1101 E. University, Urbana, IL 61801    ARPAnet: aglew@gswd-vms.arpa