[comp.lang.c] function calls

henry@utzoo.uucp (Henry Spencer) (03/15/90)

In article <2000@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>> I was under the impression that C's function calls are, relative to other
>> high level languages, very inexpensive...
>
>Simple, yes.  Inexpensive, no.  I can conceive of a way in which hardware
>can be constructed which would make subroutine calls inexpensive, but I
>know of no machine which has it.  

Many people can conceive of such ways, and a good many of them have been
implemented, some quite successfully.  On the Mips machines, for example,
a C function call to a "leaf" function (one which calls no others) is
one instruction for the call and one for the return, total elapsed time
a small fraction of a microsecond.  Non-leaf functions have to do a little
bit more work at times.

However, as Steve Johnson said at the winter 1987 Usenix conference, talking
about the early days:  ``Dennis had a lot of credibility and lied very
convincingly... it was years before we found out the truth.''  On the early
Unix machines (pdp11 and VAX), function calls were not particularly cheap
(although it is true that C's calls are cheaper than those of many other
languages).  Machine designers have paid a lot more attention to the matter
in recent years, and the situation has improved greatly.

>A subroutine necessarily must find its arguments (or pointers to them, etc.)
>in standardized places.  It also must put its results in such places, or use
>pointers to know where to put them.  Occasionally one can put the arguments
>in those places or use the results that way, but not always...

Not always, no, but a lot of the time.  Many modern machines are organized
so that most function parameters simply get evaluated into a register
(usually necessary anyway) and left there for the callee to find.  There
is sometimes a small cost of having to save the previous value of that
register, depending on circumstances and architecture.

>...necessarily a context switch to some extent...

It can be a very small extent on a well-designed machine.  There *is* a
fundamental problem of instruction reference locality, which necessarily
slows things down a bit on machines with hardware optimized for sequential
fetches.
-- 
MSDOS, abbrev:  Maybe SomeDay |     Henry Spencer at U of Toronto Zoology
an Operating System.          | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

jlg@lambda.UUCP (Jim Giles) (03/15/90)

From article <1990Mar14.172152.25436@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> [... efficient procedure call interface ...]
> Many people can conceive of such ways, and a good many of them have been
> implemented, some quite successfully.  On the Mips machines, for example,
> a C function call to a "leaf" function (one which calls no others) is
> one instruction for the call and one for the return, total elapsed time
> a small fraction of a microsecond.  Non-leaf functions have to do a little
> bit more work at times.

Most machines implement call/return with single instructions.  However, this
tends to be the tip of the iceberg for procedure call overhead.  The interface
_MUST_ do the following:

1) Save/restore all register values that are still 'live' values for the
   caller.

2) Test the stack to make sure there is enough room for the callee's local
   variables (and call the memory manager for a new stack frame if there
   wasn't room).

3) Create (on the stack) a traceback entry so the debugger and/or the
   postmortem analyzer can find the current thread through the call tree.

A 'leaf' routine _might_ get away without doing number (3) by setting some
'leaf' flag.  A virtual memory machine might simplify number (2) to just a
test/fatal_error combination since stack overflow might be considered too
rare to justify much effort.  The problem is: _MOST_ of the procedure call
overhead is concentrated in number (1)!  And, this overhead applies to all
procedures - including 'leaf' routines.  Basically, 'leafness' has very
little to do with procedure call overhead.  Because modern machines tend
to have a _large_ number of registers, implementing (1) is usually quite
expensive - and it can't be made cheaper without interprocedural analysis
which in turn can't be done as long as separate compilation is possible.
Oh, well---

J. Giles

chris@mimsy.umd.edu (Chris Torek) (03/15/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:
>
>1) Save/restore all register values that are still 'live' values for the
>   caller.

Right.

>2) Test the stack to make sure there is enough room for the callee's local
>   variables (and call the memory manager for a new stack frame if there
>   wasn't room).

True, but (as Jim Giles points out) VM machines can cheat.  (He does not
say how to cheat, but it is very easy: do not test.  Simply try to use the
new stack address(es).  If they are invalid, the kernel takes an access
fault and decides whether to grow the stack or to terminate the program.)

>3) Create (on the stack) a traceback entry so the debugger and/or the
>   postmortem analyzer can find the current thread through the call tree.

There is a way to cheat on this as well; MIPS use it in their compilers.
When a function does not need a frame pointer---most do not---it gets
instead a `virtual frame pointer' which is simply an offset from the
stack pointer.  The debugger reads the symbol table of the object file
to find out which register is the stack pointer, and what the offset from
this register is, so as to find the previous call (and any saved registers,
using a compiler-generated field also found in the object file).

>The problem is: _MOST_ of the procedure call overhead is concentrated in
>number (1)!  And, this overhead applies to all procedures - including
>'leaf' routines.

True; and this is where most of the effort at reducing function call
overhead must be applied.  Fortunately, some smart guys have figured
out some good methods to handle this.  On the MIPS, for instance, the
system designates some registers as `caller saves' and some registers
as `callee saves'.  A leaf procedure heads straight for the `caller
saves' registers; others head for the `callee saves' registers.

In other words, leaf routines can (provided they do not need a huge
number of registers) get away with saving and restoring nothing.  (They
need not even save and restore the stack pointer, if a `callee saves'
register is free: simply do something like

	add	#-40,standard_sp,temp_reg_3 // set temp_reg to sp-40

and then use temp_reg_3 as your `stack pointer'.

>Basically, 'leafness' has very little to do with procedure call overhead.

As shown above, this is no longer true.  If the leaf uses a `large'
number of registers (more than are available as temporary computation
registers in non-leaf routines), this statement holds; if not, the
fact that the routine is a leaf makes the registers `free'.

(Of course, callers that use lots of registers, and store things in
the temporary registers, must spill those registers on subroutine
calls.  This may be what Jim Giles meant all along.  Perhaps someone
at MIPS can post statistics as to how often this is the case.)

>Because modern machines tend to have a _large_ number of registers,
>implementing (1) is usually quite expensive - and it can't be made
>cheaper without interprocedural analysis which in turn can't be done as
>long as separate compilation is possible.

The only problem with this last statement (`interprocedural analysis
cannot be done due to separate compilation') is that someone already
does it---namely, MIPS do it at the highest compilation level.  Again,
one does it by cheating: `compile' to `Ucode', an intermediate level
code that is neither source nor object.  When `linking' the Ucode,
put everything together, do global flow analysis, optimise, and then
generate machine code.

(Of course, linking starts to get very slow, encouraging people to
run with lower optimisation levels.  This is more or less as it should
be, though: you choose the trade-off of compilation time for run time.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

tim@nucleus.amd.com (Tim Olson) (03/16/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| Most machines implement call/return with single instructions.  However, this
| tends to be the tip of the iceberg for procedure call overhead.  The interface
| _MUST_ do the following:
| 
| 1) Save/restore all register values that are still 'live' values for the
|    caller.
| 
| 2) Test the stack to make sure there is enough room for the callee's local
|    variables (and call the memory manager for a new stack frame if there
|    wasn't room).
| 
| 3) Create (on the stack) a traceback entry so the debugger and/or the
|    postmortem analyzer can find the current thread through the call tree.
| 
| The problem is: _MOST_ of the procedure call
| overhead is concentrated in number (1)!  And, this overhead applies to all
| procedures - including 'leaf' routines.  Basically, 'leafness' has very
| little to do with procedure call overhead.

If you partition your registers into a set that is live across a
function call and a set that is killed, then leaf routines can run
entirely out of the later set, not having to save any registers.  For
example, in the Am29000 calling-convention, a function can allocate up
to 126 local registers which are saved across function calls and
also has use of 24 "temporary" registers which are not.  Most leaf
routines can run entirely out of these 24 temps.  The entire overhead
is simply a call instruction and a return instruction (2 cycles if the
delay slots are filled).

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

henry@utzoo.uucp (Henry Spencer) (03/16/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>> ... On the Mips machines, for example,
>> a C function call to a "leaf" function (one which calls no others) is
>> one instruction for the call and one for the return, total elapsed time
>> a small fraction of a microsecond.  Non-leaf functions have to do a little
>> bit more work at times.
>
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:
>
>1) Save/restore all register values that are still 'live' values for the
>   caller.  ...  The problem is: _MOST_ of the procedure call
>overhead is concentrated in number (1)!  And, this overhead applies to all
>procedures - including 'leaf' routines.  Basically, 'leafness' has very
>little to do with procedure call overhead...

Um, do you want to go back and re-think that statement, Jim?  Leafness has
a lot to do with it, because a leaf procedure doesn't have to keep things
neat in case it calls somebody.  In particular, if you assign some of your
registers to be scratch registers, normally unused at call time, then a
leaf procedure just uses them for all its internal work and has to save
and restore *nothing* except in very unusual cases.  This is exactly what
the Mips compilers do.  When I said that a call to a leaf function on a
Mips machine was one instruction each way, that's *one* instruction,
total, end to end, no further overhead involved.  And those are RISC
instructions, normally one cycle apiece.

The correct statement is that the callee must save/restore all register
values that are still live for the caller *and that the callee is going
to overwrite*.  Careful register-allocation conventions (with an adequate
supply of registers) can minimize that number, usually to zero in the case
of leaf functions.

Most of the other overheads you mention can likewise be minimized or
eliminated with some thought.  Stack-overflow probing is needed only on
systems without MMUs, or with defective MMUs; even the pdp11, the original
C machine, did without it.  Traceback overhead for debuggers can be
reduced considerably with better cooperation between compiler and debugger,
and a modern compiler at a high level of optimization generally produces
stuff that a debugger can't handle anyway, so it can jettison that entirely.
Interprocedural optimization in the presence of separate compilation is
not merely possible, it's commercially available:  -O3 on a Mips compiler.

Much of the "common wisdom" on the subject of call overhead simply isn't
true if you are willing to work hard at falsifying it.  (Actually, the
same is true of many other performance "constraints"...)
-- 
MSDOS, abbrev:  Maybe SomeDay |     Henry Spencer at U of Toronto Zoology
an Operating System.          | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

preston@titan.rice.edu (Preston Briggs) (03/16/90)

In article <1990Mar15.173408.29622@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

>>1) Save/restore all register values that are still 'live' values for the
>>   caller.  ...  The problem is: _MOST_ of the procedure call
>>overhead is concentrated in number (1)!  

>The correct statement is that the callee must save/restore all register
>values that are still live for the caller *and that the callee is going
>to overwrite*.  Careful register-allocation conventions (with an adequate
>supply of registers) can minimize that number, usually to zero in the case
>of leaf functions.

And of course there are register windows of various kinds.

They don't solve the register allocation problem, but they
do help reduce call overhead, at some expense to task
switching overhead (but we know which is more common).

On the other hand, Wall argues that link-time register
allocation can surpass register-window performance.
See his paper

	Register Windows vs. Register Allocation
	D. W. Wall
	Proceedings of the SIGPLAN '88 Conference on
		Programming Language Design and Implementation

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

jamiller@hpcupt1.HP.COM (Jim Miller) (03/16/90)

>
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:
>
>1) Save/restore all register values that are still 'live' values for the
>   caller.
...
>procedures - including 'leaf' routines.  Basically, 'leafness' has very
>little to do with procedure call overhead.  Because modern machines tend
>to have a _large_ number of registers, implementing (1) is usually quite
>expensive - and it can't be made cheaper without interprocedural analysis
>which in turn can't be done as long as separate compilation is possible.
>Oh, well---
>
>J. Giles

Not completely true. The calling convention can have some registers
"caller save" and others "callee save".  So the caller only saves those
registers that the callee is allowed to use without restoring your
values when returning.  So if the leaf only needs the argument registers
and one or two more, then the callee doesn't have to save anything,
and a good compiler doesn't have to save much (if anything) in the
caller.  The HP Risk machine does this, with some success.  Milage
will vary with machine and application.

    jim miller

jlg@lambda.UUCP (Jim Giles) (03/16/90)

From article <23113@mimsy.umd.edu>, by chris@mimsy.umd.edu (Chris Torek):
> [... 'caller-save' vs. 'callee-save' registers ...]
> As shown above, this is no longer true.  If the leaf uses a `large'
> number of registers (more than are available as temporary computation
> registers in non-leaf routines), this statement holds; if not, the
> fact that the routine is a leaf makes the registers `free'.
> 
> (Of course, callers that use lots of registers, and store things in
> the temporary registers, must spill those registers on subroutine
> calls.  This may be what Jim Giles meant all along.  Perhaps someone
> at MIPS can post statistics as to how often this is the case.)

This is exactly what I meant.  The Cray system has a similar mechanism
(and, in fact they even have special types of 'leaf' procedures called
'baselevel' routines).  The problem is that the 'caller' routine still
needs to save 'live' values around calls because the registers assigned
to the 'callee' are nearly always in use.

When I write in assembly, I tend to use all the registers I can in order
to avoid the memory overhead - memory costs about a dozen clocks per
reference while transfers to the temp regs only costs one.  Even with
memory pipelined and running in parallel with other functional units,
this extra delay is expensive.  If I were writing a compiler, I would
be similarly greedy with the registers for generated code.

All this trouble could be avoided if the register use of the 'callee'
were known in advance.  Then the code generator for the 'caller' could
do register scheduling with this extra information in mind.  Still
causes problems if the 'callee' uses a _lot_ of registers, but it's
better than nothing.  Of course, the best deal (if speed were para-
mount) would be to 'inline' the 'callee' completely.  Then the register
scheduling would take place across the call boundary (and the save/
restore could be hidden better under pipelining).

> [...]
> The only problem with this last statement (`interprocedural analysis
> cannot be done due to separate compilation') is that someone already
> does it---namely, MIPS do it at the highest compilation level.  Again,
> one does it by cheating: `compile' to `Ucode', an intermediate level
> code that is neither source nor object.  When `linking' the Ucode,
> put everything together, do global flow analysis, optimise, and then
> generate machine code.

I've often thought that code generation should be done by the loader for
this very reason.  Both inlining and regester scheduling across calls
would be improvements that would be worth the loader slowdown.  In
addition, the compile step would be considerably faster.  This means
that syntax checking would be a breeze (a common use of the compiler,
like it or not, is as a form of 'lint' - at least for non-C languages).

J. Giles

jlg@lambda.UUCP (Jim Giles) (03/16/90)

From article <1990Mar15.173408.29622@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
> [...]
>>1) Save/restore all register values that are still 'live' values for the
>>   caller.  ...  The problem is: _MOST_ of the procedure call
>>overhead is concentrated in number (1)!  And, this overhead applies to all
>>procedures - including 'leaf' routines.  Basically, 'leafness' has very
>>little to do with procedure call overhead...
> 
> Um, do you want to go back and re-think that statement, Jim?  Leafness has
> a lot to do with it, because a leaf procedure doesn't have to keep things
> neat in case it calls somebody.  In particular, if you assign some of your
> registers to be scratch registers, normally unused at call time, then a
> leaf procedure just uses them for all its internal work and has to save
> and restore *nothing* except in very unusual cases.  This is exactly what
> the Mips compilers do.  [...]

This is also what the Cray compilers do for 'baselevel' routines.  The
problem arises with your assumption that the are such things as "scratch
registers, normally unused at call time."  If your register scheduling
algorithm in the compiler is sufficiently greedy, you will, almost always
have some save/restores to do.  And note: it is almost always a good thing
for your register scheduler to be as greedy as possible.  In fact, the only
exception is the procedure call overhead.  So, I _HAVE_ rethought my first
stetement - and I still stand by it!

> [...]
> The correct statement is that the callee must save/restore all register
> values that are still live for the caller *and that the callee is going
> to overwrite*.  [...]

Quite true.  But, in the context of separate compilation, how does the
compiler _KNOW_ which registers are to be used by the 'callee'?  Or, when
compiling the 'callee', how does the compiler know which values are still
'live' for the 'caller'?  Now, I'm all for the idea of interprocedural
analysis.  I'm even a supporter of having the loader do the code generation
for this very reason.  But right now, the procedure interface is still
one of the most costly parts of any program.

> [...]          Careful register-allocation conventions (with an adequate
> supply of registers) can minimize that number, usually to zero in the case
> of leaf functions.

Careful register-allocation conventions are usually the ones that use the
registers most greedily.  This is because, in general, the more you use
the registers, the faster your code goes.  If your "careful" approach to 
registers is not greedy, how much performance am I loosing to it by not
getting full use out of the hardware availible?  Further: what's "an
adequate supply of registers"?  I know code which can use up as many as
you give me.  In fact, if interprocedural analysis were commonplace
instead of rare, you would probably find that the register set was almost
always completely full (this wouldn't matter since it would be a logical
consequence of register allocation being done on the program instead of
a routine at a time).

These problems may all be solved in the future - even the very near future.
But at present, only MIPS and Cray (the only ones mentioned anyway) have
addressed this problem.  And these two 'solutions' rely on the 'callee'
not using lots of registers and the 'caller' deliberately leaving some
spare ones - but this, in itself, may have a negative impact on performance.

J. Giles

meissner@osf.org (Michael Meissner) (03/16/90)

In article <14272@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

| Careful register-allocation conventions are usually the ones that use the
| registers most greedily.  This is because, in general, the more you use
| the registers, the faster your code goes.  If your "careful" approach to 
| registers is not greedy, how much performance am I loosing to it by not
| getting full use out of the hardware availible?  Further: what's "an
| adequate supply of registers"?  I know code which can use up as many as
| you give me.  In fact, if interprocedural analysis were commonplace
| instead of rare, you would probably find that the register set was almost
| always completely full (this wouldn't matter since it would be a logical
| consequence of register allocation being done on the program instead of
| a routine at a time).
| 
| These problems may all be solved in the future - even the very near future.
| But at present, only MIPS and Cray (the only ones mentioned anyway) have
| addressed this problem.  And these two 'solutions' rely on the 'callee'
| not using lots of registers and the 'caller' deliberately leaving some
| spare ones - but this, in itself, may have a negative impact on performance.

Umm, the 88k also partitions the register set into caller save and
callee save.  For the two machines that I'm familar with, the
breakdown is as follows:

MIPS:	32 integer 32-bit registers
		 1 register hardwired to 0
		 2 registers used for return value and/or staic link
		 4 registers used for arguments
		10 registers not preserved across calls
		 9 registers preserved across calls
		 1 register for stack pointer
		 1 register for return address
		 1 register for accessing small static/global data
		 1 register used by the assembler
		 2 registers used by the kernel

	16 floating point 64-bit registers
		 2 registers for return value
		 2 registers for arguments
		 6 registers not preserved across calls
		 6 registers preserved across calls

88K:	32 32-bit registers (double precision takes 2 regs)
		 1 register hardwired to 0
		 8 registers for arguments & return value
		 4 registers not preserved across calls
		13 registers preserved across calls
		 4 registers reserved for linker/OS
		 1 register for the stack pointer
		 1 register for the return address

Note that registers used for passing arguments, and returning values
are also used as temps.  If the return address is stored on the stack,
the register that holds can also be used for a temporary.  Neither
architecture requires the use of a frame pointer, though frame
pointers can be synthesized easily if needed because variable sized
stack allocations are done.  Finally, both machines software defines
static tables that describe where registers are stored on the stack,
and what register and offset from that register are to be used as a
virtual frame pointer for use in the library and in the debugger.

The MIPS compilers also have a -O3 option which does global register
allocation.  Here is an fragment of the man page from a Decstation:

          -O3            Perform all optimizations, including global
                         register allocation.  This option must
                         precede all source file arguments.  With this
                         option, a ucode object file is created for
                         each C source file and left in a .u file.
                         The newly created ucode object files, the
                         ucode object files specified on the command
                         line, the runtime startup routine, and all
                         the runtime libraries are ucode linked.
                         Optimization is performed on the resulting
                         ucode linked file and then it is linked as
                         normal producing an a.out file. A resulting
                         .o file is not left from the ucode linked
                         result.  In fact -c cannot be specified with
                         -O3.

          -feedback file Use with the -cord option to specify the
                         feedback file.  This file is produced by with
                         its -feedback option from an execution of the
                         program produced by

          -cord          Run the procedure-rearranger on the resulting
                         file after linking.  The rearrangement is
                         performed to reduce the cache conflicts of
                         the program's text.  The output is left in
                         the file specified by the -o output option or
                         a.out by default.  At least one -feedback
                         file must be specified.


Because of the restriction of not specifying -c, I'm not sure how many
people use this in practice for large software.  I would imagine that
for programs which use use runtime binding (ie, emacs, or C++ code
with virtual functions), it would default back to the standard calling
sequence.  I wonder how much it buys for typical software as opposed
to special cases.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

jlg@lambda.UUCP (Jim Giles) (03/17/90)

From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> [...]
> If you partition your registers into a set that is live across a
> function call and a set that is killed, then leaf routines can run
> entirely out of the later set, not having to save any registers.  For
> example, in the Am29000 calling-convention, a function can allocate up
> to 126 local registers which are saved across function calls and
> also has use of 24 "temporary" registers which are not.  Most leaf
> routines can run entirely out of these 24 temps.  The entire overhead
> is simply a call instruction and a return instruction (2 cycles if the
> delay slots are filled).

It seems to me that if a procedure is so small that it can only find
use for 24 (or fewer) registers, then it is small enough to be a real
good candidate for inlining.  Clearly, _any_ procedure can be written
only _use_ 24 registers (there were some machines in the past with only
_one_ register - the were completely functional).  However, most procedures
of any size can benefit from quite a few more registers than 24.

Similarly, any 'caller' routine that can afford to leave 24 registers
unused is fairly small as well.  For efficiency, all 'live' values
should be kept in registers if at all possible.  The problem is that
any large program (and some small ones) have quite a large number of
'live' values.  Procedures that manupulate arrays may have whole
arrays (or array sections) that are 'live'.   If there's _room_ in
the register set for all these values, that's where they should be
kept.

The fact is that this kind of 'temp' register idea only _appears_
to save time.  You probably actually lose performance by forcing
many of your 'live' values to memory when they needn't have been.

J. Giles

jlg@lambda.UUCP (Jim Giles) (03/17/90)

From article <MEISSNER.90Mar16103226@curley.osf.org>, by meissner@osf.org (Michael Meissner):
> [...]
> The MIPS compilers also have a -O3 option which does global register
> allocation.  Here is an fragment of the man page from a Decstation:
> [...]

Great!  Now if this ability were widespread in the _real_world_ (or,
the rest of the real world - for those of you who consider MIPS to
be real :-), then there would be much less waste in the procedure
interface.  Of course, once you have this capability, you would also
want an 'inline' keyword on procedure declarations, etc..

J. Giles

seanf@sco.COM (Sean Fagan) (03/17/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:

No, it must not.

>1) Save/restore all register values that are still 'live' values for the
>   caller.

What about callee saves?

>2) Test the stack to make sure there is enough room for the callee's local
>   variables (and call the memory manager for a new stack frame if there
>   wasn't room).

	sub	esp, 1000000

seems to work just fine here.  True, you will get into problems if you run
out of VM, or your stack-pointer wraps around, but I don't know of anyone
who checks for those.

>3) Create (on the stack) a traceback entry so the debugger and/or the
>   postmortem analyzer can find the current thread through the call tree.

It need not be on the stack.  Ever hear of saving the return address in a
register?  You know, like Cray's do?  You only have to save the retaddr when
you call another function.

>The problem is: _MOST_ of the procedure call
>overhead is concentrated in number (1)!  And, this overhead applies to all
>procedures - including 'leaf' routines.

True.  That's why register windows were considered a "good thing" (I'm not
sure I like them).  It's also one of the problems you have if you have
large, large amounts of registers (say, 256 anyone?).

>Because modern machines tend
>to have a _large_ number of registers, implementing (1) is usually quite
>expensive - and it can't be made cheaper without interprocedural analysis
>which in turn can't be done as long as separate compilation is possible.

Huh?  The MIPS compiler does interprocedural analysis, and I'm lead to
believe that the Pyramid compilers do as well.  It's not *easy* or cheap, to
be sure, but it can be done.

-- 

-----------------+
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

djones@megatest.UUCP (Dave Jones) (03/17/90)

From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> 
> If you partition your registers into a set that is live across a
> function call and a set that is killed, then leaf routines can run
> entirely out of the later set, not having to save any registers.  For
> example, in the Am29000 calling-convention, a function can allocate up
> to 126 local registers which are saved across function calls and
> also has use of 24 "temporary" registers which are not.  ...

The RISC machines have an internal register which acts something
like a frame-pointer with respect to the user-registers, yielding
what they call 'overlapping register windows'.  Thus user-registers need
not be saved and restored except when the window moves off one end of
the register set or the other.

Back in the 70's, TI's 990 and 99000 series chips had the about same thing,
but the "work-spaces", as they called the register-windows, were accessed
by way of the main-memory busses. (But that architecture had some bad
problems also: poor jbsr/rtn implementation and 16-bit addresses.)

What I would like to see is a register-set that acts as the cache for
a virtual stack in a machine with true stack machine instructions. That
would be my idea of a really good time, although it might put some graph
colorers out of business, having no registers to allocate. Does anybody
know if such a scheme was proposed for SPARC, and if so, why it was
rejected? Seems like such a win, and not all that big a step
beyond the register-window approach.

david.megginson@canremote.uucp (DAVID MEGGINSON) (03/17/90)

Turbo C on the Atari ST (M68000) passes the first three numeric 
arguments in D0, D1 and D2, and the first two address arguments in A0 
and A1. You can override this with the cdecl keyword. All floats and 
doubles get passed on the stack. You can buy a lot of speed this way, 
with a reasonably intelligent compiler. The moral: worry about logic and
readability, and let your compiler do the optimizing.

David Megginson, Centre for Medieval Studies
BITNET: meggin@vm.epas.utoronto.ca

PS. I know this is stupid, but in case anyone doesn't understand, D0-D7 
are the 32 bit data registers in a 68k chip, and A0-A7 are the address 
registers, also 32 bit. (A7 is the stack pointer).
---
 * Via ProDoor 3.1R

drc@cs.brown.edu (David R. Chase) (03/17/90)

In article <12350@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>What I would like to see is a register-set that acts as the cache for
>a virtual stack in a machine with true stack machine instructions. That
>would be my idea of a really good time, although it might put some graph
>colorers out of business, having no registers to allocate.

Not at all.  No doubt the "hardware stack" will be of finite size, and
if that is exceeded then some of it must be saved to memory.  In this
situation, you'll win if stack frames aren't so large, and one way to
do that is to allocate stack slots using algorithms very much like
those used to allocate registers.  Spilling won't occur, of course,
since you can have as many "registers" as are needed, but they don't
come for free.  Use fewer and (ignoring scheduling problems caused by
spurious dependences on reused registers) do better.

Hard life, isn't it?

David
(really in Menlo Park, CA)

meissner@osf.osf.org (Michael Meissner) (03/18/90)

In article <14274@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
|From article <MEISSNER.90Mar16103226@curley.osf.org>, by
|meissner@osf.org (Michael Meissner): 
|> [...]
|> The MIPS compilers also have a -O3 option which does global register
|> allocation.  Here is an fragment of the man page from a Decstation:
|> [...]
|
|Great!  Now if this ability were widespread in the _real_world_ (or,
|the rest of the real world - for those of you who consider MIPS to
|be real :-), then there would be much less waste in the procedure
|interface.  Of course, once you have this capability, you would also
|want an 'inline' keyword on procedure declarations, etc..

The next revision of the MIPS compilers is supposed to have -O4, which
does automatic inlining as well.  It hasn't propigated through DEC
yet...
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

tim@nucleus.amd.com (Tim Olson) (03/19/90)

In article <14273@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
| > [...]
| > If you partition your registers into a set that is live across a
| > function call and a set that is killed, then leaf routines can run
| > entirely out of the later set, not having to save any registers.  For
| > example, in the Am29000 calling-convention, a function can allocate up
| > to 126 local registers which are saved across function calls and
| > also has use of 24 "temporary" registers which are not.  Most leaf
| > routines can run entirely out of these 24 temps.  The entire overhead
| > is simply a call instruction and a return instruction (2 cycles if the
| > delay slots are filled).
| 
| It seems to me that if a procedure is so small that it can only find
| use for 24 (or fewer) registers, then it is small enough to be a real
| good candidate for inlining.  Clearly, _any_ procedure can be written
| only _use_ 24 registers (there were some machines in the past with only
| _one_ register - the were completely functional).  However, most procedures
| of any size can benefit from quite a few more registers than 24.

It might be true that scientific routines written in FORTRAN may have
this many live, non-overlapping variables to keep in registers, but I
don't believe this is true in general.  Statistics from a large
collection of programs and library routines (a mix of general and
scientific applications written in C) show that of 782 functions (620
of which were non-leaf functions), an average of 6.5 registers per
function were live across function calls.

| Similarly, any 'caller' routine that can afford to leave 24 registers
| unused is fairly small as well.  For efficiency, all 'live' values
| should be kept in registers if at all possible.  The problem is that
| any large program (and some small ones) have quite a large number of
| 'live' values.  Procedures that manupulate arrays may have whole
| arrays (or array sections) that are 'live'.   If there's _room_ in
| the register set for all these values, that's where they should be
| kept.

Yes, this is true.  However, unless the array manipulation is quite
simple, it becomes tricky to use registers for them (the stats quoted
above were for scalar variables only).

| The fact is that this kind of 'temp' register idea only _appears_
| to save time.  You probably actually lose performance by forcing
| many of your 'live' values to memory when they needn't have been.

No live values were forced to memory (at least not ones in the active
function).  The Am29000 stack cache tries to keep live variables from
all functions in the register file as well as the variables of the
current function.  If the register file becomes full, live variables
from older functions are spilled to memory, then filled as they become
active again.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

tim@nucleus.amd.com (Tim Olson) (03/19/90)

In article <12350@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
| What I would like to see is a register-set that acts as the cache for
| a virtual stack in a machine with true stack machine instructions. That
| would be my idea of a really good time, although it might put some graph
| colorers out of business, having no registers to allocate. Does anybody
| know if such a scheme was proposed for SPARC, and if so, why it was
| rejected? Seems like such a win, and not all that big a step
| beyond the register-window approach.

This was the approach used in the AT&T CRISP processor.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

albert@bsovax.UUCP (Albert van der Horst) (03/21/90)

In article <12350@goofy.megatest.UUCP
> djones@megatest.UUCP (Dave Jones) writes:
> ( In a follow up on Tim Olson)c
>What I would like to see is a register-set that acts as the cache for
>a virtual stack in a machine with true stack machine instructions. That
>would be my idea of a really good time, although it might put some graph
>colorers out of business, having no registers to allocate. Does anybody
>know if such a scheme was proposed for SPARC, and if so, why it was
>rejected? Seems like such a win, and not all that big a step
>beyond the register-window approach.

This idea is not new.
It is about time for this idea to provide the most ideal machine
to write compilers for.

 There is a weird language that you may not know about. It is called
FORTH, and it is based on a virtual machine. This virtual machine is 
IMHO exactly right. 
  It has two stacks, one for the return addresses, and one for
the calculation; the latter is visible to the FORTH programmer.
The designer of FORTH, Charles Moore, has cast this idea into silicon.
This resulted in a RISC chip (NOVIX 4000) with some amazing properties.
For instance, a one cycle call, and mostly a FREE return from a 
subroutine. It also allocates registers as needed.  
  I was the president of a small group in the Netherlands that explored
this idea independantly. We got far enough to create a working emulator
(written in FORTH...) for our chip, baptized FIETS (Fast implementation
of Enhanced Translators and Systems, by sheer accident also the dutch
word for bycicle). Then came Moore and interest in the project faded.
Two of the group members gained hands on experience with the NOVIX,
resulting in for example a program that generates a video signal by
toggling an output pin, showing a picture with the name "NOVIX".

  Despite some coverage by BYTE the NOVIX did not become a succes at
all. In my opinion, real mistakes were the lack of a C-compiler for it
- the FIETS contained 16 stack based general purpose registers, 
that would have made such a C-compiler viable. The NOVIX registers were
much less in number, caused by a too narrow Forth based scope. Also
Charles Moore still insists that programs needing more than a 16 bit 
address range are desing errors.... Marketing was directed towards the
incrowd of FORTH programmers. Probably very few of you would
have noted the value of the chip from the BYTE article. Even to a FORTH-
literate the article in BYTE was hard to digest. The end was: NOVIX went
out of business. 

   Fortunately, the basic idea's have been bought by Harris Semiconductor's
and they know better what makes a processor sing. They have already
produced a few practically successful designs, and are still developping
it further. These chips are optimized for real time signal processing, and
have extremely fast context switches. (switching stacks, no registers
to save)

I do hope we will arrive at the processor bottom line.
  The bottom line is this.
You have the choice between your code being inline, or a single cycle
call/return overhead.
All instructions (let's say almost) do something usefull, like addition,
putting the result in a new register "on the fly" if required.
Or on the other hand adding an offset to a base address will drop the
offset from the stack automatically.

I think " the register caching via stacks" is much easier to implement than
the register window stuff. And as far as I can see, it is easier on the
compiler optimizer too. 

N.B. Those interested (Harris?) may mail me for the FIETS instruction set.
N.B. I am currently exploring transputers, that also have an "allocate
     register as needed" mechanism, without overflowing however and very shalllow.

View expressed are my own { not all FIETS members would agree.
                          { my boss does not know what I am talking about

Albert van der Horst               {eunet!mcvax!hp4nl!bsovax!albert}
P.O. 8052                                                                    
3053 RB   Utrecht            " It's pitch dark. You might be eaten by a grue" 
The netherlands              " I hope you got more light than from a lamp"

jlg@lambda.UUCP (Jim Giles) (03/21/90)

From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> [...]
> It might be true that scientific routines written in FORTRAN may have
> this many live, non-overlapping variables to keep in registers, but I
> don't believe this is true in general.  Statistics from a large
> collection of programs and library routines (a mix of general and
> scientific applications written in C) show that of 782 functions (620
> of which were non-leaf functions), an average of 6.5 registers per
> function were live across function calls.

This statistic can only be interpreted in one way: the C compiler in
question didn't allocate registers very well.  Especially in scientific
packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
execution of even simple routines.  Vectors, arrays, lists, strings, etc,
are alle being either produced or consumed.  The fact that none of these
_VALUES_ were in registers at the time of the call indicates one of two
things: 1) the code in question was fragmented to the point that most
procedures had only a few data items (and scalar at that), or 2) the
compiler simply wasn't using the registers to anywhere near their potential.
Since you imply the code was well written, I reject the first explanation.
That leaves the compiler.

My experience (I don't have statistics) with both Fortran and C is that
good compilers generally PACK the registers with as much live data as
possible.  Even an apparently pure scalar loop that does only 'simple'
operations may be 'unrolled' a few times to make better use of the
registers.  Compilers are becoming available that apply that optimization
automatically (so this isn't just a case which applies only to 'coder
enhanced' code).  If such a loop (and this is still the simple kind
mind you) had a procedure call imbedded in it, all the registers that
the procedure might use would have to be spilled to memory - and then
reloaded on return.  On the Cray, spilling and reloading just _ONE_
vector register is over 150 clocks (64 elements at one clock each
to and from memory plus time for the memory pipeline plus a little
overhead to set up stride, address, etc.).  This is _not_ a tiny
problem.

Again I say, this scheme of having 'preserved' vs. 'temp' registers
for procedure calls only _appears_ to save time.  In truth, it deprives
you of registers which could be put to use (at least if your compiler
was clever enough).  The only solution to the problem is to 'inline'
(or, at least, schedule registers globally).  At present, in most
environments, the only way to do this is by manually 'inlining' the
procedures.  Let's hope that more automatic solutions become generally
available in the next few years.

By the way, aside from a problem with aliasing with C, Fortran and C are
identical with respect to optimization, register allocation, etc..
So, your implied put-down of Fortran is not relevant in this context.

J. Giles

dik@cwi.nl (Dik T. Winter) (03/21/90)

In article <584@bsovax.UUCP> albert@BSOVAX.UUCP (Albert van der Horst) writes:
 > In article <12350@goofy.megatest.UUCP
 > > djones@megatest.UUCP (Dave Jones) writes:
 > > ( In a follow up on Tim Olson)c
 > >What I would like to see is a register-set that acts as the cache for
 > >a virtual stack in a machine with true stack machine instructions.
 > 
 > This idea is not new.

No, indeed.  Look at the English Electric KDF9 (of fifties vintage).
-- 
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

tim@nucleus.amd.com (Tim Olson) (03/22/90)

In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
| > [...]
| > It might be true that scientific routines written in FORTRAN may have
| > this many live, non-overlapping variables to keep in registers, but I
| > don't believe this is true in general.  Statistics from a large
| > collection of programs and library routines (a mix of general and
| > scientific applications written in C) show that of 782 functions (620
| > of which were non-leaf functions), an average of 6.5 registers per
| > function were live across function calls.
| 
| This statistic can only be interpreted in one way: the C compiler in
| question didn't allocate registers very well.  Especially in scientific
| packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
| execution of even simple routines.  Vectors, arrays, lists, strings, etc,
| are alle being either produced or consumed.  The fact that none of these
| _VALUES_ were in registers at the time of the call indicates one of two
| things: 1) the code in question was fragmented to the point that most
| procedures had only a few data items (and scalar at that), or 2) the
| compiler simply wasn't using the registers to anywhere near their potential.
| Since you imply the code was well written, I reject the first explanation.
| That leaves the compiler.

Do you have any statistics to back up your assertions?  The values I
quoted match what other people have found as well.  Again, you may be
only looking at a small portion of scientific code and missing what is
going on in general code.

The compiler that I used to collect the stats will keep all live
scalars in the register file, but doesn't do inlining of functions,
and doesn't try to operate on arrays in registers (how do you generate
code to deal with runtime address calculations???)

| My experience (I don't have statistics) with both Fortran and C is that
		 ^^^^^^^^^^^^^^^^^^^^^^^
Oh.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

jlg@lambda.UUCP (Jim Giles) (03/23/90)

From article <29585@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> [...]
> The compiler that I used to collect the stats will keep all live
> scalars in the register file, [...]                    [^^^]

I don't believe a word of it.  The word I don't believe is 'all'.
_ANY_ scalar that is visible in the local scope and has a value
which might still be used by the program at some future point is 'live'.
(The same obviously applies to non-scalar values.)  That is the definition
of 'live'.  What you really mean is that the compiler breaks code into
'basic blocks' and keeps all active values from _that_ block in registers.
The use of 'basic blocks' and the lack of global dataflow analysis is
confusing you into thinking that only a few values are 'live' at a given
point.  Also, it is considerable mistake not to regard non-scalar values
as important.

The bottom line is that registers are an important resource in any machine.
They are often more that 10 times faster than memory to reference.  Your
analysis indicates that your code (whether the compiler is responsible
or not) is using this resource to only a small fraction of its potential.
But instead of complaining, you are acting like it is a desireable way
for the code to behave.

> | My experience (I don't have statistics) with both Fortran and C is that
> 		 ^^^^^^^^^^^^^^^^^^^^^^^
> Oh.

Ok, now I have statistics.  I just did a quick survey of some expertly
crafted C code that I have source code to (it's actually part of the
X11 window package).  This is only a small program (it's xlock for those
interested in checking).  Still, there are over 50 scalar variables
declared global to the utility - these are _ALWAYS_ 'live'.  Further,
this code includes a lot of window related declarations (which also
contain a bunch of _ALWAYS_ 'live' global declarations).  This is
admittedly a small sample and it disregards local variables and non-scalars.
Still, since most of the program consists of small procedures which
manipulate these global variables, I suspect it would be a smaller, faster
code if the values always occupied registers.  xlock probably doesn't
need to be faster, but if speed were important, register usage would
be too.

J. Giles

peter@ficc.uu.net (Peter da Silva) (03/23/90)

(for comp.arch folks, Jim is basically saying that the more registers you
 have, the better, because it lets you put more variables in registers)

But, Jim, registers aren't free. The more registers you have the more bits
you have to hide in your carefully crafted instruction set to specify
registers. Modern RISC processors have up to 4 arguments to a given
instruction, so even if they're all registers (no mixed-mode instructions)
and you only have 16 registers, that's 16 bits gone right there.

Now, let's add some more bits to specify constants and maybe some addressing
modes, and you can run out of bits pretty quickly. With 64 registers (about
the most I'd expect you to be happy with), and immediate operands that
would leave you with about 5 bits to specify the instruction. Now suppose
you have room on the chip for 512 registers... why *not* use register
windows, or a stack cache, or something similar instead of just making
the instruction wider?
-- 
 _--_|\  `-_-' Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \  'U`
\_.--._/
      v

tim@nucleus.amd.com (Tim Olson) (03/23/90)

In article <14285@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| From article <29585@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
| > [...]
| > The compiler that I used to collect the stats will keep all live
| > scalars in the register file, [...]                    [^^^]
| 
| I don't believe a word of it.  The word I don't believe is 'all'.
| _ANY_ scalar that is visible in the local scope and has a value
| which might still be used by the program at some future point is 'live'.
| (The same obviously applies to non-scalar values.)  That is the definition
| of 'live'.

Sorry -- I should have clarified that it keeps all live scalar
variables *that don't have potential aliasing problems* in the
register file.  Since global variables are visible to other functions,
they have the potential for having their address taken in some other
compilation unit (remember, we are talking about C code), and thus
cannot be kept in registers -- unless "universal" register allocation
and optimization is performed.

| What you really mean is that the compiler breaks code into
| 'basic blocks' and keeps all active values from _that_ block in registers.
| The use of 'basic blocks' and the lack of global dataflow analysis is
| confusing you into thinking that only a few values are 'live' at a given
| point.

No, the compiler is performing register allocation on an entire
function, not on a basic block level.  It is performing "global"
dataflow analysis (meaning on a per-function basis), but not
"universal" dataflow analysis (on the entire program, including all
other compilation units).

| Also, it is considerable mistake not to regard non-scalar values
| as important.

I never said they weren't important -- it is just that it is very hard
(sometimes impossible) to operate on structures and arrays that are
kept entirely in registers (runtime indexing, etc.).  If you mean that
*scalar elements* of an array that are used often should be kept in
the register file, then I agree, and we do that.

| The bottom line is that registers are an important resource in any machine.
| They are often more that 10 times faster than memory to reference.  Your
| analysis indicates that your code (whether the compiler is responsible
| or not) is using this resource to only a small fraction of its potential.
| But instead of complaining, you are acting like it is a desireable way
| for the code to behave.

No, I don't claim it is "desireable", just that a lot of
general-purpose C code tends to have a large number of small to
medium-sized functions which use < 20 local scalars.

| Ok, now I have statistics.  I just did a quick survey of some expertly
| crafted C code that I have source code to (it's actually part of the
| X11 window package).  This is only a small program (it's xlock for those
| interested in checking).  Still, there are over 50 scalar variables
| declared global to the utility - these are _ALWAYS_ 'live'.
	   ^^^^^^

OK -- now I see where you are getting your numbers and where we are
differing.  The stats I collected were for local ("automatic")
scalars, because global scalars can have aliasing problems unless we
perform "universal" dataflow analysis.  I agree that if this were
done, the number of variables live across a function call increase
dramatically.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (03/26/90)

In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

   From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
   > [...]
   > It might be true that scientific routines written in FORTRAN may have
   > this many live, non-overlapping variables to keep in registers, but I
   > don't believe this is true in general.  Statistics from a large
   > collection of programs and library routines (a mix of general and
   > scientific applications written in C) show that of 782 functions (620
   > of which were non-leaf functions), an average of 6.5 registers per
   > function were live across function calls.

   This statistic can only be interpreted in one way: the C compiler in
   question didn't allocate registers very well.  Especially in scientific
   packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
   execution of even simple routines.  Vectors, arrays, lists, strings, etc,
   are alle being either produced or consumed.

This is an old fallacy: the number of useful registers is usually quite
low; the Wall paper and others say that for most codes, even floating
point intensive ones, 4-8-16 registers make do. The problem that Giles
does not seem to consider is that caching values in registers is only
useful if the values are going to be used repeatedly, like all forms of
caching. It is not difficult to produce examples of fairly common pieces
of code where on many machine register caching worsens performance.

Many registers are useful when:

1) Your so called 'optimizer' does not select values to cache on
expected dynamic frequency of use but on static frequency of use. Since
the two are poorly correlated, your so called 'optimizer' wants to cache
everything in sight.

2) You have extremely high latency to memory, and you want to use a
large register cache as a large cache, where even infrequently reused
values are insufferably expensive to refetch.

3) You have extremely high latency to memory, and you can prefetch
blocks of operands while other blocks of operands are being processed,
because you know which operands are going to be needed next, like with
vector machines.

4) You have multiple functionals units, and each of them then can make
use of a set of registers.

Note that all these do not really mean that you need lots of registers;
1) means that your compiler is stupid, 2) that you are missing a proper
dynamic cache, and 3) and 4) that you have actually multiple threads of
control.

My aversion to large register caches and so called clever optimizer
should be well known now, and stems from my opinion that stupid
compilers are to be avoided, very optimizing compilers are accident
prone and easily made unnecessary by careful coding where it matters,
and that I am only interested in general purpose architectures. It is
always possible to design a specific architecture that isn't such...

In particular there are two uses for registers: one is as inputs to
functional units, and the other as cache. There are machines, especially
ones with stack orientedness and caches, that have only specialized
registers, i.e. each register is only there as input to a functional
units. The Manchester mainframes were specialized register machines. Crisp
is in some sense such a machine as well.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

preston@titan.rice.edu (Preston Briggs) (03/26/90)

In article <PCG.90Mar25230051@rupert.cs.aber.ac.uk> pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:

>This is an old fallacy: the number of useful registers is usually quite
>low; the Wall paper and others say that for most codes, even floating
>point intensive ones, 4-8-16 registers make do. 

Experience papers that say "n is enough registers" should be read
with the caveat "for my optimizer and my benchmarks."  More agressive 
optimization will usually make profitable use of more registers.

>The problem that Giles
>does not seem to consider is that caching values in registers is only
>useful if the values are going to be used repeatedly, like all forms of
>caching. 

On many (most?) machines, 2 uses is enough to justify keeping a value
in register.

>It is not difficult to produce examples of fairly common pieces
>of code where on many machine register caching worsens performance.

Of course, we can produce plenty of examples where many registers are helpful.
Further, can you post an example of some sort?  I spend a lot
of time thinking about this stuff, and hard cases would help.

>Many registers are useful when:
>1) Your so called 'optimizer' does not select values to cache on
>expected dynamic frequency of use but on static frequency of use.

Surely bad optimizers/register-allocators waste registers.
(Naturally, your reply will be "there's no such thing as a good optimizer.")

>2) You have extremely high latency to memory
>3) You have extremely high latency to memory, and you can prefetch
>blocks of operands while other blocks of operands are being processed,
>4) You have multiple functionals units

But how many chips does this describe?  Today, many; next year, many more.
The i860 has a fast cache, but it's small and a miss is about 25 cycles.
Many chips have load pipelines, where it's profitable to fetch several
cycles in advance.  And how many chipsets have asynchrounous FP processors?

CPU's outrun memory.  They have been for years, and memory isn't catching up.
Hence the development of caches, multi-level caches, 
wide data busses, and large register sets.

>1) means that your compiler is stupid, 
Right, but I don't like stupid compilers either.

>2) that you are missing a proper dynamic cache
Cache isn't a cure-all.  It's finite and fairly simplistic.
Long line sizes and limited set associativity tend to restrict
it's abilities to replace registers.

>and 3) and 4) that you have actually multiple threads of control.
Or perhaps the compiler might find enough low-level parallelism
to keep your chip busy.

>My aversion to large register caches
You like "regular" caches but not registers caches...
Aren't registers just the top of the memory hierarchy?
They are rather more subject to control from software, but I'd
think that was a plus.  I'd rather see the systems extended so that
the other layers of the hierarchy were also under software control
(prefetches to cache, fetches around cache, prefetching pages of
virtual memory, and so forth).

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

jlg@lambda.UUCP (Jim Giles) (03/27/90)

From article <PCG.90Mar25230051@rupert.cs.aber.ac.uk>, by pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi):
> [...]                                           The problem that Giles
> does not seem to consider is that caching values in registers is only
> useful if the values are going to be used repeatedly, like all forms of
> caching. 

I don't see where I've been at fault here.  The _definition_ of 'live'
values it that they are to be used again.  The more of these that you
can place into registers the better.  The compiler should do sufficient
data flow so that it knows which 'live' values are needed next and should
schedule registers to preload these (and hold them until other values are
of more immediate use).  Procedure calls interrupt this analysis (at least,
if you don't have interprocedural analysis).  This problem doesn't go away
(even with your well known masochistic micro-management style of programming)
without hand 'inlining' or automatic interprocedural analysis.  In fact,
hand 'inlining' is just the sort of micro-optimization that your style of
programming would recommend!  So, in this case, we should be in agreement.

J. Giles