[comp.lang.misc] function calls

henry@utzoo.uucp (Henry Spencer) (03/15/90)

In article <2000@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>> I was under the impression that C's function calls are, relative to other
>> high level languages, very inexpensive...
>
>Simple, yes.  Inexpensive, no.  I can conceive of a way in which hardware
>can be constructed which would make subroutine calls inexpensive, but I
>know of no machine which has it.  

Many people can conceive of such ways, and a good many of them have been
implemented, some quite successfully.  On the Mips machines, for example,
a C function call to a "leaf" function (one which calls no others) is
one instruction for the call and one for the return, total elapsed time
a small fraction of a microsecond.  Non-leaf functions have to do a little
bit more work at times.

However, as Steve Johnson said at the winter 1987 Usenix conference, talking
about the early days:  ``Dennis had a lot of credibility and lied very
convincingly... it was years before we found out the truth.''  On the early
Unix machines (pdp11 and VAX), function calls were not particularly cheap
(although it is true that C's calls are cheaper than those of many other
languages).  Machine designers have paid a lot more attention to the matter
in recent years, and the situation has improved greatly.

>A subroutine necessarily must find its arguments (or pointers to them, etc.)
>in standardized places.  It also must put its results in such places, or use
>pointers to know where to put them.  Occasionally one can put the arguments
>in those places or use the results that way, but not always...

Not always, no, but a lot of the time.  Many modern machines are organized
so that most function parameters simply get evaluated into a register
(usually necessary anyway) and left there for the callee to find.  There
is sometimes a small cost of having to save the previous value of that
register, depending on circumstances and architecture.

>...necessarily a context switch to some extent...

It can be a very small extent on a well-designed machine.  There *is* a
fundamental problem of instruction reference locality, which necessarily
slows things down a bit on machines with hardware optimized for sequential
fetches.
-- 
MSDOS, abbrev:  Maybe SomeDay |     Henry Spencer at U of Toronto Zoology
an Operating System.          | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

jlg@lambda.UUCP (Jim Giles) (03/15/90)

From article <1990Mar14.172152.25436@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> [... efficient procedure call interface ...]
> Many people can conceive of such ways, and a good many of them have been
> implemented, some quite successfully.  On the Mips machines, for example,
> a C function call to a "leaf" function (one which calls no others) is
> one instruction for the call and one for the return, total elapsed time
> a small fraction of a microsecond.  Non-leaf functions have to do a little
> bit more work at times.

Most machines implement call/return with single instructions.  However, this
tends to be the tip of the iceberg for procedure call overhead.  The interface
_MUST_ do the following:

1) Save/restore all register values that are still 'live' values for the
   caller.

2) Test the stack to make sure there is enough room for the callee's local
   variables (and call the memory manager for a new stack frame if there
   wasn't room).

3) Create (on the stack) a traceback entry so the debugger and/or the
   postmortem analyzer can find the current thread through the call tree.

A 'leaf' routine _might_ get away without doing number (3) by setting some
'leaf' flag.  A virtual memory machine might simplify number (2) to just a
test/fatal_error combination since stack overflow might be considered too
rare to justify much effort.  The problem is: _MOST_ of the procedure call
overhead is concentrated in number (1)!  And, this overhead applies to all
procedures - including 'leaf' routines.  Basically, 'leafness' has very
little to do with procedure call overhead.  Because modern machines tend
to have a _large_ number of registers, implementing (1) is usually quite
expensive - and it can't be made cheaper without interprocedural analysis
which in turn can't be done as long as separate compilation is possible.
Oh, well---

J. Giles

chris@mimsy.umd.edu (Chris Torek) (03/15/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:
>
>1) Save/restore all register values that are still 'live' values for the
>   caller.

Right.

>2) Test the stack to make sure there is enough room for the callee's local
>   variables (and call the memory manager for a new stack frame if there
>   wasn't room).

True, but (as Jim Giles points out) VM machines can cheat.  (He does not
say how to cheat, but it is very easy: do not test.  Simply try to use the
new stack address(es).  If they are invalid, the kernel takes an access
fault and decides whether to grow the stack or to terminate the program.)

>3) Create (on the stack) a traceback entry so the debugger and/or the
>   postmortem analyzer can find the current thread through the call tree.

There is a way to cheat on this as well; MIPS use it in their compilers.
When a function does not need a frame pointer---most do not---it gets
instead a `virtual frame pointer' which is simply an offset from the
stack pointer.  The debugger reads the symbol table of the object file
to find out which register is the stack pointer, and what the offset from
this register is, so as to find the previous call (and any saved registers,
using a compiler-generated field also found in the object file).

>The problem is: _MOST_ of the procedure call overhead is concentrated in
>number (1)!  And, this overhead applies to all procedures - including
>'leaf' routines.

True; and this is where most of the effort at reducing function call
overhead must be applied.  Fortunately, some smart guys have figured
out some good methods to handle this.  On the MIPS, for instance, the
system designates some registers as `caller saves' and some registers
as `callee saves'.  A leaf procedure heads straight for the `caller
saves' registers; others head for the `callee saves' registers.

In other words, leaf routines can (provided they do not need a huge
number of registers) get away with saving and restoring nothing.  (They
need not even save and restore the stack pointer, if a `callee saves'
register is free: simply do something like

	add	#-40,standard_sp,temp_reg_3 // set temp_reg to sp-40

and then use temp_reg_3 as your `stack pointer'.

>Basically, 'leafness' has very little to do with procedure call overhead.

As shown above, this is no longer true.  If the leaf uses a `large'
number of registers (more than are available as temporary computation
registers in non-leaf routines), this statement holds; if not, the
fact that the routine is a leaf makes the registers `free'.

(Of course, callers that use lots of registers, and store things in
the temporary registers, must spill those registers on subroutine
calls.  This may be what Jim Giles meant all along.  Perhaps someone
at MIPS can post statistics as to how often this is the case.)

>Because modern machines tend to have a _large_ number of registers,
>implementing (1) is usually quite expensive - and it can't be made
>cheaper without interprocedural analysis which in turn can't be done as
>long as separate compilation is possible.

The only problem with this last statement (`interprocedural analysis
cannot be done due to separate compilation') is that someone already
does it---namely, MIPS do it at the highest compilation level.  Again,
one does it by cheating: `compile' to `Ucode', an intermediate level
code that is neither source nor object.  When `linking' the Ucode,
put everything together, do global flow analysis, optimise, and then
generate machine code.

(Of course, linking starts to get very slow, encouraging people to
run with lower optimisation levels.  This is more or less as it should
be, though: you choose the trade-off of compilation time for run time.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

tim@nucleus.amd.com (Tim Olson) (03/16/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| Most machines implement call/return with single instructions.  However, this
| tends to be the tip of the iceberg for procedure call overhead.  The interface
| _MUST_ do the following:
| 
| 1) Save/restore all register values that are still 'live' values for the
|    caller.
| 
| 2) Test the stack to make sure there is enough room for the callee's local
|    variables (and call the memory manager for a new stack frame if there
|    wasn't room).
| 
| 3) Create (on the stack) a traceback entry so the debugger and/or the
|    postmortem analyzer can find the current thread through the call tree.
| 
| The problem is: _MOST_ of the procedure call
| overhead is concentrated in number (1)!  And, this overhead applies to all
| procedures - including 'leaf' routines.  Basically, 'leafness' has very
| little to do with procedure call overhead.

If you partition your registers into a set that is live across a
function call and a set that is killed, then leaf routines can run
entirely out of the later set, not having to save any registers.  For
example, in the Am29000 calling-convention, a function can allocate up
to 126 local registers which are saved across function calls and
also has use of 24 "temporary" registers which are not.  Most leaf
routines can run entirely out of these 24 temps.  The entire overhead
is simply a call instruction and a return instruction (2 cycles if the
delay slots are filled).

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

henry@utzoo.uucp (Henry Spencer) (03/16/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>> ... On the Mips machines, for example,
>> a C function call to a "leaf" function (one which calls no others) is
>> one instruction for the call and one for the return, total elapsed time
>> a small fraction of a microsecond.  Non-leaf functions have to do a little
>> bit more work at times.
>
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:
>
>1) Save/restore all register values that are still 'live' values for the
>   caller.  ...  The problem is: _MOST_ of the procedure call
>overhead is concentrated in number (1)!  And, this overhead applies to all
>procedures - including 'leaf' routines.  Basically, 'leafness' has very
>little to do with procedure call overhead...

Um, do you want to go back and re-think that statement, Jim?  Leafness has
a lot to do with it, because a leaf procedure doesn't have to keep things
neat in case it calls somebody.  In particular, if you assign some of your
registers to be scratch registers, normally unused at call time, then a
leaf procedure just uses them for all its internal work and has to save
and restore *nothing* except in very unusual cases.  This is exactly what
the Mips compilers do.  When I said that a call to a leaf function on a
Mips machine was one instruction each way, that's *one* instruction,
total, end to end, no further overhead involved.  And those are RISC
instructions, normally one cycle apiece.

The correct statement is that the callee must save/restore all register
values that are still live for the caller *and that the callee is going
to overwrite*.  Careful register-allocation conventions (with an adequate
supply of registers) can minimize that number, usually to zero in the case
of leaf functions.

Most of the other overheads you mention can likewise be minimized or
eliminated with some thought.  Stack-overflow probing is needed only on
systems without MMUs, or with defective MMUs; even the pdp11, the original
C machine, did without it.  Traceback overhead for debuggers can be
reduced considerably with better cooperation between compiler and debugger,
and a modern compiler at a high level of optimization generally produces
stuff that a debugger can't handle anyway, so it can jettison that entirely.
Interprocedural optimization in the presence of separate compilation is
not merely possible, it's commercially available:  -O3 on a Mips compiler.

Much of the "common wisdom" on the subject of call overhead simply isn't
true if you are willing to work hard at falsifying it.  (Actually, the
same is true of many other performance "constraints"...)
-- 
MSDOS, abbrev:  Maybe SomeDay |     Henry Spencer at U of Toronto Zoology
an Operating System.          | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

preston@titan.rice.edu (Preston Briggs) (03/16/90)

In article <1990Mar15.173408.29622@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

>>1) Save/restore all register values that are still 'live' values for the
>>   caller.  ...  The problem is: _MOST_ of the procedure call
>>overhead is concentrated in number (1)!  

>The correct statement is that the callee must save/restore all register
>values that are still live for the caller *and that the callee is going
>to overwrite*.  Careful register-allocation conventions (with an adequate
>supply of registers) can minimize that number, usually to zero in the case
>of leaf functions.

And of course there are register windows of various kinds.

They don't solve the register allocation problem, but they
do help reduce call overhead, at some expense to task
switching overhead (but we know which is more common).

On the other hand, Wall argues that link-time register
allocation can surpass register-window performance.
See his paper

	Register Windows vs. Register Allocation
	D. W. Wall
	Proceedings of the SIGPLAN '88 Conference on
		Programming Language Design and Implementation

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

jlg@lambda.UUCP (Jim Giles) (03/16/90)

From article <23113@mimsy.umd.edu>, by chris@mimsy.umd.edu (Chris Torek):
> [... 'caller-save' vs. 'callee-save' registers ...]
> As shown above, this is no longer true.  If the leaf uses a `large'
> number of registers (more than are available as temporary computation
> registers in non-leaf routines), this statement holds; if not, the
> fact that the routine is a leaf makes the registers `free'.
> 
> (Of course, callers that use lots of registers, and store things in
> the temporary registers, must spill those registers on subroutine
> calls.  This may be what Jim Giles meant all along.  Perhaps someone
> at MIPS can post statistics as to how often this is the case.)

This is exactly what I meant.  The Cray system has a similar mechanism
(and, in fact they even have special types of 'leaf' procedures called
'baselevel' routines).  The problem is that the 'caller' routine still
needs to save 'live' values around calls because the registers assigned
to the 'callee' are nearly always in use.

When I write in assembly, I tend to use all the registers I can in order
to avoid the memory overhead - memory costs about a dozen clocks per
reference while transfers to the temp regs only costs one.  Even with
memory pipelined and running in parallel with other functional units,
this extra delay is expensive.  If I were writing a compiler, I would
be similarly greedy with the registers for generated code.

All this trouble could be avoided if the register use of the 'callee'
were known in advance.  Then the code generator for the 'caller' could
do register scheduling with this extra information in mind.  Still
causes problems if the 'callee' uses a _lot_ of registers, but it's
better than nothing.  Of course, the best deal (if speed were para-
mount) would be to 'inline' the 'callee' completely.  Then the register
scheduling would take place across the call boundary (and the save/
restore could be hidden better under pipelining).

> [...]
> The only problem with this last statement (`interprocedural analysis
> cannot be done due to separate compilation') is that someone already
> does it---namely, MIPS do it at the highest compilation level.  Again,
> one does it by cheating: `compile' to `Ucode', an intermediate level
> code that is neither source nor object.  When `linking' the Ucode,
> put everything together, do global flow analysis, optimise, and then
> generate machine code.

I've often thought that code generation should be done by the loader for
this very reason.  Both inlining and regester scheduling across calls
would be improvements that would be worth the loader slowdown.  In
addition, the compile step would be considerably faster.  This means
that syntax checking would be a breeze (a common use of the compiler,
like it or not, is as a form of 'lint' - at least for non-C languages).

J. Giles

jlg@lambda.UUCP (Jim Giles) (03/16/90)

From article <1990Mar15.173408.29622@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
> [...]
>>1) Save/restore all register values that are still 'live' values for the
>>   caller.  ...  The problem is: _MOST_ of the procedure call
>>overhead is concentrated in number (1)!  And, this overhead applies to all
>>procedures - including 'leaf' routines.  Basically, 'leafness' has very
>>little to do with procedure call overhead...
> 
> Um, do you want to go back and re-think that statement, Jim?  Leafness has
> a lot to do with it, because a leaf procedure doesn't have to keep things
> neat in case it calls somebody.  In particular, if you assign some of your
> registers to be scratch registers, normally unused at call time, then a
> leaf procedure just uses them for all its internal work and has to save
> and restore *nothing* except in very unusual cases.  This is exactly what
> the Mips compilers do.  [...]

This is also what the Cray compilers do for 'baselevel' routines.  The
problem arises with your assumption that the are such things as "scratch
registers, normally unused at call time."  If your register scheduling
algorithm in the compiler is sufficiently greedy, you will, almost always
have some save/restores to do.  And note: it is almost always a good thing
for your register scheduler to be as greedy as possible.  In fact, the only
exception is the procedure call overhead.  So, I _HAVE_ rethought my first
stetement - and I still stand by it!

> [...]
> The correct statement is that the callee must save/restore all register
> values that are still live for the caller *and that the callee is going
> to overwrite*.  [...]

Quite true.  But, in the context of separate compilation, how does the
compiler _KNOW_ which registers are to be used by the 'callee'?  Or, when
compiling the 'callee', how does the compiler know which values are still
'live' for the 'caller'?  Now, I'm all for the idea of interprocedural
analysis.  I'm even a supporter of having the loader do the code generation
for this very reason.  But right now, the procedure interface is still
one of the most costly parts of any program.

> [...]          Careful register-allocation conventions (with an adequate
> supply of registers) can minimize that number, usually to zero in the case
> of leaf functions.

Careful register-allocation conventions are usually the ones that use the
registers most greedily.  This is because, in general, the more you use
the registers, the faster your code goes.  If your "careful" approach to 
registers is not greedy, how much performance am I loosing to it by not
getting full use out of the hardware availible?  Further: what's "an
adequate supply of registers"?  I know code which can use up as many as
you give me.  In fact, if interprocedural analysis were commonplace
instead of rare, you would probably find that the register set was almost
always completely full (this wouldn't matter since it would be a logical
consequence of register allocation being done on the program instead of
a routine at a time).

These problems may all be solved in the future - even the very near future.
But at present, only MIPS and Cray (the only ones mentioned anyway) have
addressed this problem.  And these two 'solutions' rely on the 'callee'
not using lots of registers and the 'caller' deliberately leaving some
spare ones - but this, in itself, may have a negative impact on performance.

J. Giles

meissner@osf.org (Michael Meissner) (03/16/90)

In article <14272@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

| Careful register-allocation conventions are usually the ones that use the
| registers most greedily.  This is because, in general, the more you use
| the registers, the faster your code goes.  If your "careful" approach to 
| registers is not greedy, how much performance am I loosing to it by not
| getting full use out of the hardware availible?  Further: what's "an
| adequate supply of registers"?  I know code which can use up as many as
| you give me.  In fact, if interprocedural analysis were commonplace
| instead of rare, you would probably find that the register set was almost
| always completely full (this wouldn't matter since it would be a logical
| consequence of register allocation being done on the program instead of
| a routine at a time).
| 
| These problems may all be solved in the future - even the very near future.
| But at present, only MIPS and Cray (the only ones mentioned anyway) have
| addressed this problem.  And these two 'solutions' rely on the 'callee'
| not using lots of registers and the 'caller' deliberately leaving some
| spare ones - but this, in itself, may have a negative impact on performance.

Umm, the 88k also partitions the register set into caller save and
callee save.  For the two machines that I'm familar with, the
breakdown is as follows:

MIPS:	32 integer 32-bit registers
		 1 register hardwired to 0
		 2 registers used for return value and/or staic link
		 4 registers used for arguments
		10 registers not preserved across calls
		 9 registers preserved across calls
		 1 register for stack pointer
		 1 register for return address
		 1 register for accessing small static/global data
		 1 register used by the assembler
		 2 registers used by the kernel

	16 floating point 64-bit registers
		 2 registers for return value
		 2 registers for arguments
		 6 registers not preserved across calls
		 6 registers preserved across calls

88K:	32 32-bit registers (double precision takes 2 regs)
		 1 register hardwired to 0
		 8 registers for arguments & return value
		 4 registers not preserved across calls
		13 registers preserved across calls
		 4 registers reserved for linker/OS
		 1 register for the stack pointer
		 1 register for the return address

Note that registers used for passing arguments, and returning values
are also used as temps.  If the return address is stored on the stack,
the register that holds can also be used for a temporary.  Neither
architecture requires the use of a frame pointer, though frame
pointers can be synthesized easily if needed because variable sized
stack allocations are done.  Finally, both machines software defines
static tables that describe where registers are stored on the stack,
and what register and offset from that register are to be used as a
virtual frame pointer for use in the library and in the debugger.

The MIPS compilers also have a -O3 option which does global register
allocation.  Here is an fragment of the man page from a Decstation:

          -O3            Perform all optimizations, including global
                         register allocation.  This option must
                         precede all source file arguments.  With this
                         option, a ucode object file is created for
                         each C source file and left in a .u file.
                         The newly created ucode object files, the
                         ucode object files specified on the command
                         line, the runtime startup routine, and all
                         the runtime libraries are ucode linked.
                         Optimization is performed on the resulting
                         ucode linked file and then it is linked as
                         normal producing an a.out file. A resulting
                         .o file is not left from the ucode linked
                         result.  In fact -c cannot be specified with
                         -O3.

          -feedback file Use with the -cord option to specify the
                         feedback file.  This file is produced by with
                         its -feedback option from an execution of the
                         program produced by

          -cord          Run the procedure-rearranger on the resulting
                         file after linking.  The rearrangement is
                         performed to reduce the cache conflicts of
                         the program's text.  The output is left in
                         the file specified by the -o output option or
                         a.out by default.  At least one -feedback
                         file must be specified.


Because of the restriction of not specifying -c, I'm not sure how many
people use this in practice for large software.  I would imagine that
for programs which use use runtime binding (ie, emacs, or C++ code
with virtual functions), it would default back to the standard calling
sequence.  I wonder how much it buys for typical software as opposed
to special cases.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

jlg@lambda.UUCP (Jim Giles) (03/17/90)

From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> [...]
> If you partition your registers into a set that is live across a
> function call and a set that is killed, then leaf routines can run
> entirely out of the later set, not having to save any registers.  For
> example, in the Am29000 calling-convention, a function can allocate up
> to 126 local registers which are saved across function calls and
> also has use of 24 "temporary" registers which are not.  Most leaf
> routines can run entirely out of these 24 temps.  The entire overhead
> is simply a call instruction and a return instruction (2 cycles if the
> delay slots are filled).

It seems to me that if a procedure is so small that it can only find
use for 24 (or fewer) registers, then it is small enough to be a real
good candidate for inlining.  Clearly, _any_ procedure can be written
only _use_ 24 registers (there were some machines in the past with only
_one_ register - the were completely functional).  However, most procedures
of any size can benefit from quite a few more registers than 24.

Similarly, any 'caller' routine that can afford to leave 24 registers
unused is fairly small as well.  For efficiency, all 'live' values
should be kept in registers if at all possible.  The problem is that
any large program (and some small ones) have quite a large number of
'live' values.  Procedures that manupulate arrays may have whole
arrays (or array sections) that are 'live'.   If there's _room_ in
the register set for all these values, that's where they should be
kept.

The fact is that this kind of 'temp' register idea only _appears_
to save time.  You probably actually lose performance by forcing
many of your 'live' values to memory when they needn't have been.

J. Giles

jlg@lambda.UUCP (Jim Giles) (03/17/90)

From article <MEISSNER.90Mar16103226@curley.osf.org>, by meissner@osf.org (Michael Meissner):
> [...]
> The MIPS compilers also have a -O3 option which does global register
> allocation.  Here is an fragment of the man page from a Decstation:
> [...]

Great!  Now if this ability were widespread in the _real_world_ (or,
the rest of the real world - for those of you who consider MIPS to
be real :-), then there would be much less waste in the procedure
interface.  Of course, once you have this capability, you would also
want an 'inline' keyword on procedure declarations, etc..

J. Giles

seanf@sco.COM (Sean Fagan) (03/17/90)

In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
>Most machines implement call/return with single instructions.  However, this
>tends to be the tip of the iceberg for procedure call overhead.  The interface
>_MUST_ do the following:

No, it must not.

>1) Save/restore all register values that are still 'live' values for the
>   caller.

What about callee saves?

>2) Test the stack to make sure there is enough room for the callee's local
>   variables (and call the memory manager for a new stack frame if there
>   wasn't room).

	sub	esp, 1000000

seems to work just fine here.  True, you will get into problems if you run
out of VM, or your stack-pointer wraps around, but I don't know of anyone
who checks for those.

>3) Create (on the stack) a traceback entry so the debugger and/or the
>   postmortem analyzer can find the current thread through the call tree.

It need not be on the stack.  Ever hear of saving the return address in a
register?  You know, like Cray's do?  You only have to save the retaddr when
you call another function.

>The problem is: _MOST_ of the procedure call
>overhead is concentrated in number (1)!  And, this overhead applies to all
>procedures - including 'leaf' routines.

True.  That's why register windows were considered a "good thing" (I'm not
sure I like them).  It's also one of the problems you have if you have
large, large amounts of registers (say, 256 anyone?).

>Because modern machines tend
>to have a _large_ number of registers, implementing (1) is usually quite
>expensive - and it can't be made cheaper without interprocedural analysis
>which in turn can't be done as long as separate compilation is possible.

Huh?  The MIPS compiler does interprocedural analysis, and I'm lead to
believe that the Pyramid compilers do as well.  It's not *easy* or cheap, to
be sure, but it can be done.

-- 

-----------------+
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

djones@megatest.UUCP (Dave Jones) (03/17/90)

From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> 
> If you partition your registers into a set that is live across a
> function call and a set that is killed, then leaf routines can run
> entirely out of the later set, not having to save any registers.  For
> example, in the Am29000 calling-convention, a function can allocate up
> to 126 local registers which are saved across function calls and
> also has use of 24 "temporary" registers which are not.  ...

The RISC machines have an internal register which acts something
like a frame-pointer with respect to the user-registers, yielding
what they call 'overlapping register windows'.  Thus user-registers need
not be saved and restored except when the window moves off one end of
the register set or the other.

Back in the 70's, TI's 990 and 99000 series chips had the about same thing,
but the "work-spaces", as they called the register-windows, were accessed
by way of the main-memory busses. (But that architecture had some bad
problems also: poor jbsr/rtn implementation and 16-bit addresses.)

What I would like to see is a register-set that acts as the cache for
a virtual stack in a machine with true stack machine instructions. That
would be my idea of a really good time, although it might put some graph
colorers out of business, having no registers to allocate. Does anybody
know if such a scheme was proposed for SPARC, and if so, why it was
rejected? Seems like such a win, and not all that big a step
beyond the register-window approach.

drc@cs.brown.edu (David R. Chase) (03/17/90)

In article <12350@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>What I would like to see is a register-set that acts as the cache for
>a virtual stack in a machine with true stack machine instructions. That
>would be my idea of a really good time, although it might put some graph
>colorers out of business, having no registers to allocate.

Not at all.  No doubt the "hardware stack" will be of finite size, and
if that is exceeded then some of it must be saved to memory.  In this
situation, you'll win if stack frames aren't so large, and one way to
do that is to allocate stack slots using algorithms very much like
those used to allocate registers.  Spilling won't occur, of course,
since you can have as many "registers" as are needed, but they don't
come for free.  Use fewer and (ignoring scheduling problems caused by
spurious dependences on reused registers) do better.

Hard life, isn't it?

David
(really in Menlo Park, CA)

meissner@osf.osf.org (Michael Meissner) (03/18/90)

In article <14274@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
|From article <MEISSNER.90Mar16103226@curley.osf.org>, by
|meissner@osf.org (Michael Meissner): 
|> [...]
|> The MIPS compilers also have a -O3 option which does global register
|> allocation.  Here is an fragment of the man page from a Decstation:
|> [...]
|
|Great!  Now if this ability were widespread in the _real_world_ (or,
|the rest of the real world - for those of you who consider MIPS to
|be real :-), then there would be much less waste in the procedure
|interface.  Of course, once you have this capability, you would also
|want an 'inline' keyword on procedure declarations, etc..

The next revision of the MIPS compilers is supposed to have -O4, which
does automatic inlining as well.  It hasn't propigated through DEC
yet...
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

tim@nucleus.amd.com (Tim Olson) (03/19/90)

In article <14273@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
| > [...]
| > If you partition your registers into a set that is live across a
| > function call and a set that is killed, then leaf routines can run
| > entirely out of the later set, not having to save any registers.  For
| > example, in the Am29000 calling-convention, a function can allocate up
| > to 126 local registers which are saved across function calls and
| > also has use of 24 "temporary" registers which are not.  Most leaf
| > routines can run entirely out of these 24 temps.  The entire overhead
| > is simply a call instruction and a return instruction (2 cycles if the
| > delay slots are filled).
| 
| It seems to me that if a procedure is so small that it can only find
| use for 24 (or fewer) registers, then it is small enough to be a real
| good candidate for inlining.  Clearly, _any_ procedure can be written
| only _use_ 24 registers (there were some machines in the past with only
| _one_ register - the were completely functional).  However, most procedures
| of any size can benefit from quite a few more registers than 24.

It might be true that scientific routines written in FORTRAN may have
this many live, non-overlapping variables to keep in registers, but I
don't believe this is true in general.  Statistics from a large
collection of programs and library routines (a mix of general and
scientific applications written in C) show that of 782 functions (620
of which were non-leaf functions), an average of 6.5 registers per
function were live across function calls.

| Similarly, any 'caller' routine that can afford to leave 24 registers
| unused is fairly small as well.  For efficiency, all 'live' values
| should be kept in registers if at all possible.  The problem is that
| any large program (and some small ones) have quite a large number of
| 'live' values.  Procedures that manupulate arrays may have whole
| arrays (or array sections) that are 'live'.   If there's _room_ in
| the register set for all these values, that's where they should be
| kept.

Yes, this is true.  However, unless the array manipulation is quite
simple, it becomes tricky to use registers for them (the stats quoted
above were for scalar variables only).

| The fact is that this kind of 'temp' register idea only _appears_
| to save time.  You probably actually lose performance by forcing
| many of your 'live' values to memory when they needn't have been.

No live values were forced to memory (at least not ones in the active
function).  The Am29000 stack cache tries to keep live variables from
all functions in the register file as well as the variables of the
current function.  If the register file becomes full, live variables
from older functions are spilled to memory, then filled as they become
active again.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

tim@nucleus.amd.com (Tim Olson) (03/19/90)

In article <12350@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
| What I would like to see is a register-set that acts as the cache for
| a virtual stack in a machine with true stack machine instructions. That
| would be my idea of a really good time, although it might put some graph
| colorers out of business, having no registers to allocate. Does anybody
| know if such a scheme was proposed for SPARC, and if so, why it was
| rejected? Seems like such a win, and not all that big a step
| beyond the register-window approach.

This was the approach used in the AT&T CRISP processor.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

albert@bsovax.UUCP (Albert van der Horst) (03/21/90)

In article <12350@goofy.megatest.UUCP
> djones@megatest.UUCP (Dave Jones) writes:
> ( In a follow up on Tim Olson)c
>What I would like to see is a register-set that acts as the cache for
>a virtual stack in a machine with true stack machine instructions. That
>would be my idea of a really good time, although it might put some graph
>colorers out of business, having no registers to allocate. Does anybody
>know if such a scheme was proposed for SPARC, and if so, why it was
>rejected? Seems like such a win, and not all that big a step
>beyond the register-window approach.

This idea is not new.
It is about time for this idea to provide the most ideal machine
to write compilers for.

 There is a weird language that you may not know about. It is called
FORTH, and it is based on a virtual machine. This virtual machine is 
IMHO exactly right. 
  It has two stacks, one for the return addresses, and one for
the calculation; the latter is visible to the FORTH programmer.
The designer of FORTH, Charles Moore, has cast this idea into silicon.
This resulted in a RISC chip (NOVIX 4000) with some amazing properties.
For instance, a one cycle call, and mostly a FREE return from a 
subroutine. It also allocates registers as needed.  
  I was the president of a small group in the Netherlands that explored
this idea independantly. We got far enough to create a working emulator
(written in FORTH...) for our chip, baptized FIETS (Fast implementation
of Enhanced Translators and Systems, by sheer accident also the dutch
word for bycicle). Then came Moore and interest in the project faded.
Two of the group members gained hands on experience with the NOVIX,
resulting in for example a program that generates a video signal by
toggling an output pin, showing a picture with the name "NOVIX".

  Despite some coverage by BYTE the NOVIX did not become a succes at
all. In my opinion, real mistakes were the lack of a C-compiler for it
- the FIETS contained 16 stack based general purpose registers, 
that would have made such a C-compiler viable. The NOVIX registers were
much less in number, caused by a too narrow Forth based scope. Also
Charles Moore still insists that programs needing more than a 16 bit 
address range are desing errors.... Marketing was directed towards the
incrowd of FORTH programmers. Probably very few of you would
have noted the value of the chip from the BYTE article. Even to a FORTH-
literate the article in BYTE was hard to digest. The end was: NOVIX went
out of business. 

   Fortunately, the basic idea's have been bought by Harris Semiconductor's
and they know better what makes a processor sing. They have already
produced a few practically successful designs, and are still developping
it further. These chips are optimized for real time signal processing, and
have extremely fast context switches. (switching stacks, no registers
to save)

I do hope we will arrive at the processor bottom line.
  The bottom line is this.
You have the choice between your code being inline, or a single cycle
call/return overhead.
All instructions (let's say almost) do something usefull, like addition,
putting the result in a new register "on the fly" if required.
Or on the other hand adding an offset to a base address will drop the
offset from the stack automatically.

I think " the register caching via stacks" is much easier to implement than
the register window stuff. And as far as I can see, it is easier on the
compiler optimizer too. 
                              
                              
N.B. Those interested (Harris?) may mail me for the FIETS instruction set.
N.B. I am currently exploring transputers, that also have an "allocate
     register as needed" mechanism, without overflowing however and very shalllow.
                              
                              
View expressed are my own { not all FIETS members would agree.
                          { my boss does not know what I am talking about

Albert van der Horst               {eunet!mcvax!hp4nl!bsovax!albert}
P.O. 8052                                                                    
3053 RB   Utrecht            " It's pitch dark. You might be eaten by a grue" 
The netherlands              " I hope you got more light than from a lamp"

jlg@lambda.UUCP (Jim Giles) (03/21/90)

From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> [...]
> It might be true that scientific routines written in FORTRAN may have
> this many live, non-overlapping variables to keep in registers, but I
> don't believe this is true in general.  Statistics from a large
> collection of programs and library routines (a mix of general and
> scientific applications written in C) show that of 782 functions (620
> of which were non-leaf functions), an average of 6.5 registers per
> function were live across function calls.

This statistic can only be interpreted in one way: the C compiler in
question didn't allocate registers very well.  Especially in scientific
packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
execution of even simple routines.  Vectors, arrays, lists, strings, etc,
are alle being either produced or consumed.  The fact that none of these
_VALUES_ were in registers at the time of the call indicates one of two
things: 1) the code in question was fragmented to the point that most
procedures had only a few data items (and scalar at that), or 2) the
compiler simply wasn't using the registers to anywhere near their potential.
Since you imply the code was well written, I reject the first explanation.
That leaves the compiler.

My experience (I don't have statistics) with both Fortran and C is that
good compilers generally PACK the registers with as much live data as
possible.  Even an apparently pure scalar loop that does only 'simple'
operations may be 'unrolled' a few times to make better use of the
registers.  Compilers are becoming available that apply that optimization
automatically (so this isn't just a case which applies only to 'coder
enhanced' code).  If such a loop (and this is still the simple kind
mind you) had a procedure call imbedded in it, all the registers that
the procedure might use would have to be spilled to memory - and then
reloaded on return.  On the Cray, spilling and reloading just _ONE_
vector register is over 150 clocks (64 elements at one clock each
to and from memory plus time for the memory pipeline plus a little
overhead to set up stride, address, etc.).  This is _not_ a tiny
problem.

Again I say, this scheme of having 'preserved' vs. 'temp' registers
for procedure calls only _appears_ to save time.  In truth, it deprives
you of registers which could be put to use (at least if your compiler
was clever enough).  The only solution to the problem is to 'inline'
(or, at least, schedule registers globally).  At present, in most
environments, the only way to do this is by manually 'inlining' the
procedures.  Let's hope that more automatic solutions become generally
available in the next few years.

By the way, aside from a problem with aliasing with C, Fortran and C are
identical with respect to optimization, register allocation, etc..
So, your implied put-down of Fortran is not relevant in this context.

J. Giles

tim@nucleus.amd.com (Tim Olson) (03/22/90)

In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
| > [...]
| > It might be true that scientific routines written in FORTRAN may have
| > this many live, non-overlapping variables to keep in registers, but I
| > don't believe this is true in general.  Statistics from a large
| > collection of programs and library routines (a mix of general and
| > scientific applications written in C) show that of 782 functions (620
| > of which were non-leaf functions), an average of 6.5 registers per
| > function were live across function calls.
| 
| This statistic can only be interpreted in one way: the C compiler in
| question didn't allocate registers very well.  Especially in scientific
| packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
| execution of even simple routines.  Vectors, arrays, lists, strings, etc,
| are alle being either produced or consumed.  The fact that none of these
| _VALUES_ were in registers at the time of the call indicates one of two
| things: 1) the code in question was fragmented to the point that most
| procedures had only a few data items (and scalar at that), or 2) the
| compiler simply wasn't using the registers to anywhere near their potential.
| Since you imply the code was well written, I reject the first explanation.
| That leaves the compiler.

Do you have any statistics to back up your assertions?  The values I
quoted match what other people have found as well.  Again, you may be
only looking at a small portion of scientific code and missing what is
going on in general code.

The compiler that I used to collect the stats will keep all live
scalars in the register file, but doesn't do inlining of functions,
and doesn't try to operate on arrays in registers (how do you generate
code to deal with runtime address calculations???)

| My experience (I don't have statistics) with both Fortran and C is that
		 ^^^^^^^^^^^^^^^^^^^^^^^
Oh.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

robison@m.cs.uiuc.edu (03/22/90)

[People wondering whether stack machines has been commercially built.]

Burroughs (now part of UNISYS) has been selling stack machines since
at least the 60's.  I think they even have a stack CPU on a chip now.

I remember a strong similarity between the early Burroughs stack
machines and the early versions of FORTH.  Can anyone comment on whether
FORTH was inspired by the Burroughs stack machines?

Arch D. Robison
University of Illinois at Urbana-Champaign

UUCP: {pur-ee,convex}!uiucdcs!robison
Internet: robison@CS.UIUC.EDU

jdudeck@polyslo.CalPoly.EDU (John R. Dudeck) (03/22/90)

In article <5200053@m.cs.uiuc.edu> robison@m.cs.uiuc.edu writes:
>I remember a strong similarity between the early Burroughs stack
>machines and the early versions of FORTH.  Can anyone comment on whether
>FORTH was inspired by the Burroughs stack machines?

I took a course in Forth from polyForth back in 1981.  We were given a
dose of the history of Forth and even graced by a visit from Charlie Moore,
its creator.  The first incarnation of Forth was done on a PDP11 with 4k
of core memory.  I belive the main considerations were 1. compactness of
code (hence threaded implementation) and 2. simplicity of compilation
(hence reverse polish notation and stacks).

-- 
John Dudeck                           "You want to read the code closely..." 
jdudeck@Polyslo.CalPoly.Edu             -- C. Staley, in OS course, teaching 
ESL: 62013975 Tel: 805-545-9549          Tanenbaum's MINIX operating system.

preston@titan.rice.edu (Preston Briggs) (03/23/90)

In article <29585@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

>| question didn't allocate registers very well.  Especially in scientific
>| packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
>| execution of even simple routines.  Vectors, arrays, lists, strings, etc,
>| are alle being either produced or consumed.  The fact that none of these

>The compiler that I used to collect the stats will keep all live
>scalars in the register file, but doesn't do inlining of functions,
>and doesn't try to operate on arrays in registers (how do you generate
>code to deal with runtime address calculations???)

Well, you don't keep the whole array in registers, and you don't keep
it there all the time.  There's a paper coming in this summer's
compiler construction conference (Callahan, Carr, and Kennedy) that
shows how to do it.  In the meantime, I can sort of illustrate
the ideas.  We'll use matrix multiply

	DO i = 1, n
	    DO j = 1, n
		A(i, j) = 0
		DO k = 1, n
		    A(i, j) = A(i, j) + B(j, k) * C(k, i)
		END
	    END
	END

Everybody knows the trick of keeping A(i, j) in a scalar (so it ends
up in a register).

	DO i = 1, n
	    DO j = 1, n
		aij = 0
		DO k = 1, n
		    aij = aij + B(j, k) * C(k, i)
		END
		A(i, j) = aij
	    END
	END

So, an easy transformation that keeps a (tiny) part of an array in a register.
We can do lots more.  Let's unroll the j loop (we'll assume n is divisible
by 2.  if it isn't, the transformation is still possible, but extra code
must be inserted).

	DO i = 1, n
	    DO j = 1, n, 2
		aij = 0
		DO k = 1, n
		    aij = aij + B(j, k) * C(k, i)
		END
		A(i, j) = aij
		aij1 = 0
		DO k = 1, n
		    aij1 = aij1 + B(j+1, k) * C(k, i)
		END
		A(i, j+1) = aij1
	    END
	END

Now let's jam the 2 copies of the inner (k) loop together.

	DO i = 1, n
	    DO j = 1, n, 2
		aij = 0
		aij1 = 0
		DO k = 1, n
		    aij  = aij  + B(j,   k) * C(k, i)
		    aij1 = aij1 + B(j+1, k) * C(k, i)
		END
		A(i, j) = aij
		A(i, j+1) = aij1
	    END
	END

and notice the re-use of C(k, i)

	DO i = 1, n
	    DO j = 1, n, 2
		aij = 0
		aij1 = 0
		DO k = 1, n
		    cik = C(i, k)
		    aij  = aij  + B(j,   k) * cik
		    aij1 = aij1 + B(j+1, k) * cik
		END
		A(i, j) = aij
		A(i, j+1) = aij1
	    END
	END

Now we're keeping more pieces of arrays in registers, and our code
is getting faster and faster, especially on machines with fast FP
and (relatively) slow memory.

In the original loop, we had 2n^3 flops, 3n^3 loads, and n^3 stores.
In the last loop, we're down to 2n^3 flops, (3n^3)/2 loads, and n^2 stores.
Notice the number of floating point ops stays the same; we're just reducing
the amount of memory traffic by getting reuse from our registers.

Of course, we could do the same unroll-and-jam trick many more times
on both the i and j loops, and we'll get more and more register reuse
(larger and larger chunks of the arrays kept in registers).
On a machine like the MIPS, we can get a factor of 3 improvement
over your favorite C code.  On the AMD 29000, with its really large
register set, we ought to do even better.  Imagine unrolling-and-jamming
both the i and j loops (say) 6 times each.  We'd need 36 registers
to hold pieces of A, and 6 more to hold a strip of C, and 6 more to hold
a strip of B; but the code would sing/scream.

Of course, you need to do some work in your optimizer.
And you need code with loops and arrays (scientific code).
And it's hard to do for C.

(I still don't like fortran.)

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

jlg@lambda.UUCP (Jim Giles) (03/23/90)

From article <29585@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
> [...]
> The compiler that I used to collect the stats will keep all live
> scalars in the register file, [...]                    [^^^]

I don't believe a word of it.  The word I don't believe is 'all'.
_ANY_ scalar that is visible in the local scope and has a value
which might still be used by the program at some future point is 'live'.
(The same obviously applies to non-scalar values.)  That is the definition
of 'live'.  What you really mean is that the compiler breaks code into
'basic blocks' and keeps all active values from _that_ block in registers.
The use of 'basic blocks' and the lack of global dataflow analysis is
confusing you into thinking that only a few values are 'live' at a given
point.  Also, it is considerable mistake not to regard non-scalar values
as important.

The bottom line is that registers are an important resource in any machine.
They are often more that 10 times faster than memory to reference.  Your
analysis indicates that your code (whether the compiler is responsible
or not) is using this resource to only a small fraction of its potential.
But instead of complaining, you are acting like it is a desireable way
for the code to behave.

> | My experience (I don't have statistics) with both Fortran and C is that
> 		 ^^^^^^^^^^^^^^^^^^^^^^^
> Oh.

Ok, now I have statistics.  I just did a quick survey of some expertly
crafted C code that I have source code to (it's actually part of the
X11 window package).  This is only a small program (it's xlock for those
interested in checking).  Still, there are over 50 scalar variables
declared global to the utility - these are _ALWAYS_ 'live'.  Further,
this code includes a lot of window related declarations (which also
contain a bunch of _ALWAYS_ 'live' global declarations).  This is
admittedly a small sample and it disregards local variables and non-scalars.
Still, since most of the program consists of small procedures which
manipulate these global variables, I suspect it would be a smaller, faster
code if the values always occupied registers.  xlock probably doesn't
need to be faster, but if speed were important, register usage would
be too.

J. Giles

peter@ficc.uu.net (Peter da Silva) (03/23/90)

(for comp.arch folks, Jim is basically saying that the more registers you
 have, the better, because it lets you put more variables in registers)

But, Jim, registers aren't free. The more registers you have the more bits
you have to hide in your carefully crafted instruction set to specify
registers. Modern RISC processors have up to 4 arguments to a given
instruction, so even if they're all registers (no mixed-mode instructions)
and you only have 16 registers, that's 16 bits gone right there.

Now, let's add some more bits to specify constants and maybe some addressing
modes, and you can run out of bits pretty quickly. With 64 registers (about
the most I'd expect you to be happy with), and immediate operands that
would leave you with about 5 bits to specify the instruction. Now suppose
you have room on the chip for 512 registers... why *not* use register
windows, or a stack cache, or something similar instead of just making
the instruction wider?
-- 
 _--_|\  `-_-' Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \  'U`
\_.--._/
      v

oconnordm@CRD.GE.COM (Dennis M. O'Connor) (03/23/90)

peter@ficc (Peter da Silva) writes:
] (for comp.arch folks, Jim is basically saying that the more registers you
]  have, the better, because it lets you put more variables in registers)
] 
] But, Jim, registers aren't free. The more registers you have the more bits
] you have to hide in your carefully crafted instruction set ...
] ... Now suppose you have room on the chip for 512 registers... why *not*
] use register windows, or a stack cache ... instead of just making the
] instruction wider?

There are of course other costs that come with lots of registers :

    Unexpected Context Switch times : goes up, since you have more
    CPU state to save. If you know the context switch is coming (like
    before a call to a kernal "WAIT_FOR_SIG" entry) you can decrease
    this penalty. On some kinds of interrupts, you may not be able to.

    Processor Speed : more registers means more loading on the
    internal processor busses, and it means these busses are longer.
    So you get more capacitance, and higher resistance in the lines.
    So your busses get slower. If the busses are not part of a critical
    path, you're okay. But eventually, as you add more registers,
    your busses will become part of a critical path.

This last effect was a major point of RISC (vs. CISC) design philosophy :
generally, you can't tack more stuff onto a circuit without possibly
slowing it down. So you should be able to show a definate performance
improvement to justify a potential slowing of the clock rate.

Some big-register-set opponents ( big >> 32 ) argue that the supporters
of this idea haven't conclusively demonstrated any such improvement.
--
  Dennis O'Connor      OCONNORDM@CRD.GE.COM      UUNET!CRD.GE.COM!OCONNOR
  "Let's take a little off the top ... a bit off the sides ...
    trim the back a bit ... Surprise ! You've been bald-ed !"

tim@nucleus.amd.com (Tim Olson) (03/23/90)

In article <14285@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
| From article <29585@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
| > [...]
| > The compiler that I used to collect the stats will keep all live
| > scalars in the register file, [...]                    [^^^]
| 
| I don't believe a word of it.  The word I don't believe is 'all'.
| _ANY_ scalar that is visible in the local scope and has a value
| which might still be used by the program at some future point is 'live'.
| (The same obviously applies to non-scalar values.)  That is the definition
| of 'live'.

Sorry -- I should have clarified that it keeps all live scalar
variables *that don't have potential aliasing problems* in the
register file.  Since global variables are visible to other functions,
they have the potential for having their address taken in some other
compilation unit (remember, we are talking about C code), and thus
cannot be kept in registers -- unless "universal" register allocation
and optimization is performed.

| What you really mean is that the compiler breaks code into
| 'basic blocks' and keeps all active values from _that_ block in registers.
| The use of 'basic blocks' and the lack of global dataflow analysis is
| confusing you into thinking that only a few values are 'live' at a given
| point.

No, the compiler is performing register allocation on an entire
function, not on a basic block level.  It is performing "global"
dataflow analysis (meaning on a per-function basis), but not
"universal" dataflow analysis (on the entire program, including all
other compilation units).

| Also, it is considerable mistake not to regard non-scalar values
| as important.

I never said they weren't important -- it is just that it is very hard
(sometimes impossible) to operate on structures and arrays that are
kept entirely in registers (runtime indexing, etc.).  If you mean that
*scalar elements* of an array that are used often should be kept in
the register file, then I agree, and we do that.

| The bottom line is that registers are an important resource in any machine.
| They are often more that 10 times faster than memory to reference.  Your
| analysis indicates that your code (whether the compiler is responsible
| or not) is using this resource to only a small fraction of its potential.
| But instead of complaining, you are acting like it is a desireable way
| for the code to behave.

No, I don't claim it is "desireable", just that a lot of
general-purpose C code tends to have a large number of small to
medium-sized functions which use < 20 local scalars.

| Ok, now I have statistics.  I just did a quick survey of some expertly
| crafted C code that I have source code to (it's actually part of the
| X11 window package).  This is only a small program (it's xlock for those
| interested in checking).  Still, there are over 50 scalar variables
| declared global to the utility - these are _ALWAYS_ 'live'.
	   ^^^^^^

OK -- now I see where you are getting your numbers and where we are
differing.  The stats I collected were for local ("automatic")
scalars, because global scalars can have aliasing problems unless we
perform "universal" dataflow analysis.  I agree that if this were
done, the number of variables live across a function call increase
dramatically.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

jlg@lambda.UUCP (Jim Giles) (03/24/90)

peter@ficc (Peter da Silva) writes:
] (for comp.arch folks, Jim is basically saying that the more registers you
]  have, the better, because it lets you put more variables in registers)
] 
] But, Jim, registers aren't free. The more registers you have the more bits
] you have to hide in your carefully crafted instruction set ...
] ... Now suppose you have room on the chip for 512 registers... why *not*
] use register windows, or a stack cache ... instead of just making the
] instruction wider?

Actually, what I'v been saying is that registers are an important resource
(no matter how many or few you have) and that not using them to the full
extent possible is wasteful.

J. Giles

peter@ficc.uu.net (Peter da Silva) (03/24/90)

In article <14288@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:
> Actually, what I'v been saying is that registers are an important resource
> (no matter how many or few you have) and that not using them to the full
> extent possible is wasteful.

So why are you against register windows, which allow you to put more of this
important resource on a chip without slowing it down by making the register
address incredibly wide?
-- 
 _--_|\  `-_-' Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \  'U`
\_.--._/
      v

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (03/26/90)

In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes:

   From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson):
   > [...]
   > It might be true that scientific routines written in FORTRAN may have
   > this many live, non-overlapping variables to keep in registers, but I
   > don't believe this is true in general.  Statistics from a large
   > collection of programs and library routines (a mix of general and
   > scientific applications written in C) show that of 782 functions (620
   > of which were non-leaf functions), an average of 6.5 registers per
   > function were live across function calls.

   This statistic can only be interpreted in one way: the C compiler in
   question didn't allocate registers very well.  Especially in scientific
   packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during
   execution of even simple routines.  Vectors, arrays, lists, strings, etc,
   are alle being either produced or consumed.

This is an old fallacy: the number of useful registers is usually quite
low; the Wall paper and others say that for most codes, even floating
point intensive ones, 4-8-16 registers make do. The problem that Giles
does not seem to consider is that caching values in registers is only
useful if the values are going to be used repeatedly, like all forms of
caching. It is not difficult to produce examples of fairly common pieces
of code where on many machine register caching worsens performance.

Many registers are useful when:

1) Your so called 'optimizer' does not select values to cache on
expected dynamic frequency of use but on static frequency of use. Since
the two are poorly correlated, your so called 'optimizer' wants to cache
everything in sight.

2) You have extremely high latency to memory, and you want to use a
large register cache as a large cache, where even infrequently reused
values are insufferably expensive to refetch.

3) You have extremely high latency to memory, and you can prefetch
blocks of operands while other blocks of operands are being processed,
because you know which operands are going to be needed next, like with
vector machines.

4) You have multiple functionals units, and each of them then can make
use of a set of registers.

Note that all these do not really mean that you need lots of registers;
1) means that your compiler is stupid, 2) that you are missing a proper
dynamic cache, and 3) and 4) that you have actually multiple threads of
control.

My aversion to large register caches and so called clever optimizer
should be well known now, and stems from my opinion that stupid
compilers are to be avoided, very optimizing compilers are accident
prone and easily made unnecessary by careful coding where it matters,
and that I am only interested in general purpose architectures. It is
always possible to design a specific architecture that isn't such...

In particular there are two uses for registers: one is as inputs to
functional units, and the other as cache. There are machines, especially
ones with stack orientedness and caches, that have only specialized
registers, i.e. each register is only there as input to a functional
units. The Manchester mainframes were specialized register machines. Crisp
is in some sense such a machine as well.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

preston@titan.rice.edu (Preston Briggs) (03/26/90)

In article <PCG.90Mar25230051@rupert.cs.aber.ac.uk> pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes:

>This is an old fallacy: the number of useful registers is usually quite
>low; the Wall paper and others say that for most codes, even floating
>point intensive ones, 4-8-16 registers make do. 

Experience papers that say "n is enough registers" should be read
with the caveat "for my optimizer and my benchmarks."  More agressive 
optimization will usually make profitable use of more registers.

>The problem that Giles
>does not seem to consider is that caching values in registers is only
>useful if the values are going to be used repeatedly, like all forms of
>caching. 

On many (most?) machines, 2 uses is enough to justify keeping a value
in register.

>It is not difficult to produce examples of fairly common pieces
>of code where on many machine register caching worsens performance.

Of course, we can produce plenty of examples where many registers are helpful.
Further, can you post an example of some sort?  I spend a lot
of time thinking about this stuff, and hard cases would help.

>Many registers are useful when:
>1) Your so called 'optimizer' does not select values to cache on
>expected dynamic frequency of use but on static frequency of use.

Surely bad optimizers/register-allocators waste registers.
(Naturally, your reply will be "there's no such thing as a good optimizer.")

>2) You have extremely high latency to memory
>3) You have extremely high latency to memory, and you can prefetch
>blocks of operands while other blocks of operands are being processed,
>4) You have multiple functionals units

But how many chips does this describe?  Today, many; next year, many more.
The i860 has a fast cache, but it's small and a miss is about 25 cycles.
Many chips have load pipelines, where it's profitable to fetch several
cycles in advance.  And how many chipsets have asynchrounous FP processors?

CPU's outrun memory.  They have been for years, and memory isn't catching up.
Hence the development of caches, multi-level caches, 
wide data busses, and large register sets.

>1) means that your compiler is stupid, 
Right, but I don't like stupid compilers either.

>2) that you are missing a proper dynamic cache
Cache isn't a cure-all.  It's finite and fairly simplistic.
Long line sizes and limited set associativity tend to restrict
it's abilities to replace registers.

>and 3) and 4) that you have actually multiple threads of control.
Or perhaps the compiler might find enough low-level parallelism
to keep your chip busy.

>My aversion to large register caches
You like "regular" caches but not registers caches...
Aren't registers just the top of the memory hierarchy?
They are rather more subject to control from software, but I'd
think that was a plus.  I'd rather see the systems extended so that
the other layers of the hierarchy were also under software control
(prefetches to cache, fetches around cache, prefetching pages of
virtual memory, and so forth).

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

jlg@lambda.UUCP (Jim Giles) (03/27/90)

From article <PCG.90Mar25230051@rupert.cs.aber.ac.uk>, by pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi):
> [...]                                           The problem that Giles
> does not seem to consider is that caching values in registers is only
> useful if the values are going to be used repeatedly, like all forms of
> caching. 

I don't see where I've been at fault here.  The _definition_ of 'live'
values it that they are to be used again.  The more of these that you
can place into registers the better.  The compiler should do sufficient
data flow so that it knows which 'live' values are needed next and should
schedule registers to preload these (and hold them until other values are
of more immediate use).  Procedure calls interrupt this analysis (at least,
if you don't have interprocedural analysis).  This problem doesn't go away
(even with your well known masochistic micro-management style of programming)
without hand 'inlining' or automatic interprocedural analysis.  In fact,
hand 'inlining' is just the sort of micro-optimization that your style of
programming would recommend!  So, in this case, we should be in agreement.

J. Giles

jlg@lambda.UUCP (Jim Giles) (03/27/90)

From article <YEG2TM1xds13@ficc.uu.net>, by peter@ficc.uu.net (Peter da Silva):
> [...]
> So why are you against register windows, which allow you to put more of this
> important resource on a chip without slowing it down by making the register
> address incredibly wide?

I didn't say I was against them.  However, they do make a considerable
number of the registers inaccessable to the programmer at any given point
in the program.  I prefer explicit caches.  That is, 'temp' registers which
can be loaded/stored from the general purpose regs and/or memory by explicit
instructions.  Cray temporary registers are examples of this.  Your register
addresses remain manageably small, you just have a few additional instructions.

J. Giles

pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (03/29/90)

In article <6076@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:

   On many (most?) machines, 2 uses is enough to justify keeping a value
   in register.

Only if you look at the relative cost of refetching vs. keeping in the
register. Speeding up a program by 0.001 percent by caching values that
are goig to be reused only a few dozen or even thousand times (when a
machine's speed in rated in millions of instructions per second) is
pointless, and registers do not come free. You usually want to allocate
a register only to values that are used for a significant fraction of
the program's memory accesses. The others simply don't matter.

Which fraction of the total memory accesses you want to consider large
enough to warrant register'ing a value is of course dependent on the
cost of reloading from memory; if you have an architecture or a language
that makes this very expensive, you will want to cache more values, but
then it is all your fault, sheesh.

   >It is not difficult to produce examples of fairly common pieces
   >of code where on many machine register caching worsens performance.

   Of course, we can produce plenty of examples where many registers are
   helpful.

Yes, but only for very special codes or architectures.

   Further, can you post an example of some sort?

The obvious example of such an oddity is on traditional general purpose
architectures and argument passing; it may be worse than pointless to
pop an argument from the stack. That's why they have stack offset
address modes...

   CPU's outrun memory.  They have been for years, and memory
   isn't catching up. Hence the development of caches, multi-level caches, 
   wide data busses, and large register sets.

Von Neumann's bottleneck? Who said bottleneck? The guy from Thinking
Technologies down the hall? :->

   >My aversion to large register caches
   You like "regular" caches but not registers caches...

But regular (pseudo) associative caches are far less "stiff" than
satically managed ones. On general purpose architectures and codes, of
course. If you can otherwise predict access patterns, "registers" win
(can I say "vector codes"?).

   Aren't registers just the top of the memory hierarchy?

Yes and no. They are the interface to functional units, and/or caches.
The current mania for general purpose registers seems to obscure this.
There are machines that only have specialized registers, i.e. one per
functional unit (X index registers, Y integer accumulators, Z floating
registers, where X,Y,Z are the numbers of the respective functional
units), and a cache, usually organized as a stack, on top of each of
these.

Note that such a machine has a number of interesting advantages from the
language implementor point of view; for example, to return to the issue
from which our thread started, procedure calls, and funargs, and
coroutines, and any form of context switch, become much easier/faster.

My own dream machine is a machine with multiple cached stacks, each
stack served by a functional unit. Some hints are available that in
typical code you only have four overlappable threads of computation,
because you often find that you only need four spills to compute most
expressions. Well, a machine with four accumulators and cached stacks
behind is my idea of a high code density superscalar.

My favourite examples are CRISP and NOVIX.

   They are rather more subject to control from software, but I'd
   think that was a plus.

If you know access patterns in advance...

   I'd rather see the systems extended so that
   the other layers of the hierarchy were also under software control
   (prefetches to cache, fetches around cache, prefetching pages of
   virtual memory, and so forth).

There is an uncertain tradeoff between generality and reliability and
complexity of hardware vs. software here. A particularly difficult
issue. My own rule is that complexity that is not supported by very big
sure advantages is to be regarded with the utmost suspicion, whether it
manifests itself as number of intructions, address modes, registers,
optimizers, and so on...
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

preston@titan.rice.edu (Preston Briggs) (03/30/90)

In pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes
many things.  Among them, he responds to my comment that
>   CPU's outrun memory.  They have been for years, and memory
>   isn't catching up. Hence the development of caches, multi-level caches, 
>   wide data busses, and large register sets.
>
>Von Neumann's bottleneck? Who said bottleneck? The guy from Thinking
>Technologies down the hall? :->

Nobody I know knows about Thinking Technologies.  Who are they?
My initial guess was that you meant Danny Hillis at Thinking Machines.
A smart man, but his machines are a little out my league.

Further, naming the problem doesn't make it go away.
It reality that memory costs money and that fast memory costs
more money.  Using multiple-levels in your hierarchy is an engineering
tradeoff that works.

>On general purpose architectures and codes, of
>course. If you can otherwise predict access patterns, "registers" win
>(can I say "vector codes"?).

Sure.  But the predictability and reuses arises due to loops
and arrays, things that appear in many time-consuming chunks of code.
Not every program fits, but many do, including programs written
in Fortran, C, Pascal, Ada, ...

>There is an uncertain tradeoff between generality and reliability and
>complexity of hardware vs. software here. A particularly difficult
>issue. My own rule is that complexity that is not supported by very big
>sure advantages is to be regarded with the utmost suspicion, whether it
>manifests itself as number of intructions, address modes, registers,
>optimizers, and so on...

I have sympathy with this position.  I think most people outside the 
government do.  However, I also recognize that some of these 
complexities, mastered, achieve performance.

In particular, I think optimization is a Good Thing.  
It allows me to specify my programs in less time and with less complexity.  
It allows me to offload tedious, error-prone tasks onto the computer.

To make an optimizer effective though, we need a large supply of registers.
Without an adequate register set, our simplifying seperation of concerns
(front end vs. optimization vs. instruction selection vs. register allocation)
breaks down.  Perhaps this doesn't appear simple!  It isn't entirely, 
and hasn't always been recognized or possible.  I think it's one of the
under-emphasized results of the whole 801/PL.8 project.

Hoare, in "Hints on Programming Language Design", wanted a language

	in which a simple straightforward "non-pessimising"
	compiler will produce straightforward object programs
	of acceptable compactness and efficiency...

I think this is a tremendous suggestion.  Further, I believe it's
what Grandi has been pressing for.  However, leaving out the
optimization doesn't achieve these desired result.  My optimizer
(not the world's best or worst) produces remarkably straighforward
code.  I think it's what most people would want to see out of their
compiler.  (I'm not including the agressive source-to-source transformations
I've talked about before.)

I think it's just more difficult to produce "straightforward object programs
of acceptable compactness and efficiency" than most people recognize.
I believe it requires a fair amount of analysis and transformation
(that is, optimization and register allocation).
--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu