henry@utzoo.uucp (Henry Spencer) (03/15/90)
In article <2000@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >> I was under the impression that C's function calls are, relative to other >> high level languages, very inexpensive... > >Simple, yes. Inexpensive, no. I can conceive of a way in which hardware >can be constructed which would make subroutine calls inexpensive, but I >know of no machine which has it. Many people can conceive of such ways, and a good many of them have been implemented, some quite successfully. On the Mips machines, for example, a C function call to a "leaf" function (one which calls no others) is one instruction for the call and one for the return, total elapsed time a small fraction of a microsecond. Non-leaf functions have to do a little bit more work at times. However, as Steve Johnson said at the winter 1987 Usenix conference, talking about the early days: ``Dennis had a lot of credibility and lied very convincingly... it was years before we found out the truth.'' On the early Unix machines (pdp11 and VAX), function calls were not particularly cheap (although it is true that C's calls are cheaper than those of many other languages). Machine designers have paid a lot more attention to the matter in recent years, and the situation has improved greatly. >A subroutine necessarily must find its arguments (or pointers to them, etc.) >in standardized places. It also must put its results in such places, or use >pointers to know where to put them. Occasionally one can put the arguments >in those places or use the results that way, but not always... Not always, no, but a lot of the time. Many modern machines are organized so that most function parameters simply get evaluated into a register (usually necessary anyway) and left there for the callee to find. There is sometimes a small cost of having to save the previous value of that register, depending on circumstances and architecture. >...necessarily a context switch to some extent... It can be a very small extent on a well-designed machine. There *is* a fundamental problem of instruction reference locality, which necessarily slows things down a bit on machines with hardware optimized for sequential fetches. -- MSDOS, abbrev: Maybe SomeDay | Henry Spencer at U of Toronto Zoology an Operating System. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
jlg@lambda.UUCP (Jim Giles) (03/15/90)
From article <1990Mar14.172152.25436@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer): > [... efficient procedure call interface ...] > Many people can conceive of such ways, and a good many of them have been > implemented, some quite successfully. On the Mips machines, for example, > a C function call to a "leaf" function (one which calls no others) is > one instruction for the call and one for the return, total elapsed time > a small fraction of a microsecond. Non-leaf functions have to do a little > bit more work at times. Most machines implement call/return with single instructions. However, this tends to be the tip of the iceberg for procedure call overhead. The interface _MUST_ do the following: 1) Save/restore all register values that are still 'live' values for the caller. 2) Test the stack to make sure there is enough room for the callee's local variables (and call the memory manager for a new stack frame if there wasn't room). 3) Create (on the stack) a traceback entry so the debugger and/or the postmortem analyzer can find the current thread through the call tree. A 'leaf' routine _might_ get away without doing number (3) by setting some 'leaf' flag. A virtual memory machine might simplify number (2) to just a test/fatal_error combination since stack overflow might be considered too rare to justify much effort. The problem is: _MOST_ of the procedure call overhead is concentrated in number (1)! And, this overhead applies to all procedures - including 'leaf' routines. Basically, 'leafness' has very little to do with procedure call overhead. Because modern machines tend to have a _large_ number of registers, implementing (1) is usually quite expensive - and it can't be made cheaper without interprocedural analysis which in turn can't be done as long as separate compilation is possible. Oh, well--- J. Giles
chris@mimsy.umd.edu (Chris Torek) (03/15/90)
In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: >Most machines implement call/return with single instructions. However, this >tends to be the tip of the iceberg for procedure call overhead. The interface >_MUST_ do the following: > >1) Save/restore all register values that are still 'live' values for the > caller. Right. >2) Test the stack to make sure there is enough room for the callee's local > variables (and call the memory manager for a new stack frame if there > wasn't room). True, but (as Jim Giles points out) VM machines can cheat. (He does not say how to cheat, but it is very easy: do not test. Simply try to use the new stack address(es). If they are invalid, the kernel takes an access fault and decides whether to grow the stack or to terminate the program.) >3) Create (on the stack) a traceback entry so the debugger and/or the > postmortem analyzer can find the current thread through the call tree. There is a way to cheat on this as well; MIPS use it in their compilers. When a function does not need a frame pointer---most do not---it gets instead a `virtual frame pointer' which is simply an offset from the stack pointer. The debugger reads the symbol table of the object file to find out which register is the stack pointer, and what the offset from this register is, so as to find the previous call (and any saved registers, using a compiler-generated field also found in the object file). >The problem is: _MOST_ of the procedure call overhead is concentrated in >number (1)! And, this overhead applies to all procedures - including >'leaf' routines. True; and this is where most of the effort at reducing function call overhead must be applied. Fortunately, some smart guys have figured out some good methods to handle this. On the MIPS, for instance, the system designates some registers as `caller saves' and some registers as `callee saves'. A leaf procedure heads straight for the `caller saves' registers; others head for the `callee saves' registers. In other words, leaf routines can (provided they do not need a huge number of registers) get away with saving and restoring nothing. (They need not even save and restore the stack pointer, if a `callee saves' register is free: simply do something like add #-40,standard_sp,temp_reg_3 // set temp_reg to sp-40 and then use temp_reg_3 as your `stack pointer'. >Basically, 'leafness' has very little to do with procedure call overhead. As shown above, this is no longer true. If the leaf uses a `large' number of registers (more than are available as temporary computation registers in non-leaf routines), this statement holds; if not, the fact that the routine is a leaf makes the registers `free'. (Of course, callers that use lots of registers, and store things in the temporary registers, must spill those registers on subroutine calls. This may be what Jim Giles meant all along. Perhaps someone at MIPS can post statistics as to how often this is the case.) >Because modern machines tend to have a _large_ number of registers, >implementing (1) is usually quite expensive - and it can't be made >cheaper without interprocedural analysis which in turn can't be done as >long as separate compilation is possible. The only problem with this last statement (`interprocedural analysis cannot be done due to separate compilation') is that someone already does it---namely, MIPS do it at the highest compilation level. Again, one does it by cheating: `compile' to `Ucode', an intermediate level code that is neither source nor object. When `linking' the Ucode, put everything together, do global flow analysis, optimise, and then generate machine code. (Of course, linking starts to get very slow, encouraging people to run with lower optimisation levels. This is more or less as it should be, though: you choose the trade-off of compilation time for run time.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
tim@nucleus.amd.com (Tim Olson) (03/16/90)
In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: | Most machines implement call/return with single instructions. However, this | tends to be the tip of the iceberg for procedure call overhead. The interface | _MUST_ do the following: | | 1) Save/restore all register values that are still 'live' values for the | caller. | | 2) Test the stack to make sure there is enough room for the callee's local | variables (and call the memory manager for a new stack frame if there | wasn't room). | | 3) Create (on the stack) a traceback entry so the debugger and/or the | postmortem analyzer can find the current thread through the call tree. | | The problem is: _MOST_ of the procedure call | overhead is concentrated in number (1)! And, this overhead applies to all | procedures - including 'leaf' routines. Basically, 'leafness' has very | little to do with procedure call overhead. If you partition your registers into a set that is live across a function call and a set that is killed, then leaf routines can run entirely out of the later set, not having to save any registers. For example, in the Am29000 calling-convention, a function can allocate up to 126 local registers which are saved across function calls and also has use of 24 "temporary" registers which are not. Most leaf routines can run entirely out of these 24 temps. The entire overhead is simply a call instruction and a return instruction (2 cycles if the delay slots are filled). -- Tim Olson Advanced Micro Devices (tim@amd.com)
henry@utzoo.uucp (Henry Spencer) (03/16/90)
In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: >> ... On the Mips machines, for example, >> a C function call to a "leaf" function (one which calls no others) is >> one instruction for the call and one for the return, total elapsed time >> a small fraction of a microsecond. Non-leaf functions have to do a little >> bit more work at times. > >Most machines implement call/return with single instructions. However, this >tends to be the tip of the iceberg for procedure call overhead. The interface >_MUST_ do the following: > >1) Save/restore all register values that are still 'live' values for the > caller. ... The problem is: _MOST_ of the procedure call >overhead is concentrated in number (1)! And, this overhead applies to all >procedures - including 'leaf' routines. Basically, 'leafness' has very >little to do with procedure call overhead... Um, do you want to go back and re-think that statement, Jim? Leafness has a lot to do with it, because a leaf procedure doesn't have to keep things neat in case it calls somebody. In particular, if you assign some of your registers to be scratch registers, normally unused at call time, then a leaf procedure just uses them for all its internal work and has to save and restore *nothing* except in very unusual cases. This is exactly what the Mips compilers do. When I said that a call to a leaf function on a Mips machine was one instruction each way, that's *one* instruction, total, end to end, no further overhead involved. And those are RISC instructions, normally one cycle apiece. The correct statement is that the callee must save/restore all register values that are still live for the caller *and that the callee is going to overwrite*. Careful register-allocation conventions (with an adequate supply of registers) can minimize that number, usually to zero in the case of leaf functions. Most of the other overheads you mention can likewise be minimized or eliminated with some thought. Stack-overflow probing is needed only on systems without MMUs, or with defective MMUs; even the pdp11, the original C machine, did without it. Traceback overhead for debuggers can be reduced considerably with better cooperation between compiler and debugger, and a modern compiler at a high level of optimization generally produces stuff that a debugger can't handle anyway, so it can jettison that entirely. Interprocedural optimization in the presence of separate compilation is not merely possible, it's commercially available: -O3 on a Mips compiler. Much of the "common wisdom" on the subject of call overhead simply isn't true if you are willing to work hard at falsifying it. (Actually, the same is true of many other performance "constraints"...) -- MSDOS, abbrev: Maybe SomeDay | Henry Spencer at U of Toronto Zoology an Operating System. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
preston@titan.rice.edu (Preston Briggs) (03/16/90)
In article <1990Mar15.173408.29622@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: >>1) Save/restore all register values that are still 'live' values for the >> caller. ... The problem is: _MOST_ of the procedure call >>overhead is concentrated in number (1)! >The correct statement is that the callee must save/restore all register >values that are still live for the caller *and that the callee is going >to overwrite*. Careful register-allocation conventions (with an adequate >supply of registers) can minimize that number, usually to zero in the case >of leaf functions. And of course there are register windows of various kinds. They don't solve the register allocation problem, but they do help reduce call overhead, at some expense to task switching overhead (but we know which is more common). On the other hand, Wall argues that link-time register allocation can surpass register-window performance. See his paper Register Windows vs. Register Allocation D. W. Wall Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implementation -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
jamiller@hpcupt1.HP.COM (Jim Miller) (03/16/90)
> >Most machines implement call/return with single instructions. However, this >tends to be the tip of the iceberg for procedure call overhead. The interface >_MUST_ do the following: > >1) Save/restore all register values that are still 'live' values for the > caller. ... >procedures - including 'leaf' routines. Basically, 'leafness' has very >little to do with procedure call overhead. Because modern machines tend >to have a _large_ number of registers, implementing (1) is usually quite >expensive - and it can't be made cheaper without interprocedural analysis >which in turn can't be done as long as separate compilation is possible. >Oh, well--- > >J. Giles Not completely true. The calling convention can have some registers "caller save" and others "callee save". So the caller only saves those registers that the callee is allowed to use without restoring your values when returning. So if the leaf only needs the argument registers and one or two more, then the callee doesn't have to save anything, and a good compiler doesn't have to save much (if anything) in the caller. The HP Risk machine does this, with some success. Milage will vary with machine and application. jim miller
jlg@lambda.UUCP (Jim Giles) (03/16/90)
From article <23113@mimsy.umd.edu>, by chris@mimsy.umd.edu (Chris Torek): > [... 'caller-save' vs. 'callee-save' registers ...] > As shown above, this is no longer true. If the leaf uses a `large' > number of registers (more than are available as temporary computation > registers in non-leaf routines), this statement holds; if not, the > fact that the routine is a leaf makes the registers `free'. > > (Of course, callers that use lots of registers, and store things in > the temporary registers, must spill those registers on subroutine > calls. This may be what Jim Giles meant all along. Perhaps someone > at MIPS can post statistics as to how often this is the case.) This is exactly what I meant. The Cray system has a similar mechanism (and, in fact they even have special types of 'leaf' procedures called 'baselevel' routines). The problem is that the 'caller' routine still needs to save 'live' values around calls because the registers assigned to the 'callee' are nearly always in use. When I write in assembly, I tend to use all the registers I can in order to avoid the memory overhead - memory costs about a dozen clocks per reference while transfers to the temp regs only costs one. Even with memory pipelined and running in parallel with other functional units, this extra delay is expensive. If I were writing a compiler, I would be similarly greedy with the registers for generated code. All this trouble could be avoided if the register use of the 'callee' were known in advance. Then the code generator for the 'caller' could do register scheduling with this extra information in mind. Still causes problems if the 'callee' uses a _lot_ of registers, but it's better than nothing. Of course, the best deal (if speed were para- mount) would be to 'inline' the 'callee' completely. Then the register scheduling would take place across the call boundary (and the save/ restore could be hidden better under pipelining). > [...] > The only problem with this last statement (`interprocedural analysis > cannot be done due to separate compilation') is that someone already > does it---namely, MIPS do it at the highest compilation level. Again, > one does it by cheating: `compile' to `Ucode', an intermediate level > code that is neither source nor object. When `linking' the Ucode, > put everything together, do global flow analysis, optimise, and then > generate machine code. I've often thought that code generation should be done by the loader for this very reason. Both inlining and regester scheduling across calls would be improvements that would be worth the loader slowdown. In addition, the compile step would be considerably faster. This means that syntax checking would be a breeze (a common use of the compiler, like it or not, is as a form of 'lint' - at least for non-C languages). J. Giles
jlg@lambda.UUCP (Jim Giles) (03/16/90)
From article <1990Mar15.173408.29622@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer): > In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: > [...] >>1) Save/restore all register values that are still 'live' values for the >> caller. ... The problem is: _MOST_ of the procedure call >>overhead is concentrated in number (1)! And, this overhead applies to all >>procedures - including 'leaf' routines. Basically, 'leafness' has very >>little to do with procedure call overhead... > > Um, do you want to go back and re-think that statement, Jim? Leafness has > a lot to do with it, because a leaf procedure doesn't have to keep things > neat in case it calls somebody. In particular, if you assign some of your > registers to be scratch registers, normally unused at call time, then a > leaf procedure just uses them for all its internal work and has to save > and restore *nothing* except in very unusual cases. This is exactly what > the Mips compilers do. [...] This is also what the Cray compilers do for 'baselevel' routines. The problem arises with your assumption that the are such things as "scratch registers, normally unused at call time." If your register scheduling algorithm in the compiler is sufficiently greedy, you will, almost always have some save/restores to do. And note: it is almost always a good thing for your register scheduler to be as greedy as possible. In fact, the only exception is the procedure call overhead. So, I _HAVE_ rethought my first stetement - and I still stand by it! > [...] > The correct statement is that the callee must save/restore all register > values that are still live for the caller *and that the callee is going > to overwrite*. [...] Quite true. But, in the context of separate compilation, how does the compiler _KNOW_ which registers are to be used by the 'callee'? Or, when compiling the 'callee', how does the compiler know which values are still 'live' for the 'caller'? Now, I'm all for the idea of interprocedural analysis. I'm even a supporter of having the loader do the code generation for this very reason. But right now, the procedure interface is still one of the most costly parts of any program. > [...] Careful register-allocation conventions (with an adequate > supply of registers) can minimize that number, usually to zero in the case > of leaf functions. Careful register-allocation conventions are usually the ones that use the registers most greedily. This is because, in general, the more you use the registers, the faster your code goes. If your "careful" approach to registers is not greedy, how much performance am I loosing to it by not getting full use out of the hardware availible? Further: what's "an adequate supply of registers"? I know code which can use up as many as you give me. In fact, if interprocedural analysis were commonplace instead of rare, you would probably find that the register set was almost always completely full (this wouldn't matter since it would be a logical consequence of register allocation being done on the program instead of a routine at a time). These problems may all be solved in the future - even the very near future. But at present, only MIPS and Cray (the only ones mentioned anyway) have addressed this problem. And these two 'solutions' rely on the 'callee' not using lots of registers and the 'caller' deliberately leaving some spare ones - but this, in itself, may have a negative impact on performance. J. Giles
meissner@osf.org (Michael Meissner) (03/16/90)
In article <14272@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: | Careful register-allocation conventions are usually the ones that use the | registers most greedily. This is because, in general, the more you use | the registers, the faster your code goes. If your "careful" approach to | registers is not greedy, how much performance am I loosing to it by not | getting full use out of the hardware availible? Further: what's "an | adequate supply of registers"? I know code which can use up as many as | you give me. In fact, if interprocedural analysis were commonplace | instead of rare, you would probably find that the register set was almost | always completely full (this wouldn't matter since it would be a logical | consequence of register allocation being done on the program instead of | a routine at a time). | | These problems may all be solved in the future - even the very near future. | But at present, only MIPS and Cray (the only ones mentioned anyway) have | addressed this problem. And these two 'solutions' rely on the 'callee' | not using lots of registers and the 'caller' deliberately leaving some | spare ones - but this, in itself, may have a negative impact on performance. Umm, the 88k also partitions the register set into caller save and callee save. For the two machines that I'm familar with, the breakdown is as follows: MIPS: 32 integer 32-bit registers 1 register hardwired to 0 2 registers used for return value and/or staic link 4 registers used for arguments 10 registers not preserved across calls 9 registers preserved across calls 1 register for stack pointer 1 register for return address 1 register for accessing small static/global data 1 register used by the assembler 2 registers used by the kernel 16 floating point 64-bit registers 2 registers for return value 2 registers for arguments 6 registers not preserved across calls 6 registers preserved across calls 88K: 32 32-bit registers (double precision takes 2 regs) 1 register hardwired to 0 8 registers for arguments & return value 4 registers not preserved across calls 13 registers preserved across calls 4 registers reserved for linker/OS 1 register for the stack pointer 1 register for the return address Note that registers used for passing arguments, and returning values are also used as temps. If the return address is stored on the stack, the register that holds can also be used for a temporary. Neither architecture requires the use of a frame pointer, though frame pointers can be synthesized easily if needed because variable sized stack allocations are done. Finally, both machines software defines static tables that describe where registers are stored on the stack, and what register and offset from that register are to be used as a virtual frame pointer for use in the library and in the debugger. The MIPS compilers also have a -O3 option which does global register allocation. Here is an fragment of the man page from a Decstation: -O3 Perform all optimizations, including global register allocation. This option must precede all source file arguments. With this option, a ucode object file is created for each C source file and left in a .u file. The newly created ucode object files, the ucode object files specified on the command line, the runtime startup routine, and all the runtime libraries are ucode linked. Optimization is performed on the resulting ucode linked file and then it is linked as normal producing an a.out file. A resulting .o file is not left from the ucode linked result. In fact -c cannot be specified with -O3. -feedback file Use with the -cord option to specify the feedback file. This file is produced by with its -feedback option from an execution of the program produced by -cord Run the procedure-rearranger on the resulting file after linking. The rearrangement is performed to reduce the cache conflicts of the program's text. The output is left in the file specified by the -o output option or a.out by default. At least one -feedback file must be specified. Because of the restriction of not specifying -c, I'm not sure how many people use this in practice for large software. I would imagine that for programs which use use runtime binding (ie, emacs, or C++ code with virtual functions), it would default back to the standard calling sequence. I wonder how much it buys for typical software as opposed to special cases. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so
jlg@lambda.UUCP (Jim Giles) (03/17/90)
From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): > [...] > If you partition your registers into a set that is live across a > function call and a set that is killed, then leaf routines can run > entirely out of the later set, not having to save any registers. For > example, in the Am29000 calling-convention, a function can allocate up > to 126 local registers which are saved across function calls and > also has use of 24 "temporary" registers which are not. Most leaf > routines can run entirely out of these 24 temps. The entire overhead > is simply a call instruction and a return instruction (2 cycles if the > delay slots are filled). It seems to me that if a procedure is so small that it can only find use for 24 (or fewer) registers, then it is small enough to be a real good candidate for inlining. Clearly, _any_ procedure can be written only _use_ 24 registers (there were some machines in the past with only _one_ register - the were completely functional). However, most procedures of any size can benefit from quite a few more registers than 24. Similarly, any 'caller' routine that can afford to leave 24 registers unused is fairly small as well. For efficiency, all 'live' values should be kept in registers if at all possible. The problem is that any large program (and some small ones) have quite a large number of 'live' values. Procedures that manupulate arrays may have whole arrays (or array sections) that are 'live'. If there's _room_ in the register set for all these values, that's where they should be kept. The fact is that this kind of 'temp' register idea only _appears_ to save time. You probably actually lose performance by forcing many of your 'live' values to memory when they needn't have been. J. Giles
jlg@lambda.UUCP (Jim Giles) (03/17/90)
From article <MEISSNER.90Mar16103226@curley.osf.org>, by meissner@osf.org (Michael Meissner): > [...] > The MIPS compilers also have a -O3 option which does global register > allocation. Here is an fragment of the man page from a Decstation: > [...] Great! Now if this ability were widespread in the _real_world_ (or, the rest of the real world - for those of you who consider MIPS to be real :-), then there would be much less waste in the procedure interface. Of course, once you have this capability, you would also want an 'inline' keyword on procedure declarations, etc.. J. Giles
seanf@sco.COM (Sean Fagan) (03/17/90)
In article <14268@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: >Most machines implement call/return with single instructions. However, this >tends to be the tip of the iceberg for procedure call overhead. The interface >_MUST_ do the following: No, it must not. >1) Save/restore all register values that are still 'live' values for the > caller. What about callee saves? >2) Test the stack to make sure there is enough room for the callee's local > variables (and call the memory manager for a new stack frame if there > wasn't room). sub esp, 1000000 seems to work just fine here. True, you will get into problems if you run out of VM, or your stack-pointer wraps around, but I don't know of anyone who checks for those. >3) Create (on the stack) a traceback entry so the debugger and/or the > postmortem analyzer can find the current thread through the call tree. It need not be on the stack. Ever hear of saving the return address in a register? You know, like Cray's do? You only have to save the retaddr when you call another function. >The problem is: _MOST_ of the procedure call >overhead is concentrated in number (1)! And, this overhead applies to all >procedures - including 'leaf' routines. True. That's why register windows were considered a "good thing" (I'm not sure I like them). It's also one of the problems you have if you have large, large amounts of registers (say, 256 anyone?). >Because modern machines tend >to have a _large_ number of registers, implementing (1) is usually quite >expensive - and it can't be made cheaper without interprocedural analysis >which in turn can't be done as long as separate compilation is possible. Huh? The MIPS compiler does interprocedural analysis, and I'm lead to believe that the Pyramid compilers do as well. It's not *easy* or cheap, to be sure, but it can be done. -- -----------------+ Sean Eric Fagan | "Time has little to do with infinity and jelly donuts." seanf@sco.COM | -- Thomas Magnum (Tom Selleck), _Magnum, P.I._ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
djones@megatest.UUCP (Dave Jones) (03/17/90)
From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): > > If you partition your registers into a set that is live across a > function call and a set that is killed, then leaf routines can run > entirely out of the later set, not having to save any registers. For > example, in the Am29000 calling-convention, a function can allocate up > to 126 local registers which are saved across function calls and > also has use of 24 "temporary" registers which are not. ... The RISC machines have an internal register which acts something like a frame-pointer with respect to the user-registers, yielding what they call 'overlapping register windows'. Thus user-registers need not be saved and restored except when the window moves off one end of the register set or the other. Back in the 70's, TI's 990 and 99000 series chips had the about same thing, but the "work-spaces", as they called the register-windows, were accessed by way of the main-memory busses. (But that architecture had some bad problems also: poor jbsr/rtn implementation and 16-bit addresses.) What I would like to see is a register-set that acts as the cache for a virtual stack in a machine with true stack machine instructions. That would be my idea of a really good time, although it might put some graph colorers out of business, having no registers to allocate. Does anybody know if such a scheme was proposed for SPARC, and if so, why it was rejected? Seems like such a win, and not all that big a step beyond the register-window approach.
david.megginson@canremote.uucp (DAVID MEGGINSON) (03/17/90)
Turbo C on the Atari ST (M68000) passes the first three numeric arguments in D0, D1 and D2, and the first two address arguments in A0 and A1. You can override this with the cdecl keyword. All floats and doubles get passed on the stack. You can buy a lot of speed this way, with a reasonably intelligent compiler. The moral: worry about logic and readability, and let your compiler do the optimizing. David Megginson, Centre for Medieval Studies BITNET: meggin@vm.epas.utoronto.ca PS. I know this is stupid, but in case anyone doesn't understand, D0-D7 are the 32 bit data registers in a 68k chip, and A0-A7 are the address registers, also 32 bit. (A7 is the stack pointer). --- * Via ProDoor 3.1R
drc@cs.brown.edu (David R. Chase) (03/17/90)
In article <12350@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes: >What I would like to see is a register-set that acts as the cache for >a virtual stack in a machine with true stack machine instructions. That >would be my idea of a really good time, although it might put some graph >colorers out of business, having no registers to allocate. Not at all. No doubt the "hardware stack" will be of finite size, and if that is exceeded then some of it must be saved to memory. In this situation, you'll win if stack frames aren't so large, and one way to do that is to allocate stack slots using algorithms very much like those used to allocate registers. Spilling won't occur, of course, since you can have as many "registers" as are needed, but they don't come for free. Use fewer and (ignoring scheduling problems caused by spurious dependences on reused registers) do better. Hard life, isn't it? David (really in Menlo Park, CA)
meissner@osf.osf.org (Michael Meissner) (03/18/90)
In article <14274@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: |From article <MEISSNER.90Mar16103226@curley.osf.org>, by |meissner@osf.org (Michael Meissner): |> [...] |> The MIPS compilers also have a -O3 option which does global register |> allocation. Here is an fragment of the man page from a Decstation: |> [...] | |Great! Now if this ability were widespread in the _real_world_ (or, |the rest of the real world - for those of you who consider MIPS to |be real :-), then there would be much less waste in the procedure |interface. Of course, once you have this capability, you would also |want an 'inline' keyword on procedure declarations, etc.. The next revision of the MIPS compilers is supposed to have -O4, which does automatic inlining as well. It hasn't propigated through DEC yet... Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so
tim@nucleus.amd.com (Tim Olson) (03/19/90)
In article <14273@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: | From article <29509@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): | > [...] | > If you partition your registers into a set that is live across a | > function call and a set that is killed, then leaf routines can run | > entirely out of the later set, not having to save any registers. For | > example, in the Am29000 calling-convention, a function can allocate up | > to 126 local registers which are saved across function calls and | > also has use of 24 "temporary" registers which are not. Most leaf | > routines can run entirely out of these 24 temps. The entire overhead | > is simply a call instruction and a return instruction (2 cycles if the | > delay slots are filled). | | It seems to me that if a procedure is so small that it can only find | use for 24 (or fewer) registers, then it is small enough to be a real | good candidate for inlining. Clearly, _any_ procedure can be written | only _use_ 24 registers (there were some machines in the past with only | _one_ register - the were completely functional). However, most procedures | of any size can benefit from quite a few more registers than 24. It might be true that scientific routines written in FORTRAN may have this many live, non-overlapping variables to keep in registers, but I don't believe this is true in general. Statistics from a large collection of programs and library routines (a mix of general and scientific applications written in C) show that of 782 functions (620 of which were non-leaf functions), an average of 6.5 registers per function were live across function calls. | Similarly, any 'caller' routine that can afford to leave 24 registers | unused is fairly small as well. For efficiency, all 'live' values | should be kept in registers if at all possible. The problem is that | any large program (and some small ones) have quite a large number of | 'live' values. Procedures that manupulate arrays may have whole | arrays (or array sections) that are 'live'. If there's _room_ in | the register set for all these values, that's where they should be | kept. Yes, this is true. However, unless the array manipulation is quite simple, it becomes tricky to use registers for them (the stats quoted above were for scalar variables only). | The fact is that this kind of 'temp' register idea only _appears_ | to save time. You probably actually lose performance by forcing | many of your 'live' values to memory when they needn't have been. No live values were forced to memory (at least not ones in the active function). The Am29000 stack cache tries to keep live variables from all functions in the register file as well as the variables of the current function. If the register file becomes full, live variables from older functions are spilled to memory, then filled as they become active again. -- Tim Olson Advanced Micro Devices (tim@amd.com)
tim@nucleus.amd.com (Tim Olson) (03/19/90)
In article <12350@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes: | What I would like to see is a register-set that acts as the cache for | a virtual stack in a machine with true stack machine instructions. That | would be my idea of a really good time, although it might put some graph | colorers out of business, having no registers to allocate. Does anybody | know if such a scheme was proposed for SPARC, and if so, why it was | rejected? Seems like such a win, and not all that big a step | beyond the register-window approach. This was the approach used in the AT&T CRISP processor. -- Tim Olson Advanced Micro Devices (tim@amd.com)
albert@bsovax.UUCP (Albert van der Horst) (03/21/90)
In article <12350@goofy.megatest.UUCP > djones@megatest.UUCP (Dave Jones) writes: > ( In a follow up on Tim Olson)c >What I would like to see is a register-set that acts as the cache for >a virtual stack in a machine with true stack machine instructions. That >would be my idea of a really good time, although it might put some graph >colorers out of business, having no registers to allocate. Does anybody >know if such a scheme was proposed for SPARC, and if so, why it was >rejected? Seems like such a win, and not all that big a step >beyond the register-window approach. This idea is not new. It is about time for this idea to provide the most ideal machine to write compilers for. There is a weird language that you may not know about. It is called FORTH, and it is based on a virtual machine. This virtual machine is IMHO exactly right. It has two stacks, one for the return addresses, and one for the calculation; the latter is visible to the FORTH programmer. The designer of FORTH, Charles Moore, has cast this idea into silicon. This resulted in a RISC chip (NOVIX 4000) with some amazing properties. For instance, a one cycle call, and mostly a FREE return from a subroutine. It also allocates registers as needed. I was the president of a small group in the Netherlands that explored this idea independantly. We got far enough to create a working emulator (written in FORTH...) for our chip, baptized FIETS (Fast implementation of Enhanced Translators and Systems, by sheer accident also the dutch word for bycicle). Then came Moore and interest in the project faded. Two of the group members gained hands on experience with the NOVIX, resulting in for example a program that generates a video signal by toggling an output pin, showing a picture with the name "NOVIX". Despite some coverage by BYTE the NOVIX did not become a succes at all. In my opinion, real mistakes were the lack of a C-compiler for it - the FIETS contained 16 stack based general purpose registers, that would have made such a C-compiler viable. The NOVIX registers were much less in number, caused by a too narrow Forth based scope. Also Charles Moore still insists that programs needing more than a 16 bit address range are desing errors.... Marketing was directed towards the incrowd of FORTH programmers. Probably very few of you would have noted the value of the chip from the BYTE article. Even to a FORTH- literate the article in BYTE was hard to digest. The end was: NOVIX went out of business. Fortunately, the basic idea's have been bought by Harris Semiconductor's and they know better what makes a processor sing. They have already produced a few practically successful designs, and are still developping it further. These chips are optimized for real time signal processing, and have extremely fast context switches. (switching stacks, no registers to save) I do hope we will arrive at the processor bottom line. The bottom line is this. You have the choice between your code being inline, or a single cycle call/return overhead. All instructions (let's say almost) do something usefull, like addition, putting the result in a new register "on the fly" if required. Or on the other hand adding an offset to a base address will drop the offset from the stack automatically. I think " the register caching via stacks" is much easier to implement than the register window stuff. And as far as I can see, it is easier on the compiler optimizer too. N.B. Those interested (Harris?) may mail me for the FIETS instruction set. N.B. I am currently exploring transputers, that also have an "allocate register as needed" mechanism, without overflowing however and very shalllow. View expressed are my own { not all FIETS members would agree. { my boss does not know what I am talking about Albert van der Horst {eunet!mcvax!hp4nl!bsovax!albert} P.O. 8052 3053 RB Utrecht " It's pitch dark. You might be eaten by a grue" The netherlands " I hope you got more light than from a lamp"
jlg@lambda.UUCP (Jim Giles) (03/21/90)
From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): > [...] > It might be true that scientific routines written in FORTRAN may have > this many live, non-overlapping variables to keep in registers, but I > don't believe this is true in general. Statistics from a large > collection of programs and library routines (a mix of general and > scientific applications written in C) show that of 782 functions (620 > of which were non-leaf functions), an average of 6.5 registers per > function were live across function calls. This statistic can only be interpreted in one way: the C compiler in question didn't allocate registers very well. Especially in scientific packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during execution of even simple routines. Vectors, arrays, lists, strings, etc, are alle being either produced or consumed. The fact that none of these _VALUES_ were in registers at the time of the call indicates one of two things: 1) the code in question was fragmented to the point that most procedures had only a few data items (and scalar at that), or 2) the compiler simply wasn't using the registers to anywhere near their potential. Since you imply the code was well written, I reject the first explanation. That leaves the compiler. My experience (I don't have statistics) with both Fortran and C is that good compilers generally PACK the registers with as much live data as possible. Even an apparently pure scalar loop that does only 'simple' operations may be 'unrolled' a few times to make better use of the registers. Compilers are becoming available that apply that optimization automatically (so this isn't just a case which applies only to 'coder enhanced' code). If such a loop (and this is still the simple kind mind you) had a procedure call imbedded in it, all the registers that the procedure might use would have to be spilled to memory - and then reloaded on return. On the Cray, spilling and reloading just _ONE_ vector register is over 150 clocks (64 elements at one clock each to and from memory plus time for the memory pipeline plus a little overhead to set up stride, address, etc.). This is _not_ a tiny problem. Again I say, this scheme of having 'preserved' vs. 'temp' registers for procedure calls only _appears_ to save time. In truth, it deprives you of registers which could be put to use (at least if your compiler was clever enough). The only solution to the problem is to 'inline' (or, at least, schedule registers globally). At present, in most environments, the only way to do this is by manually 'inlining' the procedures. Let's hope that more automatic solutions become generally available in the next few years. By the way, aside from a problem with aliasing with C, Fortran and C are identical with respect to optimization, register allocation, etc.. So, your implied put-down of Fortran is not relevant in this context. J. Giles
dik@cwi.nl (Dik T. Winter) (03/21/90)
In article <584@bsovax.UUCP> albert@BSOVAX.UUCP (Albert van der Horst) writes: > In article <12350@goofy.megatest.UUCP > > djones@megatest.UUCP (Dave Jones) writes: > > ( In a follow up on Tim Olson)c > >What I would like to see is a register-set that acts as the cache for > >a virtual stack in a machine with true stack machine instructions. > > This idea is not new. No, indeed. Look at the English Electric KDF9 (of fifties vintage). -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
tim@nucleus.amd.com (Tim Olson) (03/22/90)
In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: | From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): | > [...] | > It might be true that scientific routines written in FORTRAN may have | > this many live, non-overlapping variables to keep in registers, but I | > don't believe this is true in general. Statistics from a large | > collection of programs and library routines (a mix of general and | > scientific applications written in C) show that of 782 functions (620 | > of which were non-leaf functions), an average of 6.5 registers per | > function were live across function calls. | | This statistic can only be interpreted in one way: the C compiler in | question didn't allocate registers very well. Especially in scientific | packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during | execution of even simple routines. Vectors, arrays, lists, strings, etc, | are alle being either produced or consumed. The fact that none of these | _VALUES_ were in registers at the time of the call indicates one of two | things: 1) the code in question was fragmented to the point that most | procedures had only a few data items (and scalar at that), or 2) the | compiler simply wasn't using the registers to anywhere near their potential. | Since you imply the code was well written, I reject the first explanation. | That leaves the compiler. Do you have any statistics to back up your assertions? The values I quoted match what other people have found as well. Again, you may be only looking at a small portion of scientific code and missing what is going on in general code. The compiler that I used to collect the stats will keep all live scalars in the register file, but doesn't do inlining of functions, and doesn't try to operate on arrays in registers (how do you generate code to deal with runtime address calculations???) | My experience (I don't have statistics) with both Fortran and C is that ^^^^^^^^^^^^^^^^^^^^^^^ Oh. -- Tim Olson Advanced Micro Devices (tim@amd.com)
jlg@lambda.UUCP (Jim Giles) (03/23/90)
From article <29585@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): > [...] > The compiler that I used to collect the stats will keep all live > scalars in the register file, [...] [^^^] I don't believe a word of it. The word I don't believe is 'all'. _ANY_ scalar that is visible in the local scope and has a value which might still be used by the program at some future point is 'live'. (The same obviously applies to non-scalar values.) That is the definition of 'live'. What you really mean is that the compiler breaks code into 'basic blocks' and keeps all active values from _that_ block in registers. The use of 'basic blocks' and the lack of global dataflow analysis is confusing you into thinking that only a few values are 'live' at a given point. Also, it is considerable mistake not to regard non-scalar values as important. The bottom line is that registers are an important resource in any machine. They are often more that 10 times faster than memory to reference. Your analysis indicates that your code (whether the compiler is responsible or not) is using this resource to only a small fraction of its potential. But instead of complaining, you are acting like it is a desireable way for the code to behave. > | My experience (I don't have statistics) with both Fortran and C is that > ^^^^^^^^^^^^^^^^^^^^^^^ > Oh. Ok, now I have statistics. I just did a quick survey of some expertly crafted C code that I have source code to (it's actually part of the X11 window package). This is only a small program (it's xlock for those interested in checking). Still, there are over 50 scalar variables declared global to the utility - these are _ALWAYS_ 'live'. Further, this code includes a lot of window related declarations (which also contain a bunch of _ALWAYS_ 'live' global declarations). This is admittedly a small sample and it disregards local variables and non-scalars. Still, since most of the program consists of small procedures which manipulate these global variables, I suspect it would be a smaller, faster code if the values always occupied registers. xlock probably doesn't need to be faster, but if speed were important, register usage would be too. J. Giles
peter@ficc.uu.net (Peter da Silva) (03/23/90)
(for comp.arch folks, Jim is basically saying that the more registers you have, the better, because it lets you put more variables in registers) But, Jim, registers aren't free. The more registers you have the more bits you have to hide in your carefully crafted instruction set to specify registers. Modern RISC processors have up to 4 arguments to a given instruction, so even if they're all registers (no mixed-mode instructions) and you only have 16 registers, that's 16 bits gone right there. Now, let's add some more bits to specify constants and maybe some addressing modes, and you can run out of bits pretty quickly. With 64 registers (about the most I'd expect you to be happy with), and immediate operands that would leave you with about 5 bits to specify the instruction. Now suppose you have room on the chip for 512 registers... why *not* use register windows, or a stack cache, or something similar instead of just making the instruction wider? -- _--_|\ `-_-' Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>. / \ 'U` \_.--._/ v
tim@nucleus.amd.com (Tim Olson) (03/23/90)
In article <14285@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: | From article <29585@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): | > [...] | > The compiler that I used to collect the stats will keep all live | > scalars in the register file, [...] [^^^] | | I don't believe a word of it. The word I don't believe is 'all'. | _ANY_ scalar that is visible in the local scope and has a value | which might still be used by the program at some future point is 'live'. | (The same obviously applies to non-scalar values.) That is the definition | of 'live'. Sorry -- I should have clarified that it keeps all live scalar variables *that don't have potential aliasing problems* in the register file. Since global variables are visible to other functions, they have the potential for having their address taken in some other compilation unit (remember, we are talking about C code), and thus cannot be kept in registers -- unless "universal" register allocation and optimization is performed. | What you really mean is that the compiler breaks code into | 'basic blocks' and keeps all active values from _that_ block in registers. | The use of 'basic blocks' and the lack of global dataflow analysis is | confusing you into thinking that only a few values are 'live' at a given | point. No, the compiler is performing register allocation on an entire function, not on a basic block level. It is performing "global" dataflow analysis (meaning on a per-function basis), but not "universal" dataflow analysis (on the entire program, including all other compilation units). | Also, it is considerable mistake not to regard non-scalar values | as important. I never said they weren't important -- it is just that it is very hard (sometimes impossible) to operate on structures and arrays that are kept entirely in registers (runtime indexing, etc.). If you mean that *scalar elements* of an array that are used often should be kept in the register file, then I agree, and we do that. | The bottom line is that registers are an important resource in any machine. | They are often more that 10 times faster than memory to reference. Your | analysis indicates that your code (whether the compiler is responsible | or not) is using this resource to only a small fraction of its potential. | But instead of complaining, you are acting like it is a desireable way | for the code to behave. No, I don't claim it is "desireable", just that a lot of general-purpose C code tends to have a large number of small to medium-sized functions which use < 20 local scalars. | Ok, now I have statistics. I just did a quick survey of some expertly | crafted C code that I have source code to (it's actually part of the | X11 window package). This is only a small program (it's xlock for those | interested in checking). Still, there are over 50 scalar variables | declared global to the utility - these are _ALWAYS_ 'live'. ^^^^^^ OK -- now I see where you are getting your numbers and where we are differing. The stats I collected were for local ("automatic") scalars, because global scalars can have aliasing problems unless we perform "universal" dataflow analysis. I agree that if this were done, the number of variables live across a function call increase dramatically. -- Tim Olson Advanced Micro Devices (tim@amd.com)
pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) (03/26/90)
In article <14281@lambda.UUCP> jlg@lambda.UUCP (Jim Giles) writes: From article <29551@amdcad.AMD.COM>, by tim@nucleus.amd.com (Tim Olson): > [...] > It might be true that scientific routines written in FORTRAN may have > this many live, non-overlapping variables to keep in registers, but I > don't believe this is true in general. Statistics from a large > collection of programs and library routines (a mix of general and > scientific applications written in C) show that of 782 functions (620 > of which were non-leaf functions), an average of 6.5 registers per > function were live across function calls. This statistic can only be interpreted in one way: the C compiler in question didn't allocate registers very well. Especially in scientific packages, there are _HUGE_ numbers of 'live' _VALUES_ to deal with during execution of even simple routines. Vectors, arrays, lists, strings, etc, are alle being either produced or consumed. This is an old fallacy: the number of useful registers is usually quite low; the Wall paper and others say that for most codes, even floating point intensive ones, 4-8-16 registers make do. The problem that Giles does not seem to consider is that caching values in registers is only useful if the values are going to be used repeatedly, like all forms of caching. It is not difficult to produce examples of fairly common pieces of code where on many machine register caching worsens performance. Many registers are useful when: 1) Your so called 'optimizer' does not select values to cache on expected dynamic frequency of use but on static frequency of use. Since the two are poorly correlated, your so called 'optimizer' wants to cache everything in sight. 2) You have extremely high latency to memory, and you want to use a large register cache as a large cache, where even infrequently reused values are insufferably expensive to refetch. 3) You have extremely high latency to memory, and you can prefetch blocks of operands while other blocks of operands are being processed, because you know which operands are going to be needed next, like with vector machines. 4) You have multiple functionals units, and each of them then can make use of a set of registers. Note that all these do not really mean that you need lots of registers; 1) means that your compiler is stupid, 2) that you are missing a proper dynamic cache, and 3) and 4) that you have actually multiple threads of control. My aversion to large register caches and so called clever optimizer should be well known now, and stems from my opinion that stupid compilers are to be avoided, very optimizing compilers are accident prone and easily made unnecessary by careful coding where it matters, and that I am only interested in general purpose architectures. It is always possible to design a specific architecture that isn't such... In particular there are two uses for registers: one is as inputs to functional units, and the other as cache. There are machines, especially ones with stack orientedness and caches, that have only specialized registers, i.e. each register is only there as input to a functional units. The Manchester mainframes were specialized register machines. Crisp is in some sense such a machine as well. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcvax!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
preston@titan.rice.edu (Preston Briggs) (03/26/90)
In article <PCG.90Mar25230051@rupert.cs.aber.ac.uk> pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi) writes: >This is an old fallacy: the number of useful registers is usually quite >low; the Wall paper and others say that for most codes, even floating >point intensive ones, 4-8-16 registers make do. Experience papers that say "n is enough registers" should be read with the caveat "for my optimizer and my benchmarks." More agressive optimization will usually make profitable use of more registers. >The problem that Giles >does not seem to consider is that caching values in registers is only >useful if the values are going to be used repeatedly, like all forms of >caching. On many (most?) machines, 2 uses is enough to justify keeping a value in register. >It is not difficult to produce examples of fairly common pieces >of code where on many machine register caching worsens performance. Of course, we can produce plenty of examples where many registers are helpful. Further, can you post an example of some sort? I spend a lot of time thinking about this stuff, and hard cases would help. >Many registers are useful when: >1) Your so called 'optimizer' does not select values to cache on >expected dynamic frequency of use but on static frequency of use. Surely bad optimizers/register-allocators waste registers. (Naturally, your reply will be "there's no such thing as a good optimizer.") >2) You have extremely high latency to memory >3) You have extremely high latency to memory, and you can prefetch >blocks of operands while other blocks of operands are being processed, >4) You have multiple functionals units But how many chips does this describe? Today, many; next year, many more. The i860 has a fast cache, but it's small and a miss is about 25 cycles. Many chips have load pipelines, where it's profitable to fetch several cycles in advance. And how many chipsets have asynchrounous FP processors? CPU's outrun memory. They have been for years, and memory isn't catching up. Hence the development of caches, multi-level caches, wide data busses, and large register sets. >1) means that your compiler is stupid, Right, but I don't like stupid compilers either. >2) that you are missing a proper dynamic cache Cache isn't a cure-all. It's finite and fairly simplistic. Long line sizes and limited set associativity tend to restrict it's abilities to replace registers. >and 3) and 4) that you have actually multiple threads of control. Or perhaps the compiler might find enough low-level parallelism to keep your chip busy. >My aversion to large register caches You like "regular" caches but not registers caches... Aren't registers just the top of the memory hierarchy? They are rather more subject to control from software, but I'd think that was a plus. I'd rather see the systems extended so that the other layers of the hierarchy were also under software control (prefetches to cache, fetches around cache, prefetching pages of virtual memory, and so forth). -- Preston Briggs looking for the great leap forward preston@titan.rice.edu
jlg@lambda.UUCP (Jim Giles) (03/27/90)
From article <PCG.90Mar25230051@rupert.cs.aber.ac.uk>, by pcg@rupert.cs.aber.ac.uk (Piercarlo Grandi): > [...] The problem that Giles > does not seem to consider is that caching values in registers is only > useful if the values are going to be used repeatedly, like all forms of > caching. I don't see where I've been at fault here. The _definition_ of 'live' values it that they are to be used again. The more of these that you can place into registers the better. The compiler should do sufficient data flow so that it knows which 'live' values are needed next and should schedule registers to preload these (and hold them until other values are of more immediate use). Procedure calls interrupt this analysis (at least, if you don't have interprocedural analysis). This problem doesn't go away (even with your well known masochistic micro-management style of programming) without hand 'inlining' or automatic interprocedural analysis. In fact, hand 'inlining' is just the sort of micro-optimization that your style of programming would recommend! So, in this case, we should be in agreement. J. Giles