krischan@truth.informatik.rwth-aachen.de (Christian Engel) (09/22/90)
Hei, Kim Kempf and all netlanders! It's a time of more than one year that I'm wondering why the Microware C compiler passes the first two arguments of a function call by the registers D0 and D1 (usually). I've discussed this question with several friends experienced in compiler construction, but the more we are thinking about advantages of this strategy the less we see them. Let me explain the two most important arguments why we can't see the advantage: * The registers D0 and D1 are used as temporaries. Consequently the parameters passed by those registers has to be moved to another but save place by the called function. * On the other hand, unless a parameter isn't declared of class register the parameters can't be held in registers (D0, D1 or any other) since the called function could take their address by the unary & operator. If and only if the parameter isn't of class register and the compiler detects that there isn't any use of the address operator with the parameter inside the functions block the compiler is allowed to decide by its own to keep a variable in register. But I think those advanced features like automatically taking a variable to be of class register aren't implementedd in the Microware C compiler. This results in the conclusion that first of all a function has to move it's parameters passed by D0/D1 to a location on the stack. And that's what the Microware C compiler does -- tell me if I'm wrong. And this means the run time system has to execute *two* move operations with any parameter passed by registers: first the caller pushes them into the registers and second the called function moves them to it's local stack area. Other parameters not passed by D0/D1 are pushed to the stack by the caller directly. Main conclusion: passing parameters by register is slower than passing by stack. Exception: a copiler with sophisticated optimizations is able to recognize a parameter that can be passed by register. That means the parameter is pushed to a register that is free, that won't be used as temporary, and which address won't be taken by the & address operator inside the function body. An more about that: Any caller has to push the parameters in the correct registers. For this at the point where the code for the call is generated the compiler must know about the code of the called function. Alright, netlanders, tell me if I'm wrong. But if not I would suggest you, Kim Kempf and other compiler builders, forget passing parameters by registers!!! Krischan. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen Ahornstr. 55, D-5100 Aachen, W. Germany
rbt@cernvax.UUCP (roberto divia) (09/27/90)
Sorry, I tend to disagree. It is true that allocating two registers for parameter passing means some overload to the executable code but on the other hand it is *MUCH* faster to do things this way. I did some tests with a MC68030 and the same routine called with parameters in registers was 10-15% faster then the equivalent called with parameters on the stack. This comes for two main reasons: 1) the stack push/pop/stackPointerRearrangement takes time; 2) bigger is the data area used by the code is and easier is to have a cache miss (this in case of processors with data caches). The Microware convention of "caller saves" reduces the overhead of registers swapping/saving to the minimum necessary, so... -- | Roberto Divia` | Love at first sight is one of the greatest | | ============= | labor-saving devices the world has ever seen | | CERN : European Laboratory for Particle Physics, 1211 Geneva 23 | | Switzerland (CH) |
kim@mcrware.UUCP (Kim Kempf) (09/29/90)
In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >Sorry, I tend to disagree. It is true that allocating two registers for >parameter passing means some overload to the executable code but on the >other hand it is *MUCH* faster to do things this way. I did some tests >with a MC68030 and the same routine called with parameters in registers >was 10-15% faster then the equivalent called with parameters on the >stack. This comes for two main reasons: > >1) the stack push/pop/stackPointerRearrangement takes time; >2) bigger is the data area used by the code is and easier is to have a > cache miss (this in case of processors with data caches). > Couldn't have stated it better myself, Roberto! The following is an example, albeit trival, of the savings that adds up quickly on non-trivial programs: The example C program: main() { register int i,j; joe(1,2); for (i=1; i < 10; i++) { joe(i,5); } } The OS-9 C code in assembler. Notice the 2-byte argument loads and the absence of the stack cleanup after the function call. 00001 psect 00002 ttl main 00003 0000 4e55 main: link a5,#0 0000 00004 0004 48e7 movem.l #_1!1,-(sp) cc00 00005 0008 203c move.l #_3,d0 :6 ffffffbc 00006 000e=6100 bsr _stkcheck 0000 00007 0012 7202 moveq.l #2,d1 :2 00008 0014 7001 moveq.l #1,d0 :2 00009 0016=6100 bsr joe 0000 00010 001a 7801 moveq.l #1,d4 :2 00011 001c 6000 bra _7 000c 00012 _5 00013 0020 7205 moveq.l #5,d1 :2 00014 0022 2004 move.l d4,d0 :2 00015 0024=6100 bsr joe 0000 00016 _8 00017 0028 5284 addq.l #1,d4 :2 00018 _7 00019 002a 700a moveq.l #10,d0 :2 00020 002c b084 cmp.l d4,d0 :2 00021 002e 6e00 bgt _5 fff0 00022 _6 00023 _4 00024 0032 4ced movem.l -12(a5),#_1 0032fff4 00025 0038 4e5d unlk a5 00026 003a 4e75 rts :2 00027 ffffffbc _3 equ 0xffffffbc :0 00028 00000032 _1 equ 0x00000032 :0 00029 00000018 _2 equ 0x00000018 :0 00030 0000003c ends Now the code from the SunOS 4.0 UNIX compiler. Notice that the same instructions take *6-bytes* and the extra 2-bytes for the addq after the function call. 00001 psect 00002 _main: 00003 0000 4e56 link a6,#0 0000 00004 0004 dffc add.l #-4,sp fffffffc 00005 000a 48d7 movem.l #0x0,(sp) 0000 00006 000e 4879 pea 0x2 00000002 00007 0014 4879 pea 0x1 00000001 00008 001a=6100 bsr _joe 0000 00009 001e 504f addq.w #0x8,sp 00010 0020 7e01 moveq #0x1,d7 00011 L17: 00012 0022 0c87 cmpi.l #0xa,d7 0000000a 00013 0028 6c14 bge.s L16 00014 002a 4879 pea 0x5 00000005 00015 00016 0030 2f07 move.l d7,-(sp) 00017 0032=6100 bsr _joe 0000 00018 0036 508f addq.l #0x8,sp 00019 L15: 00020 0038 5287 addq.l #0x1,d7 00021 003a 6000 bra L17 ffe6 00022 L16: 00023 003e 7000 moveq.l #0,d0 00024 LE12: 00025 0040 4cee movem.l 4(a6),#0x80 00800004 00026 0046 4e5e unlk a6 00027 0048 4e75 rts Which is smaller and faster! ---------------- Kim Kempf, Microware Systems Corporation {sun,uunet}!mcrware!kim "Opinions! We doan need no stinkink opinions!" -- ---------------- Kim Kempf, Microware Systems Corporation {sun,uunet}!mcrware!kim "Opinions! We doan need no stinkink opinions!"
krischan@truth.informatik.rwth-aachen.de (Christian Engel) (09/30/90)
In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >1) the stack push/pop/stackPointerRearrangement takes time; In article <3117@mcrware.UUCP> kim@mcrware.UUCP (Kim Kempf) writes: >The OS-9 C code in assembler. Notice the 2-byte argument loads and >the absence of the stack cleanup after the function call. > [...] >Which is smaller and faster! Allright, I've got it. But You missed to tell me how about the *called* function! However I think I've got the explanation: Sometines you need an additional move operation saving the parameters passed via d0/d1. This saving is done by the movem operation at the beginning of every function entry which also saves all registers used for local variables of storage class register. If and only if you pass parameters and there are no register variables inside the function the movem is required only for the d0/d1 saving. In this case you loose a little time and code compared with the parameter passing via stack. I haven't made my mind up what comes up at the end if you compare both models of parameter passing, but after considering Your answers and the explanations above I think statistically the passing via registers will win clearly. Am I right? In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >2) bigger is the data area used by the code is and easier is to have a > cache miss (this in case of processors with data caches). Now, that's too high for me. Why should the chance of cache miss increase in passing parameters via stack? The first two parameters need space in the stack frame of the called function anyway. That means there is no difference! The only difference is, that passing parameters via stack moves them into the stack frame *before* entering the called function (done by the caller) and passing parameters via registers d0/d1 moves them into the stack frame after entering the called function (done by the callee). krischan -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen Ahornstr. 55, D-5100 Aachen, W. Germany
rbt@cernvax.UUCP (roberto divia) (09/30/90)
In article <3552@rwthinf.UUCP> krischan@truth.UUCP (Christian Engel) writes: >In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >>2) bigger is the data area used by the code is and easier is to have a >> cache miss (this in case of processors with data caches). >Now, that's too high for me. Why should the chance of cache miss increase >in passing parameters via stack? Simply because they will HAVE to be store in the data cache (this is done during the PUSH), removing other lines from the cache itself, lines that will have to be reloaded in case of new references. The data cache works normally as function of the "region" of the data used by the program. Stack references will exit from this region, increasing the number of cache misses...-- | Roberto Divia` | Love at first sight is one of the greatest | | ============= | labor-saving devices the world has ever seen | | CERN : European Laboratory for Particle Physics, 1211 Geneva 23 | | Switzerland (CH) |
krischan@strange.informatik.rwth-aachen.de (Christian Engel) (10/01/90)
In article <2823@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >In article <3552@rwthinf.UUCP> krischan@truth.UUCP (Christian Engel) writes: >>In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >>>2) bigger is the data area used by the code is and easier is to have a >>> cache miss (this in case of processors with data caches). >>Now, that's too high for me. Why should the chance of cache miss increase >>in passing parameters via stack? > >Simply because they will HAVE to be store in the data cache (this is done >during the PUSH), removing other lines from the cache itself, lines that >will have to be reloaded in case of new references. The data cache works >normally as function of the "region" of the data used by the program. Stack >references will exit from this region, increasing the number of cache >misses...-- Again: You need the same amount of stack storage area in either models. In any case except for parameters of storage class register space in the stack frame of the called function is required for the parameters. I think it makes no difference wether the parameters are moved to the stack before entering the function or afterwards. The code executed in the meantime is only the bsr and the link instructions. If you get a cache miss in one model you will get it in the other, too, don't you? -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen Ahornstr. 55, D-5100 Aachen, W. Germany
krischan@strange.informatik.rwth-aachen.de (Christian Engel) (10/09/90)
Hi! Some more discussion for the chance of cache faults: In article <3552@rwthinf.UUCP> I write: >In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >>2) bigger is the data area used by the code is and easier is to have a >> cache miss (this in case of processors with data caches). > >Now, that's too high for me. Why should the chance of cache miss increase >in passing parameters via stack? The first two parameters need >space in the stack frame of the called function anyway. That means >there is no difference! The only difference is, that passing parameters >via stack moves them into the stack frame *before* entering the called >function (done by the caller) and passing parameters via registers d0/d1 >moves them into the stack frame after entering the called function >(done by the callee). In article <3555@rwthinf.UUCP> I write: >In article <2823@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes: >>Simply because they will HAVE to be store in the data cache (this is done >>during the PUSH), removing other lines from the cache itself, lines that >>will have to be reloaded in case of new references. The data cache works >>normally as function of the "region" of the data used by the program. Stack >>references will exit from this region, increasing the number of cache >>misses...-- > >Again: You need the same amount of stack storage area in either models. >In any case except for parameters of storage class register space in >the stack frame of the called function is required for the parameters. >I think it makes no difference wether the parameters are moved to the stack >before entering the function or afterwards. The code executed in the >meantime is only the bsr and the link instructions. If you get a >cache miss in one model you will get it in the other, too, don't you? Now some more aspects: Compare the stack frame for both models passing parameters by stack and passing parameters by registers: ---------------------------------------------------------------------- Model of stack frame if passing all parameters by *stack*: high | | |-----------------| | parameters | \ pushed by caller | 1, 2, ... in | | | reverse order | | |-----------------| | pushed by bsr | return address | | |-----------------| > locally accessed by the function / | old link reg. | | | |-----------------| | built by link < | space | | | | for local | | \ | variables | / |-----------------| / | registers saved | | | to give space | by movem < | for local | | | register | \ | variables | |-----------------| <--- Top of stack low | | ---------------------------------------------------------------------- Model of stack frame if passing parameters 1 and 2 via *registers* D0/D1: high | | |-----------------| | parameters | \ pushed by caller | 3, 4, ... in | | | reverse order | | |-----------------| | pushed by bsr | return address | | |-----------------| | / | old link reg. | | | |-----------------| | built by link < | space | | | | for local | > locally accessed by the function \ | variables | | |-----------------| | / | registers saved | | | | to give space | | | | for local | | by movem < | register | | | | variables | | | |-----------------| | \ | parameters 1, 2 | / |-----------------| <--- Top of stack low | | ---------------------------------------------------------------------- The propabilty of a cache fault will raise if the area of the accessed memory grows. If you compare both models you will see that the accessed area will be larger for the model of passing all register via stack if and only if there are one or two parameters and there are no local register variables. Looks like this: ---------------------------------------------------------------------- Model of stack frame if passing all parameters by *stack*: high | | |-----------------| | parameters | \ pushed by caller | 1, 2 in | | | reverse order | | |-----------------| | pushed by bsr | return address | | |-----------------| > locally accessed by the function / | old link reg. | | | |-----------------| | built by link < | space | | | | for local | | \ | variables | / |-----------------| <--- Top of stack low | | ---------------------------------------------------------------------- Model of stack frame if passing parameters 1 and 2 via *registers* D0/D1: high | | |-----------------| pushed by bsr | return address | |-----------------| / | old link reg. | | |-----------------| built by link < | space | \ | | for local | | \ | variables | > locally accessed by the function |-----------------| | by movem | parameters 1, 2 | / |-----------------| <--- Top of stack low | | ---------------------------------------------------------------------- The difference of the sizes accessed is 8 Bytes. Now we can discuss how high is the propability that this happens. I think the probability for passing exactly one or two parameters is very high. That gives an advantage for passing by registers. But this advantage decreases with any registers to be freed for local register variables. It is lost with two register variables and becomes a disadvantage with more than two. This propability depends on your programming style (how often do you use local variables of storage class register?) and the compiler used (does it execute an optimization changing local variables from storage class automatic to register?). Krischan -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen Ahornstr. 55, D-5100 Aachen, W. Germany
rbt@cernvax.UUCP (roberto divia) (10/11/90)
This might go on forever... I have looked at some of my code (mostly written for real time applications) and noticed how often I have used routines with one or two input parameters and the output value. I also did some tests where the same routine was called with parameters on stack and parameters in registers and found that the second case is MUCH faster. My tests where executed on a MC68030 (program and data caches). The simple fact of passing one or two of 6 parameters (or more) into registers reduced the execution time of 10 to 20 %. To see why, I tried different combinations. In particular, when I turned OFF the data cache, the speed gain was much less (2, 3%). This because the data region used by the program was somehow "extended" by the use (or better - the abuse) of the data stack made during subroutine calls. Try to believe... -- | Roberto Divia` | Love at first sight is one of the greatest | | ============= | labor-saving devices the world has ever seen | | CERN : European Laboratory for Particle Physics, 1211 Geneva 23 | | Switzerland (CH) |
dm@digiw.UUCP (Dick Mihalowsky) (10/13/90)
In article <3117@mcrware.UUCP> kim@mcrware.UUCP (Kim Kempf) writes: >Couldn't have stated it better myself, Roberto! The following is an >example, albeit trival, of the savings that adds up quickly on >non-trivial programs: [example deleted] It is undeniably faster to use registers for passing parameters. But the current Microware C compiler uses a rather complex register allocation algorithm and I am still waiting for Microware to implement <varargs.h> (or <stdarg.h> if you insist) and a working set of the vprintf(3) -family. Missing these, I take stack-passing anytime. Still hacking on variable argument functions for OSK! Dick Mihalovsky -------- Dick Mihalovsky, Jyvaskyla, Finland. Email: dm@itumic.Itumic.FI
adp@moscom.UUCP (Alan Percy) (10/17/90)
In article <1546@digiw.UUCP> dm@digiw.UUCP (Dick Mihalowsky) writes: >... >But the current Microware C compiler uses a rather complex register >allocation algorithm and I am still waiting for Microware to implement ><varargs.h> (or <stdarg.h> if you insist) and a working set of >the vprintf(3) -family. Missing these, I take stack-passing anytime. /* varargs.h for os9/68k by Robert A. Larson */ /* version 0 for os9/68k C 2.0 04/20/86 */ /* varargs is a "portable" way to write a routine that takes a variable */ /* number of arguements. This implemination agrees with both the 4.2bsd*/ /* and Sys V documentation of varargs. Note that just because varargs.h*/ /* is used does not mean that it is used properly. */ /* Ignore the "expression with little effect" warnings. (Seems to be a */ /* compiler bug.) */ #define va_alist _va_arg1, _va_arg2, _va_arg3 #define va_dcl unsigned _va_arg1, _va_arg2, _va_arg3; typedef struct { unsigned _va_at; /* number of arguments used, 0, 1, or more */ /* (double as first arg counts as 2) */ union { struct { unsigned _va_uns1, _va_uns2; } _va_uuns; double _va_udouble; } _va_union; char *_va_pointer; } va_list; #define va_start(pvar) ( (pvar)._va_at = 0, \ (pvar)._va_union._va_uuns._va_uns1 = _va_arg1,\ (pvar)._va_union._va_uuns._va_uns2 = _va_arg2,\ (pvar)._va_pointer = (char *) &_va_arg3 \ ) #define va_arg(pvar, type) ( \ ((pvar)._va_at++) ? ( \ ((pvar)._va_at == 2) ? ( \ (sizeof(type) == 8) ? ( \ *(((type *)((pvar)._va_pointer))++) \ ) : (type)( \ (pvar)._va_union._va_uuns._va_uns2 \ ) \ ) : ( \ *(((type *)((pvar)._va_pointer))++) \ ) \ ) : ( \ (sizeof(type) == 8) ? (type)( \ (pvar)._va_at++, \ (pvar)._va_union._va_udouble \ ) : (type)( \ (pvar)._va_union._va_uuns._va_uns1 \ ) \ ) \ ) #define va_end(pvar) /* va_end is simple */ Knock your self out.... -- A nanosecond here, a nanosecond there, next thing you know you've got real time. Alan Percy..........................{rutgers,ames,cmcl2}!rochester!moscom!adp