[comp.os.os9] Passing parameters by registers is bad use?!?!?

krischan@truth.informatik.rwth-aachen.de (Christian Engel) (09/22/90)

Hei, Kim Kempf and all netlanders!

It's a time of more than one year that I'm wondering why the Microware
C compiler passes the first two arguments of a function call by the
registers D0 and D1 (usually). I've discussed this question with
several friends experienced in compiler construction, but the more we
are thinking about advantages of this strategy the less we see them.

Let me explain the two most important arguments why we can't see the
advantage:

* The registers D0 and D1 are used as temporaries. Consequently the
  parameters passed by those registers has to be moved to another but
  save place by the called function.

* On the other hand, unless a parameter isn't declared of class register
  the parameters can't be held in registers (D0, D1 or any other) since
  the called function could take their address by the unary & operator.
  If and only if the parameter isn't of class register and the compiler
  detects that there isn't any use of the address operator with the
  parameter inside the functions block the compiler is allowed to decide
  by its own to keep a variable in register.  But I think those advanced
  features like automatically taking a variable to be of class register
  aren't implementedd in the Microware C compiler.

This results in the conclusion that first of all a function has to
move it's parameters passed by D0/D1 to a location on the stack.  And
that's what the Microware C compiler does -- tell me if I'm wrong.
And this means the run time system has to execute *two* move
operations with any parameter passed by registers: first the caller
pushes them into the registers and second the called function moves
them to it's local stack area. Other parameters not passed by D0/D1
are pushed to the stack by the caller directly. Main conclusion:
passing parameters by register is slower than passing by stack.

Exception: a copiler with sophisticated optimizations is able to
recognize a parameter that can be passed by register. That means the
parameter is pushed to a register that is free, that won't be used as
temporary, and which address won't be taken by the & address operator
inside the function body. An more about that: Any caller has to push
the parameters in the correct registers. For this at the point where
the code for the call is generated the compiler must know about the
code of the called function.

Alright, netlanders, tell me if I'm wrong. But if
not I would suggest you, Kim Kempf and other compiler builders,
forget passing parameters by registers!!!

Krischan.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan
Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen
Ahornstr. 55, D-5100 Aachen, W. Germany

rbt@cernvax.UUCP (roberto divia) (09/27/90)

Sorry, I tend to disagree. It is true that allocating two registers for
parameter passing means some overload to the executable code but on the
other hand it is *MUCH* faster to do things this way. I did some tests
with a MC68030 and the same routine called with parameters in registers
was 10-15% faster then the equivalent called with parameters on the
stack. This comes for two main reasons:

1) the stack push/pop/stackPointerRearrangement takes time;
2) bigger is the data area used by the code is and easier is to have a
   cache miss (this in case of processors with data caches).

The Microware convention of "caller saves" reduces the overhead of registers
swapping/saving to the minimum necessary, so...
-- 
|   Roberto Divia`      | Love at  first sight  is one of the greatest |
|   =============       | labor-saving devices the world has ever seen |
|  CERN :  European Laboratory for Particle Physics,  1211  Geneva  23 |
|  Switzerland (CH)                                                    |

kim@mcrware.UUCP (Kim Kempf) (09/29/90)

In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>Sorry, I tend to disagree. It is true that allocating two registers for
>parameter passing means some overload to the executable code but on the
>other hand it is *MUCH* faster to do things this way. I did some tests
>with a MC68030 and the same routine called with parameters in registers
>was 10-15% faster then the equivalent called with parameters on the
>stack. This comes for two main reasons:
>
>1) the stack push/pop/stackPointerRearrangement takes time;
>2) bigger is the data area used by the code is and easier is to have a
>   cache miss (this in case of processors with data caches).
>

Couldn't have stated it better myself, Roberto! The following is an
example, albeit trival, of the savings that adds up quickly on
non-trivial programs:

The example C program:

main()
{
	register int i,j;
	
	joe(1,2);
	for (i=1; i < 10; i++) {
		joe(i,5);
	}
}


The OS-9 C code in assembler.  Notice the 2-byte argument loads and
the absence of the stack cleanup after the function call.

00001                      psect                   
00002                      ttl      main           
00003 0000 4e55 main:      link     a5,#0          
           0000
00004 0004 48e7            movem.l  #_1!1,-(sp)    
           cc00
00005 0008 203c            move.l   #_3,d0         :6
           ffffffbc
00006 000e=6100            bsr      _stkcheck      
           0000
00007 0012 7202            moveq.l  #2,d1          :2
00008 0014 7001            moveq.l  #1,d0          :2
00009 0016=6100            bsr      joe            
           0000
00010 001a 7801            moveq.l  #1,d4          :2
00011 001c 6000            bra      _7             
           000c
00012           _5                                 
00013 0020 7205            moveq.l  #5,d1          :2
00014 0022 2004            move.l   d4,d0          :2
00015 0024=6100            bsr      joe            
           0000
00016           _8                                 
00017 0028 5284            addq.l   #1,d4          :2
00018           _7                                 
00019 002a 700a            moveq.l  #10,d0         :2
00020 002c b084            cmp.l    d4,d0          :2
00021 002e 6e00            bgt      _5             
           fff0
00022           _6                                 
00023           _4                                 
00024 0032 4ced            movem.l  -12(a5),#_1    
           0032fff4
00025 0038 4e5d            unlk     a5             
00026 003a 4e75            rts                     :2
00027  ffffffbc _3         equ      0xffffffbc     :0
00028  00000032 _1         equ      0x00000032     :0
00029  00000018 _2         equ      0x00000018     :0
00030  0000003c            ends                    


Now the code from the SunOS 4.0 UNIX compiler.  Notice that the same
instructions take *6-bytes* and the extra 2-bytes for the addq after the
function call. 


00001                      psect                   
00002           _main:                             
00003 0000 4e56            link     a6,#0          
           0000
00004 0004 dffc            add.l    #-4,sp         
           fffffffc
00005 000a 48d7            movem.l  #0x0,(sp)      
           0000
00006 000e 4879            pea      0x2            
           00000002
00007 0014 4879            pea      0x1            
           00000001
00008 001a=6100            bsr      _joe           
           0000
00009 001e 504f            addq.w   #0x8,sp        
00010 0020 7e01            moveq    #0x1,d7        
00011           L17:                               
00012 0022 0c87            cmpi.l   #0xa,d7        
           0000000a
00013 0028 6c14            bge.s    L16            
00014 002a 4879            pea      0x5            
           00000005
00015 
00016 0030 2f07            move.l   d7,-(sp)       
00017 0032=6100            bsr      _joe           
           0000
00018 0036 508f            addq.l   #0x8,sp        
00019           L15:                               
00020 0038 5287            addq.l   #0x1,d7        
00021 003a 6000            bra      L17            
           ffe6
00022           L16:                               
00023 003e 7000            moveq.l  #0,d0          
00024           LE12:                              
00025 0040 4cee            movem.l  4(a6),#0x80    
           00800004
00026 0046 4e5e            unlk     a6             
00027 0048 4e75            rts                     


Which is smaller and faster!
----------------
Kim Kempf, Microware Systems Corporation	{sun,uunet}!mcrware!kim
      "Opinions!  We doan need no stinkink opinions!"


-- 
----------------
Kim Kempf, Microware Systems Corporation	{sun,uunet}!mcrware!kim
      "Opinions!  We doan need no stinkink opinions!"

krischan@truth.informatik.rwth-aachen.de (Christian Engel) (09/30/90)

In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>1) the stack push/pop/stackPointerRearrangement takes time;

In article <3117@mcrware.UUCP> kim@mcrware.UUCP (Kim Kempf) writes:
>The OS-9 C code in assembler.  Notice the 2-byte argument loads and
>the absence of the stack cleanup after the function call.
> [...]
>Which is smaller and faster!

Allright, I've got it. But You missed to tell me how about the
*called* function! However I think I've got the explanation:
Sometines you need an additional move operation
saving the parameters passed via d0/d1. This saving is done by the
movem operation at the beginning of every function entry which also
saves all registers used for local variables of storage class register.
If and only if you pass parameters and there are no register variables
inside the function the movem is required only for the d0/d1 saving.
In this case you loose a little time and code compared with the
parameter passing via stack.
I haven't made my mind up what comes up at the end if you compare
both models of parameter passing, but after considering Your answers
and the explanations above I think statistically the passing
via registers will win clearly.

Am I right?

In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>2) bigger is the data area used by the code is and easier is to have a
>   cache miss (this in case of processors with data caches).

Now, that's too high for me. Why should the chance of cache miss increase
in passing parameters via stack? The first two parameters need
space in the stack frame of the called function anyway. That means
there is no difference! The only difference is, that passing parameters
via stack moves them into the stack frame *before* entering the called
function (done by the caller) and passing parameters via registers d0/d1
moves them into the stack frame after entering the called function
(done by the callee).

krischan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan
Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen
Ahornstr. 55, D-5100 Aachen, W. Germany

rbt@cernvax.UUCP (roberto divia) (09/30/90)

In article <3552@rwthinf.UUCP> krischan@truth.UUCP (Christian Engel) writes:
>In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>>2) bigger is the data area used by the code is and easier is to have a
>>   cache miss (this in case of processors with data caches).
>Now, that's too high for me. Why should the chance of cache miss increase
>in passing parameters via stack?

Simply because they will HAVE to be store in the data cache (this is done
during the PUSH), removing other lines from the cache itself, lines that
will have to be reloaded in case of new references. The data cache works
normally as function of the "region" of the data used by the program. Stack
references will exit from this region, increasing the number of cache misses...-- 
|   Roberto Divia`      | Love at  first sight  is one of the greatest |
|   =============       | labor-saving devices the world has ever seen |
|  CERN :  European Laboratory for Particle Physics,  1211  Geneva  23 |
|  Switzerland (CH)                                                    |

krischan@strange.informatik.rwth-aachen.de (Christian Engel) (10/01/90)

In article <2823@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>In article <3552@rwthinf.UUCP> krischan@truth.UUCP (Christian Engel) writes:
>>In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>>>2) bigger is the data area used by the code is and easier is to have a
>>>   cache miss (this in case of processors with data caches).
>>Now, that's too high for me. Why should the chance of cache miss increase
>>in passing parameters via stack?
>
>Simply because they will HAVE to be store in the data cache (this is done
>during the PUSH), removing other lines from the cache itself, lines that
>will have to be reloaded in case of new references. The data cache works
>normally as function of the "region" of the data used by the program. Stack
>references will exit from this region, increasing the number of cache
>misses...-- 

Again: You need the same amount of stack storage area in either models.
In any case except for parameters of storage class register space in
the stack frame of the called function is required for the parameters.
I think it makes no difference wether the parameters are moved to the stack
before entering the function or afterwards. The code executed in the
meantime is only the bsr and the link instructions. If you get a
cache miss in one model you will get it in the other, too, don't you?

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan
Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen
Ahornstr. 55, D-5100 Aachen, W. Germany

krischan@strange.informatik.rwth-aachen.de (Christian Engel) (10/09/90)

Hi!

Some more discussion for the chance of cache faults:

In article <3552@rwthinf.UUCP> I write:
>In article <2820@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>>2) bigger is the data area used by the code is and easier is to have a
>>   cache miss (this in case of processors with data caches).
>
>Now, that's too high for me. Why should the chance of cache miss increase
>in passing parameters via stack? The first two parameters need
>space in the stack frame of the called function anyway. That means
>there is no difference! The only difference is, that passing parameters
>via stack moves them into the stack frame *before* entering the called
>function (done by the caller) and passing parameters via registers d0/d1
>moves them into the stack frame after entering the called function
>(done by the callee).

In article <3555@rwthinf.UUCP> I write:
>In article <2823@cernvax.UUCP> rbt@cernvax.UUCP (roberto divia) writes:
>>Simply because they will HAVE to be store in the data cache (this is done
>>during the PUSH), removing other lines from the cache itself, lines that
>>will have to be reloaded in case of new references. The data cache works
>>normally as function of the "region" of the data used by the program. Stack
>>references will exit from this region, increasing the number of cache
>>misses...--
>
>Again: You need the same amount of stack storage area in either models.
>In any case except for parameters of storage class register space in
>the stack frame of the called function is required for the parameters.
>I think it makes no difference wether the parameters are moved to the stack
>before entering the function or afterwards. The code executed in the
>meantime is only the bsr and the link instructions. If you get a
>cache miss in one model you will get it in the other, too, don't you?

Now some more aspects:

Compare the stack frame for both models passing parameters by stack
and passing parameters by registers:

----------------------------------------------------------------------

Model of stack frame if passing all parameters by *stack*:

high                  |                 |
                      |-----------------|
                      | parameters      |  \
  pushed by caller    | 1, 2, ... in    |   |
                      | reverse order   |   |
                      |-----------------|   |
  pushed by bsr       | return address  |   |
                      |-----------------|   > locally accessed by the function
                    / | old link reg.   |   |
                   |  |-----------------|   |
  built by link    <  | space           |   |
                   |  | for local       |   |
                    \ | variables       |  /
                      |-----------------|
                    / | registers saved |
                   |  | to give space   |
  by movem         <  | for local       |
                   |  | register        |
                    \ | variables       |
                      |-----------------|  <--- Top of stack
low                   |                 |

----------------------------------------------------------------------

Model of stack frame if passing parameters 1 and 2 via *registers* D0/D1:

high                  |                 |
                      |-----------------|
                      | parameters      |  \
  pushed by caller    | 3, 4, ... in    |   |
                      | reverse order   |   |
                      |-----------------|   |
  pushed by bsr       | return address  |   |
                      |-----------------|   |
                    / | old link reg.   |   |
                   |  |-----------------|   |
  built by link    <  | space           |   |
                   |  | for local       |   > locally accessed by the function
                    \ | variables       |   |
                      |-----------------|   |
                    / | registers saved |   |
                   |  | to give space   |   |
                   |  | for local       |   |
  by movem         <  | register        |   |
                   |  | variables       |   |
                   |  |-----------------|   |
                    \ | parameters 1, 2 |   /
                      |-----------------|  <--- Top of stack
low                   |                 |

----------------------------------------------------------------------

The propabilty of a cache fault will raise if the area of the accessed
memory grows. If you compare both models you will see that the accessed
area will be larger for the model of passing all register via stack
if and only if there are one or two parameters and there are no local
register variables. Looks like this:

----------------------------------------------------------------------

Model of stack frame if passing all parameters by *stack*:

high                  |                 |
                      |-----------------|
                      | parameters      |  \
  pushed by caller    | 1, 2 in         |   |
                      | reverse order   |   |
                      |-----------------|   |
  pushed by bsr       | return address  |   |
                      |-----------------|   > locally accessed by the function
                    / | old link reg.   |   |
                   |  |-----------------|   |
  built by link    <  | space           |   |
                   |  | for local       |   |
                    \ | variables       |  /
                      |-----------------|  <--- Top of stack
low                   |                 |

----------------------------------------------------------------------

Model of stack frame if passing parameters 1 and 2 via *registers* D0/D1:

high                  |                 |
                      |-----------------|
  pushed by bsr       | return address  |
                      |-----------------|
                    / | old link reg.   |
                   |  |-----------------|
  built by link    <  | space           |  \
                   |  | for local       |   |
                    \ | variables       |   > locally accessed by the function
                      |-----------------|   |
  by movem            | parameters 1, 2 |  /
                      |-----------------|  <--- Top of stack
low                   |                 |

----------------------------------------------------------------------

The difference of the sizes accessed is 8 Bytes.
Now we can discuss how high is the propability that this happens.
I think the probability for passing exactly one or two parameters
is very high. That gives an advantage for passing by registers. But
this advantage decreases with any registers to be freed for local
register variables. It is lost with two register variables and becomes
a disadvantage with more than two. This propability depends
on your programming style (how often do you use local variables
of storage class register?) and the compiler used (does it execute an
optimization changing local variables from storage class automatic
to register?).

Krischan

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
krischan@informatik.rwth-aachen.de *** mcvax!unido!rwthinf!strange!krischan
Christian Engel, Lehrstuhl fuer Informatik I, RWTH Aachen
Ahornstr. 55, D-5100 Aachen, W. Germany

rbt@cernvax.UUCP (roberto divia) (10/11/90)

This might go on forever... I have looked at some of my code (mostly written
for real time applications) and noticed how often I have used routines with
one or two input parameters and the output value. I also did some tests where
the same routine was called with parameters on stack and parameters in
registers and found that the second case is MUCH faster. My tests where
executed on a MC68030 (program and data caches). The simple fact of passing
one or two of 6 parameters (or more) into registers reduced the execution time
of 10 to 20 %. To see why, I tried different combinations. In particular, when
I turned OFF the data cache, the speed gain was much less (2, 3%). This
because the data region used by the program was somehow "extended" by the use
(or better - the abuse) of the data stack made during subroutine calls. Try
to believe...
-- 
|   Roberto Divia`      | Love at  first sight  is one of the greatest |
|   =============       | labor-saving devices the world has ever seen |
|  CERN :  European Laboratory for Particle Physics,  1211  Geneva  23 |
|  Switzerland (CH)                                                    |

dm@digiw.UUCP (Dick Mihalowsky) (10/13/90)

In article <3117@mcrware.UUCP> kim@mcrware.UUCP (Kim Kempf) writes:
>Couldn't have stated it better myself, Roberto! The following is an
>example, albeit trival, of the savings that adds up quickly on
>non-trivial programs:
[example deleted]

It is undeniably faster to use registers for passing parameters.
But the current Microware C compiler uses a rather complex register
allocation algorithm and I am still waiting for Microware to implement
<varargs.h> (or <stdarg.h> if you insist) and a working set of
the vprintf(3) -family.   Missing these, I take stack-passing anytime.

Still hacking on variable argument functions for OSK!

	Dick Mihalovsky

--------
Dick Mihalovsky, Jyvaskyla, Finland.
Email: dm@itumic.Itumic.FI

adp@moscom.UUCP (Alan Percy) (10/17/90)

In article <1546@digiw.UUCP> dm@digiw.UUCP (Dick Mihalowsky) writes:
>...
>But the current Microware C compiler uses a rather complex register
>allocation algorithm and I am still waiting for Microware to implement
><varargs.h> (or <stdarg.h> if you insist) and a working set of
>the vprintf(3) -family.   Missing these, I take stack-passing anytime.

/* varargs.h for os9/68k by Robert A. Larson	*/
/* version 0 for os9/68k C 2.0 04/20/86		*/
/* varargs is a "portable" way to write a routine that takes a variable */
/* number of arguements.  This implemination agrees with both the 4.2bsd*/
/* and Sys V documentation of varargs.  Note that just because varargs.h*/
/* is used does not mean that it is used properly.			*/
/* Ignore the "expression with little effect" warnings.  (Seems to be a */
/* compiler bug.)							*/

#define va_alist	_va_arg1, _va_arg2, _va_arg3

#define va_dcl		unsigned _va_arg1, _va_arg2, _va_arg3;

typedef	struct {
	  unsigned _va_at;  /* number of arguments used, 0, 1, or more  */
                            /* (double as first arg counts as 2)        */
	  union {
	    struct {
	      unsigned _va_uns1, _va_uns2;
	    } _va_uuns;
	    double _va_udouble;
	  } _va_union;
	  char *_va_pointer;
        } va_list;

#define	va_start(pvar)	( (pvar)._va_at = 0,				\
			  (pvar)._va_union._va_uuns._va_uns1 = _va_arg1,\
			  (pvar)._va_union._va_uuns._va_uns2 = _va_arg2,\
			  (pvar)._va_pointer = (char *) &_va_arg3	\
			)

#define va_arg(pvar, type)	(				\
	((pvar)._va_at++) ? (					\
		((pvar)._va_at == 2) ? (			\
			(sizeof(type) == 8) ? (			\
				*(((type *)((pvar)._va_pointer))++)	\
			) : (type)(				\
				(pvar)._va_union._va_uuns._va_uns2	\
			)					\
		) : (						\
			*(((type *)((pvar)._va_pointer))++)	\
		)						\
	) : (							\
		(sizeof(type) == 8) ? (type)(			\
			(pvar)._va_at++,			\
			(pvar)._va_union._va_udouble		\
		) : (type)(					\
			(pvar)._va_union._va_uuns._va_uns1	\
		)						\
	)							\
)

#define va_end(pvar)				/* va_end is simple */



Knock your self out....

-- 
A nanosecond here, a nanosecond there, 
		next thing you know you've got real time.
Alan Percy..........................{rutgers,ames,cmcl2}!rochester!moscom!adp