[comp.lang.c] Bad optimizations

steve@vsi1.UUCP (Steve Maurer) (01/21/88)

[lineater]
SWARBRICK FRANCIS JOHN writes:
>
> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
>
> Try that, and tell me what you think.  A super-genius friend of mine figured
> it out.  He says it will work for anything (int, char *, etc.).  But it
> sure boggles my mind...
>
    It boggles mine as well.    Just from looking at it, it looks like
the resulting Assembly would be a mess, even in the simplistic case of
integers.    However, just to try this out, I decided to compile the
following section of code on our Sun3 (without optimization).

    #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
    #define swap2(A, B, TYPE)  { TYPE temp; temp = A; A = B; B = temp; }

    main()
    {
    int a = 3, b = 5;

    swap(a, b);
    foo_marker();	/* marker to deleniate code sections */
    swap2(a, b, int);
    }

The resulting code (without headers or trailers) was as follows:

    movl	a6@(-0x4),d0		<-- xor method
    movl	a6@(-0x8),d1
    eorl	d1,d0
    movl	d0,a6@(-0x4)
    movl	a6@(-0x8),d1
    eorl	d1,d0
    movl	d0,a6@(-0x8)
    movl	a6@(-0x4),d1
    eorl	d1,d0
    movl	d0,a6@(-0x4)

    jbsr	_foo_marker

    movl	a6@(-0x4),a6@(-0xc)	<-- move method
    movl	a6@(-0x8),a6@(-0x4)
    movl	a6@(-0xc),a6@(-0x8)

The resulting code is approximately 3 times longer!   This is
partly due to the superflous storing of intermediate results that
the compiler makes, because it isn't sure in that portion of the
parse tree whether the new result is going to be used.

    In other words, "optimal moves" don't necessarily work in HLL's.


Here is the same exclusive or swap() macro, under Sun's optimizer:

    eorl	d1,d0
    movl	d0,a6@(-4)
    eorl	d1,d0
    movl	d0,a6@(-8)
    movl	a6@(-4),d1
    eorl	d1,d0
    movl	d0,a6@(-4)

As you can see, some, but not all that much better.

This isn't really Sun's fault, I might add.   Most compilers would do
just about the same, or perhaps worse.  Of course, an extrodinarily
optimized compiler (one in which i = 5; i = 6; would delete the i = 5;
statement altogether), might approximate the original intent of the
code, but I doubt it.

					Steve Maurer

tim@amdcad.AMD.COM (Tim Olson) (01/21/88)

In article <269@vsi1.UUCP>, steve@vsi1.UUCP (Steve Maurer) writes:
|  This isn't really Sun's fault, I might add.   Most compilers would do
|  just about the same, or perhaps worse.  Of course, an extrodinarily
|  optimized compiler (one in which i = 5; i = 6; would delete the i = 5;
|  statement altogether), might approximate the original intent of the
|  code, but I doubt it.
|  
|  					Steve Maurer

Well, you can stop doubting.  Here is the .s file generated by
the MetaWare Am29000 compiler (minus procedure prologue and epilogue):


	call	lr0,_foo_marker


As you can see, there are no swaps whatsoever.  This is because they
weren't needed.  I then modified the source to pass 'a' and 'b' to the
function 'foo_marker', so that the values were required.  The resulting
code is [with foo_marker(a,b)]:


	const	lr3,3			<-- pass '3' as second parameter
	call	lr0,_foo_marker
	const	lr2,5			<-- pass '5' as the first
					    (delayed branch)
	
( Ain't RISC machines + optimizing compilers great! ;-)

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)
	

dgh%dgh@Sun.COM (David Hough) (01/22/88)

In article <269@vsi1.UUCP>, steve@vsi1.UUCP (Steve Maurer) writes:
> to try this out, I decided to compile the
> following section of code on our Sun3 (without optimization).
> 
>     #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
>     #define swap2(A, B, TYPE)  { TYPE temp; temp = A; A = B; B = temp; }
> 
>     main()
>     {
>     int a = 3, b = 5;
> 
>     swap(a, b);
>     foo_marker();	/* marker to deleniate code sections */
>     swap2(a, b, int);
>     }
> 
> The resulting code (without headers or trailers) was as follows:
> 
>     movl	a6@(-0x4),d0		<-- xor method
>     movl	a6@(-0x8),d1
>     eorl	d1,d0
>     movl	d0,a6@(-0x4)
>     movl	a6@(-0x8),d1
>     eorl	d1,d0
>     movl	d0,a6@(-0x8)
>     movl	a6@(-0x4),d1
>     eorl	d1,d0
>     movl	d0,a6@(-0x4)
> 
>     jbsr	_foo_marker
> 
>     movl	a6@(-0x4),a6@(-0xc)	<-- move method
>     movl	a6@(-0x8),a6@(-0x4)
>     movl	a6@(-0xc),a6@(-0x8)
> 
> Here is the same exclusive or swap() macro, under Sun's optimizer:
> 
>     eorl	d1,d0
>     movl	d0,a6@(-4)
>     eorl	d1,d0
>     movl	d0,a6@(-8)
>     movl	a6@(-4),d1
>     eorl	d1,d0
>     movl	d0,a6@(-4)

The above reflects what's called -O1 optimization in SunOS 4.0,
currently under development.  I tried the example program under
maximal optimization (-O4) and it optimized away to just about nothing,
so I revised the program:

#define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
#define swap2(A, B, TYPE)  { TYPE temp; temp = A; A = B; B = temp; }

main()
{
int a,b;

getab(&a,&b);		/* pretend that a and b get new values */
swap(a, b);
foo_marker(a,b);        /* marker to delineate code sections */
getab(&a,&b);		/* pretend that a and b get new values */
swap2(a, b, int);
foo_marker(a,b);        /* marker to delineate code sections */
}

and found:

	jbsr	_getab
	addqw	#8,sp
	movl	a6@(-8),d0			ex or macro
	eorl	d0,a6@(-4)
	movl	a6@(-4),d0
	eorl	d0,a6@(-8)
	movl	a6@(-8),d0
	eorl	d0,a6@(-4)
	movl	d0,sp@-
	movl	a6@(-4),sp@-
	jbsr	_foo_marker
...
	jbsr	_getab
	addqw	#8,sp
	movl	a6@(-4),d7			move via temp
	movl	a6@(-8),a6@(-4)
	movl	d7,a6@(-8)
	movl	d7,sp@-
	movl	a6@(-4),sp@-
	jbsr	_foo_marker

8 instructions instead of 5, the difference being the 3 eorl's.
In general, you wouldn't use the EOR method unless you were out of
space for temporaries, which might be the case if you were microcoding
within a small register file, but is hardly typical of C applications.

David Hough

ARPA: dhough@sun.com
UUCP: {ucbvax,decvax,allegra,decwrl,cbosgd,ihnp4,seismo}!sun!dhough

clay@oravax.UUCP (McFarland) (01/22/88)

In article <269@vsi1.UUCP> steve@vsi1.UUCP (Steve Maurer) writes:
>SWARBRICK FRANCIS JOHN writes:
>>
>> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
>>
>> He says it will work for anything (int, char *, etc.).  But it
>> sure boggles my mind...
>>
>    It boggles mine as well.    Just from looking at it, it looks like
>the resulting Assembly would be a mess, even in the simplistic case of
>integers.    However, just to try this out, I decided to compile the
>following section of code on our Sun3 (without optimization).
> (Code deleted)
>The resulting code is approximately 3 times longer!   This is
>partly due to the superflous storing of intermediate results that
>the compiler makes, because it isn't sure in that portion of the
>parse tree whether the new result is going to be used.
>
>    In other words, "optimal moves" don't necessarily work in HLL's.


	Or anyplace else where they don't fit. I first invented this
chestnut in 1966. I'm sure there were many prior inventors; it's
obvious once you think " wouldn't it be nice to store both a and b in
the same place for a while". I was using it to swap registers on the
IBM 360 ( it didn't need a register swap, it had General Registers
:-). In that context it worked perfectly but, as Maurer showed, it's
ridiculous in C. I think the definition of googol (googolplex??:-) is
"the number of computing constructs that were great for what they were
designed for, but are used for 20 years where they are kluges".


Clay Brooke-McFarland			Odyssey Research Associates


---------------------------------------------------------------------
"Wait! There's something we're all overlooking!" Chorus: "What is it,|
Doctor?" "I don't know. I'm overlooking it myself."		     |
----------------------------------------------------------------------






>>
>> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
>>
>> He says it will work for anything (int, char *, etc.).  But it
>> sure boggles my mind...
>>
>    It boggles mine as well.    Just from looking at it, it looks like
>the resulting Assembly would be a mess, even in the simplistic case of
>integers.    However, just to try this out, I decided to compile the
>following section of code on our Sun3 (without optimization).
> (Code deleted)
>The resulting code is approximately 3 times longer!   This is
>partly due to the superflous storing of intermediate results that
>the compiler makes, because it isn't sure in that portion of the
>parse tree whether the new result is going to be used.
>
>    In other words, "optimal moves" don't necessarily work in HLL's.
>
	Or anyplace else where they don't fit. I first invented this
chestnut in 1966. I'm sure there were many prior inventors; it's
obvious once you think " wouldn't it be nice to store both a and b in
the same place for a while". I was using it to swap registers on the
IBM 360 ( it didn't need a register swap, it had General Registers
:-). In that context it worked perfectly but, as Maurer showed, it's
ridiculous in C. I think the definition of googol (googolplex??:-) is
"the number of computing constructs that were great for what they were
designed for, but are used for 20 years where they are kluges".

Clay Brooke-McFarland			Odyssey Research Associates

"Wait! There's something we're all overlooking!" Chorus: "What is it,
Doctor?" "I don't know. I'm overlooking it myself."

hansen@mips.COM (Craig Hansen) (01/23/88)

In article <39616@sun.uucp>, dgh%dgh@Sun.COM (David Hough) writes:
> The above reflects what's called -O1 optimization in SunOS 4.0,
> currently under development.  I tried the example program under
> maximal optimization (-O4) and it optimized away to just about nothing,
> so I revised the program:
> 
> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
> #define swap2(A, B, TYPE)  { TYPE temp; temp = A; A = B; B = temp; }
> 
> main()
> {
> int a,b;
> 
> getab(&a,&b);		/* pretend that a and b get new values */
> swap(a, b);
> foo_marker(a,b);        /* marker to delineate code sections */
> getab(&a,&b);		/* pretend that a and b get new values */
> swap2(a, b, int);
> foo_marker(a,b);        /* marker to delineate code sections */
> }
> 
> and found:
> 
> 	jbsr	_getab
> 	addqw	#8,sp
> 	movl	a6@(-8),d0			ex or macro
> 	eorl	d0,a6@(-4)
> 	movl	a6@(-4),d0
> 	eorl	d0,a6@(-8)
> 	movl	a6@(-8),d0
> 	eorl	d0,a6@(-4)
> 	movl	d0,sp@-
> 	movl	a6@(-4),sp@-
> 	jbsr	_foo_marker
> ...
> 	jbsr	_getab
> 	addqw	#8,sp
> 	movl	a6@(-4),d7			move via temp
> 	movl	a6@(-8),a6@(-4)
> 	movl	d7,a6@(-8)
> 	movl	d7,sp@-
> 	movl	a6@(-4),sp@-
> 	jbsr	_foo_marker
> 
> 8 instructions instead of 5, the difference being the 3 eorl's.
> In general, you wouldn't use the EOR method unless you were out of
> space for temporaries, which might be the case if you were microcoding
> within a small register file, but is hardly typical of C applications.

Maybe a better compiler would keep this from getting out of hand.
Dave Hough's version of the example turns into the following code
on a MIPS R2000: (-O level optimization).

	main:
File 'swap.c':
0: #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
1: #define swap2(A, B, TYPE)  { TYPE temp; temp = A; A = B; B = temp; }
2: 
3: main()
4: {
  [swap.c:   5] 0x0:	27bdffd8	addiu	sp,sp,-40
  [swap.c:   5] 0x4:	afbf0014	sw	ra,20(sp)
5: int a,b;
6: 
7: getab(&a,&b);		/* pretend that a and b get new values */
  [swap.c:   8] 0x8:	27a40024	addiu	a0,sp,36
  [swap.c:   8] 0xc:	0c000000	jal	getab
  [swap.c:   8] 0x10:	27a50020	addiu	a1,sp,32
  [swap.c:   8] 0x14:	8fa30024	lw	v1,36(sp)
  [swap.c:   8] 0x18:	8fa60020	lw	a2,32(sp)
8: swap(a, b);
  [swap.c:   9] 0x1c:	00000000	nop
  [swap.c:   9] 0x20:	00661826	xor	v1,v1,a2
  [swap.c:   9] 0x24:	00663026	xor	a2,v1,a2
  [swap.c:   9] 0x28:	00c31826	xor	v1,a2,v1
9: foo_marker(a,b);        /* marker to delineate code sections */
  [swap.c:  10] 0x2c:	00602021	move	a0,v1
  [swap.c:  10] 0x30:	afa30024	sw	v1,36(sp)
  [swap.c:  10] 0x34:	00c02821	move	a1,a2
  [swap.c:  10] 0x38:	0c000000	jal	foo_marker
  [swap.c:  10] 0x3c:	afa60020	sw	a2,32(sp)
10: getab(&a,&b);		/* pretend that a and b get new values */
  [swap.c:  11] 0x40:	27a40024	addiu	a0,sp,36
  [swap.c:  11] 0x44:	0c000000	jal	getab
  [swap.c:  11] 0x48:	27a50020	addiu	a1,sp,32
  [swap.c:  11] 0x4c:	8fa30024	lw	v1,36(sp)
  [swap.c:  11] 0x50:	8fa60020	lw	a2,32(sp)
11: swap2(a, b, int);
  [swap.c:  12] 0x54:	00601021	move	v0,v1
  [swap.c:  12] 0x58:	00c01821	move	v1,a2
  [swap.c:  12] 0x5c:	00403021	move	a2,v0
12: foo_marker(a,b);        /* marker to delineate code sections */
  [swap.c:  13] 0x60:	afa60020	sw	a2,32(sp)
  [swap.c:  13] 0x64:	00602021	move	a0,v1
  [swap.c:  13] 0x68:	afa30024	sw	v1,36(sp)
  [swap.c:  13] 0x6c:	0c000000	jal	foo_marker
  [swap.c:  13] 0x70:	00402821	move	a1,v0
13: }
  [swap.c:  14] 0x74:	8fbf0014	lw	ra,20(sp)
  [swap.c:  14] 0x78:	27bd0028	addiu	sp,sp,40
  [swap.c:  14] 0x7c:	03e00008	jr	ra
  [swap.c:  14] 0x80:	00000000	nop

This brings us back to reality. The exor version uses two registers to
swap the values, and the temp version uses three, and both use three
instructions. The exor version is slightly slower because the first
instruction uses both of the operands, and so a load-delay occurs
because b is used immediately upon loading it into a register; in more
typical code, this wouldn't be a factor.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...{ames,decwrl,prls}!mips!hansen or hansen@mips.com

eao@anumb.UUCP (e.a.olson) (01/23/88)

In article <179@oravax.UUCP> clay@oravax.odyssey.UUCP (Clay Brooke-McFarland) writes:
>In article <269@vsi1.UUCP> steve@vsi1.UUCP (Steve Maurer) writes:
>>SWARBRICK FRANCIS JOHN writes:
>>>
>>> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a))
>>>
>>>
>> [...] However, just to try this out, I decided to compile the
>>following section of code on our Sun3 (without optimization).
>> (Code deleted)
>>The resulting code is approximately 3 times longer!   This is
>>partly due to the superflous storing of intermediate results that
>>the compiler makes, [ ... ]

    I tried it on a 3B2, with and without optimization.
    In both cases the swap-macro version lost to the
    temp-variable version.  The macro version optimized
    into three XORW instructions, and was only eight
    bytes longer than the optimized temp-var version.

    However, in the temp-var version the optimizer realized that
    it was going to be pushing the variables on the stack for
    a printf and decided not to bother with any moves and
    just pushed the variables on the stack in the order that
    would have been proper IF it had done the swap


					mvuxq!eao