steve@vsi1.UUCP (Steve Maurer) (01/21/88)
[lineater] SWARBRICK FRANCIS JOHN writes: > > #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) > > Try that, and tell me what you think. A super-genius friend of mine figured > it out. He says it will work for anything (int, char *, etc.). But it > sure boggles my mind... > It boggles mine as well. Just from looking at it, it looks like the resulting Assembly would be a mess, even in the simplistic case of integers. However, just to try this out, I decided to compile the following section of code on our Sun3 (without optimization). #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) #define swap2(A, B, TYPE) { TYPE temp; temp = A; A = B; B = temp; } main() { int a = 3, b = 5; swap(a, b); foo_marker(); /* marker to deleniate code sections */ swap2(a, b, int); } The resulting code (without headers or trailers) was as follows: movl a6@(-0x4),d0 <-- xor method movl a6@(-0x8),d1 eorl d1,d0 movl d0,a6@(-0x4) movl a6@(-0x8),d1 eorl d1,d0 movl d0,a6@(-0x8) movl a6@(-0x4),d1 eorl d1,d0 movl d0,a6@(-0x4) jbsr _foo_marker movl a6@(-0x4),a6@(-0xc) <-- move method movl a6@(-0x8),a6@(-0x4) movl a6@(-0xc),a6@(-0x8) The resulting code is approximately 3 times longer! This is partly due to the superflous storing of intermediate results that the compiler makes, because it isn't sure in that portion of the parse tree whether the new result is going to be used. In other words, "optimal moves" don't necessarily work in HLL's. Here is the same exclusive or swap() macro, under Sun's optimizer: eorl d1,d0 movl d0,a6@(-4) eorl d1,d0 movl d0,a6@(-8) movl a6@(-4),d1 eorl d1,d0 movl d0,a6@(-4) As you can see, some, but not all that much better. This isn't really Sun's fault, I might add. Most compilers would do just about the same, or perhaps worse. Of course, an extrodinarily optimized compiler (one in which i = 5; i = 6; would delete the i = 5; statement altogether), might approximate the original intent of the code, but I doubt it. Steve Maurer
tim@amdcad.AMD.COM (Tim Olson) (01/21/88)
In article <269@vsi1.UUCP>, steve@vsi1.UUCP (Steve Maurer) writes: | This isn't really Sun's fault, I might add. Most compilers would do | just about the same, or perhaps worse. Of course, an extrodinarily | optimized compiler (one in which i = 5; i = 6; would delete the i = 5; | statement altogether), might approximate the original intent of the | code, but I doubt it. | | Steve Maurer Well, you can stop doubting. Here is the .s file generated by the MetaWare Am29000 compiler (minus procedure prologue and epilogue): call lr0,_foo_marker As you can see, there are no swaps whatsoever. This is because they weren't needed. I then modified the source to pass 'a' and 'b' to the function 'foo_marker', so that the values were required. The resulting code is [with foo_marker(a,b)]: const lr3,3 <-- pass '3' as second parameter call lr0,_foo_marker const lr2,5 <-- pass '5' as the first (delayed branch) ( Ain't RISC machines + optimizing compilers great! ;-) -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
dgh%dgh@Sun.COM (David Hough) (01/22/88)
In article <269@vsi1.UUCP>, steve@vsi1.UUCP (Steve Maurer) writes: > to try this out, I decided to compile the > following section of code on our Sun3 (without optimization). > > #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) > #define swap2(A, B, TYPE) { TYPE temp; temp = A; A = B; B = temp; } > > main() > { > int a = 3, b = 5; > > swap(a, b); > foo_marker(); /* marker to deleniate code sections */ > swap2(a, b, int); > } > > The resulting code (without headers or trailers) was as follows: > > movl a6@(-0x4),d0 <-- xor method > movl a6@(-0x8),d1 > eorl d1,d0 > movl d0,a6@(-0x4) > movl a6@(-0x8),d1 > eorl d1,d0 > movl d0,a6@(-0x8) > movl a6@(-0x4),d1 > eorl d1,d0 > movl d0,a6@(-0x4) > > jbsr _foo_marker > > movl a6@(-0x4),a6@(-0xc) <-- move method > movl a6@(-0x8),a6@(-0x4) > movl a6@(-0xc),a6@(-0x8) > > Here is the same exclusive or swap() macro, under Sun's optimizer: > > eorl d1,d0 > movl d0,a6@(-4) > eorl d1,d0 > movl d0,a6@(-8) > movl a6@(-4),d1 > eorl d1,d0 > movl d0,a6@(-4) The above reflects what's called -O1 optimization in SunOS 4.0, currently under development. I tried the example program under maximal optimization (-O4) and it optimized away to just about nothing, so I revised the program: #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) #define swap2(A, B, TYPE) { TYPE temp; temp = A; A = B; B = temp; } main() { int a,b; getab(&a,&b); /* pretend that a and b get new values */ swap(a, b); foo_marker(a,b); /* marker to delineate code sections */ getab(&a,&b); /* pretend that a and b get new values */ swap2(a, b, int); foo_marker(a,b); /* marker to delineate code sections */ } and found: jbsr _getab addqw #8,sp movl a6@(-8),d0 ex or macro eorl d0,a6@(-4) movl a6@(-4),d0 eorl d0,a6@(-8) movl a6@(-8),d0 eorl d0,a6@(-4) movl d0,sp@- movl a6@(-4),sp@- jbsr _foo_marker ... jbsr _getab addqw #8,sp movl a6@(-4),d7 move via temp movl a6@(-8),a6@(-4) movl d7,a6@(-8) movl d7,sp@- movl a6@(-4),sp@- jbsr _foo_marker 8 instructions instead of 5, the difference being the 3 eorl's. In general, you wouldn't use the EOR method unless you were out of space for temporaries, which might be the case if you were microcoding within a small register file, but is hardly typical of C applications. David Hough ARPA: dhough@sun.com UUCP: {ucbvax,decvax,allegra,decwrl,cbosgd,ihnp4,seismo}!sun!dhough
clay@oravax.UUCP (McFarland) (01/22/88)
In article <269@vsi1.UUCP> steve@vsi1.UUCP (Steve Maurer) writes: >SWARBRICK FRANCIS JOHN writes: >> >> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) >> >> He says it will work for anything (int, char *, etc.). But it >> sure boggles my mind... >> > It boggles mine as well. Just from looking at it, it looks like >the resulting Assembly would be a mess, even in the simplistic case of >integers. However, just to try this out, I decided to compile the >following section of code on our Sun3 (without optimization). > (Code deleted) >The resulting code is approximately 3 times longer! This is >partly due to the superflous storing of intermediate results that >the compiler makes, because it isn't sure in that portion of the >parse tree whether the new result is going to be used. > > In other words, "optimal moves" don't necessarily work in HLL's. Or anyplace else where they don't fit. I first invented this chestnut in 1966. I'm sure there were many prior inventors; it's obvious once you think " wouldn't it be nice to store both a and b in the same place for a while". I was using it to swap registers on the IBM 360 ( it didn't need a register swap, it had General Registers :-). In that context it worked perfectly but, as Maurer showed, it's ridiculous in C. I think the definition of googol (googolplex??:-) is "the number of computing constructs that were great for what they were designed for, but are used for 20 years where they are kluges". Clay Brooke-McFarland Odyssey Research Associates --------------------------------------------------------------------- "Wait! There's something we're all overlooking!" Chorus: "What is it,| Doctor?" "I don't know. I'm overlooking it myself." | ---------------------------------------------------------------------- >> >> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) >> >> He says it will work for anything (int, char *, etc.). But it >> sure boggles my mind... >> > It boggles mine as well. Just from looking at it, it looks like >the resulting Assembly would be a mess, even in the simplistic case of >integers. However, just to try this out, I decided to compile the >following section of code on our Sun3 (without optimization). > (Code deleted) >The resulting code is approximately 3 times longer! This is >partly due to the superflous storing of intermediate results that >the compiler makes, because it isn't sure in that portion of the >parse tree whether the new result is going to be used. > > In other words, "optimal moves" don't necessarily work in HLL's. > Or anyplace else where they don't fit. I first invented this chestnut in 1966. I'm sure there were many prior inventors; it's obvious once you think " wouldn't it be nice to store both a and b in the same place for a while". I was using it to swap registers on the IBM 360 ( it didn't need a register swap, it had General Registers :-). In that context it worked perfectly but, as Maurer showed, it's ridiculous in C. I think the definition of googol (googolplex??:-) is "the number of computing constructs that were great for what they were designed for, but are used for 20 years where they are kluges". Clay Brooke-McFarland Odyssey Research Associates "Wait! There's something we're all overlooking!" Chorus: "What is it, Doctor?" "I don't know. I'm overlooking it myself."
hansen@mips.COM (Craig Hansen) (01/23/88)
In article <39616@sun.uucp>, dgh%dgh@Sun.COM (David Hough) writes: > The above reflects what's called -O1 optimization in SunOS 4.0, > currently under development. I tried the example program under > maximal optimization (-O4) and it optimized away to just about nothing, > so I revised the program: > > #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) > #define swap2(A, B, TYPE) { TYPE temp; temp = A; A = B; B = temp; } > > main() > { > int a,b; > > getab(&a,&b); /* pretend that a and b get new values */ > swap(a, b); > foo_marker(a,b); /* marker to delineate code sections */ > getab(&a,&b); /* pretend that a and b get new values */ > swap2(a, b, int); > foo_marker(a,b); /* marker to delineate code sections */ > } > > and found: > > jbsr _getab > addqw #8,sp > movl a6@(-8),d0 ex or macro > eorl d0,a6@(-4) > movl a6@(-4),d0 > eorl d0,a6@(-8) > movl a6@(-8),d0 > eorl d0,a6@(-4) > movl d0,sp@- > movl a6@(-4),sp@- > jbsr _foo_marker > ... > jbsr _getab > addqw #8,sp > movl a6@(-4),d7 move via temp > movl a6@(-8),a6@(-4) > movl d7,a6@(-8) > movl d7,sp@- > movl a6@(-4),sp@- > jbsr _foo_marker > > 8 instructions instead of 5, the difference being the 3 eorl's. > In general, you wouldn't use the EOR method unless you were out of > space for temporaries, which might be the case if you were microcoding > within a small register file, but is hardly typical of C applications. Maybe a better compiler would keep this from getting out of hand. Dave Hough's version of the example turns into the following code on a MIPS R2000: (-O level optimization). main: File 'swap.c': 0: #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) 1: #define swap2(A, B, TYPE) { TYPE temp; temp = A; A = B; B = temp; } 2: 3: main() 4: { [swap.c: 5] 0x0: 27bdffd8 addiu sp,sp,-40 [swap.c: 5] 0x4: afbf0014 sw ra,20(sp) 5: int a,b; 6: 7: getab(&a,&b); /* pretend that a and b get new values */ [swap.c: 8] 0x8: 27a40024 addiu a0,sp,36 [swap.c: 8] 0xc: 0c000000 jal getab [swap.c: 8] 0x10: 27a50020 addiu a1,sp,32 [swap.c: 8] 0x14: 8fa30024 lw v1,36(sp) [swap.c: 8] 0x18: 8fa60020 lw a2,32(sp) 8: swap(a, b); [swap.c: 9] 0x1c: 00000000 nop [swap.c: 9] 0x20: 00661826 xor v1,v1,a2 [swap.c: 9] 0x24: 00663026 xor a2,v1,a2 [swap.c: 9] 0x28: 00c31826 xor v1,a2,v1 9: foo_marker(a,b); /* marker to delineate code sections */ [swap.c: 10] 0x2c: 00602021 move a0,v1 [swap.c: 10] 0x30: afa30024 sw v1,36(sp) [swap.c: 10] 0x34: 00c02821 move a1,a2 [swap.c: 10] 0x38: 0c000000 jal foo_marker [swap.c: 10] 0x3c: afa60020 sw a2,32(sp) 10: getab(&a,&b); /* pretend that a and b get new values */ [swap.c: 11] 0x40: 27a40024 addiu a0,sp,36 [swap.c: 11] 0x44: 0c000000 jal getab [swap.c: 11] 0x48: 27a50020 addiu a1,sp,32 [swap.c: 11] 0x4c: 8fa30024 lw v1,36(sp) [swap.c: 11] 0x50: 8fa60020 lw a2,32(sp) 11: swap2(a, b, int); [swap.c: 12] 0x54: 00601021 move v0,v1 [swap.c: 12] 0x58: 00c01821 move v1,a2 [swap.c: 12] 0x5c: 00403021 move a2,v0 12: foo_marker(a,b); /* marker to delineate code sections */ [swap.c: 13] 0x60: afa60020 sw a2,32(sp) [swap.c: 13] 0x64: 00602021 move a0,v1 [swap.c: 13] 0x68: afa30024 sw v1,36(sp) [swap.c: 13] 0x6c: 0c000000 jal foo_marker [swap.c: 13] 0x70: 00402821 move a1,v0 13: } [swap.c: 14] 0x74: 8fbf0014 lw ra,20(sp) [swap.c: 14] 0x78: 27bd0028 addiu sp,sp,40 [swap.c: 14] 0x7c: 03e00008 jr ra [swap.c: 14] 0x80: 00000000 nop This brings us back to reality. The exor version uses two registers to swap the values, and the temp version uses three, and both use three instructions. The exor version is slightly slower because the first instruction uses both of the operands, and so a load-delay occurs because b is used immediately upon loading it into a register; in more typical code, this wouldn't be a factor. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...{ames,decwrl,prls}!mips!hansen or hansen@mips.com
eao@anumb.UUCP (e.a.olson) (01/23/88)
In article <179@oravax.UUCP> clay@oravax.odyssey.UUCP (Clay Brooke-McFarland) writes: >In article <269@vsi1.UUCP> steve@vsi1.UUCP (Steve Maurer) writes: >>SWARBRICK FRANCIS JOHN writes: >>> >>> #define swap(a,b) ((a) = ((b) = ((a) = (a) ^ (b)) ^ (b)) ^ (a)) >>> >>> >> [...] However, just to try this out, I decided to compile the >>following section of code on our Sun3 (without optimization). >> (Code deleted) >>The resulting code is approximately 3 times longer! This is >>partly due to the superflous storing of intermediate results that >>the compiler makes, [ ... ] I tried it on a 3B2, with and without optimization. In both cases the swap-macro version lost to the temp-variable version. The macro version optimized into three XORW instructions, and was only eight bytes longer than the optimized temp-var version. However, in the temp-var version the optimizer realized that it was going to be pushing the variables on the stack for a printf and decided not to bother with any moves and just pushed the variables on the stack in the order that would have been proper IF it had done the swap mvuxq!eao