pokey@well.UUCP (Jef Poskanzer) (06/05/89)
SUMMARY ------- I compiled the following different kinds of for loops: for ( i = 0; i < COUNT; i++ ) for ( i = 0; i < COUNT; ++i ) for ( i = 0; ++i <= COUNT; ) for ( i = 0; i++ < COUNT; ) for ( i = COUNT; i > 0; i-- ) for ( i = COUNT; i > 0; --i ) for ( i = COUNT; --i >= 0; ) for ( i = COUNT; i-- > 0; ) on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the generated code and the timings. COUNT was a small (< 127) compile-time constant. The loop body did not reference i and had no side-effects. In theory, all eight of these loops should have optimized to the most efficient loop possible, ignoring the otherwise unreferenced variable i and simply traversing the loop body the proper number of times. On the 68020, this most efficient loop is a dbra instruction. But in fact, cc never generates dbra, and gcc generates it for only one of the above loop types, and only when the -fstrength-reduce flag is specified. In contrast, the BSD cc on a vax generates a single instruction loop in four out of the eight cases, including the two most commonly-used ones. This is not because the vax compiler is any smarter -- the fact that it wins only half the time, instead of all the time, shows that it is also failing to recognize that it can freely optimize the loop. The only reason the vax's cc does better is that the vax has a richer instruction set. Since the loop overhead for dbra is 1.7 times less than for the code from the common for loops, it is useful to know how to provoke your compiler into generating it. But it would be more useful for compilers to be smarter. And more generally, those of you who blithely assume (as I used to) that compilers are smart enough to turn clearly-expressed C into optimal assembler, should perhaps look at their assembler output more often. FULL DATA --------- Case 1) First, let's look at the generic loops, counting up from zero to the limit: for ( i = 0; i < COUNT; i++ ) for ( i = 0; i < COUNT; ++i ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 tag: tag: <loop body> <loop body> addql #1,d6 addql #1,d0 moveq #COUNT,d1 moveq #COUNT-1,d3 cmpl d1,d6 cmpl d0,d3 jlt tag jge tag 0.97 usecs 0.97 usecs The two compilers generate essentially the same instructions. Pre or post increment doesn't make any difference. Note that the moveq of COUNT could be moved above the loop -- the loop body doesn't doesn't call any subroutines that might smash it. This looks like a missed optimization. Case 2a) Second, let's try doing the increment and test in one C statement. Seems to me this shouldn't make any difference to a smart optimizer, but... for ( i = 0; ++i <= COUNT; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 jra tag2 jra tag2 tag1: tag1: <loop body> <loop body> tag2: tag2: addql #1,d6 addql #1,d0 moveq #COUNT,d1 moveq #COUNT,d3 cmpl d1,d6 cmpl d0,d3 jle tag1 jge tag1 0.97 usecs 0.97 usecs No important difference from case 1. Case 2b) But if we try the post increment version: for ( i = 0; i++ < COUNT; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 jra tag2 jra tag2 tag1: tag1: <loop body> <loop body> tag2: tag2: movl d6 d0 addql #1,d0 addql #1,d6 moveq #COUNT+1,d3 moveq #COUNT,d1 cmpl d0,d3 cmpl d1,d0 jgt tag1 jlt tag1 1.11 usecs 0.97 usecs Gcc is smart enough to bias the loop count, but cc doesn't see that optimization and so adds an extra instruction. Case 3) Third, let's try a count-down loop: for ( i = COUNT; i > 0; i-- ) for ( i = COUNT; i > 0; --i ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 tag: tag: <loop body> <loop body> subql #1,d6 subql #1,d0 tstl d6 tstl d0 jgt tag jgt tag 0.83 usecs 0.83 usecs Here the two compilers generate exactly the same instructions. There is one less instruction than in the count-up case, because the compilers are smart enough to use the tstl instruction to compare against zero, and so they don't need the moveq #COUNT instruction (which shouldn't have been in the loop in the first place). Case 4a) Fourth, let's try a count-down loop with the decrement included in the test statement: for ( i = COUNT; --i >= 0; ) cc -O gcc -O gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 moveq #COUNT,d0 jra tag2 jra tag2 jra tag2 tag1: tag1: tag1: <loop body> <loop body> <loop body> tag2: tag2: tag2: subql #1,d6 subql #1,d0 dbra d0,tag1 jge tag1 jpl tag1 clrw d0 subql #1,d0 jcc tag1 0.70 usecs 0.70 usecs 0.57 usecs Cc and gcc generate code similar to case 3, except that they suddenly decide that subql sets the condition codes just fine by itself, and a separate tstl is not necessary. So they shave off an instruction. Seems like a peephole optimizer should have picked this up in the previous cases too; it's another missed optimization. But the big win here is with the -fstrength-reduce option in gcc. It actually generates the optimal one-instruction loop, for a factor of 1.7 reduction in loop overhead from the generic loop. Not bad. But wait! What's that chud after the loop? Let's see, clear d1 to zero, subtract one from it giving -1 and setting carry, and jump if carry is clear. Hmm, looks like a three-instruction no-op to me! I guess gcc wants to leave the loop index with the right value, and isn't smart enough to notice it isn't used. (But why not simply moveq the final value?) Oh well, since it's not inside the loop it doesn't make much difference. Case 4b) But if we try this with post decrement: for ( i = COUNT; i-- > 0; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 jra tag2 jra tag2 tag1: tag1: <loop body> <loop body> tag2: tag2: movl d6,d0 subql #1,d0 subql #1,d6 moveq #-1,d3 tstl d0 cmpl d0,d3 jgt tag1 jlt tag1 0.97 usecs 0.97 usecs Cc not only adds in the movl to save the unincremented value of i, it also somehow fails to realize that subql sets the condition codes, so the tstl is back. And gcc gets totally confused and brings in a literal -1. CONCLUSION ---------- For both compilers and all levels of optimization, this loop: for ( i = COUNT; --i >= 0; ) gives the lowest overhead. Using gcc -fstrength-reduce, this low overhead means the optimal single-instruction loop; but even for cc and for gcc without -fstrength-reduce, this loop is faster than the others. I don't show the results here, but this is also true for large constant loops and for variable-length loops. It is unfortunate that we have to use such a non-intuitive loop to get the best results -- especially unfortunate since it is probably *not* the best loop on some other architectures or compilers -- but that's the facts. I would be interested to see similar checks done on different architectures; in particular RISC machines, which supposedly are designed in cooperation with the compiler writers. --- Jef Jef Poskanzer {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey "Man invented language to satisfy his deep need to complain." -- Lily Tomlin
jima@hplsla.HP.COM (Jim Adcock) (06/06/89)
Unfortunately, the "favorite loop constuct" is going to be very dependent on
the particular compiler used, and exactly *what* you do inside the loop.
Below is your examples, redone with a null loop body, plus *my* favorite
loop construct: for(i=COUNT; i--;). I used gcc and hp's cc 6.5 compiler
with -O -S and in gnu -O -S -fstrength-reduce -fcombine-regs -fforce-mem
-fforce-addr. The point is that these things are very unpredictable,
and optimizing compilers work well on linear code segments, not over
branch points in programs [including loops]. The "optimizing" compilers
I've seen do about 20% better than non-optimizing compilers, but are still
far from "optimal" by our human standards.
#define COUNT 100
main()
{
int i;
for ( i = 0; i < COUNT; i++ );
xxxxxxxxxxxxxxxxxxx();
for ( i = 0; i < COUNT; ++i );
xxxxxxxxxxxxxxxxxxx();
for ( i = 0; ++i <= COUNT; );
xxxxxxxxxxxxxxxxxxx();
for ( i = 0; i++ < COUNT; );
xxxxxxxxxxxxxxxxxxx();
for ( i = COUNT; i > 0; i-- );
xxxxxxxxxxxxxxxxxxx();
for ( i = COUNT; i > 0; --i );
xxxxxxxxxxxxxxxxxxx();
for ( i = COUNT; --i >= 0; );
xxxxxxxxxxxxxxxxxxx();
for ( i = COUNT; i-- > 0; );
xxxxxxxxxxxxxxxxxxx();
for ( i = COUNT; i--; );
xxxxxxxxxxxxxxxxxxx();
}
#NO_APP
gcc_compiled.:
.text
.even
.globl _main
_main:
link a6,#0
moveq #99,d0
L4:
dbra d0,L4
clrw d0
subql #1,d0
jcc L4
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #99,d0
L8:
dbra d0,L8
clrw d0
subql #1,d0
jcc L8
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #100,d0
L10:
dbra d0,L10
clrw d0
subql #1,d0
jcc L10
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #100,d0
L14:
dbra d0,L14
clrw d0
subql #1,d0
jcc L14
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #100,d0
L20:
subql #1,d0
tstl d0
jgt L20
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #100,d0
L24:
subql #1,d0
tstl d0
jgt L24
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #100,d0
L26:
dbra d0,L26
clrw d0
subql #1,d0
jcc L26
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #100,d0
L30:
subql #1,d0
moveq #-1,d1
cmpl d0,d1
jlt L30
jbsr _xxxxxxxxxxxxxxxxxxx
moveq #100,d0
L34:
dbra d0,L34
clrw d0
subql #1,d0
jcc L34
jbsr _xxxxxxxxxxxxxxxxxxx
unlk a6
rts
[hp cc 6.5:] ==============================================
global _main
_main:
link.w %a6,&-8
mov.l %d7,-(%sp)
movq &99,%d7
L12:
dbf %d7,L12
jsr _xxxxxxxxxxxxxxxxxxx
movq &99,%d7
L16:
dbf %d7,L16
jsr _xxxxxxxxxxxxxxxxxxx
movq &0,%d7
L21:
addq.l &1,%d7
movq &100,%d0
cmp.l %d7,%d0
ble.b L21
jsr _xxxxxxxxxxxxxxxxxxx
movq &0,%d7
L24:
mov.l %d7,%d0
addq.l &1,%d7
movq &100,%d1
cmp.l %d0,%d1
blt.b L24
jsr _xxxxxxxxxxxxxxxxxxx
movq &100,%d7
L20001:
subq.l &1,%d7
tst.l %d7
bgt.b L20001
jsr _xxxxxxxxxxxxxxxxxxx
movq &100,%d7
L20003:
subq.l &1,%d7
tst.l %d7
bgt.b L20003
jsr _xxxxxxxxxxxxxxxxxxx
movq &100,%d7
L33:
subq.l &1,%d7
bge.b L33
jsr _xxxxxxxxxxxxxxxxxxx
movq &100,%d7
L36:
mov.l %d7,%d0
subq.l &1,%d7
tst.l %d0
bgt.b L36
jsr _xxxxxxxxxxxxxxxxxxx
movq &100,%d7
L39:
mov.l %d7,%d0
subq.l &1,%d7
tst.l %d0
bne.b L39
jsr _xxxxxxxxxxxxxxxxxxx
mov.l (%sp)+,%d7
unlk %a6
rts
chris@mimsy.UUCP (Chris Torek) (06/06/89)
In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes: >... COUNT was a small (< 127) compile-time constant. > for ( i = COUNT; --i >= 0; ) [all but gcc -O -fstrength-reduce deleted] > moveq #COUNT,d0 > jra tag2 >tag1: > <loop body> >tag2: > dbra d0,tag1 > clrw d0 > subql #1,d0 > jcc tag1 >... But wait! What's that chud after the loop? Let's see, clear d1 >to zero, subtract one from it giving -1 and setting carry, and jump >if carry is clear. Hmm, looks like a three-instruction no-op to me! No---the problem is that `dbra' decrements a *word*, compares the result against -1, and (if not -1) braches. The semantics of the loop demands a 32 bit comparison. The only reason it is not necessary in this particular case is the first quoted line above. Still, it would be nice if gcc always used the dbra/clrw/subql/jcc sequence for `--x >= 0' loops, since it does always work. The `clrw' fixes up the case where the 16-bit result has gone to -1: before decrement: wxyz 0000 after decrement: wxyz FFFF after clrw: wxyz 0000 after subql: wxyz-1 FFFF The dbra loop is so much faster that the extra time and space for one `unnecessary' dbra+clrw (when the loop really does go from 0 to -1, and at every 65536 trips when the loop counter is large and positive) that I would make this optimisation unconditional. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
suitti@haddock.ima.isc.com (Stephen Uitti) (06/06/89)
In article <11993@well.UUCP> Jef Poskanzer <pokey@well.com> writes: >I compiled the following different kinds of for loops: > for ( i = 0; i < COUNT; i++ ) > for ( i = 0; i < COUNT; ++i ) > for ( i = 0; ++i <= COUNT; ) > for ( i = 0; i++ < COUNT; ) > for ( i = COUNT; i > 0; i-- ) > for ( i = COUNT; i > 0; --i ) > for ( i = COUNT; --i >= 0; ) > for ( i = COUNT; i-- > 0; ) >on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the >generated code and the timings. COUNT was a small (< 127) compile-time >constant. A huge amount of time ago, i did this for the PDP-11 using the Ritchie compiler. This compiler is not awesomely bright, but i had the impression that it wasn't quite as bad as PCC (or even less portable). The optimal loop would use the "sob" (subtract one and branch if not zero) instruction. The code that caused it to be emitted was register i; i = COUNT; do { loop body; } while (--i); This is much clearer than anything with for(), since it tends to get (even stupidly) compiled to: register i; i = COUNT; mov $COUNT,r2 do { L1: loop body; loop body... } while (--i); sob r2,L1 The compiler seemed to be smart enough to know not to use it if the loop was too long. Thus, at worst, the sob would be replaced by: sub $1,r2 jne L1 The trouble with "for" loops is that it is defined as "test at the top", when most single instructions loops are "test at bottom". The VAX (PCC based) compiler's optimizer would convert many loops that used multiple instructions to use a single complicated instruction. Unfortunately, the VAX 780 (ubiquitous at the time) generally ran these slower. One case was the acbl (add, compare and branch longword). The optimizer would replace the three instructions (something like:) add $1,r2 cmp $END,r2 bne L1 with the "acbl" and all its arguments. Both codings take the same number of bytes (25!). The multiple instructions are faster (on the 780). Newer VAXen have different relative timing. I recall this case coming from the prime number sieve benchmark. The loop could not be easily converted to the do-while. I haven't gotten a chance to play with gcc (real work to do). I've heard it can do wonders, and thought it did better multi-statement optimization. Still, all the silly side-effects semantics of C make it hard to tell a compiler how to take mediocre source and produce good code from it. It is best to start with good source. Though i hardly ever write assembler anymore (and it seems never for speed or size), i've still yet to bump into a C compiler that produces code that i can't improve on at a glance. This may change if/when i start working with scoreboarded RISC processors (88K), in that instruction cycle counting looks hard (requiring more than a glance). Stephen.
david@sun.com (there IS no ``average'' SubGenius) (06/06/89)
The Sun 680x0 compiler has been able to generate dbcc loops for a long time. However, the loop has to be written in a particular form. Here are some examples compiled with the SunOS 3.5 cc: 1. register short s; for (s = 127; s != -1; s--) ; moveq #127,d7 LY00001: dbra d7,LY00001 2a. register int i; for (i = 127; i != -1; i--) ; moveq #127,d6 LY00003: subql #1,d6 moveq #-1,d7 cmpl d7,d6 jne LY00003 (Oops!) 2b. register int i; i = 127; while (--i != -1) ; moveq #127,d6 L20: dbra d6,L20 clrw d6 subql #1,d6 jcc L20 3. extern int x; register short s; s = 127; while (x >= 0 && --s != -1) ; moveq #127,d7 L15: tstl _x dblt d7,L15 P.S. Sun users might want to look at the PR_LOOP macros in <pixrect/pr_impl_util.h>. -- David DiGiacomo, Sun Microsystems, Mt. View, CA sun!david david@sun.com
edler@cmcl2.NYU.EDU (Jan Edler) (06/07/89)
In article <17891@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes: >>... COUNT was a small (< 127) compile-time constant. >> for ( i = COUNT; --i >= 0; ) > >[all but gcc -O -fstrength-reduce deleted] > >> moveq #COUNT,d0 >> jra tag2 >>tag1: >> <loop body> >>tag2: >> dbra d0,tag1 >> clrw d0 >> subql #1,d0 >> jcc tag1 I prefer moveq #COUNT,d0 jra tag3 tag1: swap d0 tag2: <loop body> tag3: dbra d0,tag2 swap d0 dbra d0,tag1 But it doesn't really make that much difference, I guess. However, the compiler shouldn't do either of these when d0 is initialized to a value < 65536. Jan Edler NYU Ultracomputer Research Laboratory edler@nyu.edu
pokey@well.UUCP (Jef Poskanzer) (06/07/89)
In the referenced message, edler@cmcl2.UUCP (Jan Edler) wrote: }However, the compiler shouldn't do either of these when d0 is initialized }to a value < 65536. Not good enough, since the loop body might modify the index variable. If the compiler is smart enough to do a global semantic check for this sort of thing, then you can do this optimization; but we've already established that cc and gcc aren't that smart. --- Jef Jef Poskanzer {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey "A banker is a fellow who lends you his umbrella when the sun is shining and wants it back the minute it begins to rain." -- Mark Twain
mikeb@coho.ee.ubc.ca (Mike Bolotski) (06/07/89)
In article <5260014@hplsla.HP.COM> jima@hplsla.HP.COM (Jim Adcock) writes: >Below is your examples, redone with a null loop body, plus *my* favorite >loop construct: for(i=COUNT; i--;). This is my favorite loop, too. But if i is defined to be short, instead of int, the following results (Sun3/260), gcc 1.35 -O -S. Note: no -fstrength_reduce required. main() { volatile int x; short j; for (j=5; j--; ) x = 0; } #NO_APP gcc_compiled.: .text .even .globl _main _main: link a6,#-4 moveq #5,d0 jra L2 L5: clrl a6@(-4) L2: dbra d0,L5 unlk a6 rts Mike Bolotski, Department of Electrical Engineering, University of British Columbia, Vancouver, Canada mikeb@ee.ubc.ca | mikeb%ee.ubc.ca@relay.ubc.ca ee.ubc.ca!mikeb@uunet.uu.net | ...!ubc-vision!ee.ubc.ca!mikeb
inst182@tuvie (Inst.f.Techn.Informatik) (06/09/89)
I tried the same kinds of for loops on a XT-compatible (8086, 8MHz) using Turbo-C 2.0. (tcc -G -O -Z -S) I also tested for (i = COUNT; i --;); because I thought, it would be the fastest. For integers, the fastest loop was for (i = COUNT; -- i >= 0;); which turned out to be 1.7 times as fast as the slowest. It is remarkable, that this is the same factor as Jef Poskanzer has timed on his Sun. #define COUNT 30000 ---------------------------------------------------------------- for (i = 0; i < COUNT; i ++); 127 ms xor si,si jmp short @5 @4: inc si @5: cmp si,30000 jl @4 Comparation against a non-zero value ---------------------------------------------------------------- for (i = 0; i < COUNT; ++ i); 115 ms xor si,si jmp short @9 @8: inc si @9: cmp si,30000 jl @8 Can anybody explain, why this is faster than the above ???? ---------------------------------------------------------------- for (i = 0; ++ i <= COUNT;); 132 ms xor si,si @13: inc si mov ax,si cmp ax,30000 jle @13 Moving si to ax seems rather useless to me. ---------------------------------------------------------------- for (i = 0; i++ < COUNT;); 132 ms xor si,si @17: mov ax,si inc si cmp ax,30000 jl @17 Ok, here it is neccessary ---------------------------------------------------------------- for (i = COUNT; i > 0; i --); 94 ms mov si,30000 jmp short @21 @20: dec si @21: or si,si jg @20 The separation of decrement and comparation causes an additional or ---------------------------------------------------------------- for (i = COUNT; i > 0; --i); 93 ms mov si,30000 jmp short @25 @24: dec si @25: or si,si jg @24 Looks exactly like the version above. The 1 ms difference could be a timing error. ---------------------------------------------------------------- for (i = COUNT; --i >= 0;); 77 ms mov si,30000 @29: dec si jge @29 This surely is the optimal loop ---------------------------------------------------------------- for (i = COUNT; i-- > 0;); 115 ms mov si,30000 @33: mov ax,si dec si or ax,ax jg @33 The former value of the counter has to be saved for the comparison. ---------------------------------------------------------------- for (i = COUNT; i--;); 115 ms mov si,30000 @37: mov ax,si dec si or ax,ax jne @37 ================================================================ Then I tried the same with a long counter (instead of int), which changed the results radically. Not only is the difference between the fastest and the slowest loop much lesser (The factor is now only 1.13), but version 7, which has been the fastest for integers is now the second slowest, and version 9 now is the fastest. Versions 3 and 4 have remained at the slow end of the charts. #define COUNT 60000 ---------------------------------------------------------------- for (i = 0; i < COUNT; i ++); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 jmp short @5 @4: add word ptr [bp-4],1 adc word ptr [bp-2],0 @5: cmp word ptr [bp-2],0 jl @4 jne @38 cmp word ptr [bp-4],-5536 jb @4 @38: ---------------------------------------------------------------- for (i = 0; i < COUNT; ++ i); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 jmp short @9 @8: add word ptr [bp-4],1 adc word ptr [bp-2],0 @9: cmp word ptr [bp-2],0 jl @8 jne @39 cmp word ptr [bp-4],-5536 jb @8 @39: ---------------------------------------------------------------- for (i = 0; ++ i <= COUNT;); 1056 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 @13: add word ptr [bp-4],1 adc word ptr [bp-2],0 mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] or dx,dx jl @13 jg @40 cmp ax,-5536 jbe @13 @40: Now the compiler seems to have forgotten that a memory location can be compared against a constant. ---------------------------------------------------------------- for (i = 0; i++ < COUNT;); 1009 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 @17: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] add word ptr [bp-4],1 adc word ptr [bp-2],0 or dx,dx jl @17 jne @41 cmp ax,-5536 jb @17 @41: ---------------------------------------------------------------- for (i = COUNT; i > 0; i --); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 jmp short @21 @20: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 @21: cmp word ptr [bp-2],0 jg @20 jne @42 cmp word ptr [bp-4],0 ja @20 @42: Now it seems to have regained this knowledge and is fast as before. Comparing against zero is not faster here than comparing against another constant. ---------------------------------------------------------------- for (i = COUNT; i > 0; -- i); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 jmp short @25 @24: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 @25: cmp word ptr [bp-2],0 jg @24 jne @43 cmp word ptr [bp-4],0 ja @24 @43: ---------------------------------------------------------------- for (i = COUNT; --i >= 0;); 1051 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @29: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] or dx,dx jg @29 jl @44 or ax,ax jae @29 @44: Again, the compiler fetches the counter into registers before comparing. Here comparing against zero saves a little time. But letting the counter in memory would have saved more. ---------------------------------------------------------------- for (i = COUNT; i-- > 0;); 1004 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @33: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] sub word ptr [bp-4],1 sbb word ptr [bp-2],0 or dx,dx jg @33 jne @45 or ax,ax ja @33 @45: ---------------------------------------------------------------- for (i = COUNT; i--;); 934 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @37: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] sub word ptr [bp-4],1 sbb word ptr [bp-2],0 or dx,ax jne @37 Checking a long for *equality* to zero saves lots of time, so the overhead for loading the registers is more than compensated. ---------------------------------------------------------------- ??? 374 ms mov si,0 mov di,30000 @_l: mov dx,si mov ax,di sub di,1 sbb si,0 or dx,ax jne @_l This would have been the optimal loop, but Turbo-C does not put long variables into registers. ================================================================ Please email to address below, because the account I was posting this from is not mine. _______________________________________________________________ | __ | | | | | \ | Peter J. Holzer | | |___|__/ | Technische Universitaet Wien | | | | | | | | | | ...!uunet!mcvax!tuvie!asupa!honey!hjphr | | ____/ |--------------------------------------------------| | | Think of it as evolution in action -- Tony Rand | |____________|__________________________________________________|
awd@dbase.UUCP (Alastair Dallas) (06/11/89)
In article <11993@well.UUCP>, pokey@well.UUCP (Jef Poskanzer) writes: > SUMMARY > ------- > > I compiled the following different kinds of for loops: > > for ( i = 0; i < COUNT; i++ ) > for ( i = 0; i < COUNT; ++i ) > for ( i = 0; ++i <= COUNT; ) > for ( i = 0; i++ < COUNT; ) > for ( i = COUNT; i > 0; i-- ) > for ( i = COUNT; i > 0; --i ) > for ( i = COUNT; --i >= 0; ) > for ( i = COUNT; i-- > 0; ) > > on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the > generated code and the timings. COUNT was a small (< 127) compile-time > constant. The loop body did not reference i and had no side-effects. > In theory, all eight of these loops should have optimized to the most > efficient loop possible, ignoring the otherwise unreferenced variable i > and simply traversing the loop body the proper number of times. On the > 68020, this most efficient loop is a dbra instruction. But in fact, cc > never generates dbra > <details omitted> > > CONCLUSION > ---------- > > For both compilers and all levels of optimization, this loop: > > for ( i = COUNT; --i >= 0; ) > > gives the lowest overhead. I tried these eight loops on Microsoft C v5.1 and was very surprised to get the same results, more or less. Jef's fastest loop (--i >= 0) on his Sun was also MSC's fastest loop on an 80386. And for the same reason: the compiler wakes up an realizes that the condition flags are already set. On 80x86, JCXZ is the fast loop instruction, but it is ignored by the MSC optimizer as well. (To be fair, I didn't turn on any special optimization, so I can't say MSC won't do it under some conditions.) Making i a register variable would obviously improve the timings, but MSC uses the SI register, not CX, and therefore can't take advantage of JCXZ. The code MSC generates for (--i >= 0) is approx. 47% of the t-states used by the code it generates for the typical (i=0; i<COUNT; i++) loop, but the typical case was not the worst case. What an eye-opener! I always assumed that it wasn't worth it being fussy about such things in C, that the compiler would optimize it perfectly. Like Jef says, if you want a particular part of your code to run fast, don't assume it will be optimized; study the compiler's output. Random thought: What are our "optimizing compilers" optimizing if they leave loops alone? /alastair/
trebor@biar.UUCP (Robert J Woodhead) (06/11/89)
In article <96@dbase.UUCP> awd@dbase.UUCP (Alastair Dallas) writes: >Random thought: What are our "optimizing compilers" optimizing if they >leave loops alone? Why, Corporate Quarterly Earnings Reports, of course! -- Robert J Woodhead, Biar Games, Inc. !uunet!biar!trebor | trebor@biar.UUCP ``The worst thing about being a vampire is that you can't go to matinees and save money anymore.''
jima@hplsla.HP.COM (Jim Adcock) (06/13/89)
>Random thought: What are our "optimizing compilers" optimizing if they >leave loops alone? Its not that they leave loops alone, they will munge either side of a label as well as they can. Optimizing across a label is hard, since such optimizations might affect [bust] the code branching to that label. See example to follow. Again, one cannot just try some trivial loop cases, say "form X is fastest", and use it blindly from there on. Life is not that simple. Compilers do not say: "aha, he's using loop form number 5, that calls for a dbra." Typically, a good optimizing compiler is going to start by naively generating a "tree" representing one's program segment, said "tree" unfortunately containing loops and brach statements which complicate a compiler's life tremendously. Then the tree is simplified using simple rules of arthmetic: "[add [const 4] [const 1]] --> [const 5]" for example. Consider what loops and branches can easily do to complicate a compiler's life: [reg0 = [const 4]] label401: [reg1 = [const 1]] [reg2 = [add reg0 reg1]] Can the final statement be simplified to: "[reg2 = [const 5]]" ??? Maybe, but you'd have to have knowledge of all the areas that branch to label401. Clearly, compilers don't like to do this! And situations like this arise all the time when your for loops get converted to a branch plus a label. Finally, after these arithmetic simplifications, attempts are made to map the tree onto machine instructions, or combinations of machine instructions. Wierd, but useful instructions like "dbra" are likely to be handled as a peep-hole optimization -- where a final stage of the compiler tries to substitute simpler instructions for specific combinations of one or more more expensive instructions. So in real life programming, whether your favorite "dbra" instruction gets used or not is close to blind luck! Its just going to depend on exactly what you do in that loop, what registers are needed at other points in the routine, etc, etc. In real life programming these things are just not very repeatable. Consider, using gcc -O -S on: #define OUTERLOOP 10000 #define COUNT 1000 main() { int i; int j; int ary[COUNT]; for (j=OUTERLOOP; j--;) { for ( i = COUNT; i-- > 0; ) ary[i] = i; } } gives: _main: link a6,#-4000 movel d2,sp@- movel #10000,d1 jra L2 L9: movel #1000,d0 jra L5 L8: lea a6@(d0:l:4),a0 movel d0,a0@(-4000) L5: subql #1,d0 moveq #-1,d2 cmpl d0,d2 jlt L8 L2: dbra d1,L9 clrw d1 subql #1,d1 jcc L9 movel a6@(-4004),d2 unlk a6 rts which executes in 15 seconds on my mid-range 68020 system. Using the "magic trick" of: "short s; for (s = COUNT; c--;) ..." to try to match a dbra instruction: #define OUTERLOOP 10000 #define COUNT 1000 main() { short s; int j; int ary[COUNT]; for (j=OUTERLOOP; j--;) { for ( s = COUNT; s-- > 0; ) ary[s] = s; } generates: _main: link a6,#-4000 movel #10000,d1 jra L2 L9: movew #1000,d0 jra L5 L8: movew d0,a1 lea a6@(a1:l:4),a0 movel a1,a0@(-4000) L5: subqw #1,d0 cmpw #-1,d0 jgt L8 L2: dbra d1,L9 clrw d1 subql #1,d1 jcc L9 unlk a6 rts which takes 16 seconds to execute! So if one has an optimizing compiler I think one is fooling oneself if one thinks one can memorize a favorite loop structure and use it blindly. Write code using common sense, that is easy to read, and not too tricky. If you need more speed, find a different algorithm, or profile it, and work on just those areas that are too slow, re-writing in assembly if necessary. More importantly, try to get one's hands on the best optimizing compiler available for one's machine. Any damned compiler can be advertised as an "optimizing" compiler -- the word means nothing. Any damned compiler can be tweaked to generate great looking code for a few do-nothing for loops -- it means nothing. Try to get a couple of the best compilers on you machine and then try them on a variety of code indicative of the work you do. They may vary greatly in areas far more important to you than loop construction. For example, 680x0 compilers I have seen vary *greatly* in the approaches they take to using the 6888x coprocessor -- with many having a rudimentary or non-existant concept of what it means to optimize floating pt instructions. Many only use registers 0 and 1 in the 6888x coprocessor! Many flotch floating pt numbers to and from the 6888x in order to convert from single to double floating pt and back, etc, etc...... [I have yet to see a compiler which is good at both 680x0 instruction generation and 6888x instruction generation..]