pokey@well.UUCP (Jef Poskanzer) (06/05/89)
SUMMARY ------- I compiled the following different kinds of for loops: for ( i = 0; i < COUNT; i++ ) for ( i = 0; i < COUNT; ++i ) for ( i = 0; ++i <= COUNT; ) for ( i = 0; i++ < COUNT; ) for ( i = COUNT; i > 0; i-- ) for ( i = COUNT; i > 0; --i ) for ( i = COUNT; --i >= 0; ) for ( i = COUNT; i-- > 0; ) on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the generated code and the timings. COUNT was a small (< 127) compile-time constant. The loop body did not reference i and had no side-effects. In theory, all eight of these loops should have optimized to the most efficient loop possible, ignoring the otherwise unreferenced variable i and simply traversing the loop body the proper number of times. On the 68020, this most efficient loop is a dbra instruction. But in fact, cc never generates dbra, and gcc generates it for only one of the above loop types, and only when the -fstrength-reduce flag is specified. In contrast, the BSD cc on a vax generates a single instruction loop in four out of the eight cases, including the two most commonly-used ones. This is not because the vax compiler is any smarter -- the fact that it wins only half the time, instead of all the time, shows that it is also failing to recognize that it can freely optimize the loop. The only reason the vax's cc does better is that the vax has a richer instruction set. Since the loop overhead for dbra is 1.7 times less than for the code from the common for loops, it is useful to know how to provoke your compiler into generating it. But it would be more useful for compilers to be smarter. And more generally, those of you who blithely assume (as I used to) that compilers are smart enough to turn clearly-expressed C into optimal assembler, should perhaps look at their assembler output more often. FULL DATA --------- Case 1) First, let's look at the generic loops, counting up from zero to the limit: for ( i = 0; i < COUNT; i++ ) for ( i = 0; i < COUNT; ++i ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 tag: tag: <loop body> <loop body> addql #1,d6 addql #1,d0 moveq #COUNT,d1 moveq #COUNT-1,d3 cmpl d1,d6 cmpl d0,d3 jlt tag jge tag 0.97 usecs 0.97 usecs The two compilers generate essentially the same instructions. Pre or post increment doesn't make any difference. Note that the moveq of COUNT could be moved above the loop -- the loop body doesn't doesn't call any subroutines that might smash it. This looks like a missed optimization. Case 2a) Second, let's try doing the increment and test in one C statement. Seems to me this shouldn't make any difference to a smart optimizer, but... for ( i = 0; ++i <= COUNT; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 jra tag2 jra tag2 tag1: tag1: <loop body> <loop body> tag2: tag2: addql #1,d6 addql #1,d0 moveq #COUNT,d1 moveq #COUNT,d3 cmpl d1,d6 cmpl d0,d3 jle tag1 jge tag1 0.97 usecs 0.97 usecs No important difference from case 1. Case 2b) But if we try the post increment version: for ( i = 0; i++ < COUNT; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #0,d6 clrl d0 jra tag2 jra tag2 tag1: tag1: <loop body> <loop body> tag2: tag2: movl d6 d0 addql #1,d0 addql #1,d6 moveq #COUNT+1,d3 moveq #COUNT,d1 cmpl d0,d3 cmpl d1,d0 jgt tag1 jlt tag1 1.11 usecs 0.97 usecs Gcc is smart enough to bias the loop count, but cc doesn't see that optimization and so adds an extra instruction. Case 3) Third, let's try a count-down loop: for ( i = COUNT; i > 0; i-- ) for ( i = COUNT; i > 0; --i ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 tag: tag: <loop body> <loop body> subql #1,d6 subql #1,d0 tstl d6 tstl d0 jgt tag jgt tag 0.83 usecs 0.83 usecs Here the two compilers generate exactly the same instructions. There is one less instruction than in the count-up case, because the compilers are smart enough to use the tstl instruction to compare against zero, and so they don't need the moveq #COUNT instruction (which shouldn't have been in the loop in the first place). Case 4a) Fourth, let's try a count-down loop with the decrement included in the test statement: for ( i = COUNT; --i >= 0; ) cc -O gcc -O gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 moveq #COUNT,d0 jra tag2 jra tag2 jra tag2 tag1: tag1: tag1: <loop body> <loop body> <loop body> tag2: tag2: tag2: subql #1,d6 subql #1,d0 dbra d0,tag1 jge tag1 jpl tag1 clrw d0 subql #1,d0 jcc tag1 0.70 usecs 0.70 usecs 0.57 usecs Cc and gcc generate code similar to case 3, except that they suddenly decide that subql sets the condition codes just fine by itself, and a separate tstl is not necessary. So they shave off an instruction. Seems like a peephole optimizer should have picked this up in the previous cases too; it's another missed optimization. But the big win here is with the -fstrength-reduce option in gcc. It actually generates the optimal one-instruction loop, for a factor of 1.7 reduction in loop overhead from the generic loop. Not bad. But wait! What's that chud after the loop? Let's see, clear d1 to zero, subtract one from it giving -1 and setting carry, and jump if carry is clear. Hmm, looks like a three-instruction no-op to me! I guess gcc wants to leave the loop index with the right value, and isn't smart enough to notice it isn't used. (But why not simply moveq the final value?) Oh well, since it's not inside the loop it doesn't make much difference. Case 4b) But if we try this with post decrement: for ( i = COUNT; i-- > 0; ) cc -O gcc -O & gcc -O -fstrength-reduce moveq #COUNT,d6 moveq #COUNT,d0 jra tag2 jra tag2 tag1: tag1: <loop body> <loop body> tag2: tag2: movl d6,d0 subql #1,d0 subql #1,d6 moveq #-1,d3 tstl d0 cmpl d0,d3 jgt tag1 jlt tag1 0.97 usecs 0.97 usecs Cc not only adds in the movl to save the unincremented value of i, it also somehow fails to realize that subql sets the condition codes, so the tstl is back. And gcc gets totally confused and brings in a literal -1. CONCLUSION ---------- For both compilers and all levels of optimization, this loop: for ( i = COUNT; --i >= 0; ) gives the lowest overhead. Using gcc -fstrength-reduce, this low overhead means the optimal single-instruction loop; but even for cc and for gcc without -fstrength-reduce, this loop is faster than the others. I don't show the results here, but this is also true for large constant loops and for variable-length loops. It is unfortunate that we have to use such a non-intuitive loop to get the best results -- especially unfortunate since it is probably *not* the best loop on some other architectures or compilers -- but that's the facts. I would be interested to see similar checks done on different architectures; in particular RISC machines, which supposedly are designed in cooperation with the compiler writers. --- Jef Jef Poskanzer {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey "Man invented language to satisfy his deep need to complain." -- Lily Tomlin
chris@mimsy.UUCP (Chris Torek) (06/06/89)
In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes: >... COUNT was a small (< 127) compile-time constant. > for ( i = COUNT; --i >= 0; ) [all but gcc -O -fstrength-reduce deleted] > moveq #COUNT,d0 > jra tag2 >tag1: > <loop body> >tag2: > dbra d0,tag1 > clrw d0 > subql #1,d0 > jcc tag1 >... But wait! What's that chud after the loop? Let's see, clear d1 >to zero, subtract one from it giving -1 and setting carry, and jump >if carry is clear. Hmm, looks like a three-instruction no-op to me! No---the problem is that `dbra' decrements a *word*, compares the result against -1, and (if not -1) braches. The semantics of the loop demands a 32 bit comparison. The only reason it is not necessary in this particular case is the first quoted line above. Still, it would be nice if gcc always used the dbra/clrw/subql/jcc sequence for `--x >= 0' loops, since it does always work. The `clrw' fixes up the case where the 16-bit result has gone to -1: before decrement: wxyz 0000 after decrement: wxyz FFFF after clrw: wxyz 0000 after subql: wxyz-1 FFFF The dbra loop is so much faster that the extra time and space for one `unnecessary' dbra+clrw (when the loop really does go from 0 to -1, and at every 65536 trips when the loop counter is large and positive) that I would make this optimisation unconditional. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
suitti@haddock.ima.isc.com (Stephen Uitti) (06/06/89)
In article <11993@well.UUCP> Jef Poskanzer <pokey@well.com> writes: >I compiled the following different kinds of for loops: > for ( i = 0; i < COUNT; i++ ) > for ( i = 0; i < COUNT; ++i ) > for ( i = 0; ++i <= COUNT; ) > for ( i = 0; i++ < COUNT; ) > for ( i = COUNT; i > 0; i-- ) > for ( i = COUNT; i > 0; --i ) > for ( i = COUNT; --i >= 0; ) > for ( i = COUNT; i-- > 0; ) >on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the >generated code and the timings. COUNT was a small (< 127) compile-time >constant. A huge amount of time ago, i did this for the PDP-11 using the Ritchie compiler. This compiler is not awesomely bright, but i had the impression that it wasn't quite as bad as PCC (or even less portable). The optimal loop would use the "sob" (subtract one and branch if not zero) instruction. The code that caused it to be emitted was register i; i = COUNT; do { loop body; } while (--i); This is much clearer than anything with for(), since it tends to get (even stupidly) compiled to: register i; i = COUNT; mov $COUNT,r2 do { L1: loop body; loop body... } while (--i); sob r2,L1 The compiler seemed to be smart enough to know not to use it if the loop was too long. Thus, at worst, the sob would be replaced by: sub $1,r2 jne L1 The trouble with "for" loops is that it is defined as "test at the top", when most single instructions loops are "test at bottom". The VAX (PCC based) compiler's optimizer would convert many loops that used multiple instructions to use a single complicated instruction. Unfortunately, the VAX 780 (ubiquitous at the time) generally ran these slower. One case was the acbl (add, compare and branch longword). The optimizer would replace the three instructions (something like:) add $1,r2 cmp $END,r2 bne L1 with the "acbl" and all its arguments. Both codings take the same number of bytes (25!). The multiple instructions are faster (on the 780). Newer VAXen have different relative timing. I recall this case coming from the prime number sieve benchmark. The loop could not be easily converted to the do-while. I haven't gotten a chance to play with gcc (real work to do). I've heard it can do wonders, and thought it did better multi-statement optimization. Still, all the silly side-effects semantics of C make it hard to tell a compiler how to take mediocre source and produce good code from it. It is best to start with good source. Though i hardly ever write assembler anymore (and it seems never for speed or size), i've still yet to bump into a C compiler that produces code that i can't improve on at a glance. This may change if/when i start working with scoreboarded RISC processors (88K), in that instruction cycle counting looks hard (requiring more than a glance). Stephen.
david@sun.com (there IS no ``average'' SubGenius) (06/06/89)
The Sun 680x0 compiler has been able to generate dbcc loops for a long time. However, the loop has to be written in a particular form. Here are some examples compiled with the SunOS 3.5 cc: 1. register short s; for (s = 127; s != -1; s--) ; moveq #127,d7 LY00001: dbra d7,LY00001 2a. register int i; for (i = 127; i != -1; i--) ; moveq #127,d6 LY00003: subql #1,d6 moveq #-1,d7 cmpl d7,d6 jne LY00003 (Oops!) 2b. register int i; i = 127; while (--i != -1) ; moveq #127,d6 L20: dbra d6,L20 clrw d6 subql #1,d6 jcc L20 3. extern int x; register short s; s = 127; while (x >= 0 && --s != -1) ; moveq #127,d7 L15: tstl _x dblt d7,L15 P.S. Sun users might want to look at the PR_LOOP macros in <pixrect/pr_impl_util.h>. -- David DiGiacomo, Sun Microsystems, Mt. View, CA sun!david david@sun.com
shankar@hpclscu.HP.COM (Shankar Unni) (06/07/89)
> for ( i = 0; i < COUNT; i++ ) > for ( i = 0; i < COUNT; ++i ) > for ( i = 0; ++i <= COUNT; ) > for ( i = 0; i++ < COUNT; ) > for ( i = COUNT; i > 0; i-- ) > for ( i = COUNT; i > 0; --i ) > for ( i = COUNT; --i >= 0; ) > for ( i = COUNT; i-- > 0; ) > > I would be interested to see similar checks done on different > architectures; in particular RISC machines, which supposedly are > designed in cooperation with the compiler writers. On the HP Precision Architecture (HP9000 series 800's), all the above loops except the last one generate a two-instruction loop + set-up: COPY 0,23 ; zero-out reg 23 LDI 127,31 ; load-immediate into reg 31 LDO 1(23),23 ; this is a copy of the instruction below ; (should have been folded with the COPY). COMBF,<=,N 31,23,. ; compare and branch if false [LOOP] LDO 1(23),23 ; increment i (LDO = LoaD Offset) [LOOP] The compare-and-branch to itself, with the increment in the delay slot of the branch, is the tightest possible loop. In the last for loop (only), a three instruction loop is generated, because of the slightly more complicated code for "i-- > 0". ---- Shankar Unni. HP California Language Lab.
edler@cmcl2.NYU.EDU (Jan Edler) (06/07/89)
In article <17891@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes: >>... COUNT was a small (< 127) compile-time constant. >> for ( i = COUNT; --i >= 0; ) > >[all but gcc -O -fstrength-reduce deleted] > >> moveq #COUNT,d0 >> jra tag2 >>tag1: >> <loop body> >>tag2: >> dbra d0,tag1 >> clrw d0 >> subql #1,d0 >> jcc tag1 I prefer moveq #COUNT,d0 jra tag3 tag1: swap d0 tag2: <loop body> tag3: dbra d0,tag2 swap d0 dbra d0,tag1 But it doesn't really make that much difference, I guess. However, the compiler shouldn't do either of these when d0 is initialized to a value < 65536. Jan Edler NYU Ultracomputer Research Laboratory edler@nyu.edu
pokey@well.UUCP (Jef Poskanzer) (06/07/89)
In the referenced message, edler@cmcl2.UUCP (Jan Edler) wrote: }However, the compiler shouldn't do either of these when d0 is initialized }to a value < 65536. Not good enough, since the loop body might modify the index variable. If the compiler is smart enough to do a global semantic check for this sort of thing, then you can do this optimization; but we've already established that cc and gcc aren't that smart. --- Jef Jef Poskanzer {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey "A banker is a fellow who lends you his umbrella when the sun is shining and wants it back the minute it begins to rain." -- Mark Twain
neideck@49.1009.enet (Burkhard Neidecker-Lutz) (06/08/89)
Freshly off a DecStation 3100 with a MIPS 1.31 compiler:
#define COUNT 127
int main()
{
int i;
for ( i = 0; i < COUNT; ++i );
move v1,zero
addiu v1,v1,1
L1: slti at,v1,127 # 3 cycle loop
bne at,zero,L1
addiu v1,v1,1
for ( i = 0; ++i <= COUNT; );
li v1,1
addiu v1,v1,1
L2: slti at,v1,128 # 3 cycle loop
bne at,zero,L2
addiu v1,v1,1
for ( i = 0; i++ < COUNT; );
li v1,1
L3: slti v0,v1,127 # 3 cycle loop
bne v0,zero,L3
addiu v1,v1,1
for ( i = COUNT; i > 0; i-- );
li v1,127
addiu v1,v1,-1
L4: bgtz v1,L4 # 2 cycle loop
addiu v1,v1,-1
for ( i = COUNT; i > 0; --i );
li v1,127
addiu v1,v1,-1
L5: bgtz v1,L5 # 2 cycle loop
addiu v1,v1,-1
for ( i = COUNT; --i >= 0; );
li v1,126
addiu v1,v1,-1
L6: bgez v1,L6 # 2 cycle loop
addiu v1,v1,-1
for ( i = COUNT; i-- > 0; );
li v1,126
move v0,v1
L7: bgtz v0,L7 # 2 cycle loop
addiu v1,v1,-1
return(i);
jr ra
move v0,v1
}
As usual, counting towards zero is more efficient than the upward direction,
but we knew that already, don't we ?
Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR
Disclaimer: I don't speak for Digital, etc. ...
rbj@dsys.ncsl.nist.gov (Root Boy Jim) (06/09/89)
? From: Jef Poskanzer <pokey@well.uucp> ? In the referenced message, edler@cmcl2.UUCP (Jan Edler) wrote: ? }However, the compiler shouldn't do either of these when d0 is initialized ? }to a value < 65536. ? Not good enough, since the loop body might modify the index variable. ? If the compiler is smart enough to do a global semantic check for this ? sort of thing, then you can do this optimization; but we've already ? established that cc and gcc aren't that smart. Hey, aren't most programs *I/O bound* anyway? And you can sometimes unroll your loops. I seem to remember Chris Torek stating something about using main() { register short n = 10; do { printf("n = %d\n"); } while (--n != -1); } On a Sun 3/180 SunOS 3.5 compiled with `cc -S -O' generates: .data .text LL0: |#PROC# 04 .data1 L18: .ascii "n = %d\12\0" LF12 = 4 LS12 = 128 LFF12 = 0 LSS12 = 0 LP12 = 16 .data .text .globl _main _main: |#PROLOGUE# 0 link a6,#-4 movl d7,sp@ |#PROLOGUE# 1 moveq #10,d7 L16: movw d7,d0 extl d0 movl d0,sp@- pea L18 jbsr _printf addqw #8,sp dbra d7,L16 movl a6@(-4),d7 unlk a6 rts and prints "n = 10" thru "n = 0". I even tried initializing n to -5 and adding `n += 2' before the printf. Prints -3 thru 0. ? Jef ? Jef Poskanzer {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey ? "A banker is a fellow who lends you his umbrella when the sun is shining and ? wants it back the minute it begins to rain." -- Mark Twain Root Boy Jim is what I am Are you what you are or what?
pokey@well.UUCP (Jef Poskanzer) (06/09/89)
In the referenced message, rbj@dsys.ncsl.nist.gov (Root Boy Jim) wrote: }Hey, aren't most programs *I/O bound* anyway? And you can sometimes }unroll your loops. Any statement about "most programs" is guaranteed false. In my case, I have a large suite of programs called PBM. Most of them (note difference) are I/O bound (but less so now that I've improved the external data format). Some are CPU bound. I estimate that these would go about 1% faster if they used a dbra in their inner loop. }I seem to remember Chris Torek stating something about using } printf("n = %d\n"); Somehow I doubt Chris would make such a stupid mistake. What is your point, anyway? I didn't see it in your article. --- Jef Jef Poskanzer pokey@well.sf.ca.us {ucbvax, apple, hplabs}!well!pokey "Sound impressive? It should. It is."
inst182@tuvie (Inst.f.Techn.Informatik) (06/09/89)
I tried the same kinds of for loops on a XT-compatible (8086, 8MHz) using Turbo-C 2.0. (tcc -G -O -Z -S) I also tested for (i = COUNT; i --;); because I thought, it would be the fastest. For integers, the fastest loop was for (i = COUNT; -- i >= 0;); which turned out to be 1.7 times as fast as the slowest. It is remarkable, that this is the same factor as Jef Poskanzer has timed on his Sun. #define COUNT 30000 ---------------------------------------------------------------- for (i = 0; i < COUNT; i ++); 127 ms xor si,si jmp short @5 @4: inc si @5: cmp si,30000 jl @4 Comparation against a non-zero value ---------------------------------------------------------------- for (i = 0; i < COUNT; ++ i); 115 ms xor si,si jmp short @9 @8: inc si @9: cmp si,30000 jl @8 Can anybody explain, why this is faster than the above ???? ---------------------------------------------------------------- for (i = 0; ++ i <= COUNT;); 132 ms xor si,si @13: inc si mov ax,si cmp ax,30000 jle @13 Moving si to ax seems rather useless to me. ---------------------------------------------------------------- for (i = 0; i++ < COUNT;); 132 ms xor si,si @17: mov ax,si inc si cmp ax,30000 jl @17 Ok, here it is neccessary ---------------------------------------------------------------- for (i = COUNT; i > 0; i --); 94 ms mov si,30000 jmp short @21 @20: dec si @21: or si,si jg @20 The separation of decrement and comparation causes an additional or ---------------------------------------------------------------- for (i = COUNT; i > 0; --i); 93 ms mov si,30000 jmp short @25 @24: dec si @25: or si,si jg @24 Looks exactly like the version above. The 1 ms difference could be a timing error. ---------------------------------------------------------------- for (i = COUNT; --i >= 0;); 77 ms mov si,30000 @29: dec si jge @29 This surely is the optimal loop ---------------------------------------------------------------- for (i = COUNT; i-- > 0;); 115 ms mov si,30000 @33: mov ax,si dec si or ax,ax jg @33 The former value of the counter has to be saved for the comparison. ---------------------------------------------------------------- for (i = COUNT; i--;); 115 ms mov si,30000 @37: mov ax,si dec si or ax,ax jne @37 ================================================================ Then I tried the same with a long counter (instead of int), which changed the results radically. Not only is the difference between the fastest and the slowest loop much lesser (The factor is now only 1.13), but version 7, which has been the fastest for integers is now the second slowest, and version 9 now is the fastest. Versions 3 and 4 have remained at the slow end of the charts. #define COUNT 60000 ---------------------------------------------------------------- for (i = 0; i < COUNT; i ++); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 jmp short @5 @4: add word ptr [bp-4],1 adc word ptr [bp-2],0 @5: cmp word ptr [bp-2],0 jl @4 jne @38 cmp word ptr [bp-4],-5536 jb @4 @38: ---------------------------------------------------------------- for (i = 0; i < COUNT; ++ i); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 jmp short @9 @8: add word ptr [bp-4],1 adc word ptr [bp-2],0 @9: cmp word ptr [bp-2],0 jl @8 jne @39 cmp word ptr [bp-4],-5536 jb @8 @39: ---------------------------------------------------------------- for (i = 0; ++ i <= COUNT;); 1056 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 @13: add word ptr [bp-4],1 adc word ptr [bp-2],0 mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] or dx,dx jl @13 jg @40 cmp ax,-5536 jbe @13 @40: Now the compiler seems to have forgotten that a memory location can be compared against a constant. ---------------------------------------------------------------- for (i = 0; i++ < COUNT;); 1009 ms mov word ptr [bp-2],0 mov word ptr [bp-4],0 @17: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] add word ptr [bp-4],1 adc word ptr [bp-2],0 or dx,dx jl @17 jne @41 cmp ax,-5536 jb @17 @41: ---------------------------------------------------------------- for (i = COUNT; i > 0; i --); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 jmp short @21 @20: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 @21: cmp word ptr [bp-2],0 jg @20 jne @42 cmp word ptr [bp-4],0 ja @20 @42: Now it seems to have regained this knowledge and is fast as before. Comparing against zero is not faster here than comparing against another constant. ---------------------------------------------------------------- for (i = COUNT; i > 0; -- i); 938 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 jmp short @25 @24: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 @25: cmp word ptr [bp-2],0 jg @24 jne @43 cmp word ptr [bp-4],0 ja @24 @43: ---------------------------------------------------------------- for (i = COUNT; --i >= 0;); 1051 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @29: sub word ptr [bp-4],1 sbb word ptr [bp-2],0 mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] or dx,dx jg @29 jl @44 or ax,ax jae @29 @44: Again, the compiler fetches the counter into registers before comparing. Here comparing against zero saves a little time. But letting the counter in memory would have saved more. ---------------------------------------------------------------- for (i = COUNT; i-- > 0;); 1004 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @33: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] sub word ptr [bp-4],1 sbb word ptr [bp-2],0 or dx,dx jg @33 jne @45 or ax,ax ja @33 @45: ---------------------------------------------------------------- for (i = COUNT; i--;); 934 ms mov word ptr [bp-2],0 mov word ptr [bp-4],-5536 @37: mov dx,word ptr [bp-2] mov ax,word ptr [bp-4] sub word ptr [bp-4],1 sbb word ptr [bp-2],0 or dx,ax jne @37 Checking a long for *equality* to zero saves lots of time, so the overhead for loading the registers is more than compensated. ---------------------------------------------------------------- ??? 374 ms mov si,0 mov di,30000 @_l: mov dx,si mov ax,di sub di,1 sbb si,0 or dx,ax jne @_l This would have been the optimal loop, but Turbo-C does not put long variables into registers. ================================================================ Please email to address below, because the account I was posting this from is not mine. _______________________________________________________________ | __ | | | | | \ | Peter J. Holzer | | |___|__/ | Technische Universitaet Wien | | | | | | | | | | ...!uunet!mcvax!tuvie!asupa!honey!hjphr | | ____/ |--------------------------------------------------| | | Think of it as evolution in action -- Tony Rand | |____________|__________________________________________________|
awd@dbase.UUCP (Alastair Dallas) (06/11/89)
In article <11993@well.UUCP>, pokey@well.UUCP (Jef Poskanzer) writes: > SUMMARY > ------- > > I compiled the following different kinds of for loops: > > for ( i = 0; i < COUNT; i++ ) > for ( i = 0; i < COUNT; ++i ) > for ( i = 0; ++i <= COUNT; ) > for ( i = 0; i++ < COUNT; ) > for ( i = COUNT; i > 0; i-- ) > for ( i = COUNT; i > 0; --i ) > for ( i = COUNT; --i >= 0; ) > for ( i = COUNT; i-- > 0; ) > > on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the > generated code and the timings. COUNT was a small (< 127) compile-time > constant. The loop body did not reference i and had no side-effects. > In theory, all eight of these loops should have optimized to the most > efficient loop possible, ignoring the otherwise unreferenced variable i > and simply traversing the loop body the proper number of times. On the > 68020, this most efficient loop is a dbra instruction. But in fact, cc > never generates dbra > <details omitted> > > CONCLUSION > ---------- > > For both compilers and all levels of optimization, this loop: > > for ( i = COUNT; --i >= 0; ) > > gives the lowest overhead. I tried these eight loops on Microsoft C v5.1 and was very surprised to get the same results, more or less. Jef's fastest loop (--i >= 0) on his Sun was also MSC's fastest loop on an 80386. And for the same reason: the compiler wakes up an realizes that the condition flags are already set. On 80x86, JCXZ is the fast loop instruction, but it is ignored by the MSC optimizer as well. (To be fair, I didn't turn on any special optimization, so I can't say MSC won't do it under some conditions.) Making i a register variable would obviously improve the timings, but MSC uses the SI register, not CX, and therefore can't take advantage of JCXZ. The code MSC generates for (--i >= 0) is approx. 47% of the t-states used by the code it generates for the typical (i=0; i<COUNT; i++) loop, but the typical case was not the worst case. What an eye-opener! I always assumed that it wasn't worth it being fussy about such things in C, that the compiler would optimize it perfectly. Like Jef says, if you want a particular part of your code to run fast, don't assume it will be optimized; study the compiler's output. Random thought: What are our "optimizing compilers" optimizing if they leave loops alone? /alastair/
trebor@biar.UUCP (Robert J Woodhead) (06/11/89)
In article <96@dbase.UUCP> awd@dbase.UUCP (Alastair Dallas) writes: >Random thought: What are our "optimizing compilers" optimizing if they >leave loops alone? Why, Corporate Quarterly Earnings Reports, of course! -- Robert J Woodhead, Biar Games, Inc. !uunet!biar!trebor | trebor@biar.UUCP ``The worst thing about being a vampire is that you can't go to matinees and save money anymore.''
rbj@dsys.ncsl.nist.gov (Root Boy Jim) (06/13/89)
? From: Jef Poskanzer <pokey@well.uucp> ? Date: 8 Jun 89 21:45:08 GMT ? In the referenced message, rbj@dsys.ncsl.nist.gov (Root Boy Jim) wrote: ? }Hey, aren't most programs *I/O bound* anyway? And you can sometimes ? }unroll your loops. ? Any statement about "most programs" is guaranteed false. In my case, ? I have a large suite of programs called PBM. Most of them (note ? difference) are I/O bound (but less so now that I've improved the ? external data format). Some are CPU bound. I estimate that these ? would go about 1% faster if they used a dbra in their inner loop. OK, I'll give you that point. It really *does* depend on your environment. Even if what I say is true, there are still cases where it matters. *But*, I would imagine that loop unrolling might beat the difference that dbra might make. Have you looked into that? ? }I seem to remember Chris Torek stating something about using ? } printf("n = %d\n"); ? Somehow I doubt Chris would make such a stupid mistake. Well, I *did* actually run the code; I started off with something more nebulous, then decided to investigate it. Evidently my transcription didn't make it back to the mail text. ? What is your point, anyway? I didn't see it in your article. Sorry, it was hidden in the generated code rather than stated. If you had looked past the obvious, trivial, unimportant error, and got to the meat of my posting, you would have noticed that I showed you how to get a `dbra' generated by the compiler. Hey, mellow out! My first reaction was to say: SHPX LBH!, but I can see where you're coming from. I didn't make my points very well, and my opening statement's flippancy evidently turned you off. You will notice that my posting was the only one that showed you how to actually get a dbra generated. Peace, and thanks for the bitmaps. ? Jef ? Jef Poskanzer pokey@well.sf.ca.us {ucbvax, apple, hplabs}!well!pokey ? "Sound impressive? It should. It is." Root Boy Jim is what I am Are you what you are or what?