torek@elf.ee.lbl.gov (Chris Torek) (02/24/91)
[the original loop: for (i = 0; i < 100; i++) x += i; ] In article <409@ceco.ceco.com> garry@ceco.ceco.com (Garry Garrett) suggests rewriting the loop as >>> for (i=99; i; i--) >>> x += i; >In article <339@smds.UUCP> rh@smds.UUCP (Richard Harter) suggests instead: >> for (i=100;--i>=0;) >> x += i; In article <414@ceco.ceco.com> garry@ceco.ceco.com (Garry Garrett) writes: > I must disagree with you. For starters, consider the situation where >i == 0. Your loop adds 0 to x. (True; we could use for (i = 100; --i > 0;) x += i; instead, or equivalently `--i != 0', etc.) >Secondly, as my discussion that you deleted pointed out, performing >"--i>=0" would take more time than "i" `No it won't! Yes it will! No it won't! Yes it will!' :-) >The reason is that both i and 0 must be loaded into registers, compared, and a branch instruction executed for your version. My version >loads i into a register and performs a branch. This turns out not to be the case. There are a number of common `dumb' implementations for the two loops for (i = 100; --i > 0;) /* loop 1 */ x += i; and for (i = 99; i; i--) /* loop 2 */ x += i; and they all depend on the compiler's and machine's characteristics. (Note: for the discussion below I have assumed that `i' goes in a machine register. If not, the picture becomes even more complicated, as the loops then depend on which ALU operations, if any, are permitted on memory operands.) Most currently-popular machines can be placed in exactly one of these two categories: `has condition codes' or `does not have condition codes'. The former can be further divided into `has condition codes that are or can be set on arithemtic operations such as decrement' and `has condition codes that can only be set by test instructions'. The latter can be divided into `has branch on positive instruction' and `requires comparison against register'. (I am deliberately leaving modern RISCs out of the first part of this article.) On the condition-code machines, the first loop tends to compile to one of: # cc machine, compiler leaves test at top mov $100, r0 # i = 100 loop: dec r0 # decrement and set condition codes jle end # exit loop if i <= 0 <code> # x += i, or whatever jmp loop # continue loop end: [3 instructions overhead] or: mov $100, r0 loop: dec r0 # decrement but do not set cc's tst r0 # set cc's jle end <code> jmp loop end: [4 instructions overhead] or: # cc machine, compiler copies test to bottom mov $100, r0 dec r0 # optional: tst r0 jle end loop: <code> dec r0 # optional: tst r0 jgt loop end: [2 or 3 instructions overhead] Thus, the loop itself contains either three or four `overhead' instructions if the compiler does not copy the test to the bottom, but only two or three if it does. If the compiler is smarter, it also notices that i=100,--i gives i=99, which is > 0, which means the top `jle' is unnecessary and the `mov $100, r0; dec r0' sequence can be simplified to `mov $99, r0'. # non-cc, branch on 0-compare; test left at top mov $100, r0 loop: dec r0 jle r0, end # jump if r0 <= 0 <code> jmp loop end: [3 instructions overhead] or: # non-cc, branch on reg-compare; test left at top mov $100, r0 mov $0, r1 loop: dec r0 jle r0, r1, end # jump if r0 <= r1 <code> jmp loop end: [3 instructions overhead] or (if the compiler is seriously stupid): mov $100, r0 loop: dec r0 mov $0, r1 jle r0, r1, end <code> jmp loop end: [4 instructions overhead] or: # non-cc, branch on 0 compare; test copied to bottom: mov $100, r0 dec r0 jle r0, end loop: <code> dec r0 jgt r0, loop [2 instructions overhead] end: or the same with the `mov $0, r1' added and the branches changed to `jxx r0, r1, end|loop' [3 instructions overhead]. Thus, the compiler tends to emit 2, 3, or 4 instructions of loop overhead, but you only get 4 if the compiler does not copy the test to the bottom of the loop. For the second loop [for (i = 99; i; i--)], we get one of these: # cc machine, does not matter whether arithmetic sets cc's; # compiler leaves test at top mov $99, r0 loop: tst r0 # set condition codes jeq end # exit loop if i==0 <code> dec r0 jmp loop end: [4 instructions of overhead] or: # cc machine; compiler copies test to bottom # (compiler moves test to bottom) mov $99, r0 tst r0 jeq end loop: <code> dec r0 # optional tst r0 jne loop # repeat if not 0 end: [2 or 3 instructions of overhead] On the non-cc machine, we get one of: # non-cc, test left at top mov $99, r0 # optional mov $0, r1 loop: jeq r0, end # or jeq r0, r1, end <code> dec r0 jmp loop end: [3 instructions overhead] or: # non-cc, test left at top, really dumb compiler mov $99, r0 loop: mov $0, r1 jeq r0, r1, end <code> dec r0 jmp loop end: [4 instructions overhead] or: # non-cc, test copied to bottom mov $99, r0 # optional mov $0, r1 jeq r0, end # or jeq r0, r1, end loop: <code> dec r0 jne r0, loop # or jne r0, r1, end end: [2 instructions overhead] or: mov $99, r0 mov $0, r1 jeq r0, r1, end loop: <code> dec r0 mov $0, r1 jne r0, r1, end [3 instructions overhead] Again, we get 2, 3, or 4 instructions of overhead, with 2/3 occuring when the compiler moves the test to the bottom and 4 occurring when it leaves the test at the top. >2 steps vs. your 4 steps. Nope. :-) Now consider modern RISCs, in which pipeline delays are exposed, i.e., you get one `free' instruction after a branch. Then: for (i = 99; i; i--) code; can be compiled as: mov $99, r1 # r0 always = 0 # compiler does not bother to test 99 != 0 loop: <code> jne r1, r0, loop # branch if not 0 dec r1 # do this whether we branch or not # out of the loop now, but r1 = -1; # if it is used again, we must first set it back to 0. On the other hand, for (i = 100; --i > 0;) code; can be compiled as: mov $99, r1 mov $1, r2 loop: <code> jgt r1, r2, loop # branch if i > 1 dec r1 # and always decrement i Both of these loops have exactly two instructions of overhead. (Note that the replacement of `--i > 0' with `i-- > 1' is legal since the only case where i-- < 1 but --i > 0 is when i==MININT before the decrement; i becomes MAXINT after the decrement, and this is an underflow, and the ANSI standard says all bets are off on underflow. Additionally, in this case, i is a loop induction variable and is known to occupy the range [99..1]. RISC compilers tend to do this sort of optimization.) A good RISC compiler may even choose to transform one of these loops to the other. Of course, the best optimization for: for (i = 1; i < 100; i++) x += i; is: x += 4950; which will typically be several hundred times faster than anyone's micro-optimization. (On another topic entirely:) > Bear in mind that C is one of the hardest languages in the world to >optimize. Not really. Try optimizing ML, or Ada, or . . . . -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/25/91)
What Chris doesn't point out is that
i = blah;
do
{
...
}
while (--i);
will compile to the optimal code on most machines under most compilers,
whether they have condition codes or not, whether they have exposed jump
delays or not, and whether the optimizer is on or off. I don't know any
machines where this will produce worse code than any other expression of
the loop.
---Dan
rh@smds.UUCP (Richard Harter) (02/25/91)
I have little to had to Chris's summary except to note that some machines have the equivalent of the PDP-11's SOB instruction, which does both a decrement (increment) and a branch on condition code in one instruction, e.g the 68xxx series. For these machines you have a one instruction overhead in a loop. Also the natural thing to do is to jump to the end of the loop immediately, e.g. goto 2$ 1$: loop body iteration code (i++, whatever) 2$: goto 1$ if test is satisfied 3$: Replicating the test really doesn't save you much, i.e. goto 3$ if test is not satisfied 1$: loop body iteration code 2$: goto 1$ if test is satisfied 3$: doesn't win much because the cost of a naked goto (as distinct from the cost of a naked civil servant) is very cheap [for CISC machines]. For reasons that are not clear to me many optimizing compilers will not collapse the two machine instructions dec r1 bge 1$ into the available single instruction to do the same thing. Perhaps some of our compiler writers can explain this to us. Finally, the inimitable Dan B. points out that the do-while construct gives the compiler its best shot at optimizing code. Dan is right, albeit a little sloppy. [Hint 1 -- the do-while loop executes at least once]. While I am at it there is an important point that has not been made. Generally speaking, when you are altering code (presumably for optimiz- ation since clarity is irrelevant in C :-)) it is desirable that a transformation preserves semantics. Or, to be more precise, it should be clear what is being preserved by the transformation and what is not. For example, the construct 'for (i=0;i<E;i++)' has the properties that the loop body will be executed for i=<0,...E-1>, that i assumes these values sequentially increasing, that i and E may be signed or unsigned, and that the loop body will not be executed if E is negative. Now consider the three posted alternatives for a countdown loop: (A) for(i=E;--i>=0;) {body} (B) for(i=E-1;i;--i) {body} (C) i=E-1; do {body} while(i--); All three do not satisfy the condition i assumes the values 0 to E-1 sequentially increasing. Alternative B fails if i is unsigned. Alternative C fails if E<=0 initially. What this means is that B and C are weaker and less valuable transformations than A. On the other hand a corrected version of (C) (D) i=E; if (i--) {do {body} while(i--);} is equivalent to (A), and, as Dan observes, stands the best chance of being completely optimized. Some of you may think I am being fussy (and you would be right). However part of the trade of being a programmer is taking programs which are (supposedly) correct and changing them into other faster/smaller programs which are equivalent. Ones chances of doing this are much better if one relies on proven transformations with known semantics. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)
In article <15396:Feb2419:32:5891@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >What Chris doesn't point out is that > > i = blah; > do > { > ... > } > while (--i); > >will compile to the optimal code on most machines under most compilers ... This is, of course, because you have manually moved the loop test to the bottom. This almost invariably saves one `jump' instruction, if nothing else. >... whether they have exposed jump delays or not ... Actually, do { ... } while (i--); is a bit easier on some exposed pipeline machines, simply because it compiles directly to jne r1, r0, loop # branch if i != 0 dec r1 # but do i-- first without any thought whatsoever. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)
(with any luck this will die its own death after this...) In article <344@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: >For reasons that are not clear to me many optimizing compilers will not >collapse the two machine instructions > dec r1 > bge 1$ >into the available single instruction to do the same thing. Perhaps >some of our compiler writers can explain this to us. Certain machines (grr :-) ) that have subtract-and-branch-on-condition instructions can only branch a very short distance; compilers for these must figure out how far the branch goes, or else use assembler pseudo ops like `jsobgtr' which expand if necessary. Unfortunately, the VAX (for one) assemblers tend not to have `jsobgtr' pseudo-ops. `Fixed in the next release....' (Incidentally, I played with timing decl+jgeq vs sobgeq on the VAX and found that it rarely made any difference. It is more compact, which does not hurt, but not really any faster. Other `fancy' VAX instructions also turn out to be slower than equivalent sequences of simpler instructions. Which ones, and how much, depend on the particular model: 780s and 8250s have fairly different characteristics.) -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
timr@gssc.UUCP (Tim Roberts) (02/27/91)
In article <10250@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >(with any luck this will die its own death after this...) Sorry. I just had to comment. In article <344@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: >For reasons that are not clear to me many optimizing compilers will not >collapse the two machine instructions > dec r1 > bge 1$ >into the available single instruction to do the same thing. Perhaps >some of our compiler writers can explain this to us. From the Intel i486 Programmers Reference Manual and the Department of Counterintuitive Advice: o The LOOP instruction takes two more clock cycles to execute than the equivalent decrement and conditional jump instructions. Even more strange: o The JECXZ (jump on ECX=0) instruction takes one more clock cycle to execute than the equivalent compare and conditional jump instructions. -- timr@gssc.gss.com Tim N Roberts, CCP Graphic Software Systems Beaverton, OR This is a very long palindrome. .emordnilap gnol yrev a si sihT
conor@lion.inmos.co.uk (Conor O'Neill) (02/27/91)
In article <10191@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >Of course, the best optimization for: > > for (i = 1; i < 100; i++) > x += i; > >is: > > x += 4950; No. Try { x += 4950; i = 100; } --- Conor O'Neill, Software Group, INMOS Ltd., UK. UK: conor@inmos.co.uk US: conor@inmos.com "It's state-of-the-art" "But it doesn't work!" "That is the state-of-the-art".
torek@elf.ee.lbl.gov (Chris Torek) (03/01/91)
>In article <10191@dog.ee.lbl.gov> I wrote: >>Of course, the best optimization for: >> for (i = 1; i < 100; i++) >> x += i; >>is: >> x += 4950; In article <14522@ganymede.inmos.co.uk> conor@inmos.co.uk (Conor O'Neill) writes: >No. >Try > { x += 4950; i = 100; } I was assuming (perhaps unwisely) that everyone understood that the variable `i' was `dead' after the loop. The alternative `optimizations' under discussion were for (i = 100; --i > 0;) x += i; and for (i = 99; i; i--) x += i; both of which leave i==0. For these to be suitable `optimizations' for the original (count-from-1-to-99) loop, the final value of `i' would have to be irrelevant. (Actually, it is conceivable that the final value of `i' might have to be exactly 0 or exactly 100; in real code, as opposed to pedagogic examples, one should check.) (You might also note that article <10191@dog.ee.lbl.gov> touched lightly on compiler transformations that changed the final value of `i' in a count-down loop from 0 to -1. I *did* consider it....) (Richard O'Keefe also pointed out some of the dangers in transforming code sequences. One must always be careful of semantics. The final value of `i' is such a semantic, and Conor O'Neill is right to be wary of anything that alters it. In this particular example, however, the final value should be considered irrelevant.) -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov