[comp.lang.c] micro-optimizing loops

torek@elf.ee.lbl.gov (Chris Torek) (02/24/91)

[the original loop:
	for (i = 0; i < 100; i++)
		x += i;
]

In article <409@ceco.ceco.com> garry@ceco.ceco.com (Garry Garrett)
suggests rewriting the loop as
>>> 		for (i=99; i; i--)
>>> 			x += i;

>In article <339@smds.UUCP> rh@smds.UUCP (Richard Harter) suggests instead:
>> 		for (i=100;--i>=0;)
>> 			x += i;

In article <414@ceco.ceco.com> garry@ceco.ceco.com (Garry Garrett) writes:
>	I must disagree with you.  For starters, consider the situation where
>i == 0.  Your loop adds 0 to x.

(True; we could use

	for (i = 100; --i > 0;)
		x += i;

instead, or equivalently `--i != 0', etc.)

>Secondly, as my discussion that you deleted pointed out, performing
>"--i>=0" would take more time than "i"

`No it won't!  Yes it will!  No it won't!  Yes it will!'  :-)

>The reason is that both i and 0 must be loaded into registers,
compared, and a branch instruction executed for your version.  My version
>loads i into a register and performs a branch.

This turns out not to be the case.

There are a number of common `dumb' implementations for the two loops

	for (i = 100; --i > 0;)	/* loop 1 */
		x += i;

and

	for (i = 99; i; i--)	/* loop 2 */
		x += i;

and they all depend on the compiler's and machine's characteristics.
(Note: for the discussion below I have assumed that `i' goes in a
machine register.  If not, the picture becomes even more complicated,
as the loops then depend on which ALU operations, if any, are permitted
on memory operands.)

Most currently-popular machines can be placed in exactly one of these
two categories: `has condition codes' or `does not have condition
codes'.  The former can be further divided into `has condition codes
that are or can be set on arithemtic operations such as decrement' and
`has condition codes that can only be set by test instructions'.  The
latter can be divided into `has branch on positive instruction' and
`requires comparison against register'.

(I am deliberately leaving modern RISCs out of the first part of this
article.)

On the condition-code machines, the first loop tends to compile to one of:

	# cc machine, compiler leaves test at top
		mov	$100, r0	# i = 100
	loop:
		dec	r0		# decrement and set condition codes
		jle	end		# exit loop if i <= 0
		<code>			# x += i, or whatever
		jmp	loop		# continue loop
	end:
	[3 instructions overhead]

or:

		mov	$100, r0
	loop:
		dec	r0		# decrement but do not set cc's
		tst	r0		# set cc's
		jle	end
		<code>
		jmp	loop
	end:
	[4 instructions overhead]

or:

	# cc machine, compiler copies test to bottom
		mov	$100, r0
		dec	r0
		# optional: tst r0
		jle	end
	loop:
		<code>
		dec	r0
		# optional: tst r0
		jgt	loop
	end:
	[2 or 3 instructions overhead]

Thus, the loop itself contains either three or four `overhead' instructions
if the compiler does not copy the test to the bottom, but only two or three
if it does.  If the compiler is smarter, it also notices that i=100,--i
gives i=99, which is > 0, which means the top `jle' is unnecessary and
the `mov $100, r0; dec r0' sequence can be simplified to `mov $99, r0'.


	# non-cc, branch on 0-compare; test left at top
		mov	$100, r0
	loop:
		dec	r0
		jle	r0, end		# jump if r0 <= 0
		<code>
		jmp	loop
	end:
	[3 instructions overhead]

or:

	# non-cc, branch on reg-compare; test left at top
		mov	$100, r0
		mov	$0, r1
	loop:
		dec	r0
		jle	r0, r1, end	# jump if r0 <= r1
		<code>
		jmp	loop
	end:
	[3 instructions overhead]

or (if the compiler is seriously stupid):

		mov	$100, r0
	loop:
		dec	r0
		mov	$0, r1
		jle	r0, r1, end
		<code>
		jmp	loop
	end:
	[4 instructions overhead]

or:

	# non-cc, branch on 0 compare; test copied to bottom:
		mov	$100, r0
		dec	r0
		jle	r0, end
	loop:
		<code>
		dec	r0
		jgt	r0, loop
	[2 instructions overhead]
	end:

or the same with the `mov $0, r1' added and the branches changed to
`jxx r0, r1, end|loop' [3 instructions overhead].

Thus, the compiler tends to emit 2, 3, or 4 instructions of loop
overhead, but you only get 4 if the compiler does not copy the
test to the bottom of the loop.

For the second loop [for (i = 99; i; i--)], we get one of these:

	# cc machine, does not matter whether arithmetic sets cc's;
	# compiler leaves test at top
		mov	$99, r0
	loop:
		tst	r0		# set condition codes
		jeq	end		# exit loop if i==0
		<code>
		dec	r0
		jmp	loop
	end:
	[4 instructions of overhead]

or:

	# cc machine; compiler copies test to bottom
	# (compiler moves test to bottom)
		mov	$99, r0
		tst	r0
		jeq	end
	loop:
		<code>
		dec	r0
		# optional tst r0
		jne	loop		# repeat if not 0
	end:
	[2 or 3 instructions of overhead]

On the non-cc machine, we get one of:

	# non-cc, test left at top
		mov	$99, r0
		# optional mov $0, r1
	loop:
		jeq	r0, end		# or jeq r0, r1, end
		<code>
		dec	r0
		jmp	loop
	end:
	[3 instructions overhead]

or:

	# non-cc, test left at top, really dumb compiler
		mov	$99, r0
	loop:
		mov	$0, r1
		jeq	r0, r1, end
		<code>
		dec	r0
		jmp	loop
	end:
	[4 instructions overhead]

or:

	# non-cc, test copied to bottom
		mov	$99, r0
		# optional mov $0, r1
		jeq	r0, end		# or jeq r0, r1, end
	loop:
		<code>
		dec	r0
		jne	r0, loop	# or jne r0, r1, end
	end:
	[2 instructions overhead]

or:

		mov	$99, r0
		mov	$0, r1
		jeq	r0, r1, end
	loop:
		<code>
		dec	r0
		mov	$0, r1
		jne	r0, r1, end
	[3 instructions overhead]

Again, we get 2, 3, or 4 instructions of overhead, with 2/3 occuring
when the compiler moves the test to the bottom and 4 occurring when it
leaves the test at the top.

>2 steps vs. your 4 steps.

Nope. :-)

Now consider modern RISCs, in which pipeline delays are exposed, i.e.,
you get one `free' instruction after a branch.  Then:

	for (i = 99; i; i--)
		code;

can be compiled as:

		mov	$99, r1		# r0 always = 0
		# compiler does not bother to test 99 != 0
	loop:
		<code>
		jne	r1, r0, loop	# branch if not 0
		dec	r1		# do this whether we branch or not
		# out of the loop now, but r1 = -1;
		# if it is used again, we must first set it back to 0.

On the other hand,

	for (i = 100; --i > 0;)
		code;

can be compiled as:

		mov	$99, r1
		mov	$1, r2
	loop:
		<code>
		jgt	r1, r2, loop	# branch if i > 1
		dec	r1		# and always decrement i

Both of these loops have exactly two instructions of overhead.  (Note
that the replacement of `--i > 0' with `i-- > 1' is legal since the
only case where i-- < 1 but --i > 0 is when i==MININT before the
decrement; i becomes MAXINT after the decrement, and this is an
underflow, and the ANSI standard says all bets are off on underflow.
Additionally, in this case, i is a loop induction variable and is known
to occupy the range [99..1].  RISC compilers tend to do this sort of
optimization.)  A good RISC compiler may even choose to transform one
of these loops to the other.

Of course, the best optimization for:

	for (i = 1; i < 100; i++)
		x += i;

is:

	x += 4950;

which will typically be several hundred times faster than anyone's
micro-optimization.

(On another topic entirely:)
>	Bear in mind that C is one of the hardest languages in the world to
>optimize.

Not really.  Try optimizing ML, or Ada, or . . . .
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/25/91)

What Chris doesn't point out is that

  i = blah;
  do
   {
    ...
   }
  while (--i);

will compile to the optimal code on most machines under most compilers,
whether they have condition codes or not, whether they have exposed jump
delays or not, and whether the optimizer is on or off. I don't know any
machines where this will produce worse code than any other expression of
the loop.

---Dan

rh@smds.UUCP (Richard Harter) (02/25/91)

I have little to had to Chris's summary except to note that some machines
have the equivalent of the PDP-11's SOB instruction, which does both a
decrement (increment) and a branch on condition code in one instruction,
e.g the 68xxx series.  For these machines you have a one instruction overhead
in a loop.

Also the natural thing to do is to jump to the end of the loop immediately,
e.g.

		goto 2$
	1$:	loop body
		iteration code (i++, whatever)
	2$:	goto 1$ if test is satisfied
	3$:

Replicating the test really doesn't save you much, i.e.

		goto 3$ if test is not satisfied
	1$:	loop body
		iteration code
	2$:	goto 1$ if test is satisfied
	3$:
		
doesn't win much because the cost of a naked goto (as distinct from the
cost of a naked civil servant) is very cheap [for CISC machines].

For reasons that are not clear to me many optimizing compilers will not
collapse the two machine instructions

		dec r1
		bge 1$

into the available single instruction to do the same thing.  Perhaps
some of our compiler writers can explain this to us.

Finally, the inimitable Dan B. points out that the do-while construct
gives the compiler its best shot at optimizing code.  Dan is right,
albeit a little sloppy.  [Hint 1 -- the do-while loop executes at least
once].

While I am at it there is an important point that has not been made.
Generally speaking, when you are altering code (presumably for optimiz-
ation since clarity is irrelevant in C :-)) it is desirable that a
transformation preserves semantics.  Or, to be more precise, it should
be clear what is being preserved by the transformation and what is not.

For example, the construct 'for (i=0;i<E;i++)' has the properties that
the loop body will be executed for i=<0,...E-1>, that i assumes these
values sequentially increasing, that i and E may be signed or unsigned,
and that the loop body will not be executed if E is negative.  Now consider
the three posted alternatives for a countdown loop:

(A)	for(i=E;--i>=0;) {body}
(B)	for(i=E-1;i;--i) {body}
(C)	i=E-1; do {body} while(i--);

All three do not satisfy the condition i assumes the values 0 to E-1
sequentially increasing.  Alternative B fails if i is unsigned.
Alternative C fails if E<=0 initially.  What this means is that B and
C are weaker and less valuable transformations than A.  On the other hand
a corrected version of (C)

(D)	i=E; if (i--) {do {body} while(i--);}

is equivalent to (A), and, as Dan observes, stands the best chance of
being completely optimized.

Some of you may think I am being fussy (and you would be right).  However
part of the trade of being a programmer is taking programs which are
(supposedly) correct and changing them into other faster/smaller programs
which are equivalent.  Ones chances of doing this are much better if one
relies on proven transformations with known semantics.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)

In article <15396:Feb2419:32:5891@kramden.acf.nyu.edu>
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>What Chris doesn't point out is that
>
>  i = blah;
>  do
>   {
>    ...
>   }
>  while (--i);
>
>will compile to the optimal code on most machines under most compilers ...

This is, of course, because you have manually moved the loop test to
the bottom.  This almost invariably saves one `jump' instruction, if
nothing else.

>... whether they have exposed jump delays or not ...

Actually,

	do { ... } while (i--);

is a bit easier on some exposed pipeline machines, simply because it
compiles directly to

	jne	r1, r0, loop	# branch if i != 0
	dec	r1		# but do i-- first

without any thought whatsoever.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

torek@elf.ee.lbl.gov (Chris Torek) (02/26/91)

(with any luck this will die its own death after this...)

In article <344@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>For reasons that are not clear to me many optimizing compilers will not
>collapse the two machine instructions
>		dec r1
>		bge 1$
>into the available single instruction to do the same thing.  Perhaps
>some of our compiler writers can explain this to us.

Certain machines (grr :-) ) that have subtract-and-branch-on-condition
instructions can only branch a very short distance; compilers for these
must figure out how far the branch goes, or else use assembler pseudo
ops like `jsobgtr' which expand if necessary.  Unfortunately, the VAX
(for one) assemblers tend not to have `jsobgtr' pseudo-ops.  `Fixed in
the next release....'

(Incidentally, I played with timing decl+jgeq vs sobgeq on the VAX and
found that it rarely made any difference.  It is more compact, which
does not hurt, but not really any faster.  Other `fancy' VAX
instructions also turn out to be slower than equivalent sequences of
simpler instructions.  Which ones, and how much, depend on the
particular model: 780s and 8250s have fairly different characteristics.)
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

timr@gssc.UUCP (Tim Roberts) (02/27/91)

In article <10250@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>(with any luck this will die its own death after this...)

Sorry.  I just had to comment.

In article <344@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>For reasons that are not clear to me many optimizing compilers will not
>collapse the two machine instructions
>		dec r1
>		bge 1$
>into the available single instruction to do the same thing.  Perhaps
>some of our compiler writers can explain this to us.

From the Intel i486 Programmers Reference Manual and the Department of
Counterintuitive Advice:

  o The LOOP instruction takes two more clock cycles to execute than the
    equivalent decrement and conditional jump instructions.

Even more strange:

  o The JECXZ (jump on ECX=0) instruction takes one more clock cycle to 
    execute than the equivalent compare and conditional jump instructions.

-- 
timr@gssc.gss.com	Tim N Roberts, CCP	Graphic Software Systems
						Beaverton, OR

This is a very long palindrome. .emordnilap gnol yrev a si sihT

conor@lion.inmos.co.uk (Conor O'Neill) (02/27/91)

In article <10191@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>Of course, the best optimization for:
>
>	for (i = 1; i < 100; i++)
>		x += i;
>
>is:
>
>	x += 4950;


No.
Try
        { x += 4950; i = 100; }




---
Conor O'Neill, Software Group, INMOS Ltd., UK.
UK: conor@inmos.co.uk		US: conor@inmos.com
"It's state-of-the-art" "But it doesn't work!" "That is the state-of-the-art".

torek@elf.ee.lbl.gov (Chris Torek) (03/01/91)

>In article <10191@dog.ee.lbl.gov> I wrote:
>>Of course, the best optimization for:
>>	for (i = 1; i < 100; i++)
>>		x += i;
>>is:
>>	x += 4950;

In article <14522@ganymede.inmos.co.uk> conor@inmos.co.uk (Conor O'Neill)
writes:
>No.
>Try
>        { x += 4950; i = 100; }

I was assuming (perhaps unwisely) that everyone understood that the
variable `i' was `dead' after the loop.  The alternative `optimizations'
under discussion were

	for (i = 100; --i > 0;)
		x += i;

and

	for (i = 99; i; i--)
		x += i;

both of which leave i==0.  For these to be suitable `optimizations' for
the original (count-from-1-to-99) loop, the final value of `i' would
have to be irrelevant.  (Actually, it is conceivable that the final
value of `i' might have to be exactly 0 or exactly 100; in real code,
as opposed to pedagogic examples, one should check.)

(You might also note that article <10191@dog.ee.lbl.gov> touched lightly
on compiler transformations that changed the final value of `i' in a
count-down loop from 0 to -1.  I *did* consider it....)

(Richard O'Keefe also pointed out some of the dangers in transforming
code sequences.  One must always be careful of semantics.  The final
value of `i' is such a semantic, and Conor O'Neill is right to be wary
of anything that alters it.  In this particular example, however, the
final value should be considered irrelevant.)
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov