[comp.lang.c] Optimal for loop on the 68020.

pokey@well.UUCP (Jef Poskanzer) (06/05/89)

SUMMARY
-------

I compiled the following different kinds of for loops:

    for ( i = 0; i < COUNT; i++ )
    for ( i = 0; i < COUNT; ++i )
    for ( i = 0; ++i <= COUNT; )
    for ( i = 0; i++ < COUNT; )
    for ( i = COUNT; i > 0; i-- )
    for ( i = COUNT; i > 0; --i )
    for ( i = COUNT; --i >= 0; )
    for ( i = COUNT; i-- > 0; )

on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
generated code and the timings.  COUNT was a small (< 127) compile-time
constant.  The loop body did not reference i and had no side-effects.
In theory, all eight of these loops should have optimized to the most
efficient loop possible, ignoring the otherwise unreferenced variable i
and simply traversing the loop body the proper number of times.  On the
68020, this most efficient loop is a dbra instruction.  But in fact, cc
never generates dbra, and gcc generates it for only one of the above
loop types, and only when the -fstrength-reduce flag is specified.

In contrast, the BSD cc on a vax generates a single instruction loop in
four out of the eight cases, including the two most commonly-used
ones.  This is not because the vax compiler is any smarter -- the fact
that it wins only half the time, instead of all the time, shows that it
is also failing to recognize that it can freely optimize the loop.  The
only reason the vax's cc does better is that the vax has a richer
instruction set.

Since the loop overhead for dbra is 1.7 times less than for the code
from the common for loops, it is useful to know how to provoke your
compiler into generating it.  But it would be more useful for compilers
to be smarter.

And more generally, those of you who blithely assume (as I used to)
that compilers are smart enough to turn clearly-expressed C into
optimal assembler, should perhaps look at their assembler output more
often.


FULL DATA
---------

Case 1) First, let's look at the generic loops, counting up from zero
to the limit:

    for ( i = 0; i < COUNT; i++ )
    for ( i = 0; i < COUNT; ++i )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
tag:			tag:
    <loop body>		    <loop body>
    addql  #1,d6	    addql  #1,d0
    moveq  #COUNT,d1	    moveq  #COUNT-1,d3
    cmpl   d1,d6	    cmpl   d0,d3
    jlt    tag		    jge    tag

    0.97 usecs		    0.97 usecs

The two compilers generate essentially the same instructions.  Pre or
post increment doesn't make any difference.  Note that the moveq of
COUNT could be moved above the loop -- the loop body doesn't doesn't
call any subroutines that might smash it.  This looks like a missed
optimization.


Case 2a) Second, let's try doing the increment and test in one C
statement.  Seems to me this shouldn't make any difference to a smart
optimizer, but...

    for ( i = 0; ++i <= COUNT; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    addql  #1,d6	    addql  #1,d0
    moveq  #COUNT,d1	    moveq  #COUNT,d3
    cmpl   d1,d6	    cmpl   d0,d3
    jle    tag1		    jge    tag1

    0.97 usecs		    0.97 usecs

No important difference from case 1.

Case 2b) But if we try the post increment version:

    for ( i = 0; i++ < COUNT; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    movl   d6 d0	    addql  #1,d0
    addql  #1,d6	    moveq  #COUNT+1,d3
    moveq  #COUNT,d1	    cmpl   d0,d3
    cmpl   d1,d0	    jgt    tag1
    jlt    tag1

    1.11 usecs		    0.97 usecs

Gcc is smart enough to bias the loop count, but cc doesn't see that
optimization and so adds an extra instruction.


Case 3) Third, let's try a count-down loop:

    for ( i = COUNT; i > 0; i-- )
    for ( i = COUNT; i > 0; --i )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #COUNT,d6	    moveq  #COUNT,d0
tag:			tag:
    <loop body>		    <loop body>
    subql  #1,d6	    subql  #1,d0
    tstl   d6		    tstl   d0
    jgt    tag		    jgt    tag

    0.83 usecs		    0.83 usecs

Here the two compilers generate exactly the same instructions.  There
is one less instruction than in the count-up case, because the
compilers are smart enough to use the tstl instruction to compare
against zero, and so they don't need the moveq #COUNT instruction
(which shouldn't have been in the loop in the first place).


Case 4a) Fourth, let's try a count-down loop with the decrement
included in the test statement:

    for ( i = COUNT; --i >= 0; )

       cc -O		      gcc -O		gcc -O -fstrength-reduce
    
    moveq  #COUNT,d6	    moveq  #COUNT,d0	    moveq  #COUNT,d0
    jra    tag2		    jra    tag2		    jra    tag2
tag1:			tag1:			tag1:
    <loop body>		    <loop body>		    <loop body>
tag2:			tag2:			tag2:
    subql  #1,d6	    subql  #1,d0	    dbra   d0,tag1
    jge    tag1		    jpl    tag1		    clrw   d0
						    subql  #1,d0
						    jcc    tag1

    0.70 usecs		    0.70 usecs		    0.57 usecs

Cc and gcc generate code similar to case 3, except that they suddenly
decide that subql sets the condition codes just fine by itself, and a
separate tstl is not necessary.  So they shave off an instruction.
Seems like a peephole optimizer should have picked this up in the
previous cases too; it's another missed optimization.

But the big win here is with the -fstrength-reduce option in gcc.  It
actually generates the optimal one-instruction loop, for a factor of
1.7 reduction in loop overhead from the generic loop.  Not bad.  But
wait!  What's that chud after the loop?  Let's see, clear d1 to zero,
subtract one from it giving -1 and setting carry, and jump if carry is
clear.  Hmm, looks like a three-instruction no-op to me!  I guess gcc
wants to leave the loop index with the right value, and isn't smart
enough to notice it isn't used.  (But why not simply moveq the final
value?)  Oh well, since it's not inside the loop it doesn't make much
difference.

Case 4b) But if we try this with post decrement:

    for ( i = COUNT; i-- > 0; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #COUNT,d6	    moveq  #COUNT,d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    movl   d6,d0	    subql  #1,d0
    subql  #1,d6	    moveq  #-1,d3
    tstl   d0		    cmpl   d0,d3
    jgt    tag1		    jlt    tag1
    
    0.97 usecs		    0.97 usecs

Cc not only adds in the movl to save the unincremented value of i, it
also somehow fails to realize that subql sets the condition codes, so
the tstl is back.  And gcc gets totally confused and brings in a
literal -1.


CONCLUSION
----------

For both compilers and all levels of optimization, this loop:

    for ( i = COUNT; --i >= 0; )

gives the lowest overhead.  Using gcc -fstrength-reduce, this low
overhead means the optimal single-instruction loop; but even for cc and
for gcc without -fstrength-reduce, this loop is faster than the
others.  I don't show the results here, but this is also true for large
constant loops and for variable-length loops.

It is unfortunate that we have to use such a non-intuitive loop to get
the best results -- especially unfortunate since it is probably *not*
the best loop on some other architectures or compilers -- but that's
the facts.

I would be interested to see similar checks done on different
architectures; in particular RISC machines, which supposedly are
designed in cooperation with the compiler writers.
---
Jef

   Jef Poskanzer   {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey
 "Man invented language to satisfy his deep need to complain." -- Lily Tomlin

jima@hplsla.HP.COM (Jim Adcock) (06/06/89)

Unfortunately, the "favorite loop constuct" is going to be very dependent on
the particular compiler used, and exactly *what* you do inside the loop.
Below is your examples, redone with a null loop body, plus *my* favorite
loop construct: for(i=COUNT; i--;).  I used gcc and hp's cc 6.5 compiler
with -O -S and in gnu -O -S -fstrength-reduce -fcombine-regs -fforce-mem
-fforce-addr.  The point is that these things are very unpredictable,
and optimizing compilers work well on linear code segments, not over
branch points in programs [including loops].  The "optimizing" compilers
I've seen do about 20% better than non-optimizing compilers, but are still
far from "optimal" by our human standards.


#define COUNT 100
main()
{
    int i;
    for ( i = 0; i < COUNT; i++ ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = 0; i < COUNT; ++i ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = 0; ++i <= COUNT; ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = 0; i++ < COUNT; ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = COUNT; i > 0; i-- ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = COUNT; i > 0; --i ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = COUNT; --i >= 0; ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = COUNT; i-- > 0; ); 
xxxxxxxxxxxxxxxxxxx();
    for ( i = COUNT; i--; ); 
xxxxxxxxxxxxxxxxxxx();
}


#NO_APP
gcc_compiled.:
.text
	.even
.globl _main
_main:
	link a6,#0


	moveq #99,d0
L4:
	dbra d0,L4
	clrw d0
	subql #1,d0
	jcc L4

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #99,d0
L8:
	dbra d0,L8
	clrw d0
	subql #1,d0
	jcc L8

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #100,d0
L10:
	dbra d0,L10
	clrw d0
	subql #1,d0
	jcc L10

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #100,d0
L14:
	dbra d0,L14
	clrw d0
	subql #1,d0
	jcc L14

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #100,d0
L20:
	subql #1,d0
	tstl d0
	jgt L20

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #100,d0
L24:
	subql #1,d0
	tstl d0
	jgt L24

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #100,d0
L26:
	dbra d0,L26
	clrw d0
	subql #1,d0
	jcc L26

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #100,d0
L30:
	subql #1,d0
	moveq #-1,d1
	cmpl d0,d1
	jlt L30

	jbsr _xxxxxxxxxxxxxxxxxxx

	moveq #100,d0
L34:
	dbra d0,L34
	clrw d0
	subql #1,d0
	jcc L34

	jbsr _xxxxxxxxxxxxxxxxxxx

	unlk a6
	rts

[hp cc 6.5:] ==============================================

	global	_main
_main:
	link.w	%a6,&-8
	mov.l	%d7,-(%sp)

	movq	&99,%d7
L12:
	dbf	%d7,L12

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&99,%d7
L16:
	dbf	%d7,L16

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&0,%d7
L21:
	addq.l	&1,%d7
	movq	&100,%d0
	cmp.l	%d7,%d0
	ble.b	L21

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&0,%d7
L24:
	mov.l	%d7,%d0
	addq.l	&1,%d7
	movq	&100,%d1
	cmp.l	%d0,%d1
	blt.b	L24

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&100,%d7
L20001:
	subq.l	&1,%d7
	tst.l	%d7
	bgt.b	L20001

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&100,%d7
L20003:
	subq.l	&1,%d7
	tst.l	%d7
	bgt.b	L20003

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&100,%d7
L33:
	subq.l	&1,%d7
	bge.b	L33

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&100,%d7
L36:
	mov.l	%d7,%d0
	subq.l	&1,%d7
	tst.l	%d0
	bgt.b	L36

	jsr	_xxxxxxxxxxxxxxxxxxx

	movq	&100,%d7
L39:
	mov.l	%d7,%d0
	subq.l	&1,%d7
	tst.l	%d0
	bne.b	L39

	jsr	_xxxxxxxxxxxxxxxxxxx

	mov.l	(%sp)+,%d7
	unlk	%a6
	rts

chris@mimsy.UUCP (Chris Torek) (06/06/89)

In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes:
>... COUNT was a small (< 127) compile-time constant.
>    for ( i = COUNT; --i >= 0; )

[all but gcc -O -fstrength-reduce deleted]

>	moveq  #COUNT,d0
>	jra    tag2
>tag1:
>	<loop body>
>tag2:
>	dbra   d0,tag1
>	clrw   d0
>	subql  #1,d0
>	jcc    tag1

>... But wait!  What's that chud after the loop?  Let's see, clear d1
>to zero, subtract one from it giving -1 and setting carry, and jump
>if carry is clear.  Hmm, looks like a three-instruction no-op to me!

No---the problem is that `dbra' decrements a *word*, compares the
result against -1, and (if not -1) braches.  The semantics of the
loop demands a 32 bit comparison.  The only reason it is not necessary
in this particular case is the first quoted line above.

Still, it would be nice if gcc always used the dbra/clrw/subql/jcc
sequence for `--x >= 0' loops, since it does always work.  The `clrw'
fixes up the case where the 16-bit result has gone to -1:

	before decrement:	wxyz 0000
	after decrement:	wxyz FFFF
	after clrw:		wxyz 0000
	after subql:	      wxyz-1 FFFF

The dbra loop is so much faster that the extra time and space for one
`unnecessary' dbra+clrw (when the loop really does go from 0 to -1,
and at every 65536 trips when the loop counter is large and positive)
that I would make this optimisation unconditional.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

suitti@haddock.ima.isc.com (Stephen Uitti) (06/06/89)

In article <11993@well.UUCP> Jef Poskanzer <pokey@well.com> writes:
>I compiled the following different kinds of for loops:
>    for ( i = 0; i < COUNT; i++ )
>    for ( i = 0; i < COUNT; ++i )
>    for ( i = 0; ++i <= COUNT; )
>    for ( i = 0; i++ < COUNT; )
>    for ( i = COUNT; i > 0; i-- )
>    for ( i = COUNT; i > 0; --i )
>    for ( i = COUNT; --i >= 0; )
>    for ( i = COUNT; i-- > 0; )
>on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
>generated code and the timings.  COUNT was a small (< 127) compile-time
>constant.

A huge amount of time ago, i did this for the PDP-11 using the
Ritchie compiler.  This compiler is not awesomely bright, but i
had the impression that it wasn't quite as bad as PCC (or even
less portable).  The optimal loop would use the "sob" (subtract
one and branch if not zero) instruction.  The code that caused it
to be emitted was
	register i;
	i = COUNT;
	do {
		loop body;
	} while (--i);

This is much clearer than anything with for(), since it tends to get
(even stupidly) compiled to:

	register i;
	i = COUNT;			mov $COUNT,r2
	do {			L1:
		loop body;		loop body...
	} while (--i);			sob r2,L1

The compiler seemed to be smart enough to know not to use it if
the loop was too long. Thus, at worst, the sob would be replaced by:
					sub $1,r2
					jne L1

The trouble with "for" loops is that it is defined as "test at
the top", when most single instructions loops are "test at bottom".

The VAX (PCC based) compiler's optimizer would convert many loops
that used multiple instructions to use a single complicated
instruction.  Unfortunately, the VAX 780 (ubiquitous at the time)
generally ran these slower.  One case was the acbl (add, compare
and branch longword).  The optimizer would replace the three
instructions (something like:)
	add $1,r2
	cmp $END,r2
	bne L1

with the "acbl" and all its arguments.  Both codings take the
same number of bytes (25!).  The multiple instructions are faster
(on the 780).  Newer VAXen have different relative timing.  I
recall this case coming from the prime number sieve benchmark.
The loop could not be easily converted to the do-while.

I haven't gotten a chance to play with gcc (real work to do).
I've heard it can do wonders, and thought it did better
multi-statement optimization.  Still, all the silly side-effects
semantics of C make it hard to tell a compiler how to take
mediocre source and produce good code from it.  It is best to
start with good source.  Though i hardly ever write assembler
anymore (and it seems never for speed or size), i've still yet to
bump into a C compiler that produces code that i can't improve on
at a glance.  This may change if/when i start working with
scoreboarded RISC processors (88K), in that instruction cycle
counting looks hard (requiring more than a glance).

Stephen.

david@sun.com (there IS no ``average'' SubGenius) (06/06/89)

The Sun 680x0 compiler has been able to generate dbcc loops for a long
time.  However, the loop has to be written in a particular form.
Here are some examples compiled with the SunOS 3.5 cc:

1.
	register short s;
	for (s = 127; s != -1; s--) ;

	moveq	#127,d7
LY00001:
	dbra	d7,LY00001

2a.
	register int i;
	for (i = 127; i != -1; i--) ;

	moveq	#127,d6
LY00003:
	subql	#1,d6
	moveq	#-1,d7
	cmpl	d7,d6
	jne	LY00003

(Oops!)

2b.
	register int i;
	i = 127;
	while (--i != -1) ;

	moveq	#127,d6
L20:
	dbra	d6,L20
	clrw	d6
	subql	#1,d6
	jcc	L20

3.
	extern int x;
	register short s;
	s = 127;
	while (x >= 0 && --s != -1) ;

	moveq	#127,d7
L15:
	tstl	_x
	dblt	d7,L15


P.S. Sun users might want to look at the PR_LOOP macros in
<pixrect/pr_impl_util.h>.

-- 
David DiGiacomo, Sun Microsystems, Mt. View, CA  sun!david david@sun.com

edler@cmcl2.NYU.EDU (Jan Edler) (06/07/89)

In article <17891@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes:
>>... COUNT was a small (< 127) compile-time constant.
>>    for ( i = COUNT; --i >= 0; )
>
>[all but gcc -O -fstrength-reduce deleted]
>
>>	moveq  #COUNT,d0
>>	jra    tag2
>>tag1:
>>	<loop body>
>>tag2:
>>	dbra   d0,tag1
>>	clrw   d0
>>	subql  #1,d0
>>	jcc    tag1

I prefer
	moveq	#COUNT,d0
	jra	tag3
tag1:	swap	d0
tag2:	<loop body>
tag3:	dbra	d0,tag2
	swap	d0
	dbra	d0,tag1

But it doesn't really make that much difference, I guess.
However, the compiler shouldn't do either of these when d0 is initialized
to a value < 65536.

Jan Edler
NYU Ultracomputer Research Laboratory
edler@nyu.edu

pokey@well.UUCP (Jef Poskanzer) (06/07/89)

In the referenced message, edler@cmcl2.UUCP (Jan Edler) wrote:
}However, the compiler shouldn't do either of these when d0 is initialized
}to a value < 65536.

Not good enough, since the loop body might modify the index variable.
If the compiler is smart enough to do a global semantic check for this
sort of thing, then you can do this optimization; but we've already
established that cc and gcc aren't that smart.
---
Jef

   Jef Poskanzer   {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey
 "A banker is a fellow who lends you his umbrella when the sun is shining and
          wants it back the minute it begins to rain." -- Mark Twain

mikeb@coho.ee.ubc.ca (Mike Bolotski) (06/07/89)

In article <5260014@hplsla.HP.COM> jima@hplsla.HP.COM (Jim Adcock) writes:
>Below is your examples, redone with a null loop body, plus *my* favorite
>loop construct: for(i=COUNT; i--;).  

This is my favorite loop, too.  But if i is defined to be short, instead
of int, the following results (Sun3/260), gcc 1.35 -O -S.  

Note: no -fstrength_reduce required.

main()
{
  volatile int x;
  short j;

  for (j=5; j--; )
    x = 0;
}

#NO_APP
gcc_compiled.:
.text
        .even
.globl _main
_main:
        link a6,#-4
        moveq #5,d0
        jra L2
L5:
        clrl a6@(-4)
L2:
        dbra d0,L5
        unlk a6
        rts


Mike Bolotski, Department of Electrical Engineering,
               University of British Columbia, Vancouver, Canada 
mikeb@ee.ubc.ca                    | mikeb%ee.ubc.ca@relay.ubc.ca
ee.ubc.ca!mikeb@uunet.uu.net       | ...!ubc-vision!ee.ubc.ca!mikeb

inst182@tuvie (Inst.f.Techn.Informatik) (06/09/89)

I tried the same kinds of for loops on a XT-compatible (8086, 8MHz)
using Turbo-C 2.0. (tcc -G -O -Z -S)

I also tested

for (i = COUNT; i --;);

because I thought, it would be the fastest.

For integers, the fastest loop was

for (i = COUNT; -- i >= 0;);

which turned out to be 1.7 times as fast as the slowest. It is
remarkable, that this is the same factor as Jef Poskanzer has timed on
his Sun.

#define COUNT 30000

----------------------------------------------------------------
for (i = 0; i < COUNT; i ++);

127 ms

	xor	si,si
	jmp	short @5
@4:
	inc	si
@5:
	cmp	si,30000
	jl	@4

Comparation against a non-zero value
----------------------------------------------------------------
for (i = 0; i < COUNT; ++ i);

115 ms

	xor	si,si
	jmp	short @9
@8:
	inc	si
@9:
	cmp	si,30000
	jl	@8

Can anybody explain, why this is faster than the above ????
----------------------------------------------------------------
for (i = 0; ++ i <= COUNT;);

132 ms

	xor	si,si
@13:
	inc	si
	mov	ax,si
	cmp	ax,30000
	jle	@13

Moving si to ax seems rather useless to me.
----------------------------------------------------------------
for (i = 0; i++ < COUNT;);

132 ms

	xor	si,si
@17:
	mov	ax,si
	inc	si
	cmp	ax,30000
	jl	@17

Ok, here it is neccessary
----------------------------------------------------------------
for (i = COUNT; i > 0; i --);

94 ms

	mov	si,30000
	jmp	short @21
@20:
	dec	si
@21:
	or	si,si
	jg	@20

The separation of decrement and comparation causes an additional or
----------------------------------------------------------------
for (i = COUNT; i > 0; --i);

93 ms

	mov	si,30000
	jmp	short @25
@24:
	dec	si
@25:
	or	si,si
	jg	@24

Looks exactly like the version above. The 1 ms difference could be a
timing error.
----------------------------------------------------------------
for (i = COUNT; --i >= 0;);

77 ms

	mov	si,30000
@29:
	dec	si
	jge	@29

This surely is the optimal loop
----------------------------------------------------------------
for (i = COUNT; i-- > 0;);

115 ms

	mov	si,30000
@33:
	mov	ax,si
	dec	si
	or	ax,ax
	jg	@33

The former value of the counter has to be saved for the comparison.
----------------------------------------------------------------
for (i = COUNT; i--;);

115 ms

	mov	si,30000
@37:
	mov	ax,si
	dec	si
	or	ax,ax
	jne	@37
================================================================

Then I tried the same with a long counter (instead of int), which
changed the results radically. Not only is the difference between the
fastest and the slowest loop much lesser (The factor is now only
1.13), but version 7, which has been the fastest for integers is now
the second slowest, and version 9 now is the fastest.

Versions 3 and 4 have remained at the slow end of the charts.

#define COUNT	60000
----------------------------------------------------------------
for (i = 0; i < COUNT; i ++);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
	jmp	short @5
@4:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
@5:
	cmp	word ptr [bp-2],0
	jl	@4
	jne	@38
	cmp	word ptr [bp-4],-5536
	jb	@4
@38:

----------------------------------------------------------------
for (i = 0; i < COUNT; ++ i);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
	jmp	short @9
@8:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
@9:
	cmp	word ptr [bp-2],0
	jl	@8
	jne	@39
	cmp	word ptr [bp-4],-5536
	jb	@8
@39:
----------------------------------------------------------------
	for (i = 0; ++ i <= COUNT;);

1056 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
@13:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	or	dx,dx
	jl	@13
	jg	@40
	cmp	ax,-5536
	jbe	@13
@40:

Now the compiler seems to have forgotten that a memory location can be
compared against a constant.
----------------------------------------------------------------
	for (i = 0; i++ < COUNT;);

1009 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
@17:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
	or	dx,dx
	jl	@17
	jne	@41
	cmp	ax,-5536
	jb	@17
@41:
----------------------------------------------------------------
	for (i = COUNT; i > 0; i --);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
	jmp	short @21
@20:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
@21:
	cmp	word ptr [bp-2],0
	jg	@20
	jne	@42
	cmp	word ptr [bp-4],0
	ja	@20
@42:

Now it seems to have regained this knowledge and is fast as before.
Comparing against zero is not faster here than comparing against
another constant.
----------------------------------------------------------------
	for (i = COUNT; i > 0; -- i);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
	jmp	short @25
@24:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
@25:
	cmp	word ptr [bp-2],0
	jg	@24
	jne	@43
	cmp	word ptr [bp-4],0
	ja	@24
@43:
----------------------------------------------------------------
	for (i = COUNT; --i >= 0;);

1051 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@29:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	or	dx,dx
	jg	@29
	jl	@44
	or	ax,ax
	jae	@29
@44:

Again, the compiler fetches the counter into registers before
comparing. Here comparing against zero saves a little time.
But letting the counter in memory would have saved more.
----------------------------------------------------------------
	for (i = COUNT; i-- > 0;);

1004 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@33:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	or	dx,dx
	jg	@33
	jne	@45
	or	ax,ax
	ja	@33
@45:
----------------------------------------------------------------
	for (i = COUNT; i--;);

934 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@37:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	or	dx,ax
	jne	@37

Checking a long for *equality* to zero saves lots of time, so the
overhead for loading the registers is more than compensated.
----------------------------------------------------------------
	???

374 ms

	mov	si,0
	mov	di,30000
@_l:
	mov	dx,si
	mov	ax,di
	sub	di,1
	sbb	si,0
	or	dx,ax
	jne	@_l

This would have been the optimal loop, but Turbo-C does not put long
variables into registers.
================================================================

Please email to address below, because the account I was posting this from is not
mine.
 _______________________________________________________________
|	__   |							|
|  |   |  \  | Peter J. Holzer					|
|  |___|__/  | Technische Universitaet Wien			|
|  |   |     |							|
|  |   |     | ...!uunet!mcvax!tuvie!asupa!honey!hjphr		|
|  ____/     |--------------------------------------------------|
|	     | Think of it as evolution in action -- Tony Rand	|
|____________|__________________________________________________|

awd@dbase.UUCP (Alastair Dallas) (06/11/89)

In article <11993@well.UUCP>, pokey@well.UUCP (Jef Poskanzer) writes:
> SUMMARY
> -------
> 
> I compiled the following different kinds of for loops:
> 
>     for ( i = 0; i < COUNT; i++ )
>     for ( i = 0; i < COUNT; ++i )
>     for ( i = 0; ++i <= COUNT; )
>     for ( i = 0; i++ < COUNT; )
>     for ( i = COUNT; i > 0; i-- )
>     for ( i = COUNT; i > 0; --i )
>     for ( i = COUNT; --i >= 0; )
>     for ( i = COUNT; i-- > 0; )
> 
> on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
> generated code and the timings.  COUNT was a small (< 127) compile-time
> constant.  The loop body did not reference i and had no side-effects.
> In theory, all eight of these loops should have optimized to the most
> efficient loop possible, ignoring the otherwise unreferenced variable i
> and simply traversing the loop body the proper number of times.  On the
> 68020, this most efficient loop is a dbra instruction.  But in fact, cc
> never generates dbra
> 
	<details omitted>
> 
> CONCLUSION
> ----------
> 
> For both compilers and all levels of optimization, this loop:
> 
>     for ( i = COUNT; --i >= 0; )
> 
> gives the lowest overhead.

I tried these eight loops on Microsoft C v5.1 and was very surprised to
get the same results, more or less.  Jef's fastest loop (--i >= 0) on his
Sun was also MSC's fastest loop on an 80386.  And for the same reason: the
compiler wakes up an realizes that the condition flags are already set.
On 80x86, JCXZ is the fast loop instruction, but it is ignored by
the MSC optimizer as well.  (To be fair, I didn't turn on any
special optimization, so I can't say MSC won't
do it under some conditions.)  Making i a register variable would obviously
improve the timings, but MSC uses the SI register, not CX, and therefore
can't take advantage of JCXZ.  The code MSC generates for (--i >= 0) is
approx. 47% of the t-states used by the code it generates for the typical
(i=0; i<COUNT; i++) loop, but the typical case was not the worst case.

What an eye-opener!  I always assumed that it wasn't worth it being fussy
about such things in C, that the compiler would optimize it perfectly.
Like Jef says, if you want a particular part of your code to run fast,
don't assume it will be optimized; study the compiler's output.

Random thought: What are our "optimizing compilers" optimizing if they 
leave loops alone?

/alastair/

trebor@biar.UUCP (Robert J Woodhead) (06/11/89)

In article <96@dbase.UUCP> awd@dbase.UUCP (Alastair Dallas) writes:
>Random thought: What are our "optimizing compilers" optimizing if they 
>leave loops alone?

	Why, Corporate Quarterly Earnings Reports, of course!


-- 
Robert J Woodhead, Biar Games, Inc.  !uunet!biar!trebor | trebor@biar.UUCP
``The worst thing about being a vampire is that you can't go to matinees
  and save money anymore.''

jima@hplsla.HP.COM (Jim Adcock) (06/13/89)

>Random thought: What are our "optimizing compilers" optimizing if they 
>leave loops alone?

Its not that they leave loops alone, they will munge either side of a label
as well as they can.  Optimizing across a label is hard, since such
optimizations might affect [bust] the code branching to that label.
See example to follow.

Again, one cannot just try some trivial loop cases, say "form X is fastest",
and use it blindly from there on.  Life is not that simple.  Compilers do not
say: "aha, he's using loop form number 5, that calls for a dbra."  Typically,
a good optimizing compiler is going to start by naively generating a "tree"
representing one's program segment, said "tree" unfortunately containing
loops and brach statements which complicate a compiler's life tremendously.  
Then the tree is simplified using simple rules of arthmetic: 

"[add [const 4] [const 1]] --> [const 5]"

for example.  Consider what loops and branches can easily do to complicate
a compiler's life:

	[reg0 = [const 4]]
label401:
	[reg1 = [const 1]]
	[reg2 = [add reg0 reg1]]

Can the final statement be simplified to: "[reg2 = [const 5]]" ???

Maybe, but you'd have to have knowledge of all the areas that branch to 
label401.  Clearly, compilers don't like to do this!  And situations like
this arise all the time when your for loops get converted to a branch plus
a label.

Finally, after these arithmetic simplifications, attempts are made to map
the tree onto machine instructions, or combinations of machine instructions.
Wierd, but useful instructions like "dbra" are likely to be handled as
a peep-hole optimization -- where a final stage of the compiler tries
to substitute simpler instructions for specific combinations of one or
more more expensive instructions.  So in real life programming, whether
your favorite "dbra" instruction gets used or not is close to blind luck!
Its just going to depend on exactly what you do in that loop, what registers
are needed at other points in the routine, etc, etc.  In real life programming
these things are just not very repeatable.

Consider, using gcc -O -S on:

#define OUTERLOOP 10000
#define COUNT 1000
main()
{
    int i;
    int j;
    int ary[COUNT];
    for (j=OUTERLOOP; j--;) 
    {
        for ( i = COUNT; i-- > 0; ) ary[i] = i; 
    } 
}

gives:

_main:
	link a6,#-4000
	movel d2,sp@-
	movel #10000,d1
	jra L2
L9:
	movel #1000,d0
	jra L5
L8:
	lea a6@(d0:l:4),a0
	movel d0,a0@(-4000)
L5:
	subql #1,d0
	moveq #-1,d2
	cmpl d0,d2
	jlt L8
L2:
	dbra d1,L9
	clrw d1
	subql #1,d1
	jcc L9
	movel a6@(-4004),d2
	unlk a6
	rts

which executes in 15 seconds on my mid-range 68020 system.  Using the 
"magic trick" of: "short s; for (s = COUNT; c--;) ..." to try to match
a dbra instruction:

#define OUTERLOOP 10000
#define COUNT 1000
main()
{
    short s;
    int j;
    int ary[COUNT];
    for (j=OUTERLOOP; j--;) 
    {
        for ( s = COUNT; s-- > 0; ) ary[s] = s; 
    }

generates:


_main:
	link a6,#-4000
	movel #10000,d1
	jra L2
L9:
	movew #1000,d0
	jra L5
L8:
	movew d0,a1
	lea a6@(a1:l:4),a0
	movel a1,a0@(-4000)
L5:
	subqw #1,d0
	cmpw #-1,d0
	jgt L8
L2:
	dbra d1,L9
	clrw d1
	subql #1,d1
	jcc L9
	unlk a6
	rts


which takes 16 seconds to execute!  So if one has an optimizing compiler I
think one is fooling oneself if one thinks one can memorize a favorite
loop structure and use it blindly.  Write code using common sense, that
is easy to read, and not too tricky.  If you need more speed, find a 
different algorithm, or profile it, and work on just those areas that are
too slow, re-writing in assembly if necessary.  More importantly, try
to get one's hands on the best optimizing compiler available for one's
machine.  Any damned compiler can be advertised as an "optimizing" compiler --
the word means nothing.  Any damned compiler can be tweaked to generate 
great looking code for a few do-nothing for loops -- it means nothing.
Try to get a couple of the best compilers on you machine and then try them on
a variety of code indicative of the work you do.  They may vary greatly in
areas far more important to you than loop construction.  For example, 
680x0 compilers I have seen vary *greatly* in the approaches they take to
using the 6888x coprocessor -- with many having a rudimentary or non-existant
concept of what it means to optimize floating pt instructions.   Many
only use registers 0 and 1 in the 6888x coprocessor!  Many flotch floating
pt numbers to and from the 6888x in order to convert from single to double
floating pt and back, etc, etc...... [I have yet to see a compiler which
is good at both 680x0 instruction generation and 6888x instruction generation..]