[comp.unix.wizards] Optimal for loop on the 68020.

pokey@well.UUCP (Jef Poskanzer) (06/05/89)

SUMMARY
-------

I compiled the following different kinds of for loops:

    for ( i = 0; i < COUNT; i++ )
    for ( i = 0; i < COUNT; ++i )
    for ( i = 0; ++i <= COUNT; )
    for ( i = 0; i++ < COUNT; )
    for ( i = COUNT; i > 0; i-- )
    for ( i = COUNT; i > 0; --i )
    for ( i = COUNT; --i >= 0; )
    for ( i = COUNT; i-- > 0; )

on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
generated code and the timings.  COUNT was a small (< 127) compile-time
constant.  The loop body did not reference i and had no side-effects.
In theory, all eight of these loops should have optimized to the most
efficient loop possible, ignoring the otherwise unreferenced variable i
and simply traversing the loop body the proper number of times.  On the
68020, this most efficient loop is a dbra instruction.  But in fact, cc
never generates dbra, and gcc generates it for only one of the above
loop types, and only when the -fstrength-reduce flag is specified.

In contrast, the BSD cc on a vax generates a single instruction loop in
four out of the eight cases, including the two most commonly-used
ones.  This is not because the vax compiler is any smarter -- the fact
that it wins only half the time, instead of all the time, shows that it
is also failing to recognize that it can freely optimize the loop.  The
only reason the vax's cc does better is that the vax has a richer
instruction set.

Since the loop overhead for dbra is 1.7 times less than for the code
from the common for loops, it is useful to know how to provoke your
compiler into generating it.  But it would be more useful for compilers
to be smarter.

And more generally, those of you who blithely assume (as I used to)
that compilers are smart enough to turn clearly-expressed C into
optimal assembler, should perhaps look at their assembler output more
often.


FULL DATA
---------

Case 1) First, let's look at the generic loops, counting up from zero
to the limit:

    for ( i = 0; i < COUNT; i++ )
    for ( i = 0; i < COUNT; ++i )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
tag:			tag:
    <loop body>		    <loop body>
    addql  #1,d6	    addql  #1,d0
    moveq  #COUNT,d1	    moveq  #COUNT-1,d3
    cmpl   d1,d6	    cmpl   d0,d3
    jlt    tag		    jge    tag

    0.97 usecs		    0.97 usecs

The two compilers generate essentially the same instructions.  Pre or
post increment doesn't make any difference.  Note that the moveq of
COUNT could be moved above the loop -- the loop body doesn't doesn't
call any subroutines that might smash it.  This looks like a missed
optimization.


Case 2a) Second, let's try doing the increment and test in one C
statement.  Seems to me this shouldn't make any difference to a smart
optimizer, but...

    for ( i = 0; ++i <= COUNT; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    addql  #1,d6	    addql  #1,d0
    moveq  #COUNT,d1	    moveq  #COUNT,d3
    cmpl   d1,d6	    cmpl   d0,d3
    jle    tag1		    jge    tag1

    0.97 usecs		    0.97 usecs

No important difference from case 1.

Case 2b) But if we try the post increment version:

    for ( i = 0; i++ < COUNT; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #0,d6	    clrl   d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    movl   d6 d0	    addql  #1,d0
    addql  #1,d6	    moveq  #COUNT+1,d3
    moveq  #COUNT,d1	    cmpl   d0,d3
    cmpl   d1,d0	    jgt    tag1
    jlt    tag1

    1.11 usecs		    0.97 usecs

Gcc is smart enough to bias the loop count, but cc doesn't see that
optimization and so adds an extra instruction.


Case 3) Third, let's try a count-down loop:

    for ( i = COUNT; i > 0; i-- )
    for ( i = COUNT; i > 0; --i )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #COUNT,d6	    moveq  #COUNT,d0
tag:			tag:
    <loop body>		    <loop body>
    subql  #1,d6	    subql  #1,d0
    tstl   d6		    tstl   d0
    jgt    tag		    jgt    tag

    0.83 usecs		    0.83 usecs

Here the two compilers generate exactly the same instructions.  There
is one less instruction than in the count-up case, because the
compilers are smart enough to use the tstl instruction to compare
against zero, and so they don't need the moveq #COUNT instruction
(which shouldn't have been in the loop in the first place).


Case 4a) Fourth, let's try a count-down loop with the decrement
included in the test statement:

    for ( i = COUNT; --i >= 0; )

       cc -O		      gcc -O		gcc -O -fstrength-reduce
    
    moveq  #COUNT,d6	    moveq  #COUNT,d0	    moveq  #COUNT,d0
    jra    tag2		    jra    tag2		    jra    tag2
tag1:			tag1:			tag1:
    <loop body>		    <loop body>		    <loop body>
tag2:			tag2:			tag2:
    subql  #1,d6	    subql  #1,d0	    dbra   d0,tag1
    jge    tag1		    jpl    tag1		    clrw   d0
						    subql  #1,d0
						    jcc    tag1

    0.70 usecs		    0.70 usecs		    0.57 usecs

Cc and gcc generate code similar to case 3, except that they suddenly
decide that subql sets the condition codes just fine by itself, and a
separate tstl is not necessary.  So they shave off an instruction.
Seems like a peephole optimizer should have picked this up in the
previous cases too; it's another missed optimization.

But the big win here is with the -fstrength-reduce option in gcc.  It
actually generates the optimal one-instruction loop, for a factor of
1.7 reduction in loop overhead from the generic loop.  Not bad.  But
wait!  What's that chud after the loop?  Let's see, clear d1 to zero,
subtract one from it giving -1 and setting carry, and jump if carry is
clear.  Hmm, looks like a three-instruction no-op to me!  I guess gcc
wants to leave the loop index with the right value, and isn't smart
enough to notice it isn't used.  (But why not simply moveq the final
value?)  Oh well, since it's not inside the loop it doesn't make much
difference.

Case 4b) But if we try this with post decrement:

    for ( i = COUNT; i-- > 0; )

       cc -O		gcc -O & gcc -O -fstrength-reduce

    moveq  #COUNT,d6	    moveq  #COUNT,d0
    jra    tag2		    jra    tag2
tag1:			tag1:
    <loop body>		    <loop body>
tag2:			tag2:
    movl   d6,d0	    subql  #1,d0
    subql  #1,d6	    moveq  #-1,d3
    tstl   d0		    cmpl   d0,d3
    jgt    tag1		    jlt    tag1
    
    0.97 usecs		    0.97 usecs

Cc not only adds in the movl to save the unincremented value of i, it
also somehow fails to realize that subql sets the condition codes, so
the tstl is back.  And gcc gets totally confused and brings in a
literal -1.


CONCLUSION
----------

For both compilers and all levels of optimization, this loop:

    for ( i = COUNT; --i >= 0; )

gives the lowest overhead.  Using gcc -fstrength-reduce, this low
overhead means the optimal single-instruction loop; but even for cc and
for gcc without -fstrength-reduce, this loop is faster than the
others.  I don't show the results here, but this is also true for large
constant loops and for variable-length loops.

It is unfortunate that we have to use such a non-intuitive loop to get
the best results -- especially unfortunate since it is probably *not*
the best loop on some other architectures or compilers -- but that's
the facts.

I would be interested to see similar checks done on different
architectures; in particular RISC machines, which supposedly are
designed in cooperation with the compiler writers.
---
Jef

   Jef Poskanzer   {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey
 "Man invented language to satisfy his deep need to complain." -- Lily Tomlin

chris@mimsy.UUCP (Chris Torek) (06/06/89)

In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes:
>... COUNT was a small (< 127) compile-time constant.
>    for ( i = COUNT; --i >= 0; )

[all but gcc -O -fstrength-reduce deleted]

>	moveq  #COUNT,d0
>	jra    tag2
>tag1:
>	<loop body>
>tag2:
>	dbra   d0,tag1
>	clrw   d0
>	subql  #1,d0
>	jcc    tag1

>... But wait!  What's that chud after the loop?  Let's see, clear d1
>to zero, subtract one from it giving -1 and setting carry, and jump
>if carry is clear.  Hmm, looks like a three-instruction no-op to me!

No---the problem is that `dbra' decrements a *word*, compares the
result against -1, and (if not -1) braches.  The semantics of the
loop demands a 32 bit comparison.  The only reason it is not necessary
in this particular case is the first quoted line above.

Still, it would be nice if gcc always used the dbra/clrw/subql/jcc
sequence for `--x >= 0' loops, since it does always work.  The `clrw'
fixes up the case where the 16-bit result has gone to -1:

	before decrement:	wxyz 0000
	after decrement:	wxyz FFFF
	after clrw:		wxyz 0000
	after subql:	      wxyz-1 FFFF

The dbra loop is so much faster that the extra time and space for one
`unnecessary' dbra+clrw (when the loop really does go from 0 to -1,
and at every 65536 trips when the loop counter is large and positive)
that I would make this optimisation unconditional.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

suitti@haddock.ima.isc.com (Stephen Uitti) (06/06/89)

In article <11993@well.UUCP> Jef Poskanzer <pokey@well.com> writes:
>I compiled the following different kinds of for loops:
>    for ( i = 0; i < COUNT; i++ )
>    for ( i = 0; i < COUNT; ++i )
>    for ( i = 0; ++i <= COUNT; )
>    for ( i = 0; i++ < COUNT; )
>    for ( i = COUNT; i > 0; i-- )
>    for ( i = COUNT; i > 0; --i )
>    for ( i = COUNT; --i >= 0; )
>    for ( i = COUNT; i-- > 0; )
>on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
>generated code and the timings.  COUNT was a small (< 127) compile-time
>constant.

A huge amount of time ago, i did this for the PDP-11 using the
Ritchie compiler.  This compiler is not awesomely bright, but i
had the impression that it wasn't quite as bad as PCC (or even
less portable).  The optimal loop would use the "sob" (subtract
one and branch if not zero) instruction.  The code that caused it
to be emitted was
	register i;
	i = COUNT;
	do {
		loop body;
	} while (--i);

This is much clearer than anything with for(), since it tends to get
(even stupidly) compiled to:

	register i;
	i = COUNT;			mov $COUNT,r2
	do {			L1:
		loop body;		loop body...
	} while (--i);			sob r2,L1

The compiler seemed to be smart enough to know not to use it if
the loop was too long. Thus, at worst, the sob would be replaced by:
					sub $1,r2
					jne L1

The trouble with "for" loops is that it is defined as "test at
the top", when most single instructions loops are "test at bottom".

The VAX (PCC based) compiler's optimizer would convert many loops
that used multiple instructions to use a single complicated
instruction.  Unfortunately, the VAX 780 (ubiquitous at the time)
generally ran these slower.  One case was the acbl (add, compare
and branch longword).  The optimizer would replace the three
instructions (something like:)
	add $1,r2
	cmp $END,r2
	bne L1

with the "acbl" and all its arguments.  Both codings take the
same number of bytes (25!).  The multiple instructions are faster
(on the 780).  Newer VAXen have different relative timing.  I
recall this case coming from the prime number sieve benchmark.
The loop could not be easily converted to the do-while.

I haven't gotten a chance to play with gcc (real work to do).
I've heard it can do wonders, and thought it did better
multi-statement optimization.  Still, all the silly side-effects
semantics of C make it hard to tell a compiler how to take
mediocre source and produce good code from it.  It is best to
start with good source.  Though i hardly ever write assembler
anymore (and it seems never for speed or size), i've still yet to
bump into a C compiler that produces code that i can't improve on
at a glance.  This may change if/when i start working with
scoreboarded RISC processors (88K), in that instruction cycle
counting looks hard (requiring more than a glance).

Stephen.

david@sun.com (there IS no ``average'' SubGenius) (06/06/89)

The Sun 680x0 compiler has been able to generate dbcc loops for a long
time.  However, the loop has to be written in a particular form.
Here are some examples compiled with the SunOS 3.5 cc:

1.
	register short s;
	for (s = 127; s != -1; s--) ;

	moveq	#127,d7
LY00001:
	dbra	d7,LY00001

2a.
	register int i;
	for (i = 127; i != -1; i--) ;

	moveq	#127,d6
LY00003:
	subql	#1,d6
	moveq	#-1,d7
	cmpl	d7,d6
	jne	LY00003

(Oops!)

2b.
	register int i;
	i = 127;
	while (--i != -1) ;

	moveq	#127,d6
L20:
	dbra	d6,L20
	clrw	d6
	subql	#1,d6
	jcc	L20

3.
	extern int x;
	register short s;
	s = 127;
	while (x >= 0 && --s != -1) ;

	moveq	#127,d7
L15:
	tstl	_x
	dblt	d7,L15


P.S. Sun users might want to look at the PR_LOOP macros in
<pixrect/pr_impl_util.h>.

-- 
David DiGiacomo, Sun Microsystems, Mt. View, CA  sun!david david@sun.com

shankar@hpclscu.HP.COM (Shankar Unni) (06/07/89)

>     for ( i = 0; i < COUNT; i++ )
>     for ( i = 0; i < COUNT; ++i )
>     for ( i = 0; ++i <= COUNT; )
>     for ( i = 0; i++ < COUNT; )
>     for ( i = COUNT; i > 0; i-- )
>     for ( i = COUNT; i > 0; --i )
>     for ( i = COUNT; --i >= 0; )
>     for ( i = COUNT; i-- > 0; )
> 
> I would be interested to see similar checks done on different
> architectures; in particular RISC machines, which supposedly are
> designed in cooperation with the compiler writers.
 
On the HP Precision Architecture (HP9000 series 800's), all the above loops
except the last one generate a two-instruction loop + set-up:

        COPY    0,23            ; zero-out reg 23
        LDI     127,31          ; load-immediate into reg 31
        LDO     1(23),23        ; this is a copy of the instruction below
                                ; (should have been folded with the COPY).
        COMBF,<=,N   31,23,.    ; compare and branch if false       [LOOP]
        LDO     1(23),23        ; increment i  (LDO = LoaD Offset)  [LOOP]

The compare-and-branch to itself, with the increment in the delay slot
of the branch, is the tightest possible loop.

In the last for loop (only), a three instruction loop is generated, because
of the slightly more complicated code for "i-- > 0".
----
Shankar Unni.
HP California Language Lab.

edler@cmcl2.NYU.EDU (Jan Edler) (06/07/89)

In article <17891@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <11993@well.UUCP> pokey@well.UUCP (Jef Poskanzer) writes:
>>... COUNT was a small (< 127) compile-time constant.
>>    for ( i = COUNT; --i >= 0; )
>
>[all but gcc -O -fstrength-reduce deleted]
>
>>	moveq  #COUNT,d0
>>	jra    tag2
>>tag1:
>>	<loop body>
>>tag2:
>>	dbra   d0,tag1
>>	clrw   d0
>>	subql  #1,d0
>>	jcc    tag1

I prefer
	moveq	#COUNT,d0
	jra	tag3
tag1:	swap	d0
tag2:	<loop body>
tag3:	dbra	d0,tag2
	swap	d0
	dbra	d0,tag1

But it doesn't really make that much difference, I guess.
However, the compiler shouldn't do either of these when d0 is initialized
to a value < 65536.

Jan Edler
NYU Ultracomputer Research Laboratory
edler@nyu.edu

pokey@well.UUCP (Jef Poskanzer) (06/07/89)

In the referenced message, edler@cmcl2.UUCP (Jan Edler) wrote:
}However, the compiler shouldn't do either of these when d0 is initialized
}to a value < 65536.

Not good enough, since the loop body might modify the index variable.
If the compiler is smart enough to do a global semantic check for this
sort of thing, then you can do this optimization; but we've already
established that cc and gcc aren't that smart.
---
Jef

   Jef Poskanzer   {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey
 "A banker is a fellow who lends you his umbrella when the sun is shining and
          wants it back the minute it begins to rain." -- Mark Twain

neideck@49.1009.enet (Burkhard Neidecker-Lutz) (06/08/89)

Freshly off a DecStation 3100 with a MIPS 1.31 compiler:
 
#define COUNT 127
int main()
{
  int i;
     
  for ( i = 0; i < COUNT; ++i );
  	move	v1,zero
  	addiu	v1,v1,1
L1:  	slti	at,v1,127		# 3 cycle loop
  	bne	at,zero,L1
  	addiu	v1,v1,1
  for ( i = 0; ++i <= COUNT; );
  	li	v1,1
  	addiu	v1,v1,1
L2:  	slti	at,v1,128		# 3 cycle loop
  	bne	at,zero,L2
  	addiu	v1,v1,1
  for ( i = 0; i++ < COUNT; );
  	li	v1,1
L3:	slti	v0,v1,127		# 3 cycle loop
  	bne	v0,zero,L3
  	addiu	v1,v1,1
  for ( i = COUNT; i > 0; i-- );
  	li	v1,127
  	addiu	v1,v1,-1
L4:  	bgtz	v1,L4			# 2 cycle loop
  	addiu	v1,v1,-1
  for ( i = COUNT; i > 0; --i );
  	li	v1,127
  	addiu	v1,v1,-1
L5:  	bgtz	v1,L5			# 2 cycle loop
  	addiu	v1,v1,-1
  for ( i = COUNT; --i >= 0; );
  	li	v1,126
  	addiu	v1,v1,-1
L6:  	bgez	v1,L6			# 2 cycle loop
  	addiu	v1,v1,-1
  for ( i = COUNT; i-- > 0; );
  	li	v1,126
  	move	v0,v1
L7:	bgtz	v0,L7			# 2 cycle loop
  	addiu	v1,v1,-1
  return(i);
  	jr	ra
  	move	v0,v1
}
 
As usual, counting towards zero is more efficient than the upward direction,
but we knew that already, don't we ?
 
 
	Burkhard Neidecker-Lutz, Digital CEC Karlsruhe, Project NESTOR
 
Disclaimer: I don't speak for Digital, etc. ...

rbj@dsys.ncsl.nist.gov (Root Boy Jim) (06/09/89)

? From: Jef Poskanzer <pokey@well.uucp>

? In the referenced message, edler@cmcl2.UUCP (Jan Edler) wrote:
? }However, the compiler shouldn't do either of these when d0 is initialized
? }to a value < 65536.

? Not good enough, since the loop body might modify the index variable.
? If the compiler is smart enough to do a global semantic check for this
? sort of thing, then you can do this optimization; but we've already
? established that cc and gcc aren't that smart.

Hey, aren't most programs *I/O bound* anyway? And you can sometimes
unroll your loops.

I seem to remember Chris Torek stating something about using

main()
{
	register short n = 10;
	do {
		printf("n = %d\n");
	} while (--n != -1);
}

On a Sun 3/180 SunOS 3.5 compiled with `cc -S -O' generates:

	.data
	.text
LL0:
|#PROC# 04
	.data1
L18:
	.ascii	"n = %d\12\0"
	LF12	=	4
	LS12	=	128
	LFF12	=	0
	LSS12	=	0
	LP12	=	16
	.data
	.text
	.globl	_main
_main:
|#PROLOGUE# 0
	link	a6,#-4
	movl	d7,sp@
|#PROLOGUE# 1
	moveq	#10,d7
L16:
	movw	d7,d0
	extl	d0
	movl	d0,sp@-
	pea	L18
	jbsr	_printf
	addqw	#8,sp
	dbra	d7,L16
	movl	a6@(-4),d7
	unlk	a6
	rts

and prints "n = 10" thru "n = 0". I even tried initializing n to -5
and adding `n += 2' before the printf. Prints -3 thru 0.

? Jef

?    Jef Poskanzer   {ucbvax, lll-crg, sun!pacbell, apple, hplabs}!well!pokey
?  "A banker is a fellow who lends you his umbrella when the sun is shining and
?           wants it back the minute it begins to rain." -- Mark Twain

	Root Boy Jim is what I am
	Are you what you are or what?

pokey@well.UUCP (Jef Poskanzer) (06/09/89)

In the referenced message, rbj@dsys.ncsl.nist.gov (Root Boy Jim) wrote:
}Hey, aren't most programs *I/O bound* anyway? And you can sometimes
}unroll your loops.

Any statement about "most programs" is guaranteed false.  In my case,
I have a large suite of programs called PBM.  Most of them (note
difference) are I/O bound (but less so now that I've improved the
external data format).  Some are CPU bound.  I estimate that these
would go about 1% faster if they used a dbra in their inner loop.

}I seem to remember Chris Torek stating something about using
}		printf("n = %d\n");

Somehow I doubt Chris would make such a stupid mistake.

What is your point, anyway?  I didn't see it in your article.
---
Jef

    Jef Poskanzer  pokey@well.sf.ca.us  {ucbvax, apple, hplabs}!well!pokey
                    "Sound impressive? It should. It is."

inst182@tuvie (Inst.f.Techn.Informatik) (06/09/89)

I tried the same kinds of for loops on a XT-compatible (8086, 8MHz)
using Turbo-C 2.0. (tcc -G -O -Z -S)

I also tested

for (i = COUNT; i --;);

because I thought, it would be the fastest.

For integers, the fastest loop was

for (i = COUNT; -- i >= 0;);

which turned out to be 1.7 times as fast as the slowest. It is
remarkable, that this is the same factor as Jef Poskanzer has timed on
his Sun.

#define COUNT 30000

----------------------------------------------------------------
for (i = 0; i < COUNT; i ++);

127 ms

	xor	si,si
	jmp	short @5
@4:
	inc	si
@5:
	cmp	si,30000
	jl	@4

Comparation against a non-zero value
----------------------------------------------------------------
for (i = 0; i < COUNT; ++ i);

115 ms

	xor	si,si
	jmp	short @9
@8:
	inc	si
@9:
	cmp	si,30000
	jl	@8

Can anybody explain, why this is faster than the above ????
----------------------------------------------------------------
for (i = 0; ++ i <= COUNT;);

132 ms

	xor	si,si
@13:
	inc	si
	mov	ax,si
	cmp	ax,30000
	jle	@13

Moving si to ax seems rather useless to me.
----------------------------------------------------------------
for (i = 0; i++ < COUNT;);

132 ms

	xor	si,si
@17:
	mov	ax,si
	inc	si
	cmp	ax,30000
	jl	@17

Ok, here it is neccessary
----------------------------------------------------------------
for (i = COUNT; i > 0; i --);

94 ms

	mov	si,30000
	jmp	short @21
@20:
	dec	si
@21:
	or	si,si
	jg	@20

The separation of decrement and comparation causes an additional or
----------------------------------------------------------------
for (i = COUNT; i > 0; --i);

93 ms

	mov	si,30000
	jmp	short @25
@24:
	dec	si
@25:
	or	si,si
	jg	@24

Looks exactly like the version above. The 1 ms difference could be a
timing error.
----------------------------------------------------------------
for (i = COUNT; --i >= 0;);

77 ms

	mov	si,30000
@29:
	dec	si
	jge	@29

This surely is the optimal loop
----------------------------------------------------------------
for (i = COUNT; i-- > 0;);

115 ms

	mov	si,30000
@33:
	mov	ax,si
	dec	si
	or	ax,ax
	jg	@33

The former value of the counter has to be saved for the comparison.
----------------------------------------------------------------
for (i = COUNT; i--;);

115 ms

	mov	si,30000
@37:
	mov	ax,si
	dec	si
	or	ax,ax
	jne	@37
================================================================

Then I tried the same with a long counter (instead of int), which
changed the results radically. Not only is the difference between the
fastest and the slowest loop much lesser (The factor is now only
1.13), but version 7, which has been the fastest for integers is now
the second slowest, and version 9 now is the fastest.

Versions 3 and 4 have remained at the slow end of the charts.

#define COUNT	60000
----------------------------------------------------------------
for (i = 0; i < COUNT; i ++);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
	jmp	short @5
@4:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
@5:
	cmp	word ptr [bp-2],0
	jl	@4
	jne	@38
	cmp	word ptr [bp-4],-5536
	jb	@4
@38:

----------------------------------------------------------------
for (i = 0; i < COUNT; ++ i);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
	jmp	short @9
@8:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
@9:
	cmp	word ptr [bp-2],0
	jl	@8
	jne	@39
	cmp	word ptr [bp-4],-5536
	jb	@8
@39:
----------------------------------------------------------------
	for (i = 0; ++ i <= COUNT;);

1056 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
@13:
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	or	dx,dx
	jl	@13
	jg	@40
	cmp	ax,-5536
	jbe	@13
@40:

Now the compiler seems to have forgotten that a memory location can be
compared against a constant.
----------------------------------------------------------------
	for (i = 0; i++ < COUNT;);

1009 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],0
@17:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	add	word ptr [bp-4],1
	adc	word ptr [bp-2],0
	or	dx,dx
	jl	@17
	jne	@41
	cmp	ax,-5536
	jb	@17
@41:
----------------------------------------------------------------
	for (i = COUNT; i > 0; i --);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
	jmp	short @21
@20:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
@21:
	cmp	word ptr [bp-2],0
	jg	@20
	jne	@42
	cmp	word ptr [bp-4],0
	ja	@20
@42:

Now it seems to have regained this knowledge and is fast as before.
Comparing against zero is not faster here than comparing against
another constant.
----------------------------------------------------------------
	for (i = COUNT; i > 0; -- i);

938 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
	jmp	short @25
@24:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
@25:
	cmp	word ptr [bp-2],0
	jg	@24
	jne	@43
	cmp	word ptr [bp-4],0
	ja	@24
@43:
----------------------------------------------------------------
	for (i = COUNT; --i >= 0;);

1051 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@29:
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	or	dx,dx
	jg	@29
	jl	@44
	or	ax,ax
	jae	@29
@44:

Again, the compiler fetches the counter into registers before
comparing. Here comparing against zero saves a little time.
But letting the counter in memory would have saved more.
----------------------------------------------------------------
	for (i = COUNT; i-- > 0;);

1004 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@33:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	or	dx,dx
	jg	@33
	jne	@45
	or	ax,ax
	ja	@33
@45:
----------------------------------------------------------------
	for (i = COUNT; i--;);

934 ms

	mov	word ptr [bp-2],0
	mov	word ptr [bp-4],-5536
@37:
	mov	dx,word ptr [bp-2]
	mov	ax,word ptr [bp-4]
	sub	word ptr [bp-4],1
	sbb	word ptr [bp-2],0
	or	dx,ax
	jne	@37

Checking a long for *equality* to zero saves lots of time, so the
overhead for loading the registers is more than compensated.
----------------------------------------------------------------
	???

374 ms

	mov	si,0
	mov	di,30000
@_l:
	mov	dx,si
	mov	ax,di
	sub	di,1
	sbb	si,0
	or	dx,ax
	jne	@_l

This would have been the optimal loop, but Turbo-C does not put long
variables into registers.
================================================================

Please email to address below, because the account I was posting this from is not
mine.
 _______________________________________________________________
|	__   |							|
|  |   |  \  | Peter J. Holzer					|
|  |___|__/  | Technische Universitaet Wien			|
|  |   |     |							|
|  |   |     | ...!uunet!mcvax!tuvie!asupa!honey!hjphr		|
|  ____/     |--------------------------------------------------|
|	     | Think of it as evolution in action -- Tony Rand	|
|____________|__________________________________________________|

awd@dbase.UUCP (Alastair Dallas) (06/11/89)

In article <11993@well.UUCP>, pokey@well.UUCP (Jef Poskanzer) writes:
> SUMMARY
> -------
> 
> I compiled the following different kinds of for loops:
> 
>     for ( i = 0; i < COUNT; i++ )
>     for ( i = 0; i < COUNT; ++i )
>     for ( i = 0; ++i <= COUNT; )
>     for ( i = 0; i++ < COUNT; )
>     for ( i = COUNT; i > 0; i-- )
>     for ( i = COUNT; i > 0; --i )
>     for ( i = COUNT; --i >= 0; )
>     for ( i = COUNT; i-- > 0; )
> 
> on a Sun 3/50 with both SunOS 3.5 cc and gcc 1.35, and looked at the
> generated code and the timings.  COUNT was a small (< 127) compile-time
> constant.  The loop body did not reference i and had no side-effects.
> In theory, all eight of these loops should have optimized to the most
> efficient loop possible, ignoring the otherwise unreferenced variable i
> and simply traversing the loop body the proper number of times.  On the
> 68020, this most efficient loop is a dbra instruction.  But in fact, cc
> never generates dbra
> 
	<details omitted>
> 
> CONCLUSION
> ----------
> 
> For both compilers and all levels of optimization, this loop:
> 
>     for ( i = COUNT; --i >= 0; )
> 
> gives the lowest overhead.

I tried these eight loops on Microsoft C v5.1 and was very surprised to
get the same results, more or less.  Jef's fastest loop (--i >= 0) on his
Sun was also MSC's fastest loop on an 80386.  And for the same reason: the
compiler wakes up an realizes that the condition flags are already set.
On 80x86, JCXZ is the fast loop instruction, but it is ignored by
the MSC optimizer as well.  (To be fair, I didn't turn on any
special optimization, so I can't say MSC won't
do it under some conditions.)  Making i a register variable would obviously
improve the timings, but MSC uses the SI register, not CX, and therefore
can't take advantage of JCXZ.  The code MSC generates for (--i >= 0) is
approx. 47% of the t-states used by the code it generates for the typical
(i=0; i<COUNT; i++) loop, but the typical case was not the worst case.

What an eye-opener!  I always assumed that it wasn't worth it being fussy
about such things in C, that the compiler would optimize it perfectly.
Like Jef says, if you want a particular part of your code to run fast,
don't assume it will be optimized; study the compiler's output.

Random thought: What are our "optimizing compilers" optimizing if they 
leave loops alone?

/alastair/

trebor@biar.UUCP (Robert J Woodhead) (06/11/89)

In article <96@dbase.UUCP> awd@dbase.UUCP (Alastair Dallas) writes:
>Random thought: What are our "optimizing compilers" optimizing if they 
>leave loops alone?

	Why, Corporate Quarterly Earnings Reports, of course!


-- 
Robert J Woodhead, Biar Games, Inc.  !uunet!biar!trebor | trebor@biar.UUCP
``The worst thing about being a vampire is that you can't go to matinees
  and save money anymore.''

rbj@dsys.ncsl.nist.gov (Root Boy Jim) (06/13/89)

? From: Jef Poskanzer <pokey@well.uucp>
? Date: 8 Jun 89 21:45:08 GMT

? In the referenced message, rbj@dsys.ncsl.nist.gov (Root Boy Jim) wrote:
? }Hey, aren't most programs *I/O bound* anyway? And you can sometimes
? }unroll your loops.

? Any statement about "most programs" is guaranteed false.  In my case,
? I have a large suite of programs called PBM.  Most of them (note
? difference) are I/O bound (but less so now that I've improved the
? external data format).  Some are CPU bound.  I estimate that these
? would go about 1% faster if they used a dbra in their inner loop.

OK, I'll give you that point. It really *does* depend on your environment.
Even if what I say is true, there are still cases where it matters.

*But*, I would imagine that loop unrolling might beat the difference
that dbra might make. Have you looked into that?

? }I seem to remember Chris Torek stating something about using
? }		printf("n = %d\n");

? Somehow I doubt Chris would make such a stupid mistake.

Well, I *did* actually run the code; I started off with something
more nebulous, then decided to investigate it. Evidently my
transcription didn't make it back to the mail text.

? What is your point, anyway?  I didn't see it in your article.

Sorry, it was hidden in the generated code rather than stated.

If you had looked past the obvious, trivial, unimportant error, and
got to the meat of my posting, you would have noticed that I showed
you how to get a `dbra' generated by the compiler.

Hey, mellow out! My first reaction was to say: SHPX LBH!, but I can
see where you're coming from. I didn't make my points very well, and
my opening statement's flippancy evidently turned you off.

You will notice that my posting was the only one that showed you
how to actually get a dbra generated.

Peace, and thanks for the bitmaps.

? Jef

?     Jef Poskanzer  pokey@well.sf.ca.us  {ucbvax, apple, hplabs}!well!pokey
?                     "Sound impressive? It should. It is."

	Root Boy Jim is what I am
	Are you what you are or what?