[comp.lang.c] Efficient Coding Practices

g-rh@XAIT.XEROX.COM (Richard Harter) (10/01/88)

In article <836@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes:

>At one time, I spent a lot of my spare time improving other
>people's code.  One of the things I discovered is that almost all
>the code I was dealing with had better than 25% execution time
>fat.  What I mean is that when I merely changed all of the code
>to follow my coding practices (which are not anything special,
>just obvious things that every coder ought to know), it got that
>much better; I am not talking about mere tweaks, nor am I talking
>about optimizing particular sections of code.  I am talking about
>reading each line of code and making obvious changes (things like
>using temporaries when needed, adding appropriate register
>declarations, eliminating obviously unnecessary function calls,
>etc.)

	I would say that we all know what Bill is talking about
here, except that "we" all obviously don't.  Basically this is 
using hand optimization as a default in coding style.  One can take
the view that a good optimizer will do many of these things, e.g.
use of temporaries, moving static expressions outside loops, placing
the appropriate variables in registers, etc.  One can also take the
view that the capabilities and realities of optimizing compilers are
less than claimed.  Hand optimizing is safer across a variety of
environments.

	In my experience hand optimizing in original code is less
efficient, in the sense of code written per unit time, then writing
code initially in a "clear" style first, and then hand optimizing
afterwards.  In short, write it in the simplest way first, get it
working, and then improve performance.

	... Discussion of practical superiority of pointers deleted.

	Let me give a simple example.  Ignoring library routines,
inline routines, etc, suppose that we want to copy n bytes from one
place to another, say array src to array dst.  We might write

	for(i=0;i<n;i++) dst[i]=src[i];

Let's hand optimize this.  Instead of using indices (presumably less
efficent) we will use pointers.  Since we don't want to destroy dst
and src as pointers we need temporaries.  Thus we might write

	tmp1 = dst;
	tmp2 = src;
	for (i=0;i<n;i++) *tmp1++ = *tmp2++;

[Leave us not worry about whether there is a more efficient way to
handle the loop control.]  The "optimized" version requires more
statements; it requires more variables.  
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

francis@proxftl.UUCP (Francis H. Yu) (10/03/88)

In article <34112@XAIT.XEROX.COM> g-rh@XAIT.Xerox.COM (Richard Harter) writes:
!
!	Let me give a simple example.  Ignoring library routines,
!inline routines, etc, suppose that we want to copy n bytes from one
!place to another, say array src to array dst.  We might write
!
!	for(i=0;i<n;i++) dst[i]=src[i];
!
!Let's hand optimize this.  Instead of using indices (presumably less
!efficent) we will use pointers.  Since we don't want to destroy dst
!and src as pointers we need temporaries.  Thus we might write
!
!	tmp1 = dst;
!	tmp2 = src;
!	for (i=0;i<n;i++) *tmp1++ = *tmp2++;
!
The better code is 
	tmp1 = dst;
	tmp2 = src;
	tmp3 = dst + n;
	while (tmp1 != tmp3) {
		*tmp1++ = *tmp2++;
	}

g-rh@XAIT.Xerox.COM (Richard Harter) (10/03/88)

In article <846@proxftl.UUCP> francis@proxftl.UUCP (Francis H. Yu) writes:
>In article <34112@XAIT.XEROX.COM> g-rh@XAIT.Xerox.COM (Richard Harter) writes:

>!Let's hand optimize this.  Instead of using indices (presumably less
>!efficent) we will use pointers.  Since we don't want to destroy dst
>!and src as pointers we need temporaries.  Thus we might write

>!	tmp1 = dst;
>!	tmp2 = src;
>!	for (i=0;i<n;i++) *tmp1++ = *tmp2++;

>The better code is 
>	tmp1 = dst;
>	tmp2 = src;
>	tmp3 = dst + n;
>	while (tmp1 != tmp3) {
>		*tmp1++ = *tmp2++;
>	}

	The point of this little exercise was to illustrate that using
pointers quite often means extra lines of code and temporaries.  However
your "better" code is not necessarily better.  In many machines it is more
efficient to control loops by counting a register down to zero and escaping
on zero than it is to exit on a compare.  A good compiler will do exactly
that with the sample code.  If we are hand optimizing, we write

	register int i;
	...
	tmp1 = dst;
	tmp2 = src;
	for (i=n;i;--i) *tmp1++ = *tmp++;

Your while loop, in pseudo machine language, runs something like this

	mv dst r1			# load register 1 with dst
	mv src r2			# load register 2 with src
	mv *n,r3			# move n to register 3
	ble a2				# n already done, skip loop
	add r1,r3			# add n to get termination
	br a1				# loop test at bottom, go there
a0:	mv *r2++ *r1++			# Move src to dst, incrementing
a1:	compare r1,r3			# compare term with with dst ptr
	bne a0				# not eq, go to loop start
a2:

The corresponding code for a countdown loop is

	mv dst r1			# load register 1 with dst
	mv src r2			# load register 2 with src
	mv n r3				# load register 3 with n
	ble a2				# n already done, skip loop
	br a1				# loop test at bottom, go there
a0:	mv *r2++, *r1++			# Move src to dst, incrementing
a1:	dec r3				# Decrement count down index
	bgt a0				# index not negative, do more
a2:

The loop bodies are the same except that the compare is replaced with
a decrement, which may be faster (it is on most machines) and is never
slower (as far as I know).  Moreover many machines have a decrement and
test instruction.  For example, the PDP-11 has an SOB instruction which
combines the two.

Note: It is more efficient to put the loop test at the bottom of the loop
and branch there to initiate the loop; it saves a branch instruction.

Lesson:  If efficiency is a concern, countdown loops are faster than
compare value loops.

Lesson:  If one is optimizing code one has to think about what the machine
will have to do in different implementations when comparing them.
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

chris@mimsy.UUCP (Chris Torek) (10/04/88)

>In article <34112@XAIT.XEROX.COM> g-rh@XAIT.Xerox.COM (Richard Harter) writes:
>>Ignoring library routines, inline routines, etc, suppose that we want
>>to copy n bytes from one place to another, say array src to array dst.
>>We might write
>>	for(i=0;i<n;i++) dst[i]=src[i];

>>Let's hand optimize this.

[This was part of an argument *against*, but let that stand:]

>>	tmp1 = dst;
>>	tmp2 = src;
>>	for (i=0;i<n;i++) *tmp1++ = *tmp2++;

In article <846@proxftl.UUCP> francis@proxftl.UUCP (Francis H. Yu) writes:
>The better code is 
>	tmp1 = dst;
>	tmp2 = src;
>	tmp3 = dst + n;
>	while (tmp1 != tmp3) {
>		*tmp1++ = *tmp2++;
>	}

Better for whom?

On a 68010, the following is *much* better:

	register short n1;
	if ((n1 = n - 1) >= 0)
		do *tmp1++ = *tmp2++; while (--n1 != -1);

because it can compile into a `dbra' loop and take advantage of the
68010 loop mode.  But this is much less efficient on a VAX than the
`movc3' instruction that the compiler might generate for the original
loop.  But the second way is better for the Foobar CPU, which has a
`count-up' loop mode; but the third is better for the BazWoRKS chip.

This is micro-efficiency at its finest: you cannot characterise it
outside of its environment.  Which loop is `best' is heavily machine
dependent.  If that loop takes much time, go ahead and optimise it,
but if not, you might as well not bother, since everyone else will
just have to re-optimise it differently anyway.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

ok@quintus.uucp (Richard A. O'Keefe) (10/04/88)

In article <34196@XAIT.Xerox.COM> g-rh@XAIT.Xerox.COM (Richard Harter) writes:
>>!	tmp1 = dst;
>>!	tmp2 = src;
>>!	for (i=0;i<n;i++) *tmp1++ = *tmp2++;

>In many machines it is more
>efficient to control loops by counting a register down to zero and escaping
>on zero than it is to exit on a compare.  A good compiler will do exactly
>that with the sample code.

This is only true if the machine's loop control instructions fit the loop
in question *very* closely.  For example, at least some MC680x0 C compilers
will generate DBcc loops for neither of these forms:
	for (i = 0; i < n; i++) <stmt>
	for (i = n; --i >= 0; ) <stmt>
Instead, you have to write
	for (i = n-1; --i != -1; ) <stmt>

>Lesson:  If efficiency is a concern, countdown loops are faster than
>compare value loops.

>Lesson:  If one is optimizing code one has to think about what the machine
>will have to do in different implementations when comparing them.

Lesson: the 2nd lesson above renders the 1st moot.  Countdown loops may be
faster than others.  Or they may be slower.  Or it might depend on the size
of the count variable.  And it depends on the compiler as well as the machine.

Lesson: looking for a way to avoid the loop entirely is going to pay off on
most machines, tweaking it for one machine may make it very bad for another.

johnl@ima.ima.isc.com (John R. Levine) (10/04/88)

In article <34196@XAIT.Xerox.COM> g-rh@XAIT.Xerox.COM (Richard Harter) writes:
>>! [ first allegedly optimal code ]
>>!	tmp1 = dst;
>>!	tmp2 = src;
>>!	for (i=0;i<n;i++) *tmp1++ = *tmp2++;
>
>> [second allegedly optimal code]
>>	tmp1 = dst;
>>	tmp2 = src;
>>	tmp3 = dst + n;
>>	while (tmp1 != tmp3) {
>>		*tmp1++ = *tmp2++;
> [ third allegedly optimal code]
>	register int i;
>	...
>	tmp1 = dst;
>	tmp2 = src;
>	for (i=n;i;--i) *tmp1++ = *tmp++;

On an Intel 386, assuming your compiler isn't smart enough to recognize such
loops and generate string move instructions, and assuming the
two blocks don't overlap, your best bet probably is:

	register i, rdst = dst, rsrc = src;

	for(i = n; --i; )
		rdst[i] = rsrc[i];

This uses the 386's scaled index modes and loop control instructions and
generates a loop two instructions long.  On non-Vax machines *p++ does
not generate particularly good code, after all.

The message here is that unless you have a specific performance problem in
a specific environment, such micro-optimization is a waste of time since
the "best" code depends heavily on the particular instruction set and
addressing model in use.
-- 
John R. Levine, IECC, PO Box 349, Cambridge MA 02238-0349, +1 617 492 3869
{ bbn | think | decvax | harvard | yale }!ima!johnl, Levine@YALE.something
Rome fell, Babylon fell, Scarsdale will have its turn.  -G. B. Shaw

space@sns.UUCP (Lars Soltau) (10/05/88)

In article <34196@XAIT.Xerox.COM> g-rh@XAIT.Xerox.COM (Richard Harter) writes:
>Lesson:  If one is optimizing code one has to think about what the machine
>will have to do in different implementations when comparing them.

Lesson: never optimize C code if you have not written the compiler yourself,
it's far easier and safer to optimize the assembler code.
-- 
Lars Soltau	UUCP: uunet!unido!sns!spcnet!space	BIX: -- no bucks --

Here's looking at you, kid!
		-- the Medusa

g-rh@XAIT.Xerox.COM (Richard Harter) (10/08/88)

In article <222@sns.UUCP> space@sns.UUCP (Lars Soltau) writes:
>In article <34196@XAIT.Xerox.COM> g-rh@XAIT.Xerox.COM (Richard Harter) writes:
>>Lesson:  If one is optimizing code one has to think about what the machine
>>will have to do in different implementations when comparing them.

>Lesson: never optimize C code if you have not written the compiler yourself,
>it's far easier and safer to optimize the assembler code.

If only life were so simple.  If you are maintaining programs across a host
of machines and operating systems assembler code is a strong minus.  For many
of us the issue of interest is optimization within the contraints of
portability.
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.