[comp.lang.c] Some optimization results

rh@smds.UUCP (Richard Harter) (05/04/91)

This is a case example of code optimization.  The context is that a 1024
byte page was being written to a binary file.  For reasons having to do
with infelicities in some NFS implementations the decision was made to
read the page back in and compare what was written with what was read.
[I am not defending or propounding this as the best approach; expert
opinions are solicited]  In any case, given that this was the functionality
of the page write routine, we wanted to improve its performance.  After
setting up a test suite and a profiled version (gprof) the timings were

self time 1.30 ms/call
total     2.58 ms/call

[The total includes one read, one write, and two lseeks].  How much time
was spent in the compare loop?  Comment it out and we get

self time  .02 ms/call
total     1.34 ms/call

Interesting.  A one line compare loop seems to be inordinately expensive.
What's going on here?  A quick check on the code shows that the readback
page is declared as an automatic array of 1024 bytes on the stack.  I
read on the net that this is a bad thing to do -- it plays hob with the
optimization.  So we make it static and get

self time  .45 ms/call
total     1.73 ms/call

Very interesting indeed.  We get a factor of three just by not putting
an array on the stack.  There's one to remember.  At this point our work
is mostly done -- a reduction from 1.73 to 1.34 (best possible improvement)
would only be another 20%.  However we might as well go for it.  Let's
look at the the loop code which looks like

	register char *p, *q;
	register int i;
	...
	for (i=1024;--i>=0;) if (*p++ != *q++) error_handler();

[Not an exact transcription.]  We already have register variables and
a countdown loop; nothing to be gained there.  The next obvious thing
to try is loop unrolling.  This knocks us down to .35 ms.  Nice but can
we do better?  Yes -- everything is aligned, so we can do integer compares.
This pushes it down to .30 ms/call.  Can we do better?  Yes indeedy.
Since this loop never fails (unless bad things have happened) we don't
need to execute conditional code; all we need to do is to log the result
of failure.  Our first try is something like this:

	register int *ip, *iq, i, check;
	...
	check = 0;
	for (i=32;--i>=0;) {
		check += (*ip++ != *iq++);
		... seven more replications ...
		}
	if (check > 0) error_handler();

This seems clever, but it doesn't buy us anything.  Why not?  Upon 
reflection it turns out that the expression, (*ip++ != *iq++), is being
forced to 1 or 0, so we haven't really gotten rid of the conditional code.
Aha.  We don't need numbers, we need 0 and non 0.  We can get 0 and non 0
out of the two things being compared with an xor (produces non 0 on
unequal) and accumulate into check with an bit-wise or (any non 0 compare
forces check to be non 0 thereafter.)  So the replicated loop line becomes

		check |= (*ip++ ^ *iq++);

And our final timings are:

Final                       Original

self/time  .20 ms/call      1.30 ms/call
total     1.46 ms/call      2.58 ms/call

What have we learned?  First of all we learned that careful optimization
of plausible looking code can reduce execution time of the code in question
by a big factor, e.g. a factor of 6.5 in this case.  Secondly the effect on
the total time is not nearly as signifigant (cost cut by 40%.)  Thirdly
we have guidelines on code optimization:

(a)	Make arrays static or malloced; don't make them automatic.
(b)	Use loop unrolling and count down loops.
(c)	Avoid using conditional code within tight loops
(d)	It's often better to run a loop to completion and test for
	failure afterwards than it is to test with the loop.
(e)	Bitwise operators can make a signifigant difference

Just as important there are limits to the value of local code optimization.
We could gain a factor of two just by removing the readback test (safety
issue.)  More importantly the most signifigant improvements can only be
made by changing the underlying program so that fewer writes have to be
made.

Note: The test machine was a SUN 4/110.  The code presented has comments
removed and does not use the actual layout and naming conventions style.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

rh@smds.UUCP (Richard Harter) (05/05/91)

This is a followup.  The timings referred to were done in situ, i.e.
the routine being tested was executed as part of the execution of a
large program running a fixed test suite.  For the purposes of timing
this is not entirely happy because of the high variance of gprof timings.
To get better times, I set up a test bed program to time the loop itself.
All variables were register variables with the arrays and pointers being
ints.  Here are the loop times:

Statement			Unroll=1	Unroll=8

if (*ia++ != *ib++) break;	.58 ms		.48 ms
check += (*ia++ == *ib++);	.43 ms		.33 ms
check |= (*ia++ ^ *ib++);	.32 ms		.20 ms

These times are fairly repeatable.  Note that the time reduction due to
unrolling loops is constant.  This implies that loop unrolling only
provides signifigant improvements in relative times if the loop is very
tight.  Also not that loop unrolling past 8 times offers very little
advantage -- with a unroll factor of 8 you have already removed 7/8 of
the loop tests.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

shap@shasta.Stanford.EDU (shap) (05/07/91)

In article <444@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>
>What have we learned?  First of all we learned that careful optimization
>of plausible looking code can reduce execution time of the code in question
>by a big factor, e.g. a factor of 6.5 in this case.  Secondly the effect on
>the total time is not nearly as signifigant (cost cut by 40%.)  Thirdly
>we have guidelines on code optimization:
>
>(a)	Make arrays static or malloced; don't make them automatic.
>(b)	Use loop unrolling and count down loops.
>(c)	Avoid using conditional code within tight loops
>(d)	It's often better to run a loop to completion and test for
>	failure afterwards than it is to test with the loop.
>(e)	Bitwise operators can make a signifigant difference
>
>Note: The test machine was a SUN 4/110.  The code presented has comments
>removed and does not use the actual layout and naming conventions style.

It's good when someone describes a compiler experiment at the level of
detail you did.  A postmortem may be interesting to some readers, so I
am offering one.

The compiler you were using explains everything.  Of the 5 optimizations
you did, only two are not visible to the compiler and really did need to be
hand-generated.

The compiler is not free to remove the branch test within the loop
because it does not have enough information.  Suppose that the
array passed in to the function was at the end of memory.  If the
branch were deferred, the resulting program might cause an addressing
exception, which would be incorrect code.  The compiler must be
conservative and leave the breanch alone.

Similarly, the compiler in general has no way to know that you don't actually
care about the magnitude of the check variable.  It SHOULD have been
smart enough to turn the equality comparison into a 1 or zero
comparison, and this is a very interesting special case.  You might be
interested to know that the GNU compiler for the SPARC handles this
case.

The compiler should have been able to handle auto arrays fine, and
should not have had any difficulty unrolling the final version of your
loop.  It might be an interesting experiment to check if once you remove
the branch the compiler doesn't unroll it correctly.  Your description
didn't indicate that you tried this test, and on a conventional SPARC,
doing the unrolling with that conditional branch in the picture doesn't
help.  Note that your unrolling took advantage of knowing about the fact
that exceptions couldn't happen.

Though in this particular case, the unrolling is hard to spot.
Basically, the compiler has to notice that it can use the branch below
the loop as an early termination for the unrolled loop.  To do this, it
has to know that you only care about the nonzeroness of the check
variable.

So, in understanding your conclusions, it's useful to look at what they
tell us.  I have taken the liberty of rewording them:

>(a)	Make arrays static or malloced; don't make them automatic.

Don't bother. Get a serious compiler.

>(b)	Use loop unrolling and count down loops.

In this specific case you were after early termination, which is a very very
special case. Once again, you know that terminating after a few extra
ops doesn't hurt, but the cmopiler cannot.  In general, the compiler is
probably pretty good at unrolling.

More to the point, use the memcmp() or bcmp() routines.  They are probably
better suited to the problem because they are hand-tuned to the system
at hand.  In general, the compiler is probably good at unrolling.

>(c)	Avoid using conditional code within tight loops

Good idea.  This is something the compiler typically cannot help with
much.

>(d)	It's often better to run a loop to completion and test for
>	failure afterwards than it is to test with the loop.

Good idea, bearing in mind that you have to know that things like
exceptinos won't get in the way.

>(e)	Bitwise operators can make a signifigant difference

True, and well worth remembering, but in your case you simply needed a
serious compiler.

Jonathan Shapiro

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/07/91)

In article <186@shasta.Stanford.EDU> shap@shasta.Stanford.EDU (shap) writes:
> >(a)	Make arrays static or malloced; don't make them automatic.
> Don't bother. Get a serious compiler.

``> (a) Write fast programs.'' ``Don't bother. Get a Cray.''

Unfortunately, buying a Cray will only help code that is never used on
another machine. Similarly, getting a better compiler will only help
code that is distributed in binary form for that machine, or is not
distributed at all.

I find both pieces of advice remarkably counterproductive, as they take
time away from serious thought about optimization.

> More to the point, use the memcmp() or bcmp() routines.  They are probably
> better suited to the problem because they are hand-tuned to the system
> at hand.

Piercarlo's study some time back proved otherwise.

> In general, the compiler is probably good at unrolling.

Most compilers can't unroll anything.

---Dan

oz@yunexus.yorku.ca (Ozan Yigit) (05/09/91)

Dan Bernstein writes:

>I find both pieces of advice remarkably counterproductive, as they take
>time away from serious thought about optimization.

If by "serious thought about optimization" you mean something like your
last attempt to optimize and make a mess out of what was already cleanly
done by Richard O'Keefe, I think you have yet to find out the meaning of
the word [counterproductive]. ;-)

To misquote Chip, Microoptimization produces microsuccess. Have fun.

oz

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/09/91)

In article <22683@yunexus.YorkU.CA> oz@yunexus.yorku.ca (Ozan Yigit) writes:
> If by "serious thought about optimization" you mean something like your
> last attempt to optimize and make a mess out of what was already cleanly
> done by Richard O'Keefe, I think you have yet to find out the meaning of
> the word [counterproductive]. ;-)

Yeah, that's exactly the sort of thing I mean, and it turned out to be
extremely productive.

The code after O'Keefe's optimizations would not be automatically
vectorized by the Cray compiler. The code after my optimizations would,
with no special directives or further work on the programmer's part.
In other words, the microoptimizations that you take such pains to
criticize ended up speeding up the code by a factor of fourteen.

That's right, a factor of fourteen.

FOURTEEN TIMES FASTER. When's the last time you saw that kind of speed?

Maybe you should review your definition of ``productive'' before you go
criticizing my programming techniques. I can't afford to ignore speedups
by a factor of fourteen. (Dik and I both thought the discussion was over
back then, so I didn't bother posting these results before.)

Even with vectorization off, my minor changes produced a 15% speedup,
and up to 50% on smaller machines. That's quite a lot if the code in
question is a bottleneck. Maybe my changes did make a ``mess'' out of
those fifteen lines, but I think it was well worth it.

Now that you know what you're talking about, Ozan, what conclusions
would you like to draw from that example?

---Dan

dik@cwi.nl (Dik T. Winter) (05/11/91)

In article <5850:May901:25:0491@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
 > In article <22683@yunexus.YorkU.CA> oz@yunexus.yorku.ca (Ozan Yigit) writes:
 > > If by "serious thought about optimization" you mean something like your
 > > last attempt to optimize and make a mess out of what was already cleanly
 > > done by Richard O'Keefe, I think you have yet to find out the meaning of
 > > the word [counterproductive]. ;-)
 > 
 > Yeah, that's exactly the sort of thing I mean, and it turned out to be
 > extremely productive.
Ah, bah.
 > 
 > The code after O'Keefe's optimizations would not be automatically
 > vectorized by the Cray compiler. The code after my optimizations would,
 > with no special directives or further work on the programmer's part.
 > In other words, the microoptimizations that you take such pains to
 > criticize ended up speeding up the code by a factor of fourteen.
You have forgotten something about our e-mail exchanges.  The reason the
Cray compiler did not vectorize was some silliness of the compiler writer
or somesuch.  The compiler saw dependencies that were not there.  And that
was the new compiler (scc); interestingly enough with the standard compiler
(cc) your code would not vectorize as well.  And as I have mailed you, when
(by the proper directives) the compiler would vectorize your codes were
marginally faster or a bit slower than Richard O'Keefe's.

In general micro-optimization does not pay, but:
It is silly to call something an optimization when it is nothing more than
a work-around for a compiler bug.  And still sillier when it is slower
without compiler bug.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/14/91)

In article <3501@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
> It is silly to call something an optimization when it is nothing more than
> a work-around for a compiler bug.

Hold on. The ``bug'' here is simply that the compiler missed an
optimization. Do you mean to imply that any compiler which misses any
optimization is buggy?

I don't think so. In fact, I don't believe that compilers in the
foreseeable future will come even close to humans in optimization skill.
(That's why I want to see the optimizer work with the programmer, rather
than against him.)

In this case, you're right: someone who knew the Cray compiler could
introduce appropriate directives to vectorize O'Keefe's code. But what
about next time? And what about all the people who don't have a Cray?

On any given bottleneck, someone could either do the obvious big
optimizations and stop, or do every optimization that he can see that
will gain a few percent. Which strategy is better? The second has two
disadvantages: the programmer takes some time looking for simple, small
reorganizations, and that section of code will end up somewhat longer.
But it will make the code run faster on a whole bunch of machines---not
just once, but forever.

There will always be some compilers that don't see optimization X.
Someone who does X by hand will get the speedup---without even seeing
the compiler. In fact, when he writes the code, the compiler may not
even exist, but his code will still run fast when it's ported later.

I can keep pointing out how hand optimization gives huge speedups. Are
you going to say each time that the compiler is buggy? Okay, fine. Then
there are a shitload of ``buggy'' compilers in the world, and it makes
more sense to realize this fact than to pretend that hand optimization
doesn't work.

---Dan

oz@yunexus.yorku.ca (Ozan Yigit) (05/15/91)

Dan Bernstein writes:

>On any given bottleneck, someone could either do the obvious big
>optimizations and stop, or do every optimization that he can see that
>will gain a few percent.

An unreliable, non-universal few percent is almost never worth the
unreadable, unmaintainable mess that is produced as an end result of
"every optimization", given a sufficiently well-written program to
start with.

>I can keep pointing out how hand optimization gives huge speedups.

 wollman@emily.uvm.edu (Garrett Wollman) points out elsewhere:

 | [For example, yabba will not work for (some subset of the set of)
 | large binary files on our SGI 4D/340S.  Dan also spent so much effort
 | unrolling loops that the source is nearly incomprehensible, and breaks
 | a parallelizing preprocessor which can be used by the MIPS/SGI
 | compiler to great effect.]

... oz
---
In seeking the unattainable, simplicity | Internet: oz@nexus.yorku.ca
only gets in the way. -- Alan J. Perlis | Uucp: utai/utzoo!yunexus!oz

dik@cwi.nl (Dik T. Winter) (05/15/91)

In article <14077:May1321:14:2691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
 > Hold on. The ``bug'' here is simply that the compiler missed an
 > optimization. Do you mean to imply that any compiler which misses any
 > optimization is buggy?
Yup.  If the compiler tells me he is not doing it because of some reason
that is patently untrue.  (In this case a non-existant dependency.)
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/15/91)

In article <22723@yunexus.YorkU.CA> oz@yunexus.yorku.ca (Ozan Yigit) writes:
  [ an implication that hand optimization only gains a few percent ]

Sorry, Ozan, but you're wrong. The yabba code after hand optimization
was slightly over two times faster than the code before. As another
example, I've taken the ``compress'' source and optimized it by hand;
the result is 30% faster for compression and 50% faster for
decompression. That is a lot more than a few percent.

> >I can keep pointing out how hand optimization gives huge speedups.
>  wollman@emily.uvm.edu (Garrett Wollman) points out elsewhere:

Where is that? All reports of yabbawhap problems should either be sent
to me or be crossposted to comp.sources.bugs. How am I supposed to fix
them otherwise?

>  | [For example, yabba will not work for (some subset of the set of)
>  | large binary files on our SGI 4D/340S.

Dunno about this. I've received five reports of yabba failures, and in
three of the cases it was simply misconfigured in ways that checkconf
(which is part of the package) would have detected had it been run as
per instructions. I haven't received word yet on the fourth and fifth.

> Dan also spent so much effort
>  | unrolling loops that the source is nearly incomprehensible,

I have no problem understanding it. I really didn't spend much effort on
the unrolling; it was just a matter of taking a subroutine like this:

#define DOWN0(n,p,h,sp,sch,no,FOUND,NOTFOUND) { \
 if ((no) = HANO(h,sch)) { if (PARENT(p,no) != (sp)) { \
   do (no) = NEXT(n,no); while ((no) && (PARENT(p,no) != (sp))); \
   if (no) { FOUND } else { NOTFOUND } \
 } else { FOUND } } else { NOTFOUND } \
}

and adding a parallel version like this:

#define DOWN1(n,p,h,sp,sch,no,FOUND,NOTFOUND) { \
 if ((no) = HANO(h,sch)) { if (PARENT(p,no) != (sp)) { \
 if ((no) = NEXT(n,no))  { if (PARENT(p,no) != (sp)) { \
   do (no) = NEXT(n,no); while ((no) && (PARENT(p,no) != (sp))); \
   if (no) { FOUND } else { NOTFOUND } \
 } else { FOUND } } else { NOTFOUND } \
 } else { FOUND } } else { NOTFOUND } \
}

plus a DOWN macro to select between DOWN0 and DOWN1 (as well as DOWN2
and DOWN5). Since this unrolling is all isolated inside a separate
routine with a well-defined interface, it doesn't hurt maintainability.
(Admittedly, I'm the only one with a copy of that interface definition
at the moment; see the top of huptrie.h for further comments.)

The most important hand optimizations in yabba were data structure
rearrangements: -DPTRS (though -UPTRS is faster on the Convex, and other
machines where a[i] is as fast as a[0]), combining two arrays into a
single array of structs, etc. It'll be a long time before compilers can
do anything like this.

> and breaks
>  | a parallelizing preprocessor which can be used by the MIPS/SGI
>  | compiler to great effect.]

There is an important issue here: If an optimization will slow down the
code on some machine, then you should always provide a way to turn the
optimization off. In this case, -DOPTCANT1 is documented to turn off the
unrolling, and if Wollman's preprocessor is buggy then he should have
followed the instructions and compiled with that option. -DPTRS is a
similar example---I made that optimization with the Convex in mind, and
made sure that it could be easily toggled on or off. And so on.

On the bright side, note that yabbawhap has so far found at least two
optimizer bugs, one in a prerelease version. At least in the latter
case, the bug will never see the light of day.

---Dan