rh@smds.UUCP (Richard Harter) (05/04/91)
This is a case example of code optimization. The context is that a 1024
byte page was being written to a binary file. For reasons having to do
with infelicities in some NFS implementations the decision was made to
read the page back in and compare what was written with what was read.
[I am not defending or propounding this as the best approach; expert
opinions are solicited] In any case, given that this was the functionality
of the page write routine, we wanted to improve its performance. After
setting up a test suite and a profiled version (gprof) the timings were
self time 1.30 ms/call
total 2.58 ms/call
[The total includes one read, one write, and two lseeks]. How much time
was spent in the compare loop? Comment it out and we get
self time .02 ms/call
total 1.34 ms/call
Interesting. A one line compare loop seems to be inordinately expensive.
What's going on here? A quick check on the code shows that the readback
page is declared as an automatic array of 1024 bytes on the stack. I
read on the net that this is a bad thing to do -- it plays hob with the
optimization. So we make it static and get
self time .45 ms/call
total 1.73 ms/call
Very interesting indeed. We get a factor of three just by not putting
an array on the stack. There's one to remember. At this point our work
is mostly done -- a reduction from 1.73 to 1.34 (best possible improvement)
would only be another 20%. However we might as well go for it. Let's
look at the the loop code which looks like
register char *p, *q;
register int i;
...
for (i=1024;--i>=0;) if (*p++ != *q++) error_handler();
[Not an exact transcription.] We already have register variables and
a countdown loop; nothing to be gained there. The next obvious thing
to try is loop unrolling. This knocks us down to .35 ms. Nice but can
we do better? Yes -- everything is aligned, so we can do integer compares.
This pushes it down to .30 ms/call. Can we do better? Yes indeedy.
Since this loop never fails (unless bad things have happened) we don't
need to execute conditional code; all we need to do is to log the result
of failure. Our first try is something like this:
register int *ip, *iq, i, check;
...
check = 0;
for (i=32;--i>=0;) {
check += (*ip++ != *iq++);
... seven more replications ...
}
if (check > 0) error_handler();
This seems clever, but it doesn't buy us anything. Why not? Upon
reflection it turns out that the expression, (*ip++ != *iq++), is being
forced to 1 or 0, so we haven't really gotten rid of the conditional code.
Aha. We don't need numbers, we need 0 and non 0. We can get 0 and non 0
out of the two things being compared with an xor (produces non 0 on
unequal) and accumulate into check with an bit-wise or (any non 0 compare
forces check to be non 0 thereafter.) So the replicated loop line becomes
check |= (*ip++ ^ *iq++);
And our final timings are:
Final Original
self/time .20 ms/call 1.30 ms/call
total 1.46 ms/call 2.58 ms/call
What have we learned? First of all we learned that careful optimization
of plausible looking code can reduce execution time of the code in question
by a big factor, e.g. a factor of 6.5 in this case. Secondly the effect on
the total time is not nearly as signifigant (cost cut by 40%.) Thirdly
we have guidelines on code optimization:
(a) Make arrays static or malloced; don't make them automatic.
(b) Use loop unrolling and count down loops.
(c) Avoid using conditional code within tight loops
(d) It's often better to run a loop to completion and test for
failure afterwards than it is to test with the loop.
(e) Bitwise operators can make a signifigant difference
Just as important there are limits to the value of local code optimization.
We could gain a factor of two just by removing the readback test (safety
issue.) More importantly the most signifigant improvements can only be
made by changing the underlying program so that fewer writes have to be
made.
Note: The test machine was a SUN 4/110. The code presented has comments
removed and does not use the actual layout and naming conventions style.
--
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb. This sentence short. This signature done.
rh@smds.UUCP (Richard Harter) (05/05/91)
This is a followup. The timings referred to were done in situ, i.e. the routine being tested was executed as part of the execution of a large program running a fixed test suite. For the purposes of timing this is not entirely happy because of the high variance of gprof timings. To get better times, I set up a test bed program to time the loop itself. All variables were register variables with the arrays and pointers being ints. Here are the loop times: Statement Unroll=1 Unroll=8 if (*ia++ != *ib++) break; .58 ms .48 ms check += (*ia++ == *ib++); .43 ms .33 ms check |= (*ia++ ^ *ib++); .32 ms .20 ms These times are fairly repeatable. Note that the time reduction due to unrolling loops is constant. This implies that loop unrolling only provides signifigant improvements in relative times if the loop is very tight. Also not that loop unrolling past 8 times offers very little advantage -- with a unroll factor of 8 you have already removed 7/8 of the loop tests. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
shap@shasta.Stanford.EDU (shap) (05/07/91)
In article <444@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: > >What have we learned? First of all we learned that careful optimization >of plausible looking code can reduce execution time of the code in question >by a big factor, e.g. a factor of 6.5 in this case. Secondly the effect on >the total time is not nearly as signifigant (cost cut by 40%.) Thirdly >we have guidelines on code optimization: > >(a) Make arrays static or malloced; don't make them automatic. >(b) Use loop unrolling and count down loops. >(c) Avoid using conditional code within tight loops >(d) It's often better to run a loop to completion and test for > failure afterwards than it is to test with the loop. >(e) Bitwise operators can make a signifigant difference > >Note: The test machine was a SUN 4/110. The code presented has comments >removed and does not use the actual layout and naming conventions style. It's good when someone describes a compiler experiment at the level of detail you did. A postmortem may be interesting to some readers, so I am offering one. The compiler you were using explains everything. Of the 5 optimizations you did, only two are not visible to the compiler and really did need to be hand-generated. The compiler is not free to remove the branch test within the loop because it does not have enough information. Suppose that the array passed in to the function was at the end of memory. If the branch were deferred, the resulting program might cause an addressing exception, which would be incorrect code. The compiler must be conservative and leave the breanch alone. Similarly, the compiler in general has no way to know that you don't actually care about the magnitude of the check variable. It SHOULD have been smart enough to turn the equality comparison into a 1 or zero comparison, and this is a very interesting special case. You might be interested to know that the GNU compiler for the SPARC handles this case. The compiler should have been able to handle auto arrays fine, and should not have had any difficulty unrolling the final version of your loop. It might be an interesting experiment to check if once you remove the branch the compiler doesn't unroll it correctly. Your description didn't indicate that you tried this test, and on a conventional SPARC, doing the unrolling with that conditional branch in the picture doesn't help. Note that your unrolling took advantage of knowing about the fact that exceptions couldn't happen. Though in this particular case, the unrolling is hard to spot. Basically, the compiler has to notice that it can use the branch below the loop as an early termination for the unrolled loop. To do this, it has to know that you only care about the nonzeroness of the check variable. So, in understanding your conclusions, it's useful to look at what they tell us. I have taken the liberty of rewording them: >(a) Make arrays static or malloced; don't make them automatic. Don't bother. Get a serious compiler. >(b) Use loop unrolling and count down loops. In this specific case you were after early termination, which is a very very special case. Once again, you know that terminating after a few extra ops doesn't hurt, but the cmopiler cannot. In general, the compiler is probably pretty good at unrolling. More to the point, use the memcmp() or bcmp() routines. They are probably better suited to the problem because they are hand-tuned to the system at hand. In general, the compiler is probably good at unrolling. >(c) Avoid using conditional code within tight loops Good idea. This is something the compiler typically cannot help with much. >(d) It's often better to run a loop to completion and test for > failure afterwards than it is to test with the loop. Good idea, bearing in mind that you have to know that things like exceptinos won't get in the way. >(e) Bitwise operators can make a signifigant difference True, and well worth remembering, but in your case you simply needed a serious compiler. Jonathan Shapiro
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/07/91)
In article <186@shasta.Stanford.EDU> shap@shasta.Stanford.EDU (shap) writes: > >(a) Make arrays static or malloced; don't make them automatic. > Don't bother. Get a serious compiler. ``> (a) Write fast programs.'' ``Don't bother. Get a Cray.'' Unfortunately, buying a Cray will only help code that is never used on another machine. Similarly, getting a better compiler will only help code that is distributed in binary form for that machine, or is not distributed at all. I find both pieces of advice remarkably counterproductive, as they take time away from serious thought about optimization. > More to the point, use the memcmp() or bcmp() routines. They are probably > better suited to the problem because they are hand-tuned to the system > at hand. Piercarlo's study some time back proved otherwise. > In general, the compiler is probably good at unrolling. Most compilers can't unroll anything. ---Dan
oz@yunexus.yorku.ca (Ozan Yigit) (05/09/91)
Dan Bernstein writes: >I find both pieces of advice remarkably counterproductive, as they take >time away from serious thought about optimization. If by "serious thought about optimization" you mean something like your last attempt to optimize and make a mess out of what was already cleanly done by Richard O'Keefe, I think you have yet to find out the meaning of the word [counterproductive]. ;-) To misquote Chip, Microoptimization produces microsuccess. Have fun. oz
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/09/91)
In article <22683@yunexus.YorkU.CA> oz@yunexus.yorku.ca (Ozan Yigit) writes: > If by "serious thought about optimization" you mean something like your > last attempt to optimize and make a mess out of what was already cleanly > done by Richard O'Keefe, I think you have yet to find out the meaning of > the word [counterproductive]. ;-) Yeah, that's exactly the sort of thing I mean, and it turned out to be extremely productive. The code after O'Keefe's optimizations would not be automatically vectorized by the Cray compiler. The code after my optimizations would, with no special directives or further work on the programmer's part. In other words, the microoptimizations that you take such pains to criticize ended up speeding up the code by a factor of fourteen. That's right, a factor of fourteen. FOURTEEN TIMES FASTER. When's the last time you saw that kind of speed? Maybe you should review your definition of ``productive'' before you go criticizing my programming techniques. I can't afford to ignore speedups by a factor of fourteen. (Dik and I both thought the discussion was over back then, so I didn't bother posting these results before.) Even with vectorization off, my minor changes produced a 15% speedup, and up to 50% on smaller machines. That's quite a lot if the code in question is a bottleneck. Maybe my changes did make a ``mess'' out of those fifteen lines, but I think it was well worth it. Now that you know what you're talking about, Ozan, what conclusions would you like to draw from that example? ---Dan
dik@cwi.nl (Dik T. Winter) (05/11/91)
In article <5850:May901:25:0491@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > In article <22683@yunexus.YorkU.CA> oz@yunexus.yorku.ca (Ozan Yigit) writes: > > If by "serious thought about optimization" you mean something like your > > last attempt to optimize and make a mess out of what was already cleanly > > done by Richard O'Keefe, I think you have yet to find out the meaning of > > the word [counterproductive]. ;-) > > Yeah, that's exactly the sort of thing I mean, and it turned out to be > extremely productive. Ah, bah. > > The code after O'Keefe's optimizations would not be automatically > vectorized by the Cray compiler. The code after my optimizations would, > with no special directives or further work on the programmer's part. > In other words, the microoptimizations that you take such pains to > criticize ended up speeding up the code by a factor of fourteen. You have forgotten something about our e-mail exchanges. The reason the Cray compiler did not vectorize was some silliness of the compiler writer or somesuch. The compiler saw dependencies that were not there. And that was the new compiler (scc); interestingly enough with the standard compiler (cc) your code would not vectorize as well. And as I have mailed you, when (by the proper directives) the compiler would vectorize your codes were marginally faster or a bit slower than Richard O'Keefe's. In general micro-optimization does not pay, but: It is silly to call something an optimization when it is nothing more than a work-around for a compiler bug. And still sillier when it is slower without compiler bug. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/14/91)
In article <3501@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > It is silly to call something an optimization when it is nothing more than > a work-around for a compiler bug. Hold on. The ``bug'' here is simply that the compiler missed an optimization. Do you mean to imply that any compiler which misses any optimization is buggy? I don't think so. In fact, I don't believe that compilers in the foreseeable future will come even close to humans in optimization skill. (That's why I want to see the optimizer work with the programmer, rather than against him.) In this case, you're right: someone who knew the Cray compiler could introduce appropriate directives to vectorize O'Keefe's code. But what about next time? And what about all the people who don't have a Cray? On any given bottleneck, someone could either do the obvious big optimizations and stop, or do every optimization that he can see that will gain a few percent. Which strategy is better? The second has two disadvantages: the programmer takes some time looking for simple, small reorganizations, and that section of code will end up somewhat longer. But it will make the code run faster on a whole bunch of machines---not just once, but forever. There will always be some compilers that don't see optimization X. Someone who does X by hand will get the speedup---without even seeing the compiler. In fact, when he writes the code, the compiler may not even exist, but his code will still run fast when it's ported later. I can keep pointing out how hand optimization gives huge speedups. Are you going to say each time that the compiler is buggy? Okay, fine. Then there are a shitload of ``buggy'' compilers in the world, and it makes more sense to realize this fact than to pretend that hand optimization doesn't work. ---Dan
oz@yunexus.yorku.ca (Ozan Yigit) (05/15/91)
Dan Bernstein writes: >On any given bottleneck, someone could either do the obvious big >optimizations and stop, or do every optimization that he can see that >will gain a few percent. An unreliable, non-universal few percent is almost never worth the unreadable, unmaintainable mess that is produced as an end result of "every optimization", given a sufficiently well-written program to start with. >I can keep pointing out how hand optimization gives huge speedups. wollman@emily.uvm.edu (Garrett Wollman) points out elsewhere: | [For example, yabba will not work for (some subset of the set of) | large binary files on our SGI 4D/340S. Dan also spent so much effort | unrolling loops that the source is nearly incomprehensible, and breaks | a parallelizing preprocessor which can be used by the MIPS/SGI | compiler to great effect.] ... oz --- In seeking the unattainable, simplicity | Internet: oz@nexus.yorku.ca only gets in the way. -- Alan J. Perlis | Uucp: utai/utzoo!yunexus!oz
dik@cwi.nl (Dik T. Winter) (05/15/91)
In article <14077:May1321:14:2691@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > Hold on. The ``bug'' here is simply that the compiler missed an > optimization. Do you mean to imply that any compiler which misses any > optimization is buggy? Yup. If the compiler tells me he is not doing it because of some reason that is patently untrue. (In this case a non-existant dependency.) -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (05/15/91)
In article <22723@yunexus.YorkU.CA> oz@yunexus.yorku.ca (Ozan Yigit) writes: [ an implication that hand optimization only gains a few percent ] Sorry, Ozan, but you're wrong. The yabba code after hand optimization was slightly over two times faster than the code before. As another example, I've taken the ``compress'' source and optimized it by hand; the result is 30% faster for compression and 50% faster for decompression. That is a lot more than a few percent. > >I can keep pointing out how hand optimization gives huge speedups. > wollman@emily.uvm.edu (Garrett Wollman) points out elsewhere: Where is that? All reports of yabbawhap problems should either be sent to me or be crossposted to comp.sources.bugs. How am I supposed to fix them otherwise? > | [For example, yabba will not work for (some subset of the set of) > | large binary files on our SGI 4D/340S. Dunno about this. I've received five reports of yabba failures, and in three of the cases it was simply misconfigured in ways that checkconf (which is part of the package) would have detected had it been run as per instructions. I haven't received word yet on the fourth and fifth. > Dan also spent so much effort > | unrolling loops that the source is nearly incomprehensible, I have no problem understanding it. I really didn't spend much effort on the unrolling; it was just a matter of taking a subroutine like this: #define DOWN0(n,p,h,sp,sch,no,FOUND,NOTFOUND) { \ if ((no) = HANO(h,sch)) { if (PARENT(p,no) != (sp)) { \ do (no) = NEXT(n,no); while ((no) && (PARENT(p,no) != (sp))); \ if (no) { FOUND } else { NOTFOUND } \ } else { FOUND } } else { NOTFOUND } \ } and adding a parallel version like this: #define DOWN1(n,p,h,sp,sch,no,FOUND,NOTFOUND) { \ if ((no) = HANO(h,sch)) { if (PARENT(p,no) != (sp)) { \ if ((no) = NEXT(n,no)) { if (PARENT(p,no) != (sp)) { \ do (no) = NEXT(n,no); while ((no) && (PARENT(p,no) != (sp))); \ if (no) { FOUND } else { NOTFOUND } \ } else { FOUND } } else { NOTFOUND } \ } else { FOUND } } else { NOTFOUND } \ } plus a DOWN macro to select between DOWN0 and DOWN1 (as well as DOWN2 and DOWN5). Since this unrolling is all isolated inside a separate routine with a well-defined interface, it doesn't hurt maintainability. (Admittedly, I'm the only one with a copy of that interface definition at the moment; see the top of huptrie.h for further comments.) The most important hand optimizations in yabba were data structure rearrangements: -DPTRS (though -UPTRS is faster on the Convex, and other machines where a[i] is as fast as a[0]), combining two arrays into a single array of structs, etc. It'll be a long time before compilers can do anything like this. > and breaks > | a parallelizing preprocessor which can be used by the MIPS/SGI > | compiler to great effect.] There is an important issue here: If an optimization will slow down the code on some machine, then you should always provide a way to turn the optimization off. In this case, -DOPTCANT1 is documented to turn off the unrolling, and if Wollman's preprocessor is buggy then he should have followed the instructions and compiled with that option. -DPTRS is a similar example---I made that optimization with the Convex in mind, and made sure that it could be easily toggled on or off. And so on. On the bright side, note that yabbawhap has so far found at least two optimizer bugs, one in a prerelease version. At least in the latter case, the bug will never see the light of day. ---Dan