kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/22/91)
To recap on the discussion so far: [Someone wondered how a specialized routine might compare with the general C library version.] [I agreed with another article _against_ micro-optimization as a general policy, but wrote a small `benchmark' for log() and found a significant variation over the different machines I use & their standard cc's. These ranged from even through a good factor in either direction.] In article <1991Mar20.173249.3819@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >>So we see that on _some_ hardware (like 68k's) the library routines are >>at an apparent _big_ disadvantage... >No, actually, we see that on some hardware/software combinations the library >routines are at a big disadvantage. In particular, on that Sun 3/60, did >you compile with -f68881 and use the inlining facility for the math library? >If not, you were timing the calling overhead, not the log function. [I tried -f68881 on the Sun with essentially the same results except when -O4 `full' optimization was _also_ used; then the `roll your own' routine was much faster than the C library function.] In article <1991Mar21.170830.24983@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >However, this doesn't generate in-line code for library functions, only for >things like addition and multiplication. There is a separate inlining >facility for library code, which is what's wanted if you're trying to >evaluate the performance of log(). See the discussion of the .il suffix. Now, to continue: I wasn't sure whether -f68881 was meant to provide the kind of inlining Henry was originally suggesting. (Although it seems to put the library routine at a further disadvantage than without the `inline' -f68881 option). His original posting seems to imply he meant inlining of fp operations since calling overhead will not have much of an effect on the original program that includes the same number of calls to library routines vs calls to routines of my own. In my original posting I was comparing my own `log' routine to the C library version. The original point of this was to show that the answer to the question ``is this a good thing to do'' is a definite ``maybe'' depending on your environment (perhaps I roused Henry's ire by using ``hardware'' at the wrong point :-). Even though I used the -O4 `global optimize' switch to Sun's 3/60 cc, _no_ subroutines were inlined in my program; if I issued N calls to my log routine, N `jbsr' instructions are done. If I issue N calls to the C library log routine N `jbsr' instructions are done. So the question indirectly raised by Henry is ``does calling overhead make a big difference to the results obtained''. We can answer this in two ways -- does calling make much of a difference in the overall running of a program in an _absolute_ sense (i.e. ``how much effect does the proposed subroutine inlining have'')? And secondly, does calling overhead affect the comparison performed in typical simple benchmarks like the one I posted. To answer the first question: no, not much. To answer the second: no. I tend to define a _significant_ difference as anything >10% or 20%. So, although admittedly I only have the time to get figures for my 3/60 and Vax 8530 in this case, let's take some _measurements_. Via the program appended at the end, I compare the averages of a small number of runs of a loop of several 1000 calls to a polynomial evaluation -- fairly typical of sin(), cos(), etc. I've left out the usual argument reduction stages for clarity -- they should not make much difference since they typically run much faster than an fp multiply (and the poly eval uses several fp muls). Results: program `synth' Vax 8530 Sun 3/60 arith mean=1.29085 1.40432 geo mean=1.28371 1.4043 (For those who will ask -- geometric means are used to average ratios. They have one property some people don't like; they are usually smaller than the arithmetic mean of the same numbers -- I therefore include both (and use whichever is _least_ convenient)). The _subroutine_ turns out to be faster on both machines. :-) (This may not be try for other hardware -- Vax's and 68k's are fairly similar in some respects -- these figures indicating one). By examining the assembler code we can see why on the Vax -- apparently the optimizer can't eliminate the (i*M_SQRT2) common subexpression on the call! By hand-coding this into a temp register we obtain the program `synth2'. Results: program `synth2' Vax 8530 Sun 3/60 arith mean=0.821795 0.98481 geo mean=0.815938 0.984798 From the results for this program, subroutine calls and argument passing are therefore seen to take about 20% of the cpu time on a Vax 8530. Interestingly the Sun indicates there is now _no_ calling overhead! (And there subroutine call to `dummy' is still there -- I checked). Ok, this is a slight joke -- let's _also_ eliminate calls to the fp arithmetic routines and see what happens: Sun 3/60 with -f68881 -O4 arith mean=0.677463 geo mean=0.676492 Presumably the now approx 30% difference is due only to the single remaining subroutine call. Therefore we can say inlining (by my definition) may produce a `significant' speedup in a program on some machines but _not_ on others. In any case if a piece of software does other things beside call subroutines (e.g. it performs very long sequences of expensive fp operations as typified by the 68k with fp subroutine calls) the speedup will probably _not_ be significant. To get to the second question -- ``does calling overhead affect the original measurements which involved a ratio of two running times''? The answer is -- no. In the original `benchmark' I looked at the ratio of `mine' vs `theirs' in cpu time. If we say calls (which happened in equal numbers in both cases) take some fraction of the total instruction times -- 20%, say -- what do we find a ratio of two cpu times to be? mine + .2 mine 1.2 mine ------------------ = ---------- = ... theirs + .2 theirs 1.2 theirs Even pipelining considerations should not affect this argument (but please correct me if I'm wrong) -- the same effect would be apparent in each measurement; the ratio would remain unchanged. We can see the calling overhead therefore doesn't matter! (Knowing this I wondered why Henry would raise it). As a final point, even if we compare apples to oranges by having an optimizer expand your _own_ routine inline and _not_ the C library routine, the effect on the original results are also not marked on the Sun since the `roll your own' version of log I tested is _inherently_ faster than the general library routine ON THAT PLATFORM (the library routine must take account of any arguments it may be given is _one_ reason). In fact, since inlining of C library routines is probably not that common yet across all architectures (it didn't happen on either the Sun or Vax no matter what optimization level I used -- although on the Sun you _can_ make it happen) some might argue this provides _another_ reason to write your own tailored routines which may be coded portably as macros -- not a idea I'd generally endorse 'tho since the results overall (remember back that far? :-) indicated such a large (by my definition) variation across the different environments examined. -kym C code follows: ====================synth.c==================== #include <stdio.h> #include <math.h> #define NSAMP 10 #define NIT 10000 #define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI) double dummy(x) register double x; { return POLY(x); } main(){ int samp; double prod=1; double sum=0; register double tot=0; for(samp=1;samp<=NSAMP;samp++){ double t0,t1,t2; register i; t0=clock(); for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2); t1=clock(); for(i=0;i<NIT;i++) { tot+=POLY(i*M_SQRT2); } t2=clock(); prod *= (t2-t1)/(t1-t0); sum += (t2-t1)/(t1-t0); } printf("arith mean=%g\n", sum/NSAMP); printf("geo mean=%g\n", pow(prod, 1.0/NSAMP)); printf("garbage=%g\n",tot);/* need this in case the optimizer is _really_ clever! */ exit(0); } ====================synth2.c==================== #include <stdio.h> #include <math.h> #define NSAMP 10 #define NIT 10000 #define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI) double dummy(x) register double x; { return POLY(x); } main(){ int samp; double prod=1; double sum=0; register double tot=0; for(samp=1;samp<=NSAMP;samp++){ double t0,t1,t2; register i; t0=clock(); for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2); t1=clock(); for(i=0;i<NIT;i++) { register double x=i*M_SQRT2; tot+=POLY(x); } t2=clock(); prod *= (t2-t1)/(t1-t0); sum += (t2-t1)/(t1-t0); } printf("arith mean=%g\n", sum/NSAMP); printf("geo mean=%g\n", pow(prod, 1.0/NSAMP)); printf("garbage=%g\n",tot); exit(0); } ====================end end end====================