[comp.arch] Log Library - How is it done in the library code?

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/22/91)
To recap on the discussion so far:

[Someone wondered how a specialized routine might compare with the
general C library version.]

[I agreed with another article _against_ micro-optimization as a general 
policy, but wrote a small `benchmark' for log() and found a significant 
variation over the different machines I use & their standard cc's. These 
ranged from even through a good factor in either direction.]

In article <1991Mar20.173249.3819@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>>So we see that on _some_ hardware (like 68k's) the library routines are
>>at an apparent _big_ disadvantage...
>No, actually, we see that on some hardware/software combinations the library
>routines are at a big disadvantage.  In particular, on that Sun 3/60, did
>you compile with -f68881 and use the inlining facility for the math library?
>If not, you were timing the calling overhead, not the log function.

[I tried -f68881 on the Sun with essentially the same results except when -O4 
`full' optimization was _also_ used; then the `roll your own' routine 
was much faster than the C library function.]

In article <1991Mar21.170830.24983@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>However, this doesn't generate in-line code for library functions, only for
>things like addition and multiplication.  There is a separate inlining
>facility for library code, which is what's wanted if you're trying to
>evaluate the performance of log().  See the discussion of the .il suffix.

Now, to continue:

I wasn't sure whether -f68881 was meant to provide the kind of inlining Henry
was originally suggesting. (Although it seems to put the library routine at
a further disadvantage than without the `inline' -f68881 option).

His original posting seems to imply he meant inlining of fp operations since 
calling overhead will not have much of an effect on the original program that 
includes the same number of calls to library routines vs calls to routines of 
my own. 

In my original posting I was comparing my own `log' routine to the C library 
version. The original point of this was to show that the answer to the 
question ``is this a good thing to do'' is a definite ``maybe'' depending on 
your environment (perhaps I roused Henry's ire by using ``hardware'' at the 
wrong point :-).

Even though I used the -O4 `global optimize' switch to Sun's 3/60 cc, _no_ 
subroutines were inlined in my program; if I issued N calls to my log routine, 
N `jbsr' instructions are done. If I issue N calls to the C library log 
routine N `jbsr' instructions are done.

So the question indirectly raised by Henry is ``does calling overhead make a 
big difference to the results obtained''.

We can answer this in two ways -- does calling make much of a difference
in the overall running of a program in an _absolute_ sense (i.e. ``how much 
effect does the proposed subroutine inlining have'')? And secondly, does calling
overhead affect the comparison performed in typical simple benchmarks like the 
one I posted.

To answer the first question: no, not much. To answer the second: no.

I tend to define a _significant_ difference as anything >10% or 20%. So, 
although admittedly I only have the time to get figures for my 3/60
and Vax 8530 in this case, let's take some _measurements_.

Via the program appended at the end, I compare the averages of a small number 
of runs of a loop of several 1000 calls to a polynomial evaluation -- fairly 
typical of sin(), cos(), etc.  I've left out the usual argument reduction 
stages for clarity -- they should not make much difference since they 
typically run much faster than an fp multiply (and the poly eval uses 
several fp muls).

Results:

program `synth'

Vax 8530		Sun 3/60

arith mean=1.29085	1.40432
geo mean=1.28371	1.4043

(For those who will ask -- geometric means are used to average ratios.
They have one property some people don't like; they are usually smaller than
the arithmetic mean of the same numbers -- I therefore include both
(and use whichever is _least_ convenient)).

The _subroutine_ turns out to be faster on both machines. :-)
(This may not be try for other hardware -- Vax's and 68k's are fairly
similar in some respects -- these figures indicating one).

By examining the assembler code we can see why on the Vax -- apparently the 
optimizer can't eliminate the (i*M_SQRT2) common subexpression on the call! By 
hand-coding this into a temp register we obtain the program `synth2'. 

Results:

program `synth2'

Vax 8530		Sun 3/60

arith mean=0.821795	0.98481
geo mean=0.815938	0.984798

From the results for this program, subroutine calls and argument passing are 
therefore seen to take about 20% of the cpu time on a Vax 8530.
Interestingly the Sun indicates there is now _no_ calling overhead!
(And there subroutine call to `dummy' is still there -- I checked).

Ok, this is a slight joke -- let's _also_ eliminate calls to the fp arithmetic 
routines and see what happens:

Sun 3/60 with -f68881 -O4

arith mean=0.677463
geo mean=0.676492

Presumably the now approx 30% difference is due only to the single remaining 
subroutine call.

Therefore we can say inlining (by my definition) may produce a `significant' 
speedup in a program on some machines but _not_ on others.  In any case if a 
piece of software does other things beside call subroutines (e.g. it performs 
very long sequences of expensive fp operations as typified by the 68k with 
fp subroutine calls) the speedup will probably _not_ be significant.

To get to the second question -- ``does calling overhead affect the original 
measurements which involved a ratio of two running times''? The answer 
is -- no.

In the original `benchmark' I looked at the ratio of `mine' vs `theirs'
in cpu time. If we say calls (which happened in equal numbers in both cases) 
take some fraction of the total instruction times -- 20%, say -- what do we 
find a ratio of two cpu times to be?

mine + .2 mine			1.2 mine
------------------	=	----------	=	...
theirs + .2 theirs		1.2 theirs

Even pipelining considerations should not affect this argument (but please 
correct me if I'm wrong) -- the same effect would be apparent in each 
measurement; the ratio would remain unchanged.  We can see the calling 
overhead therefore doesn't matter! (Knowing this I wondered why Henry 
would raise it).

As a final point, even if we compare apples to oranges by having an optimizer 
expand your _own_ routine inline and _not_ the C library routine, the effect 
on the original results are also not marked on the Sun since the 
`roll your own' version of log I tested is _inherently_ faster than the 
general library routine ON THAT PLATFORM (the library routine 
must take account of any arguments it may be given is _one_ reason).

In fact, since inlining of C library routines is probably not that common yet 
across all architectures (it didn't happen on either the Sun or Vax no matter 
what optimization level I used -- although on the Sun you _can_ 
make it happen) some might argue this provides _another_ reason to write 
your own tailored routines which may be coded portably as macros -- not a
idea I'd generally endorse 'tho since the results overall (remember back that
far? :-) indicated such a large (by my definition) variation across the 
different environments examined.

-kym

C code follows:

====================synth.c====================
#include <stdio.h>
#include <math.h>

#define NSAMP 10
#define NIT 10000

#define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI)

double dummy(x) register double x; {
	return POLY(x);
	}

main(){
	int samp;
	double prod=1;
	double sum=0;
	register double tot=0;

	for(samp=1;samp<=NSAMP;samp++){
		double t0,t1,t2;
		register i;

		t0=clock();
		for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2);
		t1=clock();

		for(i=0;i<NIT;i++) {
			tot+=POLY(i*M_SQRT2);
			}
		t2=clock();
		
		prod *= (t2-t1)/(t1-t0);
		sum += (t2-t1)/(t1-t0);
		}

	printf("arith mean=%g\n", sum/NSAMP);
	printf("geo mean=%g\n", pow(prod, 1.0/NSAMP));
	printf("garbage=%g\n",tot);/* need this in case the
					optimizer is _really_
					clever! */

	exit(0);
	}
====================synth2.c====================
#include <stdio.h>
#include <math.h>

#define NSAMP 10
#define NIT 10000

#define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI)

double dummy(x) register double x; {
	return POLY(x);
	}

main(){
	int samp;
	double prod=1;
	double sum=0;
	register double tot=0;

	for(samp=1;samp<=NSAMP;samp++){
		double t0,t1,t2;
		register i;

		t0=clock();
		for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2);
		t1=clock();

		for(i=0;i<NIT;i++) {
			register double x=i*M_SQRT2;
			tot+=POLY(x);
			}
		t2=clock();
		
		prod *= (t2-t1)/(t1-t0);
		sum += (t2-t1)/(t1-t0);
		}

	printf("arith mean=%g\n", sum/NSAMP);
	printf("geo mean=%g\n", pow(prod, 1.0/NSAMP));
	printf("garbage=%g\n",tot);

	exit(0);
	}
====================end end end====================