[comp.lang.c] Log Library - How is it done in the library code?

joshi@m.cs.uiuc.edu (Anil Joshi) (03/11/91)

I need to compute natural log for some numbers. I could use the c math.h library
routine but I do not want the accuracy of the library routine. A crude 
approximation would suffice. Does anyone know how this is done in the c library
routine? Is it possible to get the source code?
Thanks
Anil
-- 
"Come the (computer) revolution, all persons found guilty of such criminal
behaviour will be summarily executed, and their programs won't be!"
- Press, Flannerty, Teukolsky and Vetterling

dave@cs.arizona.edu (Dave P. Schaumann) (03/11/91)

In article <1991Mar11.022141.12068@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes:
>I need to compute natural log for some numbers. I could use the c math.h
>library outine but I do not want the accuracy of the library routine. A crude 
>approximation would suffice. Does anyone know how this is done in the c
>library routine? Is it possible to get the source code?

Trancendental functions are generally modeled with a Taylor-series expansion.
(Check your favorite calculus books for what a Taylor series is, and your
favorite numerical analysis book on how to get the best (ie most accurate)
results).

However, you might also consider using a table, and interpolating the
intermediate values.  Depending on the expected range of your arguments, and
the required accuracy of your result, this could be an effective (and quick)
method of gaining the results.

Of course, you would still have to initialize the table at the start of the
program... (yikes!  shades of another thread creeping in -- film at 11).
But if you're calling the function enough times that efficiency is a problem,
this should be all right.

-- 
Dave Schaumann | dave@cs.arizona.edu | Short .sig's rule!

gwyn@smoke.brl.mil (Doug Gwyn) (03/12/91)

In article <1991Mar11.022141.12068@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes:
>I need to compute natural log for some numbers. I could use the c math.h library
>routine but I do not want the accuracy of the library routine. A crude 
>approximation would suffice. Does anyone know how this is done in the c library
>routine? Is it possible to get the source code?

I don't understand why you don't simply use the C library log() function.
It should be reliable and efficient, so what would you gain by trying to
roll your own cruder version?

joshi@m.cs.uiuc.edu (Anil Joshi) (03/12/91)

gwyn@smoke.brl.mil (Doug Gwyn) writes:

>In article <1991Mar11.022141.12068@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes:
>>I need to compute natural log for some numbers. I could use the c math.h library
>>routine but I do not want the accuracy of the library routine. A crude 
>>approximation would suffice. Does anyone know how this is done in the c library
>>routine? Is it possible to get the source code?

>I don't understand why you don't simply use the C library log() function.
>It should be reliable and efficient, so what would you gain by trying to
>roll your own cruder version?

I do not want the accuracy that might have been provided in the c library 
log(). It might be spending more time than necessary to calculate more
accurately than I want it to be. 

The suggestions I got were:

1. Use Taylor Series.
2. Use Chyebyshev (sp?) polynomials
3. Use a table and interpolate.
4. Harmonic Series (this is my/my advisor's idea). I am not sure wether this
   would give results.
5. Somebody mentioned Dr.Dobbs Journal (Old, old issue) which gave code for 2
   above.

I am tending towards Idea 3, which seems to be fastest (my intuition).

If I get any more suggestions, I'll post.

Thanks everybody.
Anil

-- 
"Come the (computer) revolution, all persons found guilty of such criminal
behaviour will be summarily executed, and their programs won't be!"
- Press, Flannerty, Teukolsky and Vetterling

dbrooks@osf.org (David Brooks) (03/12/91)

joshi@m.cs.uiuc.edu (Anil Joshi) writes:
|> I do not want the accuracy that might have been provided in the c library 
|> log(). It might be spending more time than necessary to calculate more
|> accurately than I want it to be. 

I agree with Doug.  It doesn't sound as though you have characterized the
speed of the library routine.

However, I'd point out that (a) speed can be helped by knowing how your
architecture stores FP numbers (b) most of them store an exponent
accessibly (c) some simple arithmetic on the exponent will give you an
approximation to the log.  How inaccurate to you want it?

henry@zoo.toronto.edu (Henry Spencer) (03/13/91)

In article <1991Mar12.014416.4289@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes:
>I do not want the accuracy that might have been provided in the c library 
>log(). It might be spending more time than necessary to calculate more
>accurately than I want it to be. 

Let's be precise.  You don't *need* the accuracy of the library routine,
and you're guessing that it might be slow.  The first thing for you to do
is to find out whether that guess is correct.  That is *guaranteed* to be
simpler and easier than inventing an approximate-log function, and it
could very solve your problem -- implementors often put considerable effort
into making library math functions run fast.

If, and only if, your library function turns out to be unacceptably slow,
note that the exponent of your floating-point number is already a good
estimate of the (binary) logarithm of the number; extracting the exponent
and multiplying to convert it to a natural log might be good enough.  It
is difficult to answer your question when you haven't said *how much*
accuracy or *how much* speed you need.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

shaunc@gold.gvg.tek.com (Shaun Case) (03/13/91)

Just as a point of minor interest, you can get the exponent of the log
of a string representation of a positive integer by doing a strlen().
(log base 10, of course.)

Example:

strlen("1") -1 ==   0   log == 0
strlen("10") -1 ==  1   log == 1
.
.
.
strlen("1000000") -1 == 7   log == 7

not being a math weenie, I will venture an unsure guess that the
maximum error using this method is ~1.  If you are doing logs
of small numbers, this is obviously unacceptable.

I don't know if doing a trunc/round on a float, followed by a 
conversion to a string and then a strlen is any faster than 
doing a log, but I thought this might be interesting to someone
anyway.  Kind of neat to see a correlation between a number and
its representation in C.

Now, if we could only use strupr() to calculate sin and cos...
:-)

// Shaun //
-- 
Shaun Case:  shaunc@gold.gvg.tek.com  or  atman%ecst.csuchico.edu@RELAY.CS.NET 
 or Shaun Case of 1:119/666.0 (Fidonet)  or  1@9651 (WWIVnet)
---
A bullet in the Bush is worth two in the hand.

joshi@m.cs.uiuc.edu (Anil Joshi) (03/13/91)

I got some more responses - some advise and some code.

After going through them I figured out that my initial posting as well as the
followup were not clear enough as to why I want roll my own log.

1. It is a new algorithm to solve LP problems that needs the log to be evaluated
but allows for some lee-way. The convergence rate of the algorithm does not
change on the accuracy of the log function.
2. There is a need to calculate the log function several times over in the 
inner loop of the algorithm.
3. Even if I roll a new log function for this one program, I do not have any
intentions of speeding up of the library function, which won't be easy to do.

All I needed was the info as to what is being done in the library
routine so that I can emulate the library routine. 

In fact, I do know that one cannot write a more efficient routine
than the system supplied one in a short time. That's why I asked
whether the code is available somewhere, so that I can make my
program efficient and fast.

Ironocially, the need to evaluate log has disappeared from the time I posted my
request and now.

So, thanks to all for your advise/code.

Anil
joshi@cs.uiuc.edu
-- 
"Come the (computer) revolution, all persons found guilty of such criminal
behaviour will be summarily executed, and their programs won't be!"
- Press, Flannerty, Teukolsky and Vetterling

bengsig@dk.oracle.com (Bjorn Engsig) (03/13/91)

Sorry, this isn't C any longer, but it need clarification.

Article <1119@caslon.cs.arizona.edu> by dave@cs.arizona.edu (Dave P. Schaumann) says:
|
|Trancendental functions are generally modeled with a Taylor-series expansion.
|(Check your favorite calculus books for what a Taylor series is,
Taylor-series are only of mathematical value.  As a computational tool, they
are absolutely worthless.
|and your
|favorite numerical analysis book on how to get the best (ie most accurate)
|results).
Which will tell you why you shouldn't use Taylor-series, and probably tell
the that Chebychev polynomails are among the best ones for a large number
of functions.
-- 
Bjorn Engsig, ORACLE Corporation, E-mail: bengsig@oracle.com, bengsig@oracle.nl

henry@zoo.toronto.edu (Henry Spencer) (03/14/91)

In article <1991Mar12.211047.826@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes:
>All I needed was the info as to what is being done in the library
>routine so that I can emulate the library routine.   ... why I asked
>whether the code is available somewhere, so that I can make my
>program efficient and fast.

Unfortunately, you didn't really supply enough information for this.
There is no single source for the log function; the answer is vendor-
specific.  As a particular case in point, on a 68xxx machine with a
68881, the best log routine consists of (ignoring calling-sequence
overhead):

	FLOGN r1, r2

or the equivalent.  That's right, one instruction.  On most more modern
machines the sequence will be more complex. :-)  When you post such
queries, you need to supply details on things like machine, compiler,
and requirements.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

cjkuo@locus.com (Chengi Jimmy Kuo) (03/14/91)

shaunc@gold.gvg.tek.com (Shaun Case) writes:

>Just as a point of minor interest, you can get the exponent of the log
>of a string representation of a positive integer by doing a strlen().
>(log base 10, of course.)

>Example:

>strlen("1") -1 ==   0   log == 0
>strlen("10") -1 ==  1   log == 1
>.
>.
>.
>strlen("1000000") -1 == 7   log == 7

>not being a math weenie, I will venture an unsure guess that the
>maximum error using this method is ~1.  If you are doing logs
>of small numbers, this is obviously unacceptable.

For this case, and you can extend this algorithm, you could use a series
of if-thens.  This way, you can determine your own accuracy and rounding.
This is a slightly different way of doing table lookup and interpolation.

strlen isn't that fast.

Jimmy Kuo
-- 
cjkuo@locus.com
"The correct answer to an either/or question is both!"

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (03/14/91)

In article <1991Mar12.014416.4289@m.cs.uiuc.edu>,
joshi@m.cs.uiuc.edu (Anil Joshi) writes:
> I do not want the accuracy that might have been provided in the C library
> log().  It might be spending more time than necessary to calculate more
> accurately than I want it to be. 

Have you *measured* it?  If you are using a system with a floating-point
coprocessor chip (80*87, 6888[12], 32*81, Weitek, ...) there is a strong
likelihood that the C function log() will map onto a single coprocessor
instruction, and that in turn is likely to be considerably faster than
anything you can cobble together.  Even if your machine has no log()
instruction or other direct support for log(), remember that the people
who work on floating point libraries are usually GOOD at it.

Just now as a check I took a look at the code for one implementation of
log().  Unless your program is spending almost all its time computing
logarithms, there isn't going to be a lot of point in trying to do better.
Basically, the kind of thing that goes on in a function like this is:
    1.  Check that the argument is in range (handle infinity, NaN, 
	subnormals, 0, negative numbers, &c)   
    2.  Argument reduction:  re-express the number as 2**k * (1+f).
    3.	Calculate t = log2(1+f) using some clever method.
    4.  Return (k + t) * log(2)	

If you know that your numbers are in range, you can eliminate 1, but
the people who wrote your library already knew that it was a good idea
to make the "everything in order" path as fast as possible.
If you know a _lot_ about your numbers, you may be able to eliminate
2, but there are C library functions for that step so it probably isn't
_worth_ eliminating.
3 is the tricky bit, but unless you are a numerical analyst or have the
help of one (and from the postings this does not seem to be the case)
it may not be obvious just how far to take your series or table to get
the accuracy you do want.  Perhaps surprisingly, you aren't likely to
save very much.

As a particular point:  extrapolating in a table is very likely to be
SLOWER than your C system's log() function.

I really can't stress enough that for the elementary transcendental
functions, you should only consider writing your own once you have
MEASURED them and found them to be too slow in context or too inaccurate.
And you should by no means use your own log() until you have measured its
accuracy:  it might be much worse than you think!  For this purpose, the
ELEFUNT (ELEmentary FUNction Testing) library is extremely valuable; you
can get it from netlib.

-- 
The purpose of advertising is to destroy the freedom of the market.

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (03/14/91)

In article <1991Mar12.211047.826@m.cs.uiuc.edu>,
joshi@m.cs.uiuc.edu (Anil Joshi) writes:
> After going through them I figured out that my initial posting as well as
> the followup were not clear enough as to why I want roll my own log.
 
> 1. It is a new algorithm to solve LP problems that needs the log to be
> evaluated but allows for some lee-way.  The convergence rate of the 
> algorithm does not change[sic] on the accuracy of the log function.

In that case, why not just use 0?  If neither the correctness of the results
nor the rate of convergence depends on the accuracy of the logarithm, why
compute it at all?  It sounds as though frexp() would have done everything
you need (or logb() if you have it).

-- 
The purpose of advertising is to destroy the freedom of the market.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/15/91)

In article <1294@dkunix9.dk.oracle.com> bengsig@dk.oracle.com (Bjorn Engsig) writes:
> Article <1119@caslon.cs.arizona.edu> by dave@cs.arizona.edu (Dave P. Schaumann) says:
> |Trancendental functions are generally modeled with a Taylor-series expansion.
> |(Check your favorite calculus books for what a Taylor series is,
> Taylor-series are only of mathematical value.  As a computational tool, they
> are absolutely worthless.

This is not true. Contrary to popular myth, Taylor series can be used
quite effectively for numerical solutions of differential equations.
They'll even tell you when the solution is becoming unstable. See
Moore's interval analysis book for more details.

In any case, it sounds like the original poster only needs a few digits
of accuracy; I'd use a large table.

---Dan

john@newave.UUCP (John A. Weeks III) (03/16/91)

In article <1991Mar12.014416.4289@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes:
> I do not want the accuracy that might have been provided in the c library 
> log(). It might be spending more time than necessary to calculate more
> accurately than I want it to be. 

> I am tending towards Idea 3, which seems to be fastest (my intuition).

As an experiment, why not try running your program using log, then
try one of the alternate solutions.  Compare the time it takes to run
each example.  Then post your results to the net.

What I expect you to find is that the difference is rather small--an
optimized hand coded assembly routine or a math coprocessor call as
compared to a C routine.  I have often found that the program algorithm
makes a far greater impact on performance than optimizing small pieces
of code.  But since you sound like you will be using log extensively, it
might make a great deal of difference.  If you do find a difference, I bet
that the difference will vary greatly between different machines.

-john-


-- 
-----------------------------------------------------------------------------
John A. Weeks III               (612) 942-6969             john@newave.mn.org
NeWave Communications                       ...uunet!tcnet!wd0gol!newave!john

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/17/91)

In article <702@newave.UUCP> john@newave.mn.org (John A. Weeks III) writes:
>In article <1991Mar12.014416.4289@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes:
[complains about library logs]
>As an experiment, why not try running your program using log, then
>try one of the alternate solutions.  Compare the time it takes to run
>each example.  Then post your results to the net.

While I would generally agree with John -- that micro-optimization of
one or two instructions or function calls seldom (if ever) improve the
overall performance of typical programs -- I'm going to play devil's
advokate here.

The following is a simple-minded routine that uses a table lookup to
comupte an approximate natural log for 16-bit integers. It could be
easily recoded to get logs of doubles or floats, and additionally you
might like to remove the floating-point operations and replace them by
fixed-point. (In fact in a similar problem posted in comp.arch
regarding division, speeds for a similar routine approached almost twice
the speed of the FLOATING POINT HARDWARE on some machines). Another
point to consider -- the initial `find first 1' logic might be coded as
a single instruction on some machines.

The results vary over different machines -- some with fp hardware;
some with `no' hardware :-). 

The results are as follows:

Sun 3/60:	0.368515
Vax 8530:	0.421875
Sun SS1:	.6
MIPS Magnum:	1
Vax 6440:	1.64706

So we see that on _some_ hardware (like 68k's) the library routines are
at an apparent _big_ disadvantage; on some other platforms (the 6440
running VMS) the situation is reversed. I particularly like the MIPS
where things everything's equal. :-)

Although I realize I'm comparing apples to oranges here (my log routine
only handles 16-bit integers; not the whole spectrum of double-precision
arguments as does the C library routine) -- that _was_ the point of the
original posting: can we ever take advantage of the fact that what we
wish to do is somewhat less general than the library designer's
intended? As John said in his post referenced above, the answer is a
categorical ``it depends''. You be the judge from the figures above.

-kym

C program follows:

==================== log.c ====================
#define NDEBUG

#include <assert.h>
#include <stdio.h>
#include <math.h>
/*

Taylor series:

f(x+h) = f(x) + h f'(x) + 1/2 h^2 f''(x) + ...

For f(x) = log(x) we have:

f(x) = log(x)
f'(x) = 1/x
f''(x) = -1/x^2
f'''(x) = 2/x^3

Evaluate each of f's at x=1.

f(1) = 0
f'(1) = 1
f''(1) = -1
f'''(1) = 2

Therefore:

log(1+x) =
	= 0 + h 1 + 1/2 h^2 (-1) + 1/6 h^3 (2) + 1/24 h^4 (-6) ...
	= h - h^2/2 + h^3/3 - h^4/4 + ...

If we wish to find:

log(L a + b) = log(La (1 + b/(La))) = log(L) + log(a) + log(1+b/(La))

where L = 2^k we have:

log(2^k) = k log(2)

and therefore:

log(x) = k*log(2) + log(a) + log(1+b/(La))

But as we know
	log(1+b/(La)) =about b/(La) - 1/2 (b/(La))^2

therefore
log(x) =about k/lg(2) + log(a) + 1/L b/a - 1/2 1/L^2 (b/a)^2

*/

#define NTAB 256	/* size of table */
#define NSAMP 10000L	/* number of samples in test */

double atab[NTAB];

init(){
	int i;

	for(i=1;i<NTAB;i++) atab[i]=log((double)i*NTAB);
	}

double mylog(x) register x; {
	register a,b,n;

	for(n=0; x<NTAB*NTAB/2; x<<=1,n++);
	a=x/NTAB;
	b=x%NTAB;
	assert(a>=0);
	assert(a<NTAB);
	return atab[a] + (double)b/(a*NTAB) - n*M_LN2;
	}

main(){
	double sum1,sum2;
	double t0,t1,t2;
	int x;

	init();
	t0=clock();

	sum1=0;
	for(x=1;x<NSAMP;x++) sum1+=log((double)x);
	t1=clock();

	sum2=0;
	for(x=1;x<NSAMP;x++) sum2+=mylog(x);
	t2=clock();

	printf("%g	%g\n",sum1,sum2);
	printf("%g\n",(t2-t1)/(t1-t0));
	exit(0);
	}
==================== END END END ====================

henry@zoo.toronto.edu (Henry Spencer) (03/21/91)

In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>So we see that on _some_ hardware (like 68k's) the library routines are
>at an apparent _big_ disadvantage...

No, actually, we see that on some hardware/software combinations the library
routines are at a big disadvantage.  In particular, on that Sun 3/60, did
you compile with -f68881 and use the inlining facility for the math library?
If not, you were timing the calling overhead, not the log function.
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/21/91)

In article <1991Mar20.173249.3819@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>>So we see that on _some_ hardware (like 68k's) the library routines are
>>at an apparent _big_ disadvantage...
>
>No, actually, we see that on some hardware/software combinations the library
>routines are at a big disadvantage.  In particular, on that Sun 3/60, did
>you compile with -f68881 and use the inlining facility for the math library?
>If not, you were timing the calling overhead, not the log function.

No, usually I just say `-O4' and let it go at that. However, if you
_wanna_ see what happens with `inlining', on a Sun 3 you get this
(I was surprised):

-O4			-f68881		-O4/-f68881	-fsoft		-O4/-fsoft

0.356631	0.414894	0.280899	0.360215	0.355993


Apparently the `inline' option -f68881 does cut down somewhat
on (presumably) the calling overhead to (essentially only) the log 
function.

The global analysis (+ a few fancy other things that don't really
apply to this program) done by -O4 is _almost_ as good as inlining (i.e.
calling overhead of about (0.415-0.357)/0.415 = 14% seems to have been
eliminated.

However, look at the comination of -O4 and -f68881! A bit hard to
understand how things can go _backwards_ for the library routine by
simply doing both things. Instead of my little subroutine running only
about 3 times faster, combining both switches makes it run almost 4
times faster than the library routine! Amazing! Perhaps Henry might
explain this one (my brain is hurting at the moment)?

As a kind of joke -- and a slight counterexample to Henry's statement
above -- I tried the -fsoft option that, I presume (from the man page
anyway), restricts everthing to using software fp routines. 
Almost the same comparison as the original -O4 result. Perhaps there
_isn't_ that much variation on a given hardware, despite the various fancy
compiler options. (Although I presume a DIFFERENT compiler and library on 
the same platform might behave quite differently).

Cheers,

-kym

henry@zoo.toronto.edu (Henry Spencer) (03/21/91)

In article <1991Mar20.204034.28931@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>Apparently the `inline' option -f68881 ...

Uh, you're a bit confused here.  There are *two* separate facilities that
are relevant, -f68881 and the inlining facility.  They're not the same thing.
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/21/91)

In article <1991Mar20.223644.16769@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Mar20.204034.28931@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>>Apparently the `inline' option -f68881 ...
>Uh, you're a bit confused here.  There are *two* separate facilities that
>are relevant, -f68881 and the inlining facility.  They're not the same thing.

Yes I may be confused. The following comes from the man page (I sure
hope those nice guys from Sun don't get me for any (c) violations):

Floating-point code generation option.  Can be one
of:

-f68881   Generate  in-line  code   for   Motorola
		 MC68881  floating-point  processor (sup-
		 ported only on Sun-3 systems).

what does this mean? :-)

-kym

henry@zoo.toronto.edu (Henry Spencer) (03/22/91)

In article <1991Mar21.000655.28999@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>>... There are *two* separate facilities that
>>are relevant, -f68881 and the inlining facility.  They're not the same thing.
>
>Yes I may be confused. The following comes from the man page ...
>
>-f68881   Generate  in-line  code   for   Motorola
>		 MC68881  floating-point  processor ...

However, this doesn't generate in-line code for library functions, only for
things like addition and multiplication.  There is a separate inlining
facility for library code, which is what's wanted if you're trying to
evaluate the performance of log().  See the discussion of the .il suffix.
-- 
"[Some people] positively *wish* to     | Henry Spencer @ U of Toronto Zoology
believe ill of the modern world."-R.Peto|  henry@zoo.toronto.edu  utzoo!henry

steve@taumet.com (Stephen Clamage) (03/22/91)

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:

>Yes I may be confused. The following comes from the man page...

>Floating-point code generation option.  Can be one of:

>-f68881   Generate  in-line  code   for   Motorola
>		 MC68881  floating-point  processor (sup-
>		 ported only on Sun-3 systems).

>what does this mean? :-)

The Sun-3 has three options for floating point calculations.  It can
use the MC68881 co-processor directly, use subroutines to do
floating-point calculations, or use subroutines which choose one of
the other two options based on the presence or absence of the MC68881.

If you know your code will only be run on machines with the MC68881,
you can tell the compiler this with the -f68881 switch.  When you then
write a floating-point expression, the compiler will generate code to
use the coprocessor directly (in line) instead of calling a subroutine
to determine how to perform the calculations.
-- 

Steve Clamage, TauMetric Corp, steve@taumet.com

kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/22/91)

To recap on the discussion so far:

[Someone wondered how a specialized routine might compare with the
general C library version.]

[I agreed with another article _against_ micro-optimization as a general 
policy, but wrote a small `benchmark' for log() and found a significant 
variation over the different machines I use & their standard cc's. These 
ranged from even through a good factor in either direction.]

In article <1991Mar20.173249.3819@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes:
>>So we see that on _some_ hardware (like 68k's) the library routines are
>>at an apparent _big_ disadvantage...
>No, actually, we see that on some hardware/software combinations the library
>routines are at a big disadvantage.  In particular, on that Sun 3/60, did
>you compile with -f68881 and use the inlining facility for the math library?
>If not, you were timing the calling overhead, not the log function.

[I tried -f68881 on the Sun with essentially the same results except when -O4 
`full' optimization was _also_ used; then the `roll your own' routine 
was much faster than the C library function.]

In article <1991Mar21.170830.24983@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>However, this doesn't generate in-line code for library functions, only for
>things like addition and multiplication.  There is a separate inlining
>facility for library code, which is what's wanted if you're trying to
>evaluate the performance of log().  See the discussion of the .il suffix.

Now, to continue:

I wasn't sure whether -f68881 was meant to provide the kind of inlining Henry
was originally suggesting. (Although it seems to put the library routine at
a further disadvantage than without the `inline' -f68881 option).

His original posting seems to imply he meant inlining of fp operations since 
calling overhead will not have much of an effect on the original program that 
includes the same number of calls to library routines vs calls to routines of 
my own. 

In my original posting I was comparing my own `log' routine to the C library 
version. The original point of this was to show that the answer to the 
question ``is this a good thing to do'' is a definite ``maybe'' depending on 
your environment (perhaps I roused Henry's ire by using ``hardware'' at the 
wrong point :-).

Even though I used the -O4 `global optimize' switch to Sun's 3/60 cc, _no_ 
subroutines were inlined in my program; if I issued N calls to my log routine, 
N `jbsr' instructions are done. If I issue N calls to the C library log 
routine N `jbsr' instructions are done.

So the question indirectly raised by Henry is ``does calling overhead make a 
big difference to the results obtained''.

We can answer this in two ways -- does calling make much of a difference
in the overall running of a program in an _absolute_ sense (i.e. ``how much 
effect does the proposed subroutine inlining have'')? And secondly, does calling
overhead affect the comparison performed in typical simple benchmarks like the 
one I posted.

To answer the first question: no, not much. To answer the second: no.

I tend to define a _significant_ difference as anything >10% or 20%. So, 
although admittedly I only have the time to get figures for my 3/60
and Vax 8530 in this case, let's take some _measurements_.

Via the program appended at the end, I compare the averages of a small number 
of runs of a loop of several 1000 calls to a polynomial evaluation -- fairly 
typical of sin(), cos(), etc.  I've left out the usual argument reduction 
stages for clarity -- they should not make much difference since they 
typically run much faster than an fp multiply (and the poly eval uses 
several fp muls).

Results:

program `synth'

Vax 8530		Sun 3/60

arith mean=1.29085	1.40432
geo mean=1.28371	1.4043

(For those who will ask -- geometric means are used to average ratios.
They have one property some people don't like; they are usually smaller than
the arithmetic mean of the same numbers -- I therefore include both
(and use whichever is _least_ convenient)).

The _subroutine_ turns out to be faster on both machines. :-)
(This may not be try for other hardware -- Vax's and 68k's are fairly
similar in some respects -- these figures indicating one).

By examining the assembler code we can see why on the Vax -- apparently the 
optimizer can't eliminate the (i*M_SQRT2) common subexpression on the call! By 
hand-coding this into a temp register we obtain the program `synth2'. 

Results:

program `synth2'

Vax 8530		Sun 3/60

arith mean=0.821795	0.98481
geo mean=0.815938	0.984798

From the results for this program, subroutine calls and argument passing are 
therefore seen to take about 20% of the cpu time on a Vax 8530.
Interestingly the Sun indicates there is now _no_ calling overhead!
(And there subroutine call to `dummy' is still there -- I checked).

Ok, this is a slight joke -- let's _also_ eliminate calls to the fp arithmetic 
routines and see what happens:

Sun 3/60 with -f68881 -O4

arith mean=0.677463
geo mean=0.676492

Presumably the now approx 30% difference is due only to the single remaining 
subroutine call.

Therefore we can say inlining (by my definition) may produce a `significant' 
speedup in a program on some machines but _not_ on others.  In any case if a 
piece of software does other things beside call subroutines (e.g. it performs 
very long sequences of expensive fp operations as typified by the 68k with 
fp subroutine calls) the speedup will probably _not_ be significant.

To get to the second question -- ``does calling overhead affect the original 
measurements which involved a ratio of two running times''? The answer 
is -- no.

In the original `benchmark' I looked at the ratio of `mine' vs `theirs'
in cpu time. If we say calls (which happened in equal numbers in both cases) 
take some fraction of the total instruction times -- 20%, say -- what do we 
find a ratio of two cpu times to be?

mine + .2 mine			1.2 mine
------------------	=	----------	=	...
theirs + .2 theirs		1.2 theirs

Even pipelining considerations should not affect this argument (but please 
correct me if I'm wrong) -- the same effect would be apparent in each 
measurement; the ratio would remain unchanged.  We can see the calling 
overhead therefore doesn't matter! (Knowing this I wondered why Henry 
would raise it).

As a final point, even if we compare apples to oranges by having an optimizer 
expand your _own_ routine inline and _not_ the C library routine, the effect 
on the original results are also not marked on the Sun since the 
`roll your own' version of log I tested is _inherently_ faster than the 
general library routine ON THAT PLATFORM (the library routine 
must take account of any arguments it may be given is _one_ reason).

In fact, since inlining of C library routines is probably not that common yet 
across all architectures (it didn't happen on either the Sun or Vax no matter 
what optimization level I used -- although on the Sun you _can_ 
make it happen) some might argue this provides _another_ reason to write 
your own tailored routines which may be coded portably as macros -- not a
idea I'd generally endorse 'tho since the results overall (remember back that
far? :-) indicated such a large (by my definition) variation across the 
different environments examined.

-kym

C code follows:

====================synth.c====================
#include <stdio.h>
#include <math.h>

#define NSAMP 10
#define NIT 10000

#define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI)

double dummy(x) register double x; {
	return POLY(x);
	}

main(){
	int samp;
	double prod=1;
	double sum=0;
	register double tot=0;

	for(samp=1;samp<=NSAMP;samp++){
		double t0,t1,t2;
		register i;

		t0=clock();
		for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2);
		t1=clock();

		for(i=0;i<NIT;i++) {
			tot+=POLY(i*M_SQRT2);
			}
		t2=clock();
		
		prod *= (t2-t1)/(t1-t0);
		sum += (t2-t1)/(t1-t0);
		}

	printf("arith mean=%g\n", sum/NSAMP);
	printf("geo mean=%g\n", pow(prod, 1.0/NSAMP));
	printf("garbage=%g\n",tot);/* need this in case the
					optimizer is _really_
					clever! */

	exit(0);
	}
====================synth2.c====================
#include <stdio.h>
#include <math.h>

#define NSAMP 10
#define NIT 10000

#define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI)

double dummy(x) register double x; {
	return POLY(x);
	}

main(){
	int samp;
	double prod=1;
	double sum=0;
	register double tot=0;

	for(samp=1;samp<=NSAMP;samp++){
		double t0,t1,t2;
		register i;

		t0=clock();
		for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2);
		t1=clock();

		for(i=0;i<NIT;i++) {
			register double x=i*M_SQRT2;
			tot+=POLY(x);
			}
		t2=clock();
		
		prod *= (t2-t1)/(t1-t0);
		sum += (t2-t1)/(t1-t0);
		}

	printf("arith mean=%g\n", sum/NSAMP);
	printf("geo mean=%g\n", pow(prod, 1.0/NSAMP));
	printf("garbage=%g\n",tot);

	exit(0);
	}
====================end end end====================