joshi@m.cs.uiuc.edu (Anil Joshi) (03/11/91)
I need to compute natural log for some numbers. I could use the c math.h library routine but I do not want the accuracy of the library routine. A crude approximation would suffice. Does anyone know how this is done in the c library routine? Is it possible to get the source code? Thanks Anil -- "Come the (computer) revolution, all persons found guilty of such criminal behaviour will be summarily executed, and their programs won't be!" - Press, Flannerty, Teukolsky and Vetterling
dave@cs.arizona.edu (Dave P. Schaumann) (03/11/91)
In article <1991Mar11.022141.12068@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes: >I need to compute natural log for some numbers. I could use the c math.h >library outine but I do not want the accuracy of the library routine. A crude >approximation would suffice. Does anyone know how this is done in the c >library routine? Is it possible to get the source code? Trancendental functions are generally modeled with a Taylor-series expansion. (Check your favorite calculus books for what a Taylor series is, and your favorite numerical analysis book on how to get the best (ie most accurate) results). However, you might also consider using a table, and interpolating the intermediate values. Depending on the expected range of your arguments, and the required accuracy of your result, this could be an effective (and quick) method of gaining the results. Of course, you would still have to initialize the table at the start of the program... (yikes! shades of another thread creeping in -- film at 11). But if you're calling the function enough times that efficiency is a problem, this should be all right. -- Dave Schaumann | dave@cs.arizona.edu | Short .sig's rule!
gwyn@smoke.brl.mil (Doug Gwyn) (03/12/91)
In article <1991Mar11.022141.12068@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes: >I need to compute natural log for some numbers. I could use the c math.h library >routine but I do not want the accuracy of the library routine. A crude >approximation would suffice. Does anyone know how this is done in the c library >routine? Is it possible to get the source code? I don't understand why you don't simply use the C library log() function. It should be reliable and efficient, so what would you gain by trying to roll your own cruder version?
joshi@m.cs.uiuc.edu (Anil Joshi) (03/12/91)
gwyn@smoke.brl.mil (Doug Gwyn) writes: >In article <1991Mar11.022141.12068@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes: >>I need to compute natural log for some numbers. I could use the c math.h library >>routine but I do not want the accuracy of the library routine. A crude >>approximation would suffice. Does anyone know how this is done in the c library >>routine? Is it possible to get the source code? >I don't understand why you don't simply use the C library log() function. >It should be reliable and efficient, so what would you gain by trying to >roll your own cruder version? I do not want the accuracy that might have been provided in the c library log(). It might be spending more time than necessary to calculate more accurately than I want it to be. The suggestions I got were: 1. Use Taylor Series. 2. Use Chyebyshev (sp?) polynomials 3. Use a table and interpolate. 4. Harmonic Series (this is my/my advisor's idea). I am not sure wether this would give results. 5. Somebody mentioned Dr.Dobbs Journal (Old, old issue) which gave code for 2 above. I am tending towards Idea 3, which seems to be fastest (my intuition). If I get any more suggestions, I'll post. Thanks everybody. Anil -- "Come the (computer) revolution, all persons found guilty of such criminal behaviour will be summarily executed, and their programs won't be!" - Press, Flannerty, Teukolsky and Vetterling
dbrooks@osf.org (David Brooks) (03/12/91)
joshi@m.cs.uiuc.edu (Anil Joshi) writes: |> I do not want the accuracy that might have been provided in the c library |> log(). It might be spending more time than necessary to calculate more |> accurately than I want it to be. I agree with Doug. It doesn't sound as though you have characterized the speed of the library routine. However, I'd point out that (a) speed can be helped by knowing how your architecture stores FP numbers (b) most of them store an exponent accessibly (c) some simple arithmetic on the exponent will give you an approximation to the log. How inaccurate to you want it?
henry@zoo.toronto.edu (Henry Spencer) (03/13/91)
In article <1991Mar12.014416.4289@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes: >I do not want the accuracy that might have been provided in the c library >log(). It might be spending more time than necessary to calculate more >accurately than I want it to be. Let's be precise. You don't *need* the accuracy of the library routine, and you're guessing that it might be slow. The first thing for you to do is to find out whether that guess is correct. That is *guaranteed* to be simpler and easier than inventing an approximate-log function, and it could very solve your problem -- implementors often put considerable effort into making library math functions run fast. If, and only if, your library function turns out to be unacceptably slow, note that the exponent of your floating-point number is already a good estimate of the (binary) logarithm of the number; extracting the exponent and multiplying to convert it to a natural log might be good enough. It is difficult to answer your question when you haven't said *how much* accuracy or *how much* speed you need. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
shaunc@gold.gvg.tek.com (Shaun Case) (03/13/91)
Just as a point of minor interest, you can get the exponent of the log of a string representation of a positive integer by doing a strlen(). (log base 10, of course.) Example: strlen("1") -1 == 0 log == 0 strlen("10") -1 == 1 log == 1 . . . strlen("1000000") -1 == 7 log == 7 not being a math weenie, I will venture an unsure guess that the maximum error using this method is ~1. If you are doing logs of small numbers, this is obviously unacceptable. I don't know if doing a trunc/round on a float, followed by a conversion to a string and then a strlen is any faster than doing a log, but I thought this might be interesting to someone anyway. Kind of neat to see a correlation between a number and its representation in C. Now, if we could only use strupr() to calculate sin and cos... :-) // Shaun // -- Shaun Case: shaunc@gold.gvg.tek.com or atman%ecst.csuchico.edu@RELAY.CS.NET or Shaun Case of 1:119/666.0 (Fidonet) or 1@9651 (WWIVnet) --- A bullet in the Bush is worth two in the hand.
joshi@m.cs.uiuc.edu (Anil Joshi) (03/13/91)
I got some more responses - some advise and some code. After going through them I figured out that my initial posting as well as the followup were not clear enough as to why I want roll my own log. 1. It is a new algorithm to solve LP problems that needs the log to be evaluated but allows for some lee-way. The convergence rate of the algorithm does not change on the accuracy of the log function. 2. There is a need to calculate the log function several times over in the inner loop of the algorithm. 3. Even if I roll a new log function for this one program, I do not have any intentions of speeding up of the library function, which won't be easy to do. All I needed was the info as to what is being done in the library routine so that I can emulate the library routine. In fact, I do know that one cannot write a more efficient routine than the system supplied one in a short time. That's why I asked whether the code is available somewhere, so that I can make my program efficient and fast. Ironocially, the need to evaluate log has disappeared from the time I posted my request and now. So, thanks to all for your advise/code. Anil joshi@cs.uiuc.edu -- "Come the (computer) revolution, all persons found guilty of such criminal behaviour will be summarily executed, and their programs won't be!" - Press, Flannerty, Teukolsky and Vetterling
bengsig@dk.oracle.com (Bjorn Engsig) (03/13/91)
Sorry, this isn't C any longer, but it need clarification. Article <1119@caslon.cs.arizona.edu> by dave@cs.arizona.edu (Dave P. Schaumann) says: | |Trancendental functions are generally modeled with a Taylor-series expansion. |(Check your favorite calculus books for what a Taylor series is, Taylor-series are only of mathematical value. As a computational tool, they are absolutely worthless. |and your |favorite numerical analysis book on how to get the best (ie most accurate) |results). Which will tell you why you shouldn't use Taylor-series, and probably tell the that Chebychev polynomails are among the best ones for a large number of functions. -- Bjorn Engsig, ORACLE Corporation, E-mail: bengsig@oracle.com, bengsig@oracle.nl
henry@zoo.toronto.edu (Henry Spencer) (03/14/91)
In article <1991Mar12.211047.826@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes: >All I needed was the info as to what is being done in the library >routine so that I can emulate the library routine. ... why I asked >whether the code is available somewhere, so that I can make my >program efficient and fast. Unfortunately, you didn't really supply enough information for this. There is no single source for the log function; the answer is vendor- specific. As a particular case in point, on a 68xxx machine with a 68881, the best log routine consists of (ignoring calling-sequence overhead): FLOGN r1, r2 or the equivalent. That's right, one instruction. On most more modern machines the sequence will be more complex. :-) When you post such queries, you need to supply details on things like machine, compiler, and requirements. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
cjkuo@locus.com (Chengi Jimmy Kuo) (03/14/91)
shaunc@gold.gvg.tek.com (Shaun Case) writes: >Just as a point of minor interest, you can get the exponent of the log >of a string representation of a positive integer by doing a strlen(). >(log base 10, of course.) >Example: >strlen("1") -1 == 0 log == 0 >strlen("10") -1 == 1 log == 1 >. >. >. >strlen("1000000") -1 == 7 log == 7 >not being a math weenie, I will venture an unsure guess that the >maximum error using this method is ~1. If you are doing logs >of small numbers, this is obviously unacceptable. For this case, and you can extend this algorithm, you could use a series of if-thens. This way, you can determine your own accuracy and rounding. This is a slightly different way of doing table lookup and interpolation. strlen isn't that fast. Jimmy Kuo -- cjkuo@locus.com "The correct answer to an either/or question is both!"
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (03/14/91)
In article <1991Mar12.014416.4289@m.cs.uiuc.edu>, joshi@m.cs.uiuc.edu (Anil Joshi) writes: > I do not want the accuracy that might have been provided in the C library > log(). It might be spending more time than necessary to calculate more > accurately than I want it to be. Have you *measured* it? If you are using a system with a floating-point coprocessor chip (80*87, 6888[12], 32*81, Weitek, ...) there is a strong likelihood that the C function log() will map onto a single coprocessor instruction, and that in turn is likely to be considerably faster than anything you can cobble together. Even if your machine has no log() instruction or other direct support for log(), remember that the people who work on floating point libraries are usually GOOD at it. Just now as a check I took a look at the code for one implementation of log(). Unless your program is spending almost all its time computing logarithms, there isn't going to be a lot of point in trying to do better. Basically, the kind of thing that goes on in a function like this is: 1. Check that the argument is in range (handle infinity, NaN, subnormals, 0, negative numbers, &c) 2. Argument reduction: re-express the number as 2**k * (1+f). 3. Calculate t = log2(1+f) using some clever method. 4. Return (k + t) * log(2) If you know that your numbers are in range, you can eliminate 1, but the people who wrote your library already knew that it was a good idea to make the "everything in order" path as fast as possible. If you know a _lot_ about your numbers, you may be able to eliminate 2, but there are C library functions for that step so it probably isn't _worth_ eliminating. 3 is the tricky bit, but unless you are a numerical analyst or have the help of one (and from the postings this does not seem to be the case) it may not be obvious just how far to take your series or table to get the accuracy you do want. Perhaps surprisingly, you aren't likely to save very much. As a particular point: extrapolating in a table is very likely to be SLOWER than your C system's log() function. I really can't stress enough that for the elementary transcendental functions, you should only consider writing your own once you have MEASURED them and found them to be too slow in context or too inaccurate. And you should by no means use your own log() until you have measured its accuracy: it might be much worse than you think! For this purpose, the ELEFUNT (ELEmentary FUNction Testing) library is extremely valuable; you can get it from netlib. -- The purpose of advertising is to destroy the freedom of the market.
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (03/14/91)
In article <1991Mar12.211047.826@m.cs.uiuc.edu>, joshi@m.cs.uiuc.edu (Anil Joshi) writes: > After going through them I figured out that my initial posting as well as > the followup were not clear enough as to why I want roll my own log. > 1. It is a new algorithm to solve LP problems that needs the log to be > evaluated but allows for some lee-way. The convergence rate of the > algorithm does not change[sic] on the accuracy of the log function. In that case, why not just use 0? If neither the correctness of the results nor the rate of convergence depends on the accuracy of the logarithm, why compute it at all? It sounds as though frexp() would have done everything you need (or logb() if you have it). -- The purpose of advertising is to destroy the freedom of the market.
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (03/15/91)
In article <1294@dkunix9.dk.oracle.com> bengsig@dk.oracle.com (Bjorn Engsig) writes: > Article <1119@caslon.cs.arizona.edu> by dave@cs.arizona.edu (Dave P. Schaumann) says: > |Trancendental functions are generally modeled with a Taylor-series expansion. > |(Check your favorite calculus books for what a Taylor series is, > Taylor-series are only of mathematical value. As a computational tool, they > are absolutely worthless. This is not true. Contrary to popular myth, Taylor series can be used quite effectively for numerical solutions of differential equations. They'll even tell you when the solution is becoming unstable. See Moore's interval analysis book for more details. In any case, it sounds like the original poster only needs a few digits of accuracy; I'd use a large table. ---Dan
john@newave.UUCP (John A. Weeks III) (03/16/91)
In article <1991Mar12.014416.4289@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes: > I do not want the accuracy that might have been provided in the c library > log(). It might be spending more time than necessary to calculate more > accurately than I want it to be. > I am tending towards Idea 3, which seems to be fastest (my intuition). As an experiment, why not try running your program using log, then try one of the alternate solutions. Compare the time it takes to run each example. Then post your results to the net. What I expect you to find is that the difference is rather small--an optimized hand coded assembly routine or a math coprocessor call as compared to a C routine. I have often found that the program algorithm makes a far greater impact on performance than optimizing small pieces of code. But since you sound like you will be using log extensively, it might make a great deal of difference. If you do find a difference, I bet that the difference will vary greatly between different machines. -john- -- ----------------------------------------------------------------------------- John A. Weeks III (612) 942-6969 john@newave.mn.org NeWave Communications ...uunet!tcnet!wd0gol!newave!john
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/17/91)
In article <702@newave.UUCP> john@newave.mn.org (John A. Weeks III) writes: >In article <1991Mar12.014416.4289@m.cs.uiuc.edu> joshi@m.cs.uiuc.edu (Anil Joshi) writes: [complains about library logs] >As an experiment, why not try running your program using log, then >try one of the alternate solutions. Compare the time it takes to run >each example. Then post your results to the net. While I would generally agree with John -- that micro-optimization of one or two instructions or function calls seldom (if ever) improve the overall performance of typical programs -- I'm going to play devil's advokate here. The following is a simple-minded routine that uses a table lookup to comupte an approximate natural log for 16-bit integers. It could be easily recoded to get logs of doubles or floats, and additionally you might like to remove the floating-point operations and replace them by fixed-point. (In fact in a similar problem posted in comp.arch regarding division, speeds for a similar routine approached almost twice the speed of the FLOATING POINT HARDWARE on some machines). Another point to consider -- the initial `find first 1' logic might be coded as a single instruction on some machines. The results vary over different machines -- some with fp hardware; some with `no' hardware :-). The results are as follows: Sun 3/60: 0.368515 Vax 8530: 0.421875 Sun SS1: .6 MIPS Magnum: 1 Vax 6440: 1.64706 So we see that on _some_ hardware (like 68k's) the library routines are at an apparent _big_ disadvantage; on some other platforms (the 6440 running VMS) the situation is reversed. I particularly like the MIPS where things everything's equal. :-) Although I realize I'm comparing apples to oranges here (my log routine only handles 16-bit integers; not the whole spectrum of double-precision arguments as does the C library routine) -- that _was_ the point of the original posting: can we ever take advantage of the fact that what we wish to do is somewhat less general than the library designer's intended? As John said in his post referenced above, the answer is a categorical ``it depends''. You be the judge from the figures above. -kym C program follows: ==================== log.c ==================== #define NDEBUG #include <assert.h> #include <stdio.h> #include <math.h> /* Taylor series: f(x+h) = f(x) + h f'(x) + 1/2 h^2 f''(x) + ... For f(x) = log(x) we have: f(x) = log(x) f'(x) = 1/x f''(x) = -1/x^2 f'''(x) = 2/x^3 Evaluate each of f's at x=1. f(1) = 0 f'(1) = 1 f''(1) = -1 f'''(1) = 2 Therefore: log(1+x) = = 0 + h 1 + 1/2 h^2 (-1) + 1/6 h^3 (2) + 1/24 h^4 (-6) ... = h - h^2/2 + h^3/3 - h^4/4 + ... If we wish to find: log(L a + b) = log(La (1 + b/(La))) = log(L) + log(a) + log(1+b/(La)) where L = 2^k we have: log(2^k) = k log(2) and therefore: log(x) = k*log(2) + log(a) + log(1+b/(La)) But as we know log(1+b/(La)) =about b/(La) - 1/2 (b/(La))^2 therefore log(x) =about k/lg(2) + log(a) + 1/L b/a - 1/2 1/L^2 (b/a)^2 */ #define NTAB 256 /* size of table */ #define NSAMP 10000L /* number of samples in test */ double atab[NTAB]; init(){ int i; for(i=1;i<NTAB;i++) atab[i]=log((double)i*NTAB); } double mylog(x) register x; { register a,b,n; for(n=0; x<NTAB*NTAB/2; x<<=1,n++); a=x/NTAB; b=x%NTAB; assert(a>=0); assert(a<NTAB); return atab[a] + (double)b/(a*NTAB) - n*M_LN2; } main(){ double sum1,sum2; double t0,t1,t2; int x; init(); t0=clock(); sum1=0; for(x=1;x<NSAMP;x++) sum1+=log((double)x); t1=clock(); sum2=0; for(x=1;x<NSAMP;x++) sum2+=mylog(x); t2=clock(); printf("%g %g\n",sum1,sum2); printf("%g\n",(t2-t1)/(t1-t0)); exit(0); } ==================== END END END ====================
henry@zoo.toronto.edu (Henry Spencer) (03/21/91)
In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >So we see that on _some_ hardware (like 68k's) the library routines are >at an apparent _big_ disadvantage... No, actually, we see that on some hardware/software combinations the library routines are at a big disadvantage. In particular, on that Sun 3/60, did you compile with -f68881 and use the inlining facility for the math library? If not, you were timing the calling overhead, not the log function. -- "[Some people] positively *wish* to | Henry Spencer @ U of Toronto Zoology believe ill of the modern world."-R.Peto| henry@zoo.toronto.edu utzoo!henry
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/21/91)
In article <1991Mar20.173249.3819@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >>So we see that on _some_ hardware (like 68k's) the library routines are >>at an apparent _big_ disadvantage... > >No, actually, we see that on some hardware/software combinations the library >routines are at a big disadvantage. In particular, on that Sun 3/60, did >you compile with -f68881 and use the inlining facility for the math library? >If not, you were timing the calling overhead, not the log function. No, usually I just say `-O4' and let it go at that. However, if you _wanna_ see what happens with `inlining', on a Sun 3 you get this (I was surprised): -O4 -f68881 -O4/-f68881 -fsoft -O4/-fsoft 0.356631 0.414894 0.280899 0.360215 0.355993 Apparently the `inline' option -f68881 does cut down somewhat on (presumably) the calling overhead to (essentially only) the log function. The global analysis (+ a few fancy other things that don't really apply to this program) done by -O4 is _almost_ as good as inlining (i.e. calling overhead of about (0.415-0.357)/0.415 = 14% seems to have been eliminated. However, look at the comination of -O4 and -f68881! A bit hard to understand how things can go _backwards_ for the library routine by simply doing both things. Instead of my little subroutine running only about 3 times faster, combining both switches makes it run almost 4 times faster than the library routine! Amazing! Perhaps Henry might explain this one (my brain is hurting at the moment)? As a kind of joke -- and a slight counterexample to Henry's statement above -- I tried the -fsoft option that, I presume (from the man page anyway), restricts everthing to using software fp routines. Almost the same comparison as the original -O4 result. Perhaps there _isn't_ that much variation on a given hardware, despite the various fancy compiler options. (Although I presume a DIFFERENT compiler and library on the same platform might behave quite differently). Cheers, -kym
henry@zoo.toronto.edu (Henry Spencer) (03/21/91)
In article <1991Mar20.204034.28931@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >Apparently the `inline' option -f68881 ... Uh, you're a bit confused here. There are *two* separate facilities that are relevant, -f68881 and the inlining facility. They're not the same thing. -- "[Some people] positively *wish* to | Henry Spencer @ U of Toronto Zoology believe ill of the modern world."-R.Peto| henry@zoo.toronto.edu utzoo!henry
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/21/91)
In article <1991Mar20.223644.16769@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1991Mar20.204034.28931@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >>Apparently the `inline' option -f68881 ... >Uh, you're a bit confused here. There are *two* separate facilities that >are relevant, -f68881 and the inlining facility. They're not the same thing. Yes I may be confused. The following comes from the man page (I sure hope those nice guys from Sun don't get me for any (c) violations): Floating-point code generation option. Can be one of: -f68881 Generate in-line code for Motorola MC68881 floating-point processor (sup- ported only on Sun-3 systems). what does this mean? :-) -kym
henry@zoo.toronto.edu (Henry Spencer) (03/22/91)
In article <1991Mar21.000655.28999@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >>... There are *two* separate facilities that >>are relevant, -f68881 and the inlining facility. They're not the same thing. > >Yes I may be confused. The following comes from the man page ... > >-f68881 Generate in-line code for Motorola > MC68881 floating-point processor ... However, this doesn't generate in-line code for library functions, only for things like addition and multiplication. There is a separate inlining facility for library code, which is what's wanted if you're trying to evaluate the performance of log(). See the discussion of the .il suffix. -- "[Some people] positively *wish* to | Henry Spencer @ U of Toronto Zoology believe ill of the modern world."-R.Peto| henry@zoo.toronto.edu utzoo!henry
steve@taumet.com (Stephen Clamage) (03/22/91)
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >Yes I may be confused. The following comes from the man page... >Floating-point code generation option. Can be one of: >-f68881 Generate in-line code for Motorola > MC68881 floating-point processor (sup- > ported only on Sun-3 systems). >what does this mean? :-) The Sun-3 has three options for floating point calculations. It can use the MC68881 co-processor directly, use subroutines to do floating-point calculations, or use subroutines which choose one of the other two options based on the presence or absence of the MC68881. If you know your code will only be run on machines with the MC68881, you can tell the compiler this with the -f68881 switch. When you then write a floating-point expression, the compiler will generate code to use the coprocessor directly (in line) instead of calling a subroutine to determine how to perform the calculations. -- Steve Clamage, TauMetric Corp, steve@taumet.com
kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) (03/22/91)
To recap on the discussion so far: [Someone wondered how a specialized routine might compare with the general C library version.] [I agreed with another article _against_ micro-optimization as a general policy, but wrote a small `benchmark' for log() and found a significant variation over the different machines I use & their standard cc's. These ranged from even through a good factor in either direction.] In article <1991Mar20.173249.3819@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <1991Mar16.201655.6104@bingvaxu.cc.binghamton.edu> kym@bingvaxu.cc.binghamton.edu (R. Kym Horsell) writes: >>So we see that on _some_ hardware (like 68k's) the library routines are >>at an apparent _big_ disadvantage... >No, actually, we see that on some hardware/software combinations the library >routines are at a big disadvantage. In particular, on that Sun 3/60, did >you compile with -f68881 and use the inlining facility for the math library? >If not, you were timing the calling overhead, not the log function. [I tried -f68881 on the Sun with essentially the same results except when -O4 `full' optimization was _also_ used; then the `roll your own' routine was much faster than the C library function.] In article <1991Mar21.170830.24983@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >However, this doesn't generate in-line code for library functions, only for >things like addition and multiplication. There is a separate inlining >facility for library code, which is what's wanted if you're trying to >evaluate the performance of log(). See the discussion of the .il suffix. Now, to continue: I wasn't sure whether -f68881 was meant to provide the kind of inlining Henry was originally suggesting. (Although it seems to put the library routine at a further disadvantage than without the `inline' -f68881 option). His original posting seems to imply he meant inlining of fp operations since calling overhead will not have much of an effect on the original program that includes the same number of calls to library routines vs calls to routines of my own. In my original posting I was comparing my own `log' routine to the C library version. The original point of this was to show that the answer to the question ``is this a good thing to do'' is a definite ``maybe'' depending on your environment (perhaps I roused Henry's ire by using ``hardware'' at the wrong point :-). Even though I used the -O4 `global optimize' switch to Sun's 3/60 cc, _no_ subroutines were inlined in my program; if I issued N calls to my log routine, N `jbsr' instructions are done. If I issue N calls to the C library log routine N `jbsr' instructions are done. So the question indirectly raised by Henry is ``does calling overhead make a big difference to the results obtained''. We can answer this in two ways -- does calling make much of a difference in the overall running of a program in an _absolute_ sense (i.e. ``how much effect does the proposed subroutine inlining have'')? And secondly, does calling overhead affect the comparison performed in typical simple benchmarks like the one I posted. To answer the first question: no, not much. To answer the second: no. I tend to define a _significant_ difference as anything >10% or 20%. So, although admittedly I only have the time to get figures for my 3/60 and Vax 8530 in this case, let's take some _measurements_. Via the program appended at the end, I compare the averages of a small number of runs of a loop of several 1000 calls to a polynomial evaluation -- fairly typical of sin(), cos(), etc. I've left out the usual argument reduction stages for clarity -- they should not make much difference since they typically run much faster than an fp multiply (and the poly eval uses several fp muls). Results: program `synth' Vax 8530 Sun 3/60 arith mean=1.29085 1.40432 geo mean=1.28371 1.4043 (For those who will ask -- geometric means are used to average ratios. They have one property some people don't like; they are usually smaller than the arithmetic mean of the same numbers -- I therefore include both (and use whichever is _least_ convenient)). The _subroutine_ turns out to be faster on both machines. :-) (This may not be try for other hardware -- Vax's and 68k's are fairly similar in some respects -- these figures indicating one). By examining the assembler code we can see why on the Vax -- apparently the optimizer can't eliminate the (i*M_SQRT2) common subexpression on the call! By hand-coding this into a temp register we obtain the program `synth2'. Results: program `synth2' Vax 8530 Sun 3/60 arith mean=0.821795 0.98481 geo mean=0.815938 0.984798 From the results for this program, subroutine calls and argument passing are therefore seen to take about 20% of the cpu time on a Vax 8530. Interestingly the Sun indicates there is now _no_ calling overhead! (And there subroutine call to `dummy' is still there -- I checked). Ok, this is a slight joke -- let's _also_ eliminate calls to the fp arithmetic routines and see what happens: Sun 3/60 with -f68881 -O4 arith mean=0.677463 geo mean=0.676492 Presumably the now approx 30% difference is due only to the single remaining subroutine call. Therefore we can say inlining (by my definition) may produce a `significant' speedup in a program on some machines but _not_ on others. In any case if a piece of software does other things beside call subroutines (e.g. it performs very long sequences of expensive fp operations as typified by the 68k with fp subroutine calls) the speedup will probably _not_ be significant. To get to the second question -- ``does calling overhead affect the original measurements which involved a ratio of two running times''? The answer is -- no. In the original `benchmark' I looked at the ratio of `mine' vs `theirs' in cpu time. If we say calls (which happened in equal numbers in both cases) take some fraction of the total instruction times -- 20%, say -- what do we find a ratio of two cpu times to be? mine + .2 mine 1.2 mine ------------------ = ---------- = ... theirs + .2 theirs 1.2 theirs Even pipelining considerations should not affect this argument (but please correct me if I'm wrong) -- the same effect would be apparent in each measurement; the ratio would remain unchanged. We can see the calling overhead therefore doesn't matter! (Knowing this I wondered why Henry would raise it). As a final point, even if we compare apples to oranges by having an optimizer expand your _own_ routine inline and _not_ the C library routine, the effect on the original results are also not marked on the Sun since the `roll your own' version of log I tested is _inherently_ faster than the general library routine ON THAT PLATFORM (the library routine must take account of any arguments it may be given is _one_ reason). In fact, since inlining of C library routines is probably not that common yet across all architectures (it didn't happen on either the Sun or Vax no matter what optimization level I used -- although on the Sun you _can_ make it happen) some might argue this provides _another_ reason to write your own tailored routines which may be coded portably as macros -- not a idea I'd generally endorse 'tho since the results overall (remember back that far? :-) indicated such a large (by my definition) variation across the different environments examined. -kym C code follows: ====================synth.c==================== #include <stdio.h> #include <math.h> #define NSAMP 10 #define NIT 10000 #define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI) double dummy(x) register double x; { return POLY(x); } main(){ int samp; double prod=1; double sum=0; register double tot=0; for(samp=1;samp<=NSAMP;samp++){ double t0,t1,t2; register i; t0=clock(); for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2); t1=clock(); for(i=0;i<NIT;i++) { tot+=POLY(i*M_SQRT2); } t2=clock(); prod *= (t2-t1)/(t1-t0); sum += (t2-t1)/(t1-t0); } printf("arith mean=%g\n", sum/NSAMP); printf("geo mean=%g\n", pow(prod, 1.0/NSAMP)); printf("garbage=%g\n",tot);/* need this in case the optimizer is _really_ clever! */ exit(0); } ====================synth2.c==================== #include <stdio.h> #include <math.h> #define NSAMP 10 #define NIT 10000 #define POLY(x) ((((x+M_PI)*x+M_PI)*x+M_PI)*x+M_PI) double dummy(x) register double x; { return POLY(x); } main(){ int samp; double prod=1; double sum=0; register double tot=0; for(samp=1;samp<=NSAMP;samp++){ double t0,t1,t2; register i; t0=clock(); for(i=0;i<NIT;i++) tot+=dummy(i*M_SQRT2); t1=clock(); for(i=0;i<NIT;i++) { register double x=i*M_SQRT2; tot+=POLY(x); } t2=clock(); prod *= (t2-t1)/(t1-t0); sum += (t2-t1)/(t1-t0); } printf("arith mean=%g\n", sum/NSAMP); printf("geo mean=%g\n", pow(prod, 1.0/NSAMP)); printf("garbage=%g\n",tot); exit(0); } ====================end end end====================