rwl@umree.ee.umr.edu (Wayne Little) (09/22/89)
One of the professors here at UMR discovered that he gained a 5 times performance increase on some of his floating point intensive code with complex numbers by using in-line expansion. The problem was with the *humongous* amount of stack traffic that was being generated by the 68881 coprocessor calls. The fix is to designate the inline expansion library for the 68881 on the compile line - e.g. f77 -c -f68881 code.c -O code /usr/lib/68881/libm.il The problem and solution are described in detail in section G (Assembly-Level In-line Expansion) of Sun's Floating-Point Programmer's Guide, pp. 105-113. I hope this will be of help to others. -- Wayne Little Internet: rwl@ee.umr.edu UUCP: uunet!umree!rwl Phone: (314) 341-4546 USPS: Univ. of Missouri-Rolla, EE Dept., Rolla, MO 65401
prl@uunet.uu.net (09/24/89)
>From iis!prl Sun Sep 24 12:13:05 MET 1989 remote from ethz > One of the professors here at UMR discovered that he gained a 5 times > performance increase on some of his floating point intensive code with > complex numbers by using in-line expansion. This is not a new phenomenon; I posted on the slowness of the SunOS 4.0 maths library (v7, about issue 195-200, Subject line is `libm for 68881 and Sun fpa is incredibly slow'). > The problem was with the *humongous* amount of stack traffic that was > being generated by the 68881 coprocessor calls. This is *NOT*, repeat *NOT* the problem. The problem is that the Sun -lm maths library doesn't use any of the high-level builtins that are available on the 68881 nor in the Sun3 FPA. The proof of this: 1) Look at the code for sqrt(), cos, or similar from the library using adb. You'll see that sqrt() does a coded Newton iteration and doesn't use the builtin fsqrt operation of the 68881. similarly, sin() and cos() do a coded series expansion. 2) If you recode the functions as C-callable functions in assembly language you get almost the same speedup as using the inline library. This is partly due to poor coding in the inline library, detailed by David Hough from Sun in response to my posting. Unfortunately I don't have an exact reference to David's article either. If you want to experiment with this yourself, try the following: main() { register int i; register double a, b; /* Check that at least one result is correct */ printf("%g\n", sqrt(2.0)); for(i = 0, a = 0; i < 50000; i++, a += 0.00001) b = sqrt(a); exit(0); } and another implementation of sqrt: .globl _sqrt _sqrt: fsqrtd sp@(4),fp0 | Do the sqrt fmoved fp0,sp@- movel sp@+,d0 movel sp@+,d1 rts User CPU time cc sqrttst.c -lm 19.36 sec cc sqrttst.c sqrt.s 1.78 sec cc sqrttst.c /usr/lib/68881/libm.il 1.63 sec All times on a 3/60 running 4.0.3. Be wary of compiling the test routine with high levels of optimisation. It is not a benchmark which is designed to be robust in the face of good global optimisation! > The fix is to designate the inline expansion library for the 68881 on the > compile line - > e.g. > f77 -c -f68881 code.c -O code /usr/lib/68881/libm.il This will work much better if you use: -O4 (provided iropt doesn't run out of stack space and dump core :-( /usr/lib/f68881/libm.il (typo in original article) > The problem and solution are described in detail in section G > (Assembly-Level In-line Expansion) of Sun's Floating-Point Programmer's > Guide, pp. 105-113. Um, sort of. The Sun's Floating-Point Programmer's Guide was written for 3.x, where the implementation of -lm was considerably faster. This means that in 3.x, the speedup *is* due to reduced stack traffic, but is nowhere near so spectacular, but in 4.x, the speedup is due to taking a completely different implementation of the functions, and the speedups are considerable. Depending on the function, using the 68881 (or Sun3 FPA) implementation of the maths library functions rather than the slow code in -lm will bring a factor of between about 3 and 10. For the pedantic; neither my sqrt.s implementation above nor the implementations in /usr/lib/f68881/libm.il satisfy SVID. This is documented (somewhere in the FM). It is relatively easy, however, to implement a sqrt() function which both is > 10* faster than that in -lm, and satisfies SVID. I have had long and wearing discussions with the engineers responsible for the Sun maths library, and they were never willing to accept the poor performance of the standard maths library as a bug. The performance enhancements are now registered as a Request For Enhancement RFE#1021706 (SO#310960). If you have a need for a faster maths library, please contact your Sun support people and quote these references. This may help to make the improvements happen faster. For those of you with Sun4's, the FP chip there implements only the sqrt function, but there is *NO* sqrt in the inline library. Recoding the sqrt() function for Sun4 in assembly to use the fsqrtd builtin gives you a factor 5 speedup. For the curious, this is how to do a faster sqrt on a Sun4, again, *not* SVID-conformant: .seg "text" .proc 7 .global _sqrt _sqrt: save %sp,-72,%sp st %i0,[%fp+68] ld [%fp+68],%f0 st %i1,[%fp+72] ld [%fp+72],%f1 fsqrtd %f0,%f0 ret restore BTW, for X11 hackers, the dreaded ARC function can be sped up by a factor of about 2.5-3 on a Sun3 by either compiling server/ddx/mi/miarc.c with the inline library (/usr/lib/68881/libm.il) or by using my sqrt.s hack above!! Peter Lamb uucp: uunet!mcvax!ethz!prl eunet: prl@iis.ethz.ch Tel: +411 256 5241 Integrated Systems Laboratory ETH-Zentrum, 8092 Zurich
prl@uunet.uu.net (09/26/89)
>From iis!prl Tue Sep 26 10:09:59 MET 1989 remote from ethz There was a fundamental error in part of my earlier reply to this message. rwl@umree.ee.umr.edu (Wayne Little) suggests compiling: > f77 -c -f68881 code.c -O code /usr/lib/68881/libm.il ^^ I don't know why I saw this and ^^^ not this. For code.f files, using the inline library *will* improve code performance by placing code inline without changing the essentials of the implementation. What I said about the poor performance of the Sun3 C library holds, though. Peter Lamb uucp: uunet!mcvax!ethz!prl Tel: (01) 256 5241 (Switzerland) eunet: prl@iis.ethz.ch +411 256 5241 (International) Integrated Systems Laboratory ETH-Zentrum 8092 Zurich