[comp.sys.sun] libm for 68881 and Sun fpa is incredibly slow

prl@eiger.uucp (03/01/89)

If you use libm (especially trancendental functions sin(), cos(), ln(),
exp() etc..) you can get nearly a 10* speedup by using the clumsy code
inlining facility provided by Sun's C complier.

For example (on Sun 3/60, SunOS 4.0.1)

The following code:

	#include <math.h>

	main()
	{
		register int i;
		register double x, y;
		for(i = 0, x = 0; i < 100000; i++, x += 2*M_PI/100000.0)
			y = cos(x);
	}

Compiled with:
	cc -O -f68881 -o cos cos.c -lm

Runs in:
	real    0m30.16s
	user    0m24.56s
	sys     0m0.58s

Compiled with (but how incredibly *UGLY*):
	cc -O -f68881 -o cos cos.c /usr/lib/f68881/libm.il

Runs in:
	real    0m4.33s
	user    0m3.65s
	sys     0m0.20s

REASON:
	Although Sun went to the trouble of making the assembly inline
	file /usr/lib/f68881/libm.il, and a 68881 version of the
	maths library, they *DID NOT* make assembly versions of
	the maths functions to put into the maths library!
	(and similarly for the FPA)

	THIS IS STUPID!

NOTA BENE, Sun:
	We discovered this while benchmarking Sony workstations. Trivial
	loops with sin(), cos() etc. calls run 10* faster on the Sony
	(20MHz, 68020 version) than on the Sun 3/60, without the need for
	the inlining muck.

	If Sony can do it right first time, why can't Sun get it right
	on their Nth release after introducing the 68881?

Peter Lamb				uucp:  seismo!mcvax!ethz!prl
Tel: (01) 256 5241 (Switzerland)	eunet: prl@iis.ethz.ch
     +411 256 5241 (International)

Integrated Systems Laboratory
ETH-Zentrum
8092 Zurich

prl@eiger.uucp (03/02/89)

I have constructed a replacement for libm.a in which most of the C library
routines have had their code replaced by code out of the inline
replacement library.

I will be making this code available in sun-spots or in comp.sources.sun
(as appropriate), in a form which requires no distribution of Sun source
or binaries, as soon as it has been tested locally.

You can expect sqrt() to be more than 10 times faster in this library than
in the standard -lm!

The speedups are as follows (Sun3/280, SunOS 4.0.1):

Motorola 68881

Func	-lm	-lmfast		libm.il
	secs	secs	rel	secs	rel	rel
			-lm		-lm	-lmfast

cos()	9.43	1.85	5.10	1.67	5.65	1.11
sin()	8.87	1.77	5.01	1.65	5.38	1.07
tan()	12.98	1.97	6.59	1.83	7.09	1.08
acos()	5.13	2.37	2.16	2.27	2.26	1.04
asin()	6.40	2.37	2.70	2.20	2.91	1.08
atan()	9.22	1.82	5.07	1.70	5.42	1.07
log()	10.25	2.25	4.56	2.10	4.88	1.07
log10()	6.47	2.35	2.75	2.22	2.91	1.06
log2()	5.33	2.33	2.29	1.48	3.60	1.57
exp()	8.42	1.95	4.32	1.80	4.68	1.08
exp10()	9.37	2.12	4.42	2.02	4.64	1.05
exp2()	6.50	2.10	3.10	1.97	3.30	1.07
sqrt()	14.32	1.20	11.93	1.03	13.90	1.17
cosh()	4.82	2.23	2.16	2.10	2.30	1.06
sinh()	5.32	2.15	2.47	1.98	2.69	1.09
tanh()	5.57	2.33	2.39	2.23	2.50	1.04
atanh()	3.95	2.52	1.57	1.63	2.42	1.55



Weitek FPA

Func	-lm	-lmfast		libm.il
	secs	secs	rel	secs	rel	rel
			-lm		-lm	-lmfast

cos()	3.45	0.83	4.16	0.82	4.21	1.01
sin()	3.23	0.75	4.31	0.73	4.42	1.03
tan()	4.60	1.63	2.82	1.57	2.93	1.04
acos()	3.55	2.00	1.77	1.97	1.80	1.02
asin()	3.95	2.05	1.93	1.90	2.08	1.08
atan()	2.97	1.23	2.41	1.23	2.41	1.00
log()	4.42	1.27	3.48	1.28	3.45	0.99
log10()	4.17	2.07	2.01	1.93	2.16	1.07
log2()	3.43	1.93	1.78	1.95	1.76	0.99
exp()	3.15	1.42	2.22	1.25	2.52	1.14
exp10()	4.90	1.75	2.80	1.67	2.93	1.05
exp2()	3.32	1.75	1.90	1.68	1.98	1.04
sqrt()	12.37	1.18	10.48	1.18	10.48	1.00
cosh()	2.75	1.95	1.41	1.82	1.51	1.07
sinh()	3.03	1.75	1.73	1.68	1.80	1.04
tanh()	3.33	2.00	1.67	1.93	1.73	1.04
atanh()	2.32	2.12	1.09	2.10	1.10	1.01

NOTES:

1) -lmfast is my modified libm.a, libm.il is with the use
   of the Sun-supplied inline code file.

2) Columns entitled `rel' are the speedup relative to the named column.

3) The times for each routine are for 50000 calls, with parameter values in the
   range from slightly more than 0.0 to slightly more than 1.0, spaced linearly
   by 1.0/50000.0.

4) Loop overhead has been subtracted, but not subroutine call overhead.


Peter Lamb
uucp:  uunet!mcvax!ethz!prl	eunet: prl@ethz.uucp	Tel:   +411 256 5241
Integrated Systems Laboratory
ETH-Zentrum, 8092 Zurich

self@bayes.arc.nasa.gov (Matthew Self) (04/19/89)

John Schultz compiled the following timings for Sun's math libraries using
GCC and CC with various options:

> My results, running on Sun 3/60, Sun OS 3.5, GNU CC 1.32 built using
> default switches were
> 
> gcc -lm                  ===   4.6 real         4.2 user         0.0 sys
> gcc -m68881  -lm         ===   4.4 real         4.2 user         0.0 sys
> gcc -O -m68881 -lm       ===   4.4 real         4.1 user         0.0 sys  
> gcc -O -g -m68881 -lm    ===   5.4 real         4.2 user         0.1 sys
> cc -lm                   === 159.4 real       146.7 user         0.6 sys
> cc -O -lm                === 155.4 real       146.2 user         0.4 sys
> cc -f68881 -lm           ===   6.6 real         4.6 user         0.1 sys 
> cc  /usr/lib/f68881.il   ===   9.9 real         6.7 user         0.1 sys
> cc -O /usr/lib/f68881.il ===   6.5 real         6.4 user         0.0 sys 
> **********************************************************************/
> #include <math.h>
> 
> main()
> {
>   register int i;
>   register double x, y;
>   for(i = 0, x = 0; i < 100000; i++, x += 2*M_PI/100000.0)
>     y = cos(x);
> }

I have written an inline math library for GCC which is more than twice as
fast as any of these options for this test program.  In fact, it permits
GCC to determine that the program does nothing at all, so it optimizes it
away entirely!  I modified the test program slightly to make the return
value depend on the computations in the loop so this won't happen.  Even
with the extra addition I introduced, the program now executes in only
2.5s, more than twice as fast as before.  Here is the new test program:

#include <math.h>  /* my inline ANSI math library */

#define M_PI 3.1415792  /* this isn't defined in ANSI C's math.h */

main()
{
  int i;		/* GCC doesn't need register declarations */
  double x, y = 0;
  for(i = 0, x = 0; i < 100000; i++, x += 2*M_PI/100000.0)
    y += cos(x);
  if (y == 0)
    return 0;
  else
    return 1;
}

Availability of this inline math library will be announced soon on the
info-gcc mailing list.  Mail to info-gcc-request@prep.ai.mit.edu to
subscribe.

			Matthew Self
		  NASA Ames Research Center
		   self@bayes.arc.nasa.gov

self@bayes.arc.nasa.gov (Matthew Self) (04/21/89)

John Schultz compiled the following timings for Sun's math libraries using
GCC and CC with various options:

> My results, running on Sun 3/60, Sun OS 3.5, GNU CC 1.32 built using
> default switches were
> 
> gcc -lm                  ===   4.6 real         4.2 user         0.0 sys
> gcc -m68881  -lm         ===   4.4 real         4.2 user         0.0 sys
> gcc -O -m68881 -lm       ===   4.4 real         4.1 user         0.0 sys  
> gcc -O -g -m68881 -lm    ===   5.4 real         4.2 user         0.1 sys
> cc -lm                   === 159.4 real       146.7 user         0.6 sys
> cc -O -lm                === 155.4 real       146.2 user         0.4 sys
> cc -f68881 -lm           ===   6.6 real         4.6 user         0.1 sys 
> cc  /usr/lib/f68881.il   ===   9.9 real         6.7 user         0.1 sys
> cc -O /usr/lib/f68881.il ===   6.5 real         6.4 user         0.0 sys 
> **********************************************************************/
> #include <math.h>
> 
> main()
> {
>   register int i;
>   register double x, y;
>   for(i = 0, x = 0; i < 100000; i++, x += 2*M_PI/100000.0)
>     y = cos(x);
> }

I have written an inline math library for GCC which is more than twice as
fast as any of these options for this test program.  In fact, it permits
GCC to determine that the program does nothing at all, so it optimizes it
away entirely!  I modified the test program slightly to make the return
value depend on the computations in the loop so this won't happen.  Even
with the extra addition I introduced, the program now executes in only
2.5s, more than twice as fast as before.  Here is the new test program:

#include <math.h>  /* my inline ANSI math library */

#define M_PI 3.1415792  /* this isn't defined in ANSI C's math.h */

main()
{
  int i;		/* GCC doesn't need register declarations */
  double x, y = 0;
  for(i = 0, x = 0; i < 100000; i++, x += 2*M_PI/100000.0)
    y += cos(x);
  if (y == 0)
    return 0;
  else
    return 1;
}

This inline math library was recently posted to the info-gcc mailing list
(gnu.gcc newsgroup).  If you can't obtain a copy there, let me know and I
will send you a copy.

			Matthew Self
		  NASA Ames Research Center
		   self@bayes.arc.nasa.gov

dav@hplabs.hp.com (David L. Markowitz) (05/06/89)

self@bayes.arc.nasa.gov (Matthew Self) writes:
> John Schultz compiled the following timings for Sun's math libraries using
> GCC and CC with various options:
> 
> > My results, running on Sun 3/60, Sun OS 3.5, GNU CC 1.32 built using
> > default switches were
> > 
[gcc timings deleted]
> > cc -lm                   === 159.4 real       146.7 user         0.6 sys
> > cc -O -lm                === 155.4 real       146.2 user         0.4 sys
> > cc -f68881 -lm           ===   6.6 real         4.6 user         0.1 sys 
                                            Is this ^^^ a Typo?  Maybe 6.4?
> > cc  /usr/lib/f68881.il   ===   9.9 real         6.7 user         0.1 sys
> > cc -O /usr/lib/f68881.il ===   6.5 real         6.4 user         0.0 sys 
[program deleted]
> 
[discussion about GCC inline library reducing user time to 2.5s deleted]

I would like to point out an error in the timing tests done above.  The
inline expansions only help in math library function calls - not in
built-in floating point operations (like * and /).  The -f68881 option
does the opposite - it helps built-in math, but not math library function
calls.  Both are needed here.  The optimal compiler command is therefore
"cc -O -f68881 /usr/lib/f68881.il cos.c", which on my Sun 3/60 under SunOS
3.4 yields 3.7s of user time, which - while not as good as GCC with
inlines - is still a lot better than the above matrix.

Will this GCC inline stuff make it into a future GCC distribution?

-- 
	David L. Markowitz		Rockwell International
	...!sun!sunkist!arcturus!dav	dav@arcturus.UUCP
	The above opinions are merely that, and only mine.