[comp.sys.sun] Fortran 300 times faster than C for computing cosine

achhabra@uceng.uc.edu (atul k chhabra) (03/21/89)

I chanced upon a segment of code that runs approximately 300 times faster
in FORTRAN than in C. I have tried the code on Sun3(OS3.5) and on
Sun4(OS4.0) (of course, on Sun4 the -f68881 flag was not used.) The
results are similar on both machines. Can anyone enlighten me on this
bizzare result?

Listing of cosc.c:
__________

/*
 * Compile using:
 *	cc -f68881 -O -o cosc cosc.c -lm.
 */

#include <math.h>

main()
{
    int i;
    float tmp;

    for(i=0;i<262144;i++)
	tmp=cos(2.5)*cos(2.5)*cos(2.5)*cos(2.5);
}

Listing of cosf.f
__________

c
c	Compile using:
c		f77 -f68881 -O -o cosf cosf.f
c
	program cosf
	integer i
	real tmp

	do 10 i=1,262144
		tmp=cos(2.5)*cos(2.5)*cos(2.5)*cos(2.5)
10	continue
	end

Timings on Sun3(OS3.5):

% time cosc
55.6u 1.0s 1:49 51% 24+8k 12+1io 0pf+0w
^^^^^
% time cosf
0.2u 0.0s 0:00 75% 16+8k 4+0io 0pf+0w
^^^^

[[ This is a good one!  You've gotten tricked by one of the most common
optimization snares around.  For f77, "-O" defaults to "-O3", which
includes global optimization.  Presumably it also does dead code
elimination.  But what constitutes dead code?  Well, code that performs
calculations that aren't used.  How do you define a value that has been
used?  Ultimately, if it has an effect on the outside world, or more
specifically, if it or something it effects is written or printed.  Your
cosf.f does neither; "tmp" is never used for anything important, so your
entire program gets optimized away!  Add "print *,tmp" after the body of
the loop and cosf.f takes alot longer.  Interestingly enough, it still
appears to be quicker than cosc.c.  --wnl ]]

Atul Chhabra, Dept. of Electrical & Computer Engineering, ML 030,
University of Cincinnati, Cincinnati, OH 45221-0030.

voice: (513)556-4766  INTERNET: achhabra@ucesp1.ece.uc.edu [129.137.33.114]
                                OR achhabra@uceng.uc.edu [129.137.33.1]

chuck@trantor.harris-atd.com (Chuck Musciano) (04/19/89)

Our moderator muses: "Interestingly enough, it [computing a function
involving four calls to cos()] still appears to be quicker [in FORTRAN]
than cosc.c [the C version]".

The answer is that in FORTRAN, cos() is an intrinsic function, and the
compiler uses the direct floating point opcode.  In C, cos() is an
external function, and the compiler generates a call to an external named
"cos".

Compare the code generated for "sin(x) + cos(x)" in FORTRAN and C.  In
FORTRAN (a5@(-0x7ffc) is x):

        fcoss   a5@(-0x7ffc),fp0
        fsins   a5@(-0x7ffc),fp1
        faddx   fp1,fp0

In C (a6@(-12) and a6@(-16) are x):

        movl    a6@(-12),sp@-
        movl    a6@(-16),sp@-
        jbsr    _sin
        addqw   #8,sp
        movl    d0,a6@(-24)
        movl    d1,a6@(-20)
        movl    a6@(-12),sp@-
        movl    a6@(-16),sp@-
        jbsr    _cos
        addqw   #8,sp
        movl    d1,sp@-
        movl    d0,sp@-
        fmoved  sp@+,fp0
        faddd   a6@(-24),fp0

Does anyone know of an optimizer which will convert calls to the standard
math routines to their 68881 instruction counterparts?

Chuck Musciano
Advanced Technology Department
Harris Corporation
(407) 727-6131
ARPA: chuck@trantor.harris-atd.com

henry@uunet.uu.net (04/19/89)

>	do 10 i=1,262144
>		tmp=cos(2.5)*cos(2.5)*cos(2.5)*cos(2.5)

This one got hashed out at some length in comp.lang.c a while ago.  The
major reason for the difference in speed is as our moderator mentioned:
the Fortran compiler knows that cos() is a builtin function, notices that
the value is never used, and optimizes out the body of the loop.  C
compilers *could not* do this until recently, since it's only the
(imminent) ANSI C standard that has declared certain library functions to
be potentially built-in.

>... Add "print *,tmp" after the body of
>the loop and cosf.f takes alot longer.  Interestingly enough, it still
>appears to be quicker than cosc.c.  --wnl ]]

It may only be computing cos(2.5) once per loop, since the compiler is
entitled to notice the duplication.  Also, note that the Fortran version
is using single-precision arithmetic where the C one is using double.

	Henry Spencer at U of Toronto Zoology
	uunet!attcan!utzoo!henry henry@zoo.toronto.edu

cudcv@uunet.uu.net (Rob McMahon) (04/22/89)

In article <764@uceng.UC.EDU> you write:
>I chanced upon a segment of code that runs approximately 300 times faster
>in FORTRAN than in C ...
>
>Timings on Sun3(OS3.5):
>
>% time cosc
>55.6u 1.0s 1:49 51% 24+8k 12+1io 0pf+0w
>^^^^^
>% time cosf
>0.2u 0.0s 0:00 75% 16+8k 4+0io 0pf+0w
>^^^^
>
>[[ ... [valid comments about results being optimised away when not used,
>because of the higher level of optimisation used by f77] ... --wnl ]]

The compiler might also be realising that cos returns the same value for
the same argument.  Here's a script using gcc, given the following
definition of cos, run on a Sun3/160 SunOS 4.0, and using gcc 1.34:

inline static const double
cos(double x)
{
    double result;
    asm("fcosx %1,%0" : "=f" (result) : "f" (x));
    return (result);
}

Script started on Wed Mar 29 22:06:38 1989
cudcv (6) >> cat x.c
#include <math.h>

main()
{
    int i;
    float tmp;

    for(i=0;i<262144;i++)
        tmp=cos(2.5)*cos(2.5)*cos(2.5)*cos(2.5);
}
cudcv (7) >> gcc -O x.c
cudcv (8) >> time ./a.out
0.2u 0.0s 0:00 141% 0+8k 0+0io 0pf+0w
cudcv (9) >> ed x.c
126
/}/i
    printf("%f\n", tmp);

stevo@elroy.jpl.nasa.gov (Steve Groom) (05/06/89)

chuck@trantor.harris-atd.com (Chuck Musciano) writes:
>Does anyone know of an optimizer which will convert calls to the standard
>math routines to their 68881 instruction counterparts?

/usr/lib/inline.  Try "cc -c foo.c /usr/lib/f68881/libm.il".  Better yet,
take a look at the Sun Floating Point Programmer's Guide (in the 4.0
manual set).

-steve

Steve Groom, Jet Propulsion Laboratory, Pasadena, CA 91109
Internet: stevo@elroy.jpl.nasa.gov   UUCP: {ames,cit-vax}!elroy!stevo
Disclaimer: (thick German accent) "I know noothingg! Noothingg!"

stpeters@uunet.uu.net (05/06/89)

In article <8903221250.AA03853@trantor.harris-atd.com> you write:
>Does anyone know of an optimizer which will convert calls to the standard
>math routines to their 68881 instruction counterparts?

In effect, yes.  The mechanism Sun uses is to have in-line code (.il)
templates that cause the compiler to generate in-line code in place of the
math library calls.

If you're running SunOS 3.X:

	cc -c -f68881 program.c /usr/lib/f68881.il
	cc -o program -f68881 program.o -lm

If you're running 4.0, it's even easier:

	cc -o program program.c /usr/lib/f68881/libm.il

Substitute soft, switch, or fpa for f68881 as appropriate.

These can make differences of HUGE factors in how fast floating point
programs run.  That floating point manual really is worth reading :-)

Dick St.Peters
GE Corporate R&D, Schenectady, NY
stpeters@dawn.crd.ge.com
uunet!steinmetz!dawn!stpeters

paul@uunet.uu.net (Paul Hudson) (05/06/89)

Whatever the question, the answer is GNU!

Using GNU gcc wonderful extended asm statement + inline functions it's
quite possible to have cos just generate the fp opcode. A header file to
do it was recently posted to bug-gcc.  I think I have it squirreled away
somewhere ....

Paul Hudson 

Snail mail: Monotype ADG	Email:	...!ukc!acorn!moncam!paul
	    Science Park,		paul@moncam.co.uk
	    Milton Road,
	    Cambridge,
	    CB4 4FQ

joerg%berthold.UUCP%TUB.BITNET@mitvma.mit.edu (Joerg Schilling) (05/06/89)

The sun C compiles does convert calls to the *external* c function sin() ...
into 68881 routines if you tell him to use the inline expander.

Try the folling c program (t.c):

extern double sin();

double t()
{
        return sin(123.0);
}


use the command :
joerg > cc -v -O2 -f68881 -S /usr/lib/f68881/libm.il t.c
..
Verbose output removed.
joerg > cat t.s
..
Several lines removed.
        .text
        .globl  _t
_t:
|#PROLOGUE# 0
        link    a6,#-8
|#PROLOGUE# 1
        movl    L2000000+4,sp@-
        movl    L2000000,sp@-
        fsind   sp@,fp0
        fmoved  fp0,sp@
        movl    sp@+,a6@(-8)
        movl    sp@+,a6@(-4)
        movl    a6@(-8),d0
        movl    a6@(-4),d1
|#PROLOGUE# 2
        unlk    a6
|#PROLOGUE# 3
        rts
L2000000:
        .long   1079951360,0
joerg >

As you see the call to the *external* function is converted into the
correct mc68881 code.  See the manual of inline(1) for more information.

        J. Schilling
        H. Berthold AG
        Teltowkanalstr. 1-4
        D 1000 Berlin 46
        +49 30  7795 - 400


joerg@berthold.de
.. tub!berthold!joerg
.. unido!berthold!joerg

khb@sun.com (Keith Bierman - SPD Languages Marketing -- MTS) (05/10/89)

I'm sorry for the very tardy reply, but I was out of town, my machine (and
office) were moved, and my dog ate my transmittal :>

chuck@trantor.harris-atd.com (Chuck Musciano) writes:
>X-Sun-Spots-Digest: Volume 7, Issue 230, message 1 of 16
>
>Compare the code generated for "sin(x) + cos(x)" in FORTRAN and C.  In
>FORTRAN (a5@(-0x7ffc) is x):
>
>        fcoss   a5@(-0x7ffc),fp0
>        fsins   a5@(-0x7ffc),fp1
>        faddx   fp1,fp0
>

One can do better. Here is some source:

      real x
      x = 10.0
      y = sin(x)+cos(x)
      print*,y
      end

and we compile (f77v1.2)

f77 -O3 -f68881 -S

.... stuff removed

|#PROLOGUE# 1
	fmoves	L1D20,fp7
	fsincosx	fp7,fp1:fp0
	faddx	fp1,fp0
	fmoves	fp0,VAR_SEG1+4
	pea	v.16
	jbsr	_s_wsle


since we can compute sin and cos together with a big savings.


Don't forget to use the inline library

	/usr/lib/f77/f68881/libm.il

for your floating point needs.


Cheers.
-- 
Keith H. Bierman          |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault         |	Marketing Technical Specialist 
-- I Voted for Bill & Opus|   Languages and Performance Tools. 
(* strange as it may seem, I do more engineering now     *)

igp@cambridge-consultants.co.uk (Ian Phillipps) (05/11/89)

chuck@trantor.harris-atd.com (Chuck Musciano) writes:

>Compare the code generated for "sin(x) + cos(x)" in FORTRAN and C.  In
>FORTRAN (a5@(-0x7ffc) is x):

[ 3 x 68881 instructions ]

>In C (a6@(-12) and a6@(-16) are x):

[ 13 x 68020 + 1 x 68881 instructions]

>Does anyone know of an optimizer which will convert calls to the standard
>math routines to their 68881 instruction counterparts?

Sun provide one! (/usr/lib/inline)

Using
	cc -S -O -f68881 trig.c /usr/lib/f68881.il

We get:

	movl    a6@(12),sp@-	; x
	movl    a6@(8),sp@-
	fcosd   sp@,fp0
	fmoved  fp0,sp@
	movl    sp@+,a6@(-8)
	movl    sp@+,a6@(-4)
	movl    a6@(12),sp@-
	movl    a6@(8),sp@-
	fsind   sp@,fp0
	fmoved  fp0,sp@
	movl    sp@+,d0
	movl    sp@,d1
	movl    d0,sp@-
	fmoved  sp@+,fp0
	faddd   a6@(-8),fp0
	fmoved  fp0,a6@(16)	; assign result

Not very optimised, but at least it does fcosd!
[Note - C gives you DCOS by default]
-- 
UUCP:  igp@camcon.co.uk   | Cambridge Consultants Ltd  |  Ian Phillipps
or:    igp@camcon.uucp    | Science Park, Milton Road  |-----------------
Phone: +44 223 420024     | Cambridge CB4 4DW, England |