[comp.sys.next] 68040 Math / Why so much system time?

finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) (02/15/91)

Ok, I've just finished upgrading to OS2.0 and the 68040 logic board.
Recalling some discussion earlier, which was never resolved on the
net, I wrote a stupid program to exercise some trancendental
arithmetic and then timed the puppy:

#include <math.h>
#define n 10000
#define m 100
#define pi 3.141592653589  
void test2(double *a){ return; }

main(){
    int i,j;
    double b,a[n];
    
    for(j=0;j<m;j++) {
	for(i=0;i<n;i++) a[i]=sin(i*pi/n);
	test2(a);
    }	
    return 0;
}	
	
    
The results are astonishing:

% cc -O testc.c -o testc -lm
% time testc
4.8u 27.7s 0:38 84% 0+0k 3+0io 0pf+0w

Nearly 86% of the execution time of this program is system, and only
14% is user. If the call to sin() is changed to sqrt(), the statistics
are _very_ different:

% cc -O testc.c -o testc -lm
% time testc
9.0u 0.0s 0:09 99% 0+0k 1+0io 0pf+0w

Now, all the execution time is user time.

Why all the system time spent doing a sin()? Just fyi, the same prog
(with call to sin()) on a Sun sparc 1+ runs 16.1 sec in user time ---
a factor of 2 faster, but with a call to sqrt() runs 23.7 sec in user
time: a factor of 3 _slower_. Just what is going on in doing
trancendental functions on the 68040?

Please reply directly to me, as I am splitting tomorrow morning and
won't be able to see news for about a week. 

Thanks,

lsf@astrosun.tn.cornell.edu

P.S. If this or a similar posting has been seen in the last 24 hours,
apologises; I tried posting once before and got a bunch of errors that
suggested we were out of space to post messages so I'm tring again now.

cdl@chiton.ucsd.edu (Carl Lowenstein) (02/15/91)

In article <1991Feb14.160749.19048@batcomputer.tn.cornell.edu> finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) writes:
>Ok, I've just finished upgrading to OS2.0 and the 68040 logic board.
> I wrote a stupid program to exercise some trancendental
>arithmetic and then timed the puppy:

>4.8u 27.7s 0:38 84% 0+0k 3+0io 0pf+0w

>Nearly 86% of the execution time of this program is system, and only
>14% is user. If the call to sin() is changed to sqrt(), the statistics
>are _very_ different:

>9.0u 0.0s 0:09 99% 0+0k 1+0io 0pf+0w

>Now, all the execution time is user time.

>Why all the system time spent doing a sin()? 

1) the 68030 processor does not have built-in floating point instructions
    therefore its floating point is handled by a 68882 FPU
      which has built-in transcendental functions
        which are pretty fast.
2) the 68040 processor has a built-in floating point instructions
     which do not include transcendental functions
       which must then be handled by kernel exception traps
         which are slow and take up system time.

One hopes that eventually libm.a will be re-written to not call
the built-in transcendentals of the 68882, computing transcendental
functions using only the floating point instructions native to the
68040, and thus avoiding the exception traps.

-- 
        carl lowenstein         marine physical lab     u.c. san diego
        {decvax|ucbvax} !ucsd!mpl!cdl                 cdl@mpl.ucsd.edu
                                                  clowenstein@ucsd.edu

news@NeXT.COM (news) (02/15/91)

In article <1991Feb14.160749.19048@batcomputer.tn.cornell.edu>  
finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) writes:
> [ Text deleted ] 
> Why all the system time spent doing a sin()? Just fyi, the same prog
> lsf@astrosun.tn.cornell.edu
> 
On our 68030 cubes with the 68882 FPU, all of the floating point arithmetic
instructions were handled on the 68882.  For the 68040, Motorola decided to 
incorporate a subset of the 68882 on chip.  In following the RISC philosophy,
they didn't dump the whole 68882 on chip, but rather concentrated on the most
used instructions that they could make the fastest, and put those on the 040.

The instructions that weren't put on the chip (transcendentals mostly) are
trapped by the '040 into the kernel, where they are handled by floating point
software that emulates that instruction.  The time spent inside the kernel
doing sin(), cos(), atanh(), etc is accrued onto your task's system time, 
which is what you noticed.  Last time I timed the transcendentals on the 68040,
they were faster than the instructions run in hardware on the 68882.

The 040 floating point hardware instructions are
	FADD, FSUB, FMUL, FDIV, FSQRT, FMOVE (most addressing modes), 
	FABS, FNEG, FMOVEM, FCMP, FSAVE, FScc, FDBcc, FBcc, FRESTORE

The 040 floating point software instructions are
	FACOS, FASIN, FATAN, FATANH, FCOS, FCOSH, FETOX, FETOXL, FGETEXP,
	FGETMAN, FINT, FINTRZ, FLOG10, FLOG2, FLOGN, FLOGNP1, FMOD, FMOVECR,
	FREM, FSCALE, FSIN, FSINCOS, FSINH, FTAN, FTANH, FTENTOX, FTWOTOX

Hope this helps.

		--morris

Morris Meyer          NeXT, Inc.
Software Engineer     900 Chesapeake Drive
NeXT OS Group         Redwood City, CA   94063
mmeyer@next.com

keen@ee.ualberta.ca (Jeff '876393' Keen) (02/15/91)

In article <SCOTT.91Feb14212146@erick.gac.edu> scott@erick.gac.edu (Scott Hess) writes:
>In article <702@chiton.ucsd.edu> cdl@chiton.ucsd.edu (Carl Lowenstein) writes:
>   One hopes that eventually libm.a will be re-written to not call
>   the built-in transcendentals of the 68882, computing transcendental
>   functions using only the floating point instructions native to the
>   68040, and thus avoiding the exception traps.
>
>Hmm. That is a very good idea.  Using shared libraries, NeXT could
>presumably fix this up so that it would work under either the '030
>or the '040 by simply (yes, that is "simply" :-) using a different
>shared library depending on the processor on the machine.
>
>I would recommend just that.  Actually changing libm.a without such
>a scheme would more than likely break on an '030 machine, which is
>something none of us want.  Well, I guess it would just run slooowwww.
>
>Later,
>--
>scott hess                      scott@gac.edu
>Independent NeXT Developer	GAC Undergrad
><I still speak for nobody>
>"Tried anarchy, once.  Found it had too many constraints . . ."
>"Buy `Sweat 'n wit '2 Live Crew'`, a new weight loss program by
>Richard Simmons . . ."

No.  That is not a good idea.  It may seem like one but it is that way
for a reason.  This is the same reason it is like this on any of the 
680x0 chips.  Any instruction that is legal but not implemented causes
a trap to a vector.  If other hardware is designed to implement these
features (like the 68882) the trap can be used to hand off the instruction
to the extra hardware.  This speeds up the hole affair without a
change in any software.  Anyway the 68040 executes the floating point
faster, so the minimal overhead in the trap wouldn't change the time
much.  This is based on my personal knowledge of the 68040, if I had a
spec sheet handy I could confirm this, but I don't so I won't.

-- 
-----------------------------------------------------
Jeff Keen    University of Alberta
keen@bode@alberta                  
-----------------------------------------------------

scott@erick.gac.edu (Scott Hess) (02/15/91)

In article <702@chiton.ucsd.edu> cdl@chiton.ucsd.edu (Carl Lowenstein) writes:
   One hopes that eventually libm.a will be re-written to not call
   the built-in transcendentals of the 68882, computing transcendental
   functions using only the floating point instructions native to the
   68040, and thus avoiding the exception traps.

Hmm. That is a very good idea.  Using shared libraries, NeXT could
presumably fix this up so that it would work under either the '030
or the '040 by simply (yes, that is "simply" :-) using a different
shared library depending on the processor on the machine.

I would recommend just that.  Actually changing libm.a without such
a scheme would more than likely break on an '030 machine, which is
something none of us want.  Well, I guess it would just run slooowwww.

Later,
--
scott hess                      scott@gac.edu
Independent NeXT Developer	GAC Undergrad
<I still speak for nobody>
"Tried anarchy, once.  Found it had too many constraints . . ."
"Buy `Sweat 'n wit '2 Live Crew'`, a new weight loss program by
Richard Simmons . . ."

bchen@pooh.caltech.edu (02/15/91)

From article <1991Feb15.045235.5984@ee.ualberta.ca>, by keen@ee.ualberta.ca (Jeff '876393' Keen):
> 
> to the extra hardware.  This speeds up the hole affair without a
> change in any software.  Anyway the 68040 executes the floating point
> faster, so the minimal overhead in the trap wouldn't change the time
> much.  This is based on my personal knowledge of the 68040, if I had a
> spec sheet handy I could confirm this, but I don't so I won't.

The problem is the 68040 floating point speed is 30%-100% slower than that
of 25Mhz 68882 when emulating transcendental functions. The following table
shows the execution speed of both 68040 and 68030 of some fairly common
transcendental functions (I use them a lot) in units of Kilo-operations
per second. 

Operation 	68040 (Kops/sec)	68882 (Kops/sec)
fetoxx(exp)	36.092 			49.448 
flognx(log)	37.290			45.510 
fsinx		35.361			64.281 
fcosx		35.950			64.641 			
ftanx		37.323			53.201 

I like the idea of providing different shared libraries depending on 
processor. I see no reason why this would cause any problems, it requires
no change in software at all if you don't compile the program with inline
math. Or one may hope Motorola can provide a more efficient emulation libary.

------------
Bing Chen
NeXT Mail: bchen@pooh.caltech.edu

toon@news.sara.nl (02/15/91)

In article <290@rosie.NeXT.COM>, news@NeXT.COM (news) writes:
> In article <1991Feb14.160749.19048@batcomputer.tn.cornell.edu>  
> finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) writes:
>> [ Text deleted ] 
>> Why all the system time spent doing a sin()? Just fyi, the same prog
>> lsf@astrosun.tn.cornell.edu
>> 
[... stuff that explains the software emulated 68882 instructions
	are handled by traps on the 68040 deleted ... ]
> Hope this helps.
Sorts of. Is trapping these instructions really faster than implementing
sin(), cos(), tan() and friends as run time library subroutines to the
(Objective-) C(++) compiler ? I would doubt it.
> 
> 		--morris
-- 

Toon Moene, SARA - Amsterdam (NL)
Internet: TOON@SARA.NL

/usr/lib/sendmail.cf: Do.:%@!=/

gessel@ilium.cs.swarthmore.edu (Daniel Mark Gessel) (02/16/91)

Sin is simulated in software. sqrt is implemented in hardware. It was
an issue of space on the 040. They ran alot of traces of what
instructions were used, and chose to fit sqrt, and I assume multiply,
add, sub, divide, and maybe a few others, on chip. The 68882 has more
operations on chip, but the speedup gained by having the FP unit on
chip apparently makes up for that in a big way, even though some of
them are not hardware implemented.

Don't take this info as gospel. It's what I remember from a talk I
heard from the 68040 project leader.

Dan
--
Daniel Mark Gessel
Internet: gessel@cs.swarthmore.edu
I do not speak (nor type) representing Swarthmore College.

sef@kithrup.COM (Sean Eric Fagan) (02/17/91)

In article <1991Feb15.135340.2808@news.sara.nl> toon@news.sara.nl writes:
>Sorts of. Is trapping these instructions really faster than implementing
>sin(), cos(), tan() and friends as run time library subroutines to the
>(Objective-) C(++) compiler ? I would doubt it.

Very good!  Perhaps you noticed the comment in this thread, a while ago,
about 68040 fp-intensive programs being faster *when recompiled*?

Or aren't you aware that there is no easy way to replace a single
instruction, e.g., fsin?

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

cdl@chiton.ucsd.edu (Carl Lowenstein) (02/18/91)

In article <1991Feb17.113656.5876@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes:
>In article <1991Feb15.135340.2808@news.sara.nl> toon@news.sara.nl writes:
>>Sorts of. Is trapping these instructions really faster than implementing
>>sin(), cos(), tan() and friends as run time library subroutines to the
>>(Objective-) C(++) compiler ? I would doubt it.
>
>Very good!  Perhaps you noticed the comment in this thread, a while ago,
>about 68040 fp-intensive programs being faster *when recompiled*?
>
>Or aren't you aware that there is no easy way to replace a single
>instruction, e.g., fsin?

I don't understand what point you are making here.  If there does exist
a math library that uses only fp instructions that are native to the
68040, then recompiling (or at least relinking) should make fp-intensive
programs run faster.

But the word from Morris Meyer at NeXT is that they didn't do it this
way, chosing to trap the un-implemented instructions and interpret them
in the kernel.  This does make it possible to run 68030/68882 programs
without relinking.  Conversely, it makes it possible to develop fp
programs on a 68040 system and have them work on the 68030/68882 with
no changes.

My proposal for a compromise solution is a math library that contains
two versions of the affected transcendental routines, one which is a
stub to call the 68882 hardware, and one using 68040 instructions.  The
68882 version can be the default.  If this traps as an un-implemented
instruction, the kernel trap handler can change a software switch so
that all subsequent calls to the routine get the 68040 version.

At the moment, I am not sufficiently fluent in 68040 to do this, but I
have seen very similar things done for PDP-11 floating point.

-- 
        carl lowenstein         marine physical lab     u.c. san diego
        {decvax|ucbvax} !ucsd!mpl!cdl                 cdl@mpl.ucsd.edu
                                                  clowenstein@ucsd.edu

eps@toaster.SFSU.EDU (Eric P. Scott) (02/18/91)

In article <712@chiton.ucsd.edu> cdl@chiton (Carl Lowenstein) writes:
>I don't understand what point you are making here.  If there does exist
>a math library that uses only fp instructions that are native to the
>68040, then recompiling (or at least relinking) should make fp-intensive
>programs run faster.

No, NeXT changed the compiler (as described in the online Release
Notes).

Anyone care to post benchmarks using the 68040-ONLY -mieee-math
option?

					-=EPS=-

bchen@owl.caltech.edu (Bing-Qing Chen) (02/18/91)

In article <1331@toaster.SFSU.EDU> eps@cs.SFSU.EDU (Eric P. Scott) writes:
>
>No, NeXT changed the compiler (as described in the online Release
>Notes).
>
>Anyone care to post benchmarks using the 68040-ONLY -mieee-math
>option?
>
>					-=EPS=-

There is no change in performance at all using the -mieee-math flag. 
The only difference between 2.0 compiler and 1.0 compiler that caused some
performance increase is that 2.0 compiler no longer generates fintrzx 
instructions which is terribly slow on 68040.

- Bing Chen

madler@pooh.caltech.edu (Mark Adler) (02/18/91)

Bing Chen notes:

>> The only difference between 2.0 compiler and 1.0 compiler that caused some
>> performance increase is that 2.0 compiler no longer generates fintrzx 
>> instructions which is terribly slow on 68040.

Though the compiler no longer generates it, methinks it still lurks in
the library, since floor() on the 040 takes twice as long as it does on
the 030 (both running 2.0).

Mark Adler
madler@pooh.caltech.edu