finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) (02/15/91)
Ok, I've just finished upgrading to OS2.0 and the 68040 logic board. Recalling some discussion earlier, which was never resolved on the net, I wrote a stupid program to exercise some trancendental arithmetic and then timed the puppy: #include <math.h> #define n 10000 #define m 100 #define pi 3.141592653589 void test2(double *a){ return; } main(){ int i,j; double b,a[n]; for(j=0;j<m;j++) { for(i=0;i<n;i++) a[i]=sin(i*pi/n); test2(a); } return 0; } The results are astonishing: % cc -O testc.c -o testc -lm % time testc 4.8u 27.7s 0:38 84% 0+0k 3+0io 0pf+0w Nearly 86% of the execution time of this program is system, and only 14% is user. If the call to sin() is changed to sqrt(), the statistics are _very_ different: % cc -O testc.c -o testc -lm % time testc 9.0u 0.0s 0:09 99% 0+0k 1+0io 0pf+0w Now, all the execution time is user time. Why all the system time spent doing a sin()? Just fyi, the same prog (with call to sin()) on a Sun sparc 1+ runs 16.1 sec in user time --- a factor of 2 faster, but with a call to sqrt() runs 23.7 sec in user time: a factor of 3 _slower_. Just what is going on in doing trancendental functions on the 68040? Please reply directly to me, as I am splitting tomorrow morning and won't be able to see news for about a week. Thanks, lsf@astrosun.tn.cornell.edu P.S. If this or a similar posting has been seen in the last 24 hours, apologises; I tried posting once before and got a bunch of errors that suggested we were out of space to post messages so I'm tring again now.
cdl@chiton.ucsd.edu (Carl Lowenstein) (02/15/91)
In article <1991Feb14.160749.19048@batcomputer.tn.cornell.edu> finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) writes: >Ok, I've just finished upgrading to OS2.0 and the 68040 logic board. > I wrote a stupid program to exercise some trancendental >arithmetic and then timed the puppy: >4.8u 27.7s 0:38 84% 0+0k 3+0io 0pf+0w >Nearly 86% of the execution time of this program is system, and only >14% is user. If the call to sin() is changed to sqrt(), the statistics >are _very_ different: >9.0u 0.0s 0:09 99% 0+0k 1+0io 0pf+0w >Now, all the execution time is user time. >Why all the system time spent doing a sin()? 1) the 68030 processor does not have built-in floating point instructions therefore its floating point is handled by a 68882 FPU which has built-in transcendental functions which are pretty fast. 2) the 68040 processor has a built-in floating point instructions which do not include transcendental functions which must then be handled by kernel exception traps which are slow and take up system time. One hopes that eventually libm.a will be re-written to not call the built-in transcendentals of the 68882, computing transcendental functions using only the floating point instructions native to the 68040, and thus avoiding the exception traps. -- carl lowenstein marine physical lab u.c. san diego {decvax|ucbvax} !ucsd!mpl!cdl cdl@mpl.ucsd.edu clowenstein@ucsd.edu
news@NeXT.COM (news) (02/15/91)
In article <1991Feb14.160749.19048@batcomputer.tn.cornell.edu> finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) writes: > [ Text deleted ] > Why all the system time spent doing a sin()? Just fyi, the same prog > lsf@astrosun.tn.cornell.edu > On our 68030 cubes with the 68882 FPU, all of the floating point arithmetic instructions were handled on the 68882. For the 68040, Motorola decided to incorporate a subset of the 68882 on chip. In following the RISC philosophy, they didn't dump the whole 68882 on chip, but rather concentrated on the most used instructions that they could make the fastest, and put those on the 040. The instructions that weren't put on the chip (transcendentals mostly) are trapped by the '040 into the kernel, where they are handled by floating point software that emulates that instruction. The time spent inside the kernel doing sin(), cos(), atanh(), etc is accrued onto your task's system time, which is what you noticed. Last time I timed the transcendentals on the 68040, they were faster than the instructions run in hardware on the 68882. The 040 floating point hardware instructions are FADD, FSUB, FMUL, FDIV, FSQRT, FMOVE (most addressing modes), FABS, FNEG, FMOVEM, FCMP, FSAVE, FScc, FDBcc, FBcc, FRESTORE The 040 floating point software instructions are FACOS, FASIN, FATAN, FATANH, FCOS, FCOSH, FETOX, FETOXL, FGETEXP, FGETMAN, FINT, FINTRZ, FLOG10, FLOG2, FLOGN, FLOGNP1, FMOD, FMOVECR, FREM, FSCALE, FSIN, FSINCOS, FSINH, FTAN, FTANH, FTENTOX, FTWOTOX Hope this helps. --morris Morris Meyer NeXT, Inc. Software Engineer 900 Chesapeake Drive NeXT OS Group Redwood City, CA 94063 mmeyer@next.com
keen@ee.ualberta.ca (Jeff '876393' Keen) (02/15/91)
In article <SCOTT.91Feb14212146@erick.gac.edu> scott@erick.gac.edu (Scott Hess) writes: >In article <702@chiton.ucsd.edu> cdl@chiton.ucsd.edu (Carl Lowenstein) writes: > One hopes that eventually libm.a will be re-written to not call > the built-in transcendentals of the 68882, computing transcendental > functions using only the floating point instructions native to the > 68040, and thus avoiding the exception traps. > >Hmm. That is a very good idea. Using shared libraries, NeXT could >presumably fix this up so that it would work under either the '030 >or the '040 by simply (yes, that is "simply" :-) using a different >shared library depending on the processor on the machine. > >I would recommend just that. Actually changing libm.a without such >a scheme would more than likely break on an '030 machine, which is >something none of us want. Well, I guess it would just run slooowwww. > >Later, >-- >scott hess scott@gac.edu >Independent NeXT Developer GAC Undergrad ><I still speak for nobody> >"Tried anarchy, once. Found it had too many constraints . . ." >"Buy `Sweat 'n wit '2 Live Crew'`, a new weight loss program by >Richard Simmons . . ." No. That is not a good idea. It may seem like one but it is that way for a reason. This is the same reason it is like this on any of the 680x0 chips. Any instruction that is legal but not implemented causes a trap to a vector. If other hardware is designed to implement these features (like the 68882) the trap can be used to hand off the instruction to the extra hardware. This speeds up the hole affair without a change in any software. Anyway the 68040 executes the floating point faster, so the minimal overhead in the trap wouldn't change the time much. This is based on my personal knowledge of the 68040, if I had a spec sheet handy I could confirm this, but I don't so I won't. -- ----------------------------------------------------- Jeff Keen University of Alberta keen@bode@alberta -----------------------------------------------------
scott@erick.gac.edu (Scott Hess) (02/15/91)
In article <702@chiton.ucsd.edu> cdl@chiton.ucsd.edu (Carl Lowenstein) writes:
One hopes that eventually libm.a will be re-written to not call
the built-in transcendentals of the 68882, computing transcendental
functions using only the floating point instructions native to the
68040, and thus avoiding the exception traps.
Hmm. That is a very good idea. Using shared libraries, NeXT could
presumably fix this up so that it would work under either the '030
or the '040 by simply (yes, that is "simply" :-) using a different
shared library depending on the processor on the machine.
I would recommend just that. Actually changing libm.a without such
a scheme would more than likely break on an '030 machine, which is
something none of us want. Well, I guess it would just run slooowwww.
Later,
--
scott hess scott@gac.edu
Independent NeXT Developer GAC Undergrad
<I still speak for nobody>
"Tried anarchy, once. Found it had too many constraints . . ."
"Buy `Sweat 'n wit '2 Live Crew'`, a new weight loss program by
Richard Simmons . . ."
bchen@pooh.caltech.edu (02/15/91)
From article <1991Feb15.045235.5984@ee.ualberta.ca>, by keen@ee.ualberta.ca (Jeff '876393' Keen): > > to the extra hardware. This speeds up the hole affair without a > change in any software. Anyway the 68040 executes the floating point > faster, so the minimal overhead in the trap wouldn't change the time > much. This is based on my personal knowledge of the 68040, if I had a > spec sheet handy I could confirm this, but I don't so I won't. The problem is the 68040 floating point speed is 30%-100% slower than that of 25Mhz 68882 when emulating transcendental functions. The following table shows the execution speed of both 68040 and 68030 of some fairly common transcendental functions (I use them a lot) in units of Kilo-operations per second. Operation 68040 (Kops/sec) 68882 (Kops/sec) fetoxx(exp) 36.092 49.448 flognx(log) 37.290 45.510 fsinx 35.361 64.281 fcosx 35.950 64.641 ftanx 37.323 53.201 I like the idea of providing different shared libraries depending on processor. I see no reason why this would cause any problems, it requires no change in software at all if you don't compile the program with inline math. Or one may hope Motorola can provide a more efficient emulation libary. ------------ Bing Chen NeXT Mail: bchen@pooh.caltech.edu
toon@news.sara.nl (02/15/91)
In article <290@rosie.NeXT.COM>, news@NeXT.COM (news) writes: > In article <1991Feb14.160749.19048@batcomputer.tn.cornell.edu> > finn@batcomputer.tn.cornell.edu (Lee Samuel Finn) writes: >> [ Text deleted ] >> Why all the system time spent doing a sin()? Just fyi, the same prog >> lsf@astrosun.tn.cornell.edu >> [... stuff that explains the software emulated 68882 instructions are handled by traps on the 68040 deleted ... ] > Hope this helps. Sorts of. Is trapping these instructions really faster than implementing sin(), cos(), tan() and friends as run time library subroutines to the (Objective-) C(++) compiler ? I would doubt it. > > --morris -- Toon Moene, SARA - Amsterdam (NL) Internet: TOON@SARA.NL /usr/lib/sendmail.cf: Do.:%@!=/
gessel@ilium.cs.swarthmore.edu (Daniel Mark Gessel) (02/16/91)
Sin is simulated in software. sqrt is implemented in hardware. It was an issue of space on the 040. They ran alot of traces of what instructions were used, and chose to fit sqrt, and I assume multiply, add, sub, divide, and maybe a few others, on chip. The 68882 has more operations on chip, but the speedup gained by having the FP unit on chip apparently makes up for that in a big way, even though some of them are not hardware implemented. Don't take this info as gospel. It's what I remember from a talk I heard from the 68040 project leader. Dan -- Daniel Mark Gessel Internet: gessel@cs.swarthmore.edu I do not speak (nor type) representing Swarthmore College.
sef@kithrup.COM (Sean Eric Fagan) (02/17/91)
In article <1991Feb15.135340.2808@news.sara.nl> toon@news.sara.nl writes: >Sorts of. Is trapping these instructions really faster than implementing >sin(), cos(), tan() and friends as run time library subroutines to the >(Objective-) C(++) compiler ? I would doubt it. Very good! Perhaps you noticed the comment in this thread, a while ago, about 68040 fp-intensive programs being faster *when recompiled*? Or aren't you aware that there is no easy way to replace a single instruction, e.g., fsin? -- Sean Eric Fagan | "I made the universe, but please don't blame me for it; sef@kithrup.COM | I had a bellyache at the time." -----------------+ -- The Turtle (Stephen King, _It_) Any opinions expressed are my own, and generally unpopular with others.
cdl@chiton.ucsd.edu (Carl Lowenstein) (02/18/91)
In article <1991Feb17.113656.5876@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes: >In article <1991Feb15.135340.2808@news.sara.nl> toon@news.sara.nl writes: >>Sorts of. Is trapping these instructions really faster than implementing >>sin(), cos(), tan() and friends as run time library subroutines to the >>(Objective-) C(++) compiler ? I would doubt it. > >Very good! Perhaps you noticed the comment in this thread, a while ago, >about 68040 fp-intensive programs being faster *when recompiled*? > >Or aren't you aware that there is no easy way to replace a single >instruction, e.g., fsin? I don't understand what point you are making here. If there does exist a math library that uses only fp instructions that are native to the 68040, then recompiling (or at least relinking) should make fp-intensive programs run faster. But the word from Morris Meyer at NeXT is that they didn't do it this way, chosing to trap the un-implemented instructions and interpret them in the kernel. This does make it possible to run 68030/68882 programs without relinking. Conversely, it makes it possible to develop fp programs on a 68040 system and have them work on the 68030/68882 with no changes. My proposal for a compromise solution is a math library that contains two versions of the affected transcendental routines, one which is a stub to call the 68882 hardware, and one using 68040 instructions. The 68882 version can be the default. If this traps as an un-implemented instruction, the kernel trap handler can change a software switch so that all subsequent calls to the routine get the 68040 version. At the moment, I am not sufficiently fluent in 68040 to do this, but I have seen very similar things done for PDP-11 floating point. -- carl lowenstein marine physical lab u.c. san diego {decvax|ucbvax} !ucsd!mpl!cdl cdl@mpl.ucsd.edu clowenstein@ucsd.edu
eps@toaster.SFSU.EDU (Eric P. Scott) (02/18/91)
In article <712@chiton.ucsd.edu> cdl@chiton (Carl Lowenstein) writes: >I don't understand what point you are making here. If there does exist >a math library that uses only fp instructions that are native to the >68040, then recompiling (or at least relinking) should make fp-intensive >programs run faster. No, NeXT changed the compiler (as described in the online Release Notes). Anyone care to post benchmarks using the 68040-ONLY -mieee-math option? -=EPS=-
bchen@owl.caltech.edu (Bing-Qing Chen) (02/18/91)
In article <1331@toaster.SFSU.EDU> eps@cs.SFSU.EDU (Eric P. Scott) writes: > >No, NeXT changed the compiler (as described in the online Release >Notes). > >Anyone care to post benchmarks using the 68040-ONLY -mieee-math >option? > > -=EPS=- There is no change in performance at all using the -mieee-math flag. The only difference between 2.0 compiler and 1.0 compiler that caused some performance increase is that 2.0 compiler no longer generates fintrzx instructions which is terribly slow on 68040. - Bing Chen
madler@pooh.caltech.edu (Mark Adler) (02/18/91)
Bing Chen notes: >> The only difference between 2.0 compiler and 1.0 compiler that caused some >> performance increase is that 2.0 compiler no longer generates fintrzx >> instructions which is terribly slow on 68040. Though the compiler no longer generates it, methinks it still lurks in the library, since floor() on the 040 takes twice as long as it does on the 030 (both running 2.0). Mark Adler madler@pooh.caltech.edu