holmer@ucbarpa.Berkeley.EDU (Bruce K. Holmer) (01/18/91)
[] I copied a number crunching application of mine from an '030 Next cube to my new '040 Nextstation, and was shocked with the relatively poor performance. After some experimentation, I've found the cause---the '040 floating point does not support in hardware the entire 68882 instruction set, so the unimplemented instructions must be done by software. That's fine with me, since Motorola assured us (IEEE Micro, February 1990, p. 77) that: A software emulator of all unimplemented instruction si available from Motorola.... Execution time of the software emulation for elementary functions including all trap overhead (running on a 25 MHz 68040) is 13 to 130 percent faster than the equivalent instructions on the 68882 running at 25 MHz. However, whatever turned up in the Nextstation is certainly not what Motorola was promising (whether the blame is Motorola's or Next's I don't know). For your amusement here is a small assembly language program: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% tmp.s %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #NO_APP gcc_compiled.: .text LC0: .even .globl _main _main: link a6,#0 clrl d1 L61: fmovex #0r0.5,fp0 fQQQx fp0,fp0 addql #1,d1 cmpl #999999,d1 jle L61 unlk a6 rts %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Here are the timings (using /bin/time) for different QQQ's: Next cube (25 MHz '030) ----------------------- QQQ user (sec.) sys (sec.) <none> 1.8 0.0 (instruction removed) sqrt 4.5 0.1 (square root) cos 13.9 0.4 (cosine) etox 19.8 0.7 (e to the x power) int 2.8 0.0 (integer part) intrz 2.8 0.1 (integer part/round to zero) Nextstation (25 MHz '040) ------------------------- QQQ user (sec.) sys (sec.) <none> 0.4 0.0 (instruction removed) sqrt 4.4 0.0 (square root) cos 0.8 27.6 (cosine) etox 0.9 27.0 (e to the x power) int 0.9 82.9 (integer part) intrz 1.0 81.9 (integer part/round to zero) Note that sqrt is implemented in hardware on the '040 (I threw it in for a reality check). Also, I ran each once, so I didn't average out the variations in the timing. However, the numbers do make the point that the emulation is done at great expense on the Nextstation. The real shock is the integer to float conversion (2000 cycles!). That's the one that hurt my application. I do not know if the Nextstep 2.0 C compiler still uses fetoxx, fintrzx, etc. (it may use subroutine calls to faster emulation software), but my alarm is still valid for programs that are copied over as binaries or Sun executables that are converted using atom. Can someone clarify this situation? Will it be fixed soon? --Bruce Holmer holmer@ucbarpa.berkeley.edu