[comp.sys.atari.st.tech] Floating Pt. Math with a 68881 not always faster

moses@hao.ucar.edu (Julie Moses) (09/02/90)

     I have recently installed a math copressor into my Mega system
and have now linked my code with libraries supporting a 68881 chip 
in a ST. 
     While more complex math functions, such as transcendentals and 
square root really show a speed increase, adding or multiplying two
floating point numbers do not show any speed increase over just
doing it with the 68000 chip. Matter of fact some of my functions
that do repetitive mulitply and additions with floats (without more
complex functions) are significantly slower when using the math 
copressor. 

Q- Can someone explain why?

     My best guess is that the time the processor takes to transfer
the floating point numbers and also the time waiting for the math
copressor to be ready to receive makes the simpler math operations
slower than just doing them inside the 68000 chip. I am using the
Prospero math libraries that look once for the math chip so that one
must link for either a 68K chip with or without the math chip, 
exclusively.

Julie Moses

kbad@atari.UUCP (Ken Badertscher) (09/03/90)

moses@hao.ucar.edu (Julie Moses) writes:


|     While more complex math functions, such as transcendentals and 
|square root really show a speed increase, adding or multiplying two
|floating point numbers do not show any speed increase over just
|doing it with the 68000 chip. Matter of fact some of my functions
|that do repetitive mulitply and additions with floats (without more
|complex functions) are significantly slower when using the math 
|copressor. 

Sounds to me like the libraries were not properly implemented to take
best advantage of the peripheral 68881.  In series of tests that I did
when working on a homemade set of bindings for Megamax Laser C some time
ago, I sometimes noticed that the speedups weren't all that
sensational... The timings I did were /never/ slower when using the
68881, though.  And the Megamax floating point routines are /fast/.
-- 
   |||   Ken Badertscher  (ames!atari!kbad)
   |||   Atari R&D System Software Engine
  / | \  #include <disclaimer>

muts@fysaj.fys.ruu.nl (Peter Mutsaers /100000) (09/03/90)

moses@hao.ucar.edu (Julie Moses) writes:

>Q- Can someone explain why?

>     My best guess is that the time the processor takes to transfer
>the floating point numbers and also the time waiting for the math
>copressor to be ready to receive makes the simpler math operations
>slower than just doing them inside the 68000 chip. I am using the
>Prospero math libraries that look once for the math chip so that one
>must link for either a 68K chip with or without the math chip, 
>exclusively.

>Julie Moses

Maybe the routines of Prospero are not very fast, or they are only
single precision.

The 68881 uses 80 bits, generally double precision takes 4 times longer
then single precision.
In Turbo C, which has the fastest floating point library available to my
knowledge, 80 bits take 3 times as long as the 68881 does. So
if there were single pricision routines they would be a bit faster
then the 68881.
--
Peter Mutsaers                          email:    muts@fysaj.fys.ruu.nl     
Rijksuniversiteit Utrecht                         nmutsaer@ruunsa.fys.ruu.nl
Princetonplein 5                          tel:    (+31)-(0)30-533880
3584 CG Utrecht, Netherlands

t68@nikhefh.nikhef.nl (Jos Vermaseren) (09/03/90)

Some years ago a wrote my own set of floating point routines.
Then a little later I got the chance to test a 68881.
My findings were that single precision addition was not any faster
on the 68881, but all others were. And those were very good
floating point routines. On the whole however the speed increase
with the floating point chip wasn't realy spectacular.
Over the original Absoft library I measured on one of my calculational
programs a factor 4 with the 68881 (and a factor 2 with the home made
library).
Most of the time is lost in the transmission and the negotiations
with the chip. This won't be the case so much on the TT as it has a
directer channel.
A compiler that can use the floating point registers will also make
an enormous difference, because then nearly all transmission delays
are reduced by a factor 4 to 5.

If you find that multiplication is faster without the chip you have
either an incredibly fast FP library, or an inefficient link to the
68881.

Jos Vermaseren

moses@hao.ucar.edu (Julie Moses) (09/04/90)

| From (Ken Badertscher)
|
|Sounds to me like the libraries were not properly implemented to take
|best advantage of the peripheral 68881.  In series of tests that I did
|when working on a homemade set of bindings for Megamax Laser C some time
|ago, I sometimes noticed that the speedups weren't all that
|sensational... The timings I did were /never/ slower when using the
|68881, though.  And the Megamax floating point routines are /fast/.
|-- 
|   |||   Ken Badertscher  (ames!atari!kbad)
|   |||   Atari R&D System Software Engine
|  / | \  #include <disclaimer>

| From (Peter Mutsaers)
|
|Maybe the routines of Prospero are not very fast, or they are only
|single precision.
|
|The 68881 uses 80 bits, generally double precision takes 4 times longer
|then single precision.
|In Turbo C, which has the fastest floating point library available to my
|knowledge, 80 bits take 3 times as long as the 68881 does. So
|if there were single pricision routines they would be a bit faster
|then the 68881.

        The above messages and some others point to a solution to 
the question: why are simpler F.P. math functions slower with the
68881 than with the 68000?

Ken and Pete,

        Yes the Prospero libraries are not highly optimized as compared
to Turbo C or Megamax C but they are a solid group of functions supported
by a good working environment. However, the Prospero 68881 LIBS, I
would wager, are faster than Megamax C's libraries for two reasons:
1) Prospero's looks once for the 68881 at program bootup for the math 
chip while Laser C looks for it everytime it wants to do F.P. math, 
2) Prospero comes with two 68881 libraries, the second has no error 
checking and that eliminates some overhead (though you better know 
what the ranges of the solutions will be).
        The solution is that I was comparing single precision (32 bit)
F.P. math being done by the 68000 to  80 bit math by the math copressor.
Complex F.P. functions, such as Tangent(x), are always faster when
done by the 68881 math copressor, but simple functions such as add,
subtract and multiply are <slower> because of the time taken : 
1) waiting for the 68881 chip to be ready to receive, 2) moving 32 bits
to the math chip, 3) the math chip converting the the 32 bit floats to
80 bit floats, 4) returning the solution back to the 68000. I am doing
my single precision F.P. math in Fortran subroutines and linking them
into my Pro-C. Double precision F.P. math, such as done by C, I would 
agree, is probably always slower than the math by the 68881.
        Having looked at the Alcyon 68881 assembly source code, there
does not seem to be much one can do to further optimize the F.P. 
routines. Prospero's are probably based on Alcyons.  The TT's copressor
will probably run circles around any single precision done by a 68xxx
CPUs (I hope).

Julie Moses

marten@tpki.toppoint.de (M.Feldtmann) (09/04/90)

In article <8379@ncar.ucar.edu> moses@hao.ucar.edu (Julie Moses) writes:

>     While more complex math functions, such as transcendentals and 
>square root really show a speed increase, adding or multiplying two
>floating point numbers do not show any speed increase over just

  Amazing! Of course, for '+' oder '-' there will be no much speed
increase (because of the data-transfer via supervisor-mode) , but for
'*' oder '/' you should expect at least a factor of about 2. 

  Perhaps the library is not so good??

  Marten

    Marten Feldtmann, Eckernfoerder Str.83, 2300 Kiel 1, West-Germany
             DNET/EUNET/USENET/SUBNET: marten@toppoint.de
      Please keep your replies short - I have to pay for them

rehrauer@apollo.HP.COM (Steve Rehrauer) (09/04/90)

In article <989@nikhefh.nikhef.nl> t68@nikhefh.nikhef.nl (Jos Vermaseren) writes:
>If you find that multiplication is faster without the chip you have
>either an incredibly fast FP library, or an inefficient link to the
>68881.

It's also worth noting that the 68882 can overlap execution of f.p.
instructions, assuming there aren't any operand dependencies.  The
68040 (which doesn't implement the full '881/'882 instruction set,
but does do the "core" -- FMUL, FDIV, etc) also overlaps, and (since
it directly implements the instructions itself) doesn't incur any
of the coprocessor overhead of a '881/'882.

In other words, there's good reasons to not do floating-point in
software, even if you can shave a few clocks in a few instances
on your current hardware by doing so.
--
   >>"Aaiiyeeee!  Death from above!"<<     | (Steve) rehrauer@apollo.hp.com
"Spontaneous human combustion - what luck!"| Apollo Computer (Hewlett-Packard)

leo@ehviea.ine.philips.nl (Leo de Wit) (09/10/90)

In article <8387@ncar.ucar.edu> moses@hao.ucar.edu (Julie Moses) writes:
|        Having looked at the Alcyon 68881 assembly source code, there
|does not seem to be much one can do to further optimize the F.P. 
|routines. Prospero's are probably based on Alcyons.  The TT's copressor
|will probably run circles around any single precision done by a 68xxx
|CPUs (I hope).

If you're looking for fast math, you could perhaps make use of
representation of real numbers by integer quotients; this is a common
technique when precision and/or range allow it (which is the case for a
lot of real-life applications). Each real (or float if you want) is
represented by a pair of integers (choose either longs or shorts),
whose quotient is the real (approximately). Especially in cases where
shorts can be used for the representation (low precision), and on a
processor that prefers integers to floats this can boost up
calculations dramatically.

Another way to possibly increase performance is to put often calculated
function values in an array, e.g. for sin(x) calculate sin(i*pi/180)
for i = 0..90. Intermediate values can be found - for example - by
linear interpolation.

For an example of quotient calculation, you can take a look at the
sources of the 3D demo program I wrote a while ago.

Cheers,

   Leo.