[comp.sys.mac.programmer] How to Cosine: some results

cak3g@astsun7.astro.Virginia.EDU (Colin Klipsch) (11/11/89)

"I can see a time when there will be one in every city!"

                                          -- anonymous American mayor,
                                             on the future of the telephone
___________________________________________________________________________

(WARNING: Assembly language discussion ahead!  Run for your lives!!)

As you may recall, I was in the process of writing an assembly language
subroutine which does the following:

   - takes an arbitrary angle in milli-arcseconds (1,296,000,000 mas/circle)
     as a long integer in register D0
   - returns the sine or cosine of that angle in fract format (which is
     signed fixed point, 2b.30b) in D0

This subroutine will exist in the context of a library of 3D graphics
routines I'm writing for my own use in future applications.  (Although
anyone who is patient enough to wait for the finished product is welcome
to have it also.)  I present in this article some discussion and empirical
timing results in my quest to find the "perfect" sine and cosine routines.

First, since my previous article on this subject, I have changed my mind
on the angular units.  For simplicity of argument reduction, and to appeal
to a certain sense of binary math aesthetics (if there is such a thing),
I have decided to take the angle as a fraction of a circle in fract format.
Alternatively, you can think of this as a long integer, where the angular
units are 1/(2^30) of a circle.  (The fract format is documented in
Inside Macintosh IV under "Toolbox Utilities".)

You may be thinking that dividing a circle into 2^30 units is ridiculously
precise.  However, I may be using these 3D routines to write an astronomical
program (perhaps similar to Orion), and since the coordinates of stars
are measured to the nearest arcsecond, you need greater than arcsecond
accuracy to rotate those buggers freely without the grainyness of your
angles showing through.  This format allows you to specify angles to
about the nearest 800th of an arcsecond.  (Good enough for me anyway.)

Also, my coordinates will be specified by 64b integers (like the SANE
comp format), and for a point very distant from the origin, a rotation
of even this small angle can still move it many millions of units.  Thus,
I would like to make my angular divisions as small as is convenient in
order to reduce this problem.

So how to do this?  I have received a surprising number of responses,
and I am grateful for all your thoughts.  Many people provided source
samples in C, which, however well-intentioned, were unhelpful for
the most part.  Most of these routines used floating point numbers,
which I am trying to avoid for the sake of speed.  (Remember, I'm
trying to do some real-time animation.)  If my performance impovement
is only a factor of two or three above SANE, then I might as well
use SANE.  My goals are to avoid: 1) floating point math, 2) ROM calls,
3) anything more than a few multiplications, 4) any divisions by
non-powers of two.

Several people suggested using a method called CORDIC, which apparently
rotates a point around an axis directly, without using any trig
transformation.  I am intrigued by this, since all of the computer
graphics literature I have read to date says to set up a 3x3 or 4x4
matrix and multiply it by the given point.  If the CORDIC algorithm proves
to be faster, while maintaining my desired accuracy, then I will use it
instead.  However, I have doubts that CORDIC will suit my purpose, since
I never hear of it in the computer graphics literature on 3 dimensional
rotation.

(I am not claiming to be an expert on these matters; merely describing
what I've personally researched.  I refer specifically to Foley & Van Dam,
Rogers, and Artwick.)

Let's assume then, for the moment, that we must calculate sines and cosines.
I finally realized that the only way to REALLY know the best method
was to roll up my sleeves and try everything.  These are the algorithms
currently under consideration:

SUBROUTINES FOR CALCULATING THE COSINE OF AN ANGLE:

I) Direct table lookup:  Yes, let's pretend we have pre-calculated the
cosine of every angle from 0 to 90 degrees.  All 268,435,456 of them.
Map the given angle to this range and just look up the answer.
Nice and fast.  Disadvantage?  Well, such a table would occupy over
a gigabyte of memory, and most people just don't seem to have that
much.  However, it is useful to include this in our tests,
since this, it seems to me, is the theoretically fastest way of
getting the cosine.

II) Table with quadratic interpolation:  Quadratic because linear
interpolation would require too many table entries for 30b accuracy.
Not cubic or higher because that requires dividing by non-powers
of two and doing many more multiplies.  Given our table entries are
separated by a distance h, use the Gregory-Newton formula:

   f(x) = f0 + t(f1 - f0) + (1/2)t(t-1)(f2 - 2f1 + f0)

   where f0 = f(x0), f1 = f(x0 + h), f2 = f(x0 + 2h), x0 is the
   nearest multiple of h under x, and t = (x-x0)/h

To speed up our calculations, let h be a power of two.  (h = 2^18 is
the largest you can get away with.)  The primary disadvantage of this
subroutine is that it requires a table with over a thousand entries,
which amounts to over 4kB.  I note also that this routine makes use
of the long multiply instructions on the 68020, which are unavailable
on the 68000.

III) Call SANE:  Convert the angle to radians in SANE extended
format (80b), call FCOSX, convert the result to fract format.

IV) Call the M68881:  Convert the angle to radians in M68881
format (96b), call FSINCOS, convert the result to fract.  Call FSINCOS
because we need both sine and cosine in the transformation matrix
anyway.  The fact that this calculates both will be taken into account.

V) Chebyshev polynomial:  Do integer multiplication and addition using
a Chebyshev expansion of the cosine function.  In general, Chebyshev
polynomials are better than Taylor series, since you can achieve the
same precision with fewer terms.

VI) Just for kicks, call _FracCos.  This routine doesn't do what I require,
but as long as we're doing these tests, we might as well see how fast
it really goes.  (It takes radians in generic fixed point format, and
is only reliable out to about 24b.)

VII) Nothing: to help eliminate timing overhead, let there be a subroutine
which consists of nothing but an "rts".

TIMING RESULTS:

These routines were run on a Mac II with no other INITs running that would
take up interrupt time, and with the cursor left perfectly still.  They were
each called in turn as many times as possible within a hundred second interval
using code like the following:

       clr.l    D6              ;D6 will count the number of calls made
       move.l   Ticks,D7        ;Ticks is a global var which acts as timer
       addi.l   #100*60,D7      ;D7 <- time in the future when we'll stop
loop   sub.w    #2,SP           ;create a random 32b number on the stack
       _Random
       sub.w    #2,SP
       _Random
       move.l   (SP)+,D0        ;D0 <- random argument
       jsr      TheRoutine      ;calculate the cosine
       addq.l   #1,D6           ;another successful call, up the counter
       cmp.l    Ticks,D7        ;time's up?
       bhi.s    loop            ;nope, keep chugging away...
                                ;yes, D6 holds number of calls made

And the results:
                         average #   raw time   corrected time
       Routine           calls made  (nanosec)    (nanosec)
----------------------   ----------  ---------  --------------
  I)  Direct lookup      1,580,286      63,280        7,400
 II)  Quadratic interp   1,250,297      79,981       24,100
III)  SANE (FCOSX)          48,178   2,075,636    2,020,000
 IV)  M68881 (FSINCOS)     817,654     122,301       66,400  (33,200)
  V)  Chebyshev            768,355     130,148       74,300
 VI)  _FracCos             516,633     193,561      137,700
VII)  Nothing            1,789,729      55,874          ---

where (raw time) = 100 sec / (# calls),
      (corrected time) = (raw time) - (Nothing's raw time),
      1,000,000,000 nanoseconds = 1 second

DISCUSSION:

The time it takes to do "nothing" includes the time required to
generate the random angle, the actual branching to the subroutine,
updating the counter, and the general overhead associated with
the loop.  Hopefully most of this effect can be eliminated by
subtracting Nothing's raw time as is done above.  This is probably
not going to be completely accurate, but it should be accurate
enough for purposes of making comparisons.

To honorably compare the M68881 call with the others, we should
either multiply all the other times by two (they would have to
be called twice anyway, though maybe some fast trick could be used),
or equivalently divide its time by two.  The actual time is listed
in the "corrected time" column, with half of this time off to the side
in parentheses.

In case you're wondering, I didn't actually create a one gigabyte
lookup table for the first routine.  (My Mac only has half a gigabyte! :)
I did the necessary pointer arithmetic to find the pretend table entry
and fetched the value as if it were the real thing.

Despite the format of this article, I am not so bold as to claim that
it is a rigorous scientific analysis.  (I'm sure many of you are
finding nits to pick already.)  My purpose is to find the fastest
routine possible within reasonable constraints, and for this I believe
my method if fairly sound.
 
CONCLUSIONS:

As I see it, the whole mess boils down to a competition between
calling the math coprocessor and doing a quadratic interpolation.
The interpolation went slightly faster (by roughly 30%), but it
does require a 4kB table.  Other interpolation methods are possible, and I
intend to investigate them to see if I can reduce the table size
while still keeping the arithmetic fast.  The difference in speed would
probably be even less using the M68882, which I understand is substantially
faster than the M68881.  (By about 25% ?  50% ?)

Thus I conclude that the overall best strategy is to do what perhaps should
have been obvious in the first place: call the friendly math coprocessor.
It's simple, it's clean, it's fast, and it's short.

However, it still remains to be seen whether CORDIC is better at rotating
points than matrix multiplication, in which case this entire treatise
on generating trig values has been a purely academic exercise.  (Not
that there's anything wrong with purely academic exercises.)  My
experiments so far with a hand calculator indicate that it rotates only
approximately, and a scaling correction needs to be made afterward,
since the point tends to move toward the origin.  For small angles CORDIC
might be a good idea.  For general angles, I'm not sure it would.

Oh yes, is there anyone from Apple listening out there?  I call to
your attention the fact that _FracCos runs much more slowly and inaccurately 
than it needs to.  _FracCos and _FracSin should be calling the coprocessor
on those machines which have it.  (Unless something else is going on which
I am unaware of?  Perhaps so, since that phenomenon happens frequently.
Nevertheless, y'all are doing a great job as far as I'm concerned.)

Anyway, I welcome any discussion this might generate, and my pessimistic
guess is that this issue has not really been resolved yet. . .
______________________________________________________________________
"May the forces of evil become confused on the way to your house."
                                                      -- George Carlin
Sincerely,                           | DISCLAIMER:
  Colin Klipsch                      |   Every word in this text is
  University of Virginia             |   actually a horrendous misspelling
  (cak3g@astsun.astro.virginia.edu)  |   of the word "fettuccini".

amanda@intercon.com (Amanda Walker) (11/12/89)

Well, you look like you're off to a good start.  Actually measuring execution
times is something that a lot of people don't do as often as they should :-).
Anyway, there are a couple of comments I have:

 - Even in a tight memory situation, a 4Kb table is not much space, and the
   quadratic interpolation technique is significantly enough faster than
   anything else (for this purpose) that I'd go with it.  That being said,
   I'm also a firm believer in using hardware to do what it can.  The 68882
   is indeed faster than the 68881, and as you noticed, Motorola does good
   stuff with silicon.  The increased accuracy of floating point may actually
   make up for the speed difference.  You also gain automatically when
   the FPU gets faster, which is nice.  Then again, it won't run an a Mac
   Plus or SE.  Depends on the specific application...

 - Among the reasons that matrix multiplication is so popular in 3D graphics
   work are: (a) you can concatenate transformations (rotation, translation,
   scaling, etc.) in any order and still end up with one matrix that you
   then multiply against your points, (b) if you build a hardware matrix
   multiplier, having a single task ("multiply this point by this matrix")
   makes the design *much* simpler, and (c) using homogeneous coordinates
   is a nifty mathematical trick :-).

I wish you luck with your project,
--
Amanda Walker <amanda@intercon.com>

d88-jwa@nada.kth.se (Jon Watte) (11/14/89)

In article <2256@hudson.acc.virginia.edu> cak3g@astsun7.astro.Virginia.EDU (Colin Klipsch) writes:

    (WARNING: Assembly language discussion ahead!  Run for your lives!!)

Various ramblings deleted

    II) Table with quadratic interpolation:  Quadratic because linear
    interpolation would require too many table entries for 30b accuracy.

[...]

    the largest you can get away with.)  The primary disadvantage of this
    subroutine is that it requires a table with over a thousand entries,
    which amounts to over 4kB.  I note also that this routine makes use
    of the long multiply instructions on the 68020, which are unavailable
    on the 68000.

[...]

    IV) Call the M68881:  Convert the angle to radians in M68881
    format (96b), call FSINCOS, convert the result to fract.  Call FSINCOS
    because we need both sine and cosine in the transformation matrix
    anyway.  The fact that this calculates both will be taken into account.

[...]

    And the results:
     II)  Quadratic interp   1,250,297      79,981       24,100
     IV)  M68881 (FSINCOS)     817,654     122,301       66,400  (33,200)


Hey, are you running a Mac 128K or what ? 4kB for speed FASTER than a
Mc68881 ? That's NOTHING at all ! And consider; more people have the
68020 (or -30) than have the 68881 (or -2)

Probably those 4kB will be less than an average "about" box 8)

You could have the table in a resource (hell, you could have
EIGHT tables in a resource 8) and load it only when you need it,
if you're so concerned for memory. This would also save you the
trouble of feeding the data by hand to the program editor :-)

All in all, you have produced a VERY GOOD post (maybe my candidate
for "post of the year" :-) and the Net should have more of these...

Happy Hacking !
							h+
-- 
This .signature is longer than 4 lines. If you'd like to see it in whole,
please send a note to me. I'm h+@nada.kth.se and also h+@proxxi.se    8')