[comp.sys.mac.programmer] How to Cosine

cak3g@astsun8.astro.Virginia.EDU (Colin Klipsch) (10/23/89)

It is difficult to produce a television documentary that is both
incisive and probing when every twelve minutes one is interrupted
by twelve dancing rabbits singing about toilet paper.

                                                  -- Rod Serling
__________________________________________________________________

WARNING: assembly language discussion ahead. . .

(Ah, I can hear even now the sound of a million fingers pressing
 the "N" key. . .)

I am writing a library of three-dimensional graphics routines
similar to the Graf3D package supplied by Apple.  However, I
desire more precision than these routines use.  I have chosen
to represent my angles as long integers, with the angle being
expressed in milli-arcseconds (with 360*3600*1000 milli-arcseconds
to a circle).  I am also using 64 bit integers (like the SANE
comp format) to represent my coordinates.

In the interest of speed, I desire to write a subroutine with
the following properties:

   1) on entry, D0 is an angle in milli-arcseconds
      (signed long integer)
   2) on exit, D0 is the cosine of the given angle
      (fract format, accurate to the last bit)
   3) a lookup table is allowable, but it should be no
      larger than 10KB, and preferably only a few KB

The past few months of research have yielded these thoughts:

Pretty obviously we can use the fact that the cosine function
is self-similar.  Thus the problem reduces to finding the
cosine of an angle between 0 and 90 degrees (or between 0 and
324,000,000 marcsec).  The cosines of other angles can be easily
derived if this sub-problem has been solved.

Although I expect to use a lookup table, linear interpolation
seems unfruitful.  To get thirty bits of accuracy (as stipulated
in Property 2) from linear interpolation would require many
thousands of table entries, even if the table density varied
over the range.  These table entries will likely be six bytes
each, in the form:

       b47 b46 . b45 b44 b43 b42 b41 . . . b3 b2 b1 b0
        |   |     |   |
        |   |     |   ---- weight 1/4
        |   |     -------- weight 1/2
        |   -------------- weight 1
        ------------------ weight -2 (sign bit)

Why this format?  If I want an accurate result in fract format,
I must do my scratch calculations in a higher precision and
truncate the final result.  The above format has the advantages
that: a) it's an even number of bytes, which fits nicely into
memory, b) to reduce the final result to fract format, all I
have to do is lop off the lower 16 bits.

Anyway, at six bytes per entry, this limits our table to about
a two thousand entries maximum.

Extending linear interpolation to the general case, we can use
the Gregory-Newton interpolation method, which goes:

     Given a table of values for a function f, with each
     table entry separated by a constant distance h, approximate
     f(x) by:

           f(x) = D0 + D1*t + D2*t*(t-1) + D3*t*(t-1)*(t-2) + ...
                              ----------   ----------------
                                  2!              3!

     where x0 is the nearest multiple of h below x, t = (x-x0)/h, and:

     D0 = f(x0)
     D1 = f(x0+h) - f(x0)
     D2 = f(x0+2h) - 2*f(x0+h) + f(x0)
     D3 = f(x0+3h) - 3*f(x0+2h) + 3*f(x0+h) - f(x0)
     (and so on, using Pascal's triangle)

Using any terms higher than the third, however, requires dividing
by three, which (you'll note) is not a power of 2.  Thus this
series should probably be limited to the first three terms for
fast use in machine language.  This method so far seems to be my most
hopeful route.

But why use a table at all?  For the sake of speed, remember.  I
could probably plug the given angle into some kind of polynomial,
but if I'm doing this on my own, then all the multiplications
will be multiple precision (because mulu.l returns a 64b result),
and I'll have to worry about multiplying a signed 64b number
by a 32b number, keeping track of the decimal point, accuracy
of the coefficients, etc.

I have already tried doing this, but my routine (even though it
wasn't quite finished) was only 2.3 times faster than converting
the angle to SANE extended (in radians) and calling FCOSX.  This
either says that my routine is surprisingly slow or that SANE
is surprisingly fast.  Of course, this was on a Mac II, but still...

So how do the professionals do it?  Good question.  FracCos and
FracSin are two routines which quickly calculate those two
trig functions, although they take their angles in radians (fixed
format) and their results are accurate to about 24 fraction bits,
not 30.  I've tried dissasembling these routines, but their method
eludes me.  (Trying to understand raw machine code is not a
pleasant experience...)

And again, I'm avoiding SANE and floating point numbers in general,
so there's no point in investigating how FCOSX works, although I would
be curious anyway.  SANE undoubtedly uses a polynomial of some kind
(Chebyshev, perhaps?) and does all of it's calculations in floating point.

By the way, FracCos runs about 9 times faster than FCOSX (assuming
equivalent angles) on a Mac II, and 20 times faster on a Mac SE.
Just thought you might like to know.

But is it even possible for me to beat FracCos?  I'm hoping so,
because it doesn't seem to use any lookup tables, and that
(generally) is a fast way of getting the values of functions.

Sources I have read: _The Art of Computing_ by Donald Knuth
                     _Microcomputer Animation_ (?) by Bruce Artwick
                     _Apple Numerics Manual_ by Apple themselves
                     _Numerical Recipes_ by Press et. al.
                     various obtuse textbooks on numerical analysis

And then of course, there's the rest of this damn project, but God
knows when I'll ever have the time. . .
______________________________________________________________________
"May the forces of evil become confused on the way to your house."

                                                         -- George Carlin
Sincerely,                            | DISCLAIMER:
   Colin Klipsch                      |   Every word in this text is
   University of Virginia             |   actually a horrendous misspelling
   (cak3g@astsun.astro.virginia.edu)  |   of the word "fettuccini".

earleh@eleazar.dartmouth.edu (Earle R. Horton) (10/23/89)

In article <2157@hudson.acc.virginia.edu> cak3g@astsun8.astro.Virginia.EDU
	(Colin Klipsch) writes:
...
>I am writing a library of three-dimensional graphics routines
>similar to the Graf3D package supplied by Apple.  However, I
>desire more precision than these routines use.  I have chosen
>to represent my angles as long integers, with the angle being
>expressed in milli-arcseconds (with 360*3600*1000 milli-arcseconds
>to a circle).  I am also using 64 bit integers (like the SANE
>comp format) to represent my coordinates.
>
>In the interest of speed, I desire to write a subroutine with
>the following properties:
...
This may not be what Colin wants, but in the interest of speed, I
would think the best thing to do would be to use floating point
coordinates exclusively, and to compile the program to use the
floating point coprocessor directly.  If you can convince your end
users that they require something with a 68020/68881 combo or better,
then you have your performance problems licked.  I seriously doubt
that you can come close to the performance of the hardware math unit
using any kind of integer-based package, particularly if you have to
do trigonometry.

>So how do the professionals do it?  Good question.

This depends on what kind of professionals we are talking about.

Professional programmers might try to approach the performance level
of a hardware math unit using highly optimized integer routines, and
they might get within the same order of magnitude at the expense of
huge outlays of intellectual effort.

Professional engineers, physicists, chemists, accountants, etc. will
say "Gimme that floating point chip!" and spend their intellectual
efforts in solving the problem that they are actually interested in.

That is why I wrote my 3d graphics library to use the 68881 directly.
If I ever have to run it on a Plus or SE, well then I plan to byte
the bullet and get a Radius or Daystar board with hardware floating
point.

Earle R. Horton

chewy@apple.com (Paul Snively) (10/23/89)

In article <2157@hudson.acc.virginia.edu> cak3g@astsun8.astro.Virginia.EDU 
(Colin Klipsch) writes:
> I have already tried doing this, but my routine (even though it
> wasn't quite finished) was only 2.3 times faster than converting
> the angle to SANE extended (in radians) and calling FCOSX.  This
> either says that my routine is surprisingly slow or that SANE
> is surprisingly fast.  Of course, this was on a Mac II, but still...

The reason for this is that SANE isn't as stupid as people think; since 
ALL Mac IIs come with an FPU, SANE uses it, so you're pretty close to the 
hardware when you use SANE on a Mac II.

__________________________________________________________________________
Just because I work for Apple Computer, Inc. doesn't mean that they 
believe what I believe or vice-versa.
__________________________________________________________________________
C++ -- The language in which only friends can access your private members.
__________________________________________________________________________

earleh@eleazar.dartmouth.edu (Earle R. Horton) (10/24/89)

In article <4831@internal.Apple.COM> chewy@apple.com (Paul Snively) writes:
...
>The reason for this is that SANE isn't as stupid as people think; since 
>ALL Mac IIs come with an FPU, SANE uses it, so you're pretty close to the 
>hardware when you use SANE on a Mac II.

I measure a 100 times speed increase on a Mac II using the MC68881
over SANE, with a trig-intensive application that uses doubles (64
bits) for the primary data storage type.  SANE does not use the
MC6888[12] for transcendental functions such as sine and cosine, but
it does offer 1 bit more accuracy than the FPU.  You take your pick, 1
bit more accuracy out of 80, or 100-fold speed increase.  The switches
to turn on the FPU from MPW C are -mc68881, -elems881, and -x149.  Use
the whole set.

The Aztec C floating point library (software) is less accurate than
SANE, but 2-3 times faster in my experience.  This is a software
solution that does not use an FPU at all.

No one is saying SANE is stupid, but it is slow.

Earle R. Horton

)) (10/24/89)

If you want a cosine, try the McLaurin polynomial. (This really
is a special case of the Taylor polynomial) I'm sure you're
familiar with it, and cleverly coded (i.e. begin at the term
to the highest power and go towards ... + 1, and all of it
done "inline" i.e. no loop) it should be quite fast.

Happy Hacking !
						h+@nada.kth.se
-- 
When you don't know what you're doing, do it gently

wetter@cup.portal.com (Pierce T Wetter) (10/27/89)

   You want to use something called a CORDIC algorithm. If I remember 
correctly it uses shifts to do the divisions and multiplications by powers of
two, and uses some algoritm that only requires divison and multiplication
by powers of two. CORDIC is the name of the machine the algorithm was 
implemented for.

Pierce

viking@silver.bacs.indiana.edu (Jon W. Backstrom) (10/27/89)

In article <23401@cup.portal.com> wetter@cup.portal.com (Pierce Wetter) writes:
>   You want to use something called a CORDIC algorithm. If I remember 
>correctly it uses shifts to do the divisions and multiplications by powers of
>two, and uses some algoritm that only requires divison and multiplication
>by powers of two. CORDIC is the name of the machine the algorithm was 
>implemented for.
>Pierce

Where is Jim Cathey when you really need him?!  :-)   Jim converted a
fractal landscape program, originally published in Creative Computing,
to use cordic rotators and the speedup is indeed amazing.  The actual
work involves multiplication by powers of two and this required only
shifting bits in the code.

As a naive programmer, not heavily into math theory, I was fascinated
to see how it worked.  Made want to learn more about math again.  (True)

-------------------------------------------------------------------------------
 Jon W. Backstrom                 "Yah sure...we gonna have fun, you bet!"
 Institute for Digital Arts
 P.O. Box 176                     Internet: viking@silver.bacs.indiana.edu
 Bloomington, IN  47402-0176      UUCP: {ames,rutgers,att}!iuvax!silver!viking
-------------------------------------------------------------------------------

jimc@iscuva.ISCS.COM (Jim Cathey) (10/31/89)

In article <28548@iuvax.cs.indiana.edu> viking@silver.bacs.indiana.edu (Jon W. Backstrom) writes:
>In article <23401@cup.portal.com> wetter@cup.portal.com (Pierce Wetter) writes:
>>   You want to use something called a CORDIC algorithm. If I remember 
>>correctly it uses shifts to do the divisions and multiplications by powers
>>of two, and uses some algoritm that only requires divison and multiplication
>>by powers of two. CORDIC is the name of the machine the algorithm was 
>>implemented for.

>Where is Jim Cathey when you really need him?!  :-)   Jim converted a
>fractal landscape program, originally published in Creative Computing,
>to use cordic rotators and the speedup is indeed amazing.  The actual
>work involves multiplication by powers of two and this required only
>shifting bits in the code.

I be here!  Though I am a staunch supporter of CORDIC, I don't think it
is what the originator of this thread needs to solve his cosine
problem.  Raw CORDIC (Coordinate Rotating DIgital Computer or somesuch)
is very good at rotating points around an axis as it never needs to
compute trig values at all.  It's capable of rotating an integer point
atan(1/2^integer_shiftcount) [I think] per step -- multiple steps at
differing shift counts can get you to more exact rotations.  I used
CORDIC in the fractals program to rotate an essentially 3D set of
integer data points about the Z and X axes prior to plotting it to the
screen.  The exact amount of shifting was immaterial to me so long as
the result looked good.  

For those who care, counter-clockwise CORDIC is:

	y' = y + (x >> shiftcount)
	x' = x - (y' >> shiftcount)

which is 6 instructions (per step) on the 68000.  Note well the use of y'
and not y in the second equation -- use of y will result in a rotator
that spins out into the walls instead of whirling quietly in place.  Best
results are obtained when you're using a fixed-point data type that is
normalized so that the magnitudes are large when considered as a signed
integer (so that the delta terms don't degenerate to zero).  CORDIC will
wobble a little bit as it spins, so don't normalize to +/- 32767, +/- 32000
will give a little breathing room.  (Same argument for long integers.)

There was an article in BYTE in '77 or '78 about using CORDIC for trig
functions, but it looked overly complicated compared to raw CORDIC.  Someone
also posted to the net some CORDIC trig functions, but I've never looked
at them (squirrel instinct made me grab them).

I believe HP used CORDIC for the trig functions in the HP-35 calculator,
but all later models use Chebyshev (sp?) polynomials because they are faster
and more accurate.

I think the best results for medium-precision cosine might be a combination
of table lookup, fast inline integer math, and (most importantly) good
range reduction so that the table isn't too big.  I'm pretty sure you only
need a 0-45 degree table, the rest can be deduced from trig reductions.

+----------------+
! II      CCCCCC !  Jim Cathey
! II  SSSSCC     !  ISC-Bunker Ramo
! II      CC     !  TAF-C8;  Spokane, WA  99220
! IISSSS  CC     !  UUCP: uunet!iscuva!jimc (jimc@iscuva.iscs.com)
! II      CCCCCC !  (509) 927-5757
+----------------+
			"With excitement like this, who is needing enemas?"