cak3g@astsun7.astro.Virginia.EDU (Colin Klipsch) (11/11/89)
"I can see a time when there will be one in every city!" -- anonymous American mayor, on the future of the telephone ___________________________________________________________________________ (WARNING: Assembly language discussion ahead! Run for your lives!!) As you may recall, I was in the process of writing an assembly language subroutine which does the following: - takes an arbitrary angle in milli-arcseconds (1,296,000,000 mas/circle) as a long integer in register D0 - returns the sine or cosine of that angle in fract format (which is signed fixed point, 2b.30b) in D0 This subroutine will exist in the context of a library of 3D graphics routines I'm writing for my own use in future applications. (Although anyone who is patient enough to wait for the finished product is welcome to have it also.) I present in this article some discussion and empirical timing results in my quest to find the "perfect" sine and cosine routines. First, since my previous article on this subject, I have changed my mind on the angular units. For simplicity of argument reduction, and to appeal to a certain sense of binary math aesthetics (if there is such a thing), I have decided to take the angle as a fraction of a circle in fract format. Alternatively, you can think of this as a long integer, where the angular units are 1/(2^30) of a circle. (The fract format is documented in Inside Macintosh IV under "Toolbox Utilities".) You may be thinking that dividing a circle into 2^30 units is ridiculously precise. However, I may be using these 3D routines to write an astronomical program (perhaps similar to Orion), and since the coordinates of stars are measured to the nearest arcsecond, you need greater than arcsecond accuracy to rotate those buggers freely without the grainyness of your angles showing through. This format allows you to specify angles to about the nearest 800th of an arcsecond. (Good enough for me anyway.) Also, my coordinates will be specified by 64b integers (like the SANE comp format), and for a point very distant from the origin, a rotation of even this small angle can still move it many millions of units. Thus, I would like to make my angular divisions as small as is convenient in order to reduce this problem. So how to do this? I have received a surprising number of responses, and I am grateful for all your thoughts. Many people provided source samples in C, which, however well-intentioned, were unhelpful for the most part. Most of these routines used floating point numbers, which I am trying to avoid for the sake of speed. (Remember, I'm trying to do some real-time animation.) If my performance impovement is only a factor of two or three above SANE, then I might as well use SANE. My goals are to avoid: 1) floating point math, 2) ROM calls, 3) anything more than a few multiplications, 4) any divisions by non-powers of two. Several people suggested using a method called CORDIC, which apparently rotates a point around an axis directly, without using any trig transformation. I am intrigued by this, since all of the computer graphics literature I have read to date says to set up a 3x3 or 4x4 matrix and multiply it by the given point. If the CORDIC algorithm proves to be faster, while maintaining my desired accuracy, then I will use it instead. However, I have doubts that CORDIC will suit my purpose, since I never hear of it in the computer graphics literature on 3 dimensional rotation. (I am not claiming to be an expert on these matters; merely describing what I've personally researched. I refer specifically to Foley & Van Dam, Rogers, and Artwick.) Let's assume then, for the moment, that we must calculate sines and cosines. I finally realized that the only way to REALLY know the best method was to roll up my sleeves and try everything. These are the algorithms currently under consideration: SUBROUTINES FOR CALCULATING THE COSINE OF AN ANGLE: I) Direct table lookup: Yes, let's pretend we have pre-calculated the cosine of every angle from 0 to 90 degrees. All 268,435,456 of them. Map the given angle to this range and just look up the answer. Nice and fast. Disadvantage? Well, such a table would occupy over a gigabyte of memory, and most people just don't seem to have that much. However, it is useful to include this in our tests, since this, it seems to me, is the theoretically fastest way of getting the cosine. II) Table with quadratic interpolation: Quadratic because linear interpolation would require too many table entries for 30b accuracy. Not cubic or higher because that requires dividing by non-powers of two and doing many more multiplies. Given our table entries are separated by a distance h, use the Gregory-Newton formula: f(x) = f0 + t(f1 - f0) + (1/2)t(t-1)(f2 - 2f1 + f0) where f0 = f(x0), f1 = f(x0 + h), f2 = f(x0 + 2h), x0 is the nearest multiple of h under x, and t = (x-x0)/h To speed up our calculations, let h be a power of two. (h = 2^18 is the largest you can get away with.) The primary disadvantage of this subroutine is that it requires a table with over a thousand entries, which amounts to over 4kB. I note also that this routine makes use of the long multiply instructions on the 68020, which are unavailable on the 68000. III) Call SANE: Convert the angle to radians in SANE extended format (80b), call FCOSX, convert the result to fract format. IV) Call the M68881: Convert the angle to radians in M68881 format (96b), call FSINCOS, convert the result to fract. Call FSINCOS because we need both sine and cosine in the transformation matrix anyway. The fact that this calculates both will be taken into account. V) Chebyshev polynomial: Do integer multiplication and addition using a Chebyshev expansion of the cosine function. In general, Chebyshev polynomials are better than Taylor series, since you can achieve the same precision with fewer terms. VI) Just for kicks, call _FracCos. This routine doesn't do what I require, but as long as we're doing these tests, we might as well see how fast it really goes. (It takes radians in generic fixed point format, and is only reliable out to about 24b.) VII) Nothing: to help eliminate timing overhead, let there be a subroutine which consists of nothing but an "rts". TIMING RESULTS: These routines were run on a Mac II with no other INITs running that would take up interrupt time, and with the cursor left perfectly still. They were each called in turn as many times as possible within a hundred second interval using code like the following: clr.l D6 ;D6 will count the number of calls made move.l Ticks,D7 ;Ticks is a global var which acts as timer addi.l #100*60,D7 ;D7 <- time in the future when we'll stop loop sub.w #2,SP ;create a random 32b number on the stack _Random sub.w #2,SP _Random move.l (SP)+,D0 ;D0 <- random argument jsr TheRoutine ;calculate the cosine addq.l #1,D6 ;another successful call, up the counter cmp.l Ticks,D7 ;time's up? bhi.s loop ;nope, keep chugging away... ;yes, D6 holds number of calls made And the results: average # raw time corrected time Routine calls made (nanosec) (nanosec) ---------------------- ---------- --------- -------------- I) Direct lookup 1,580,286 63,280 7,400 II) Quadratic interp 1,250,297 79,981 24,100 III) SANE (FCOSX) 48,178 2,075,636 2,020,000 IV) M68881 (FSINCOS) 817,654 122,301 66,400 (33,200) V) Chebyshev 768,355 130,148 74,300 VI) _FracCos 516,633 193,561 137,700 VII) Nothing 1,789,729 55,874 --- where (raw time) = 100 sec / (# calls), (corrected time) = (raw time) - (Nothing's raw time), 1,000,000,000 nanoseconds = 1 second DISCUSSION: The time it takes to do "nothing" includes the time required to generate the random angle, the actual branching to the subroutine, updating the counter, and the general overhead associated with the loop. Hopefully most of this effect can be eliminated by subtracting Nothing's raw time as is done above. This is probably not going to be completely accurate, but it should be accurate enough for purposes of making comparisons. To honorably compare the M68881 call with the others, we should either multiply all the other times by two (they would have to be called twice anyway, though maybe some fast trick could be used), or equivalently divide its time by two. The actual time is listed in the "corrected time" column, with half of this time off to the side in parentheses. In case you're wondering, I didn't actually create a one gigabyte lookup table for the first routine. (My Mac only has half a gigabyte! :) I did the necessary pointer arithmetic to find the pretend table entry and fetched the value as if it were the real thing. Despite the format of this article, I am not so bold as to claim that it is a rigorous scientific analysis. (I'm sure many of you are finding nits to pick already.) My purpose is to find the fastest routine possible within reasonable constraints, and for this I believe my method if fairly sound. CONCLUSIONS: As I see it, the whole mess boils down to a competition between calling the math coprocessor and doing a quadratic interpolation. The interpolation went slightly faster (by roughly 30%), but it does require a 4kB table. Other interpolation methods are possible, and I intend to investigate them to see if I can reduce the table size while still keeping the arithmetic fast. The difference in speed would probably be even less using the M68882, which I understand is substantially faster than the M68881. (By about 25% ? 50% ?) Thus I conclude that the overall best strategy is to do what perhaps should have been obvious in the first place: call the friendly math coprocessor. It's simple, it's clean, it's fast, and it's short. However, it still remains to be seen whether CORDIC is better at rotating points than matrix multiplication, in which case this entire treatise on generating trig values has been a purely academic exercise. (Not that there's anything wrong with purely academic exercises.) My experiments so far with a hand calculator indicate that it rotates only approximately, and a scaling correction needs to be made afterward, since the point tends to move toward the origin. For small angles CORDIC might be a good idea. For general angles, I'm not sure it would. Oh yes, is there anyone from Apple listening out there? I call to your attention the fact that _FracCos runs much more slowly and inaccurately than it needs to. _FracCos and _FracSin should be calling the coprocessor on those machines which have it. (Unless something else is going on which I am unaware of? Perhaps so, since that phenomenon happens frequently. Nevertheless, y'all are doing a great job as far as I'm concerned.) Anyway, I welcome any discussion this might generate, and my pessimistic guess is that this issue has not really been resolved yet. . . ______________________________________________________________________ "May the forces of evil become confused on the way to your house." -- George Carlin Sincerely, | DISCLAIMER: Colin Klipsch | Every word in this text is University of Virginia | actually a horrendous misspelling (cak3g@astsun.astro.virginia.edu) | of the word "fettuccini".
amanda@intercon.com (Amanda Walker) (11/12/89)
Well, you look like you're off to a good start. Actually measuring execution times is something that a lot of people don't do as often as they should :-). Anyway, there are a couple of comments I have: - Even in a tight memory situation, a 4Kb table is not much space, and the quadratic interpolation technique is significantly enough faster than anything else (for this purpose) that I'd go with it. That being said, I'm also a firm believer in using hardware to do what it can. The 68882 is indeed faster than the 68881, and as you noticed, Motorola does good stuff with silicon. The increased accuracy of floating point may actually make up for the speed difference. You also gain automatically when the FPU gets faster, which is nice. Then again, it won't run an a Mac Plus or SE. Depends on the specific application... - Among the reasons that matrix multiplication is so popular in 3D graphics work are: (a) you can concatenate transformations (rotation, translation, scaling, etc.) in any order and still end up with one matrix that you then multiply against your points, (b) if you build a hardware matrix multiplier, having a single task ("multiply this point by this matrix") makes the design *much* simpler, and (c) using homogeneous coordinates is a nifty mathematical trick :-). I wish you luck with your project, -- Amanda Walker <amanda@intercon.com>
d88-jwa@nada.kth.se (Jon Watte) (11/14/89)
In article <2256@hudson.acc.virginia.edu> cak3g@astsun7.astro.Virginia.EDU (Colin Klipsch) writes:
(WARNING: Assembly language discussion ahead! Run for your lives!!)
Various ramblings deleted
II) Table with quadratic interpolation: Quadratic because linear
interpolation would require too many table entries for 30b accuracy.
[...]
the largest you can get away with.) The primary disadvantage of this
subroutine is that it requires a table with over a thousand entries,
which amounts to over 4kB. I note also that this routine makes use
of the long multiply instructions on the 68020, which are unavailable
on the 68000.
[...]
IV) Call the M68881: Convert the angle to radians in M68881
format (96b), call FSINCOS, convert the result to fract. Call FSINCOS
because we need both sine and cosine in the transformation matrix
anyway. The fact that this calculates both will be taken into account.
[...]
And the results:
II) Quadratic interp 1,250,297 79,981 24,100
IV) M68881 (FSINCOS) 817,654 122,301 66,400 (33,200)
Hey, are you running a Mac 128K or what ? 4kB for speed FASTER than a
Mc68881 ? That's NOTHING at all ! And consider; more people have the
68020 (or -30) than have the 68881 (or -2)
Probably those 4kB will be less than an average "about" box 8)
You could have the table in a resource (hell, you could have
EIGHT tables in a resource 8) and load it only when you need it,
if you're so concerned for memory. This would also save you the
trouble of feeding the data by hand to the program editor :-)
All in all, you have produced a VERY GOOD post (maybe my candidate
for "post of the year" :-) and the Net should have more of these...
Happy Hacking !
h+
--
This .signature is longer than 4 lines. If you'd like to see it in whole,
please send a note to me. I'm h+@nada.kth.se and also h+@proxxi.se 8')