[comp.sys.amiga.programmer] Fast 3d Graphics

mmblom@cs.vu.nl (Marc Blom) (04/03/91)

'm wondering how fast 3D graphics can really go. I've written some rotation
and perspective routines (in assembly) and profiling shows that 60 % of the
time is spend in my routine RotatePoint. This routine rotates 1 point around
all three axes in only about 900 cycles minimum and about 1100 cycles maximum.
Is this fast ? Can someone do better ? (If so PLEASE mail me how you did it).
I won't include the source here because it's about 100 lines long but instead
i'll tell you what it does: It rotates a point x.word, y.word, z.word (signed)
using fixed point (14 bits behind the point) integer calculation. rotating
around each ax uses 4 muls, giving a total of 12 muls. (This is the variation
in clock cycles I mentioned above, about 900 for best case muls, about 1100
for worst case muls) By the way, these clock cycles are only valid for a plain
68000. (I don't know about 680x0 with x=1,2,3,(4?) I believe you then have to
play with cache-sizes etc. to get the best results).
I want to have the fastest possible rotation routine so I can concentrate on
kicking the Graphics Library AreaMove/Draw/End out (In a system friendly manner,
don't worry :-) ) And write a fast blitter routine.
The 60 % mentioned above was only relative to my routines part of the execution
time, of the total time involved, 15 % was spend in my routines, including
RotatePoint, Perspective etc and 85 % was consumed by those wonderfull (yuk,
did I say wonderfull ?) Graphics Library Calls. No wonder games writers kick
the OS out ! (I hope this doesn't provoke another TO OS OR NOT TO OS
discussion !)
If someone is interested in the source of RotatePoint, just drop me a line and
I will mail it to you.
 
	Marc Blom
 
+--------------------------------------------+------------------------+
| Marc Blom        | Email: mmblom@cs.vu.nl  | What IS the question ? |
| Gondel 14/13     |                         |                        |
| 8243 BL Lelystad | Phone: 03200-46233 (NL) | TO OS OR NOT TO OS,    |
| The Netherlands  |                         | that is the question ! |
+------------------+-------------------------+------------------------+

rjc@geech.gnu.ai.mit.edu (Ray Cromwell) (04/03/91)

In article <9529@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
>'m wondering how fast 3D graphics can really go. I've written some rotation
>and perspective routines (in assembly) and profiling shows that 60 % of the
>time is spend in my routine RotatePoint. This routine rotates 1 point around
>all three axes in only about 900 cycles minimum and about 1100 cycles maximum.
>Is this fast ? Can someone do better ? (If so PLEASE mail me how you did it).
>I won't include the source here because it's about 100 lines long but instead
>i'll tell you what it does: It rotates a point x.word, y.word, z.word (signed)
>using fixed point (14 bits behind the point) integer calculation. rotating
>around each ax uses 4 muls, giving a total of 12 muls. (This is the variation
>in clock cycles I mentioned above, about 900 for best case muls, about 1100
>for worst case muls) By the way, these clock cycles are only valid for a plain
>68000. (I don't know about 680x0 with x=1,2,3,(4?) I believe you then have to
>play with cache-sizes etc. to get the best results).
>I want to have the fastest possible rotation routine so I can concentrate on
>kicking the Graphics Library AreaMove/Draw/End out (In a system friendly manner,
>don't worry :-) ) And write a fast blitter routine.
>The 60 % mentioned above was only relative to my routines part of the execution
>time, of the total time involved, 15 % was spend in my routines, including
>RotatePoint, Perspective etc and 85 % was consumed by those wonderfull (yuk,
>did I say wonderfull ?) Graphics Library Calls. No wonder games writers kick
>the OS out ! (I hope this doesn't provoke another TO OS OR NOT TO OS
>discussion !)
>If someone is interested in the source of RotatePoint, just drop me a line and
>I will mail it to you.
> 
>	Marc Blom
> 
>+--------------------------------------------+------------------------+
>| Marc Blom        | Email: mmblom@cs.vu.nl  | What IS the question ? |
>| Gondel 14/13     |                         |                        |
>| 8243 BL Lelystad | Phone: 03200-46233 (NL) | TO OS OR NOT TO OS,    |
>| The Netherlands  |                         | that is the question ! |
>+------------------+-------------------------+------------------------+
> 

 Well, this all depends on what you are trying to do. Do you want to do
a flight simulator, or are you doing an animation/demo/model rotation?
If the latter, you can precompute all the rotations before run time, and
just render them via lookup table.

 As for the OS, it really has nothing to do with your project. What needs
to be optimized is the critical section of your rotation routine. I'd suggest
caching the points and computing all the rotations first, then render
them later all at once. I have the feeling you're doing something like this:

loop {
  calculate point
  render point
  get next point
} end loop

 You should (IHMO) calculate all the pointers first, and if space
isn't a problem, unwind the loop and use lots of lookup tables. You can
also add in special case optimization for known rotations like 0, 45  and 90
degrees. 

 As for rendering, you don't need to kick out the OS. Just raise your
priority, OwnBlitter, and render all the points.

 The procedure looks like this:

start
  calculate points
  OwnBlitter()
  render points
  DisownBlitter()
loop to start

  This is friendly because while you are calculating (not using the blitter)
other programs can use the blitter. Of course Mike and the other
'kill the OS' people will suggest that you Disable() all interupts
and do your calculations while the blitter is running in parallel.
This is fine if you doing a game, but if it's something like a 3d modeler
I'd advise against it. Owning the blitter for long periods of time is
very unfriendly.

  I seem to recall someone posted a 'really fast' line drawing 
routine a while back. It was highly optmized for speed using the
CPU (not the blitter. It also was very long because loops were unrolled
and it had lots of special cases for different line slopes.)

--
/~\_______________________________________________________________________/~\
|n|   rjc@albert.ai.mit.edu   Amiga, the computer for the creative mind.  |n|
|~|                                .-. .-.                                |~|
|_|________________________________| |_| |________________________________|_|

peterk@cbmger.UUCP (Peter Kittel GERMANY) (04/03/91)

In article <9529@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
>'m wondering how fast 3D graphics can really go. I've written some rotation
>and perspective routines (in assembly) and profiling shows that 60 % of the
>time is spend in my routine RotatePoint. This routine rotates 1 point around
>all three axes in only about 900 cycles minimum and about 1100 cycles maximum.
>Is this fast ? Can someone do better ? (If so PLEASE mail me how you did it).

All I hear permanently is, use tables, tables, tables instead of real
(or integer) calculations. At least for the sin and cos calculations.
Estimations about real-life situations show that you don't need tremendous
precision normally (but that depends on your real project). Perhaps even
2D or 3D rotations could be simplified into one (more or less) big table.
It's worth considering in any case.

-- 
Best regards, Dr. Peter Kittel  // E-Mail to  \\  Only my personal opinions... 
Commodore Frankfurt, Germany  \X/ {uunet|pyramid|rutgers}!cbmvax!cbmger!peterk

chrisg@cbmvax.commodore.com (Chris Green) (04/04/91)

In article <1047@cbmger.UUCP> peterk@cbmger.UUCP (Peter Kittel GERMANY) writes:
>In article <9529@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
>>'m wondering how fast 3D graphics can really go. I've written some rotation
>>and perspective routines (in assembly) and profiling shows that 60 % of the
>>time is spend in my routine RotatePoint. This routine rotates 1 point around
>>all three axes in only about 900 cycles minimum and about 1100 cycles maximum.
>>Is this fast ? Can someone do better ? (If so PLEASE mail me how you did it).

	Doing a quick cycle count on my routine, it looks like 485 worst case.
60% is really high to spend on rotation. 10% is more likely for an amiga.
You ARE using a 3x3 matrix and not 3 separate axis rotations, right?

-- 
*-------------------------------------------*---------------------------*
|Chris Green - Graphics Software Engineer   - chrisg@commodore.COM      f
|                  Commodore-Amiga          - uunet!cbmvax!chrisg       n
|My opinions are my own, and do not         - killyouridolssonicdeath   o
|necessarily represent those of my employer.- itstheendoftheworld       r
*-------------------------------------------*---------------------------d

zap@lysator.liu.se (Zap Andersson) (04/04/91)

peterk@cbmger.UUCP (Peter Kittel GERMANY) writes:

>In article <9529@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
>>'m wondering how fast 3D graphics can really go. I've written some rotation
>>and perspective routines (in assembly) and profiling shows that 60 % of the
>>time is spend in my routine RotatePoint. This routine rotates 1 point around
>>all three axes in only about 900 cycles minimum and about 1100 cycles maximum.
Instead of rotation around one axis at the time, you may create a rotation
matrix, that rotates around all three axis simultaneously, with 9 multiplies.
The tecnique is picked up from any elementary cumputer graphics book, but
simply, it is:

NewX = OldX * XX + OldY * XY + OldZ * XZ
NewY = OldY * YX + OldY * YY + OldZ * YZ
New>Z = OldZ * ZX + OldY * ZY + OldZ * ZZ

where XX,XY,XZ,YX,YY,YZ,ZX,ZY,ZZ is your rotation matrix. You can think of
XX,XY,XZ as the (new) direction of the old X axis, and YX,YY,YZ as the
(new) direction of the old Y axis and so forth. You may easily build these
by taking the points:

  XX = 1, XY = 0, XZ = 0
 YX = 0, YY = 1, YZ = 00
 ZX = 0, ZY = 0, ZZ = 1
And rotate these using "your" method. Then Use these to rotate the other
points.

>-- 

>Best regards, Dr. Peter Kittel  // E-Mail to  \\  Only my personal opinions... 
>Commodore Frankfurt, Germany  \X/ {uunet|pyramid|rutgers}!cbmvax!cbmger!peterk
-- 
* * * * * * * * * * * * * * * * *          (This rent for space)
* My signature is smaller than  * Be warned! The letter 'Z' is Copyright 1991
* yours!  - zap@lysator.liu.se  * by Zap Inc. So are the colors Red, Green and
* * * * * * * * * * * * * * * * * Greenish-yellow (Blue was taken by IBM) 
--
* * * * * * * * * * * * * * * * *          (This rent for space)
* My signature is smaller than  * Be warned! The letter 'Z' is Copyright 1991
* yours!  - zap@lysator.liu.se  * by Zap Inc. So are the colors Red, Green and
* * * * * * * * * * * * * * * * * Greenish-yellow (Blue was taken by IBM)

mmblom@cs.vu.nl (Marc Blom) (04/05/91)

rjc@geech.gnu.ai.mit.edu (Ray Cromwell) wrote:
> Well, this all depends on what you are trying to do. Do you want to do
>a flight simulator, or are you doing an animation/demo/model rotation?
>If the latter, you can precompute all the rotations before run time, and
>just render them via lookup table.

Nope, a lookup table won't do, i'm hoping to write a elite like space
game some day so i don't know how everything has to be rotated in advance.

> As for the OS, it really has nothing to do with your project. What needs
>to be optimized is the critical section of your rotation routine. I'd suggest
>caching the points and computing all the rotations first, then render

The OS has everything to do with it !, profiling shows that 85 % of the time
is spend in the OS, since i'm only using AreaMove, AreaDraw, AreaEnd & SetAPen,
optimising my 3d-routines alone won't do me much good. I have to come up with
something faster to replace these OS calls.

>them later all at once. I have the feeling you're doing something like this:
>
>loop {
>  calculate point
>  render point
>  get next point
>} end loop

What I'm doing looks something like this:

loop {
  discard objects that are not inside sight-pyramid
  get sin/cos values for this object out of a large table
  loop {
    rotate normal of next surface
    discard surfaces with normal pointing the wrong way (ie. not to screen)
    loop {
      rotate next point (of this surface)
      put point in perspective
    } end loop
  } end loop
}end loop
Sort remaining surfaces (comparing average Z-coords, qsort,
                         only pointers are swapped)	
loop {
  Draw next surface    /* closest is drawn last */
} end loop

>  I seem to recall someone posted a 'really fast' line drawing 
>routine a while back. It was highly optmized for speed using the
>CPU (not the blitter. It also was very long because loops were unrolled
>and it had lots of special cases for different line slopes.)

Love to hear from the one that posted this ! (or someone who has it)

peterk@cbmger.UUCP (Peter Kittel GERMANY) wrote:
>All I hear permanently is, use tables, tables, tables instead of real
>(or integer) calculations. At least for the sin and cos calculations.
>Estimations about real-life situations show that you don't need tremendous
>precision normally (but that depends on your real project). Perhaps even
>2D or 3D rotations could be simplified into one (more or less) big table.
>It's worth considering in any case.

I wish I knew how to use tables without taking too much memory: I'm using
360 degrees around each angle, so to multiply a coordinate with a sin/cos
(this is one table) value i would need 360 words PER coordinate, I'm now
using coordinates that range from -16K to + 16K. I don't even want to
calculate this ! (22.5 meg !) Offcourse i could use less percision, for
instance only rotating by an even number of degrees (0,2,4...,360) and
don't use such a large a coordinate system, but i don't want to compromise
between speed/presicion just yet !

+--------------------------------------------+------------------------+
| Marc Blom        | Email: mmblom@cs.vu.nl  | What IS the question ? |
| Gondel 14/13     |                         |                        |
| 8243 BL Lelystad | Phone: 03200-46233 (NL) | TO OS OR NOT TO OS,    |
| The Netherlands  |                         | that is the question ! |
+------------------+-------------------------+------------------------+

peterk@cbmger.UUCP (Peter Kittel GERMANY) (04/05/91)

In article <9554@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
>
>peterk@cbmger.UUCP (Peter Kittel GERMANY) wrote:
>>All I hear permanently is, use tables, tables, tables instead of real
>>(or integer) calculations. At least for the sin and cos calculations.
>>Estimations about real-life situations show that you don't need tremendous
>>precision normally (but that depends on your real project). 
> 
>I wish I knew how to use tables without taking too much memory: I'm using
>360 degrees around each angle, so to multiply a coordinate with a sin/cos
>(this is one table) value i would need 360 words PER coordinate,

No, you can cut down this by a factor of 4 by making use of the identities
of sin and cos for offsets of n*90 degrees. This is common technique.

> I'm now
>using coordinates that range from -16K to + 16K. I don't even want to
>calculate this ! (22.5 meg !) Offcourse i could use less percision, for
>instance only rotating by an even number of degrees (0,2,4...,360) and
>don't use such a large a coordinate system, but i don't want to compromise
>between speed/presicion just yet !

Hmm, why is the size of the coordinate system important?
In first attempt I would only make a table of sin and cos values in some
arbitrary stepping. If you need high precision, then for only one quadrant
in small steps; else for the whole circle in bigger steps. In this system
you save the sin/cos evaluation, but you have still to multiply your
coordinates with these factors.

Another thought: If you don't have so many objects, would it be of
advantage to build tables of every object coordinate already multiplied
with all these sin/cos values? That should save you some more multipli-
cations in the innermost loops. But the number of stored table entries
has squared, now there you would have to choose very carefully.
Anyone already done this?

-- 
Best regards, Dr. Peter Kittel  // E-Mail to  \\  Only my personal opinions... 
Commodore Frankfurt, Germany  \X/ {uunet|pyramid|rutgers}!cbmvax!cbmger!peterk

bombadil@diku.dk (Kristian Nielsen) (04/06/91)

mmblom@cs.vu.nl (Marc Blom) writes:

> 
>>  I seem to recall someone posted a 'really fast' line drawing 
>>routine a while back. It was highly optmized for speed using the
>>CPU (not the blitter. It also was very long because loops were unrolled
>>and it had lots of special cases for different line slopes.)
> 
>Love to hear from the one that posted this ! (or someone who has it)
> 
> 
  I think this probably refers to a rutine posted by John Schultz; I'll try
and see if I've still got it, or he might repost (again :-) ). However,
meanwhile, you might want to take a look at this. It implements line draws
to a 4-plane bitmap, and includes a small demo. The main trick is not to
operate on x/y coords in the drawing loop, but instead manipulate screen
addresses and masks; also, plotting points is done while avoiding looping
through each bitplane.
  Note that the slope of a line is calculated using a divide instruction -
this could be optimised by using a slightly different algorithm. Also, I
agree - this rutine IS a bit long.


	Kristian

Following long source (for Hisoft DevPac 2):

* Test-call.

         INCDIR   Include/

         INCLUDE  exec/exec_lib.i
         INCLUDE  intuition/intuition.i
         INCLUDE  intuition/intuition_lib.i

TEST
         moveq    #0,d0
         lea      iname(pc),a1
         movea.l  4.w,a6
         jsr      _LVOOpenLibrary(a6)
         move.l   d0,ibase
         beq      noint
         movea.l  d0,a6

         lea      ns(pc),a0
         jsr      _LVOOpenScreen(a6)
         move.l   d0,screen
         beq      noscreen
         move.l   d0,scrptr

         lea      nw(pc),a0
         jsr      _LVOOpenWindow(a6)
         move.l   d0,window
         beq      nowind
         movea.l  d0,a4
         movea.l  wd_RPort(a4),a4
         movea.l  rp_BitMap(a4),a4
         lea      bm_Planes(a4),a4

         moveq    #5,d7
loop
         movea.l  screen(pc),a0
         move.w   sc_MouseX(a0),d0
         move.w   sc_MouseY(a0),d1

         move.w   #160,d2
         move.w   #128,d3
         movem.l  (a4),a0-a3
         addq     #1,d7
         andi.w   #$000f,d7
         movem.l  d0-d7/a0-a6,-(a7)
         bsr      CpuLine
         movem.l  (a7)+,d0-d7/a0-a6

         movea.l  4.w,a6
         movea.l  window(pc),a0
         movea.l  wd_UserPort(a0),a0
         jsr      _LVOGetMsg(a6)
         tst.l    d0
         beq.s    loop

         movea.l  d0,a1
         jsr      _LVOReplyMsg(a6)


         movea.l  window(pc),a0
         movea.l  ibase(pc),a6
         jsr      _LVOCloseWindow(a6)
nowind
         movea.l  screen(pc),a0
         jsr      _LVOCloseScreen(a6)
noscreen
         movea.l  4.w,a6
         movea.l  ibase(pc),a1
         jsr      _LVOCloseLibrary(a6)
noint
         rts

ns
         dc.w     0,0,320,256,4
         dc.b     -1,-1
         dc.w     0
         dc.w     CUSTOMSCREEN
         dc.l     0,0,0,0

nw       dc.w     0,1,320,255
         dc.b     -1,-1
         dc.l     CLOSEWINDOW
         dc.l     WINDOWCLOSE|WINDOWDRAG
         dc.l     0,0
         dc.l     windowtitle
scrptr   ds.l     1
         dc.l     0
         dc.w     0,0,0,0,CUSTOMSCREEN

window   ds.l     1
screen   ds.l     1
ibase    ds.l     1
iname    dc.b     'intuition.library',0
windowtitle
         dc.b     '*FAST*! Cpu line rutine - demo.',0
         EVEN

************************************************************************
* Fast CPU line rutine.
*
* Main Entry point: Draw a line from (d0,d1) to (d2,d3).
* a0-a3 are plane pointers.
* d7 is line color 0-15.
************************************************************************
CpuLine
         cmp.w    d1,d3             ;For simplicy, we don't ever draw a line
         bge.s    CpuLine2          ;upwards; instead, the points are swapped.

         exg      d0,d2
         exg      d1,d3

;Secondary entry point: Enter here if you know for sure that d1<=d3.
CpuLine2
* Calculate jump-offset from color-value.
         add.w    d7,d7
         add.w    d7,d7

* Calculation of the memory address of the starting point.
* NOTE! This is specific for a 40-byte wide screen.
* (It could be argued that this is not very general... also I am not certain
* that a mulu wouldn't be quicker, at least on some members of the 680xx
* family. Still room for improvement....
* Also, I use only WORD access to display memory. This makes for full speed
* in old 16bit chipmem, but is slower (I guess) than longword access on an
* A3000, should you be so lucky as to own one of those beasts. I guess that
* there are just too many different amigas out there for the perfect optimi-
* zation.

         move.w   d1,d4
         add.w    d4,d4
         add.w    d4,d4             ;y*4
         add.w    d1,d4             ;y*5
         lsl.w    #3,d4             ;*40
         move.w   d0,d5
         lsr.w    #4,d5
         add.w    d5,d5
         add.w    d5,d4             ;d4=offset into bitplanes.
         add.w    d4,a0
         add.w    d4,a1
         add.w    d4,a2
         add.w    d4,a3

         move.w   d0,d4
         andi.w   #$000f,d4         ;d4=bit no (from the left) of first point

* Now determine whether the line slants to the left or to the right.
         
         sub.w    d1,d3             ;The end-point isn't really usefull, but
         sub.w    d0,d2             ;the vector from (d0,d1) is!
         blt      lineleft          ;Our line is slanting to the left.

* Initialising some values.
         move.w   #$8000,d0         ;Mask for set pixel.
         move.w   #$7fff,d1         ;Mask for clear pixel.
* (Yes I know - d0 is not needed for color 0 + same for d1. Feel free to
* optimize further if you use these colors a lot...)

         move.w   d1,d5             ;Store just less than 0.5 for rounding.
         ror.w    d4,d0
         ror.w    d4,d1             ;Get bit in correct position.
;NOTE - Change this for different plane width.
         moveq    #40,d4            ;Width of planes in bytes.

* To draw the line with no 'holes' in it (and no overlap of points either),
* either the x-coord or the y-coord has to be changed by one for each point
* plotted, while the other is only changed occationally.

         cmp.w    d2,d3
         blt      alwaysright       ;Jump if x is to be changed always.


* Now we have to calculate the 'slant' of the line, ie. what persentage
* of the time do we have to increment the x-coord.
* Now if only i knew which of divu and divs were faster....
* First, however, avoid division by zero.

         tst.w    d3
         beq      singlepoint       ;Branch to draw only a single point.
* Need special case for d2=d3 to avoid overflow.
         cmp.w    d2,d3
         beq.s    deg45A
         swap     d2
         clr.w    d2                ;Need to calculate x*2^16/y.
         divu     d3,d2             ;d2 now contains line-slant.

         jmp      jump_tabA(pc,d7.w);Pick right loop from color.
deg45A
         move.l   #$FFFF,d2
         jmp      jump_tabA(pc,d7.w);Pick right loop from color.

jump_tabA
* This junk just creates 16 branches to labels tightloopA0..15.
* The most tricky bit is the 'label\<num>' bit. This expands to a label
* with the decimal representation of num at its end, so if num=5, it
* gives 'label5'. With my Devpac 2, this works nicely; hopefully it is
* also supported by other assemblers (else buy Devpac - it's great!)
bradefA  MACRO
         bra      tightloopA\<num>
         ENDM

num      SET      0
         REPT     16
         bradefA
num      SET      num+1
         ENDR

* This creates the 16 loops, one for each color.

loopdefA MACRO
* This is the loop that needs to be optimized real hard.
* It outputs one point with only
*  always  11 instructions, sometimes 14, occasionally 18.
*  this is 12 words,                  15 words,        19 words.

tightloopA\<num>
* Handle bit 0.
         IFEQ     (num)&1
         and.w    d1,(a0)
         ELSEIF
         or.w     d0,(a0)
         ENDC
* Handle bit 1.
         IFEQ     (num)&2
         and.w    d1,(a1)
         ELSEIF
         or.w     d0,(a1)
         ENDC
* Handle bit 2.
         IFEQ     (num)&4
         and.w    d1,(a2)
         ELSEIF
         or.w     d0,(a2)
         ENDC
* Handle bit 3.
         IFEQ     (num)&8
         and.w    d1,(a3)
         ELSEIF
         or.w     d0,(a3)
         ENDC

         add.w    d4,a0
         add.w    d4,a1
         add.w    d4,a2
         add.w    d4,a3             ;increase y-coord.

         add.w    d2,d5
         bcc.s    gonextA\<num>     ;Check if x is to be increased.

         IFNE     num-15            ;Clear bit not needed for color15.
         ror.w    #1,d1
         ENDC
         IFNE     num               ;Set bit not needed for color 0.
         ror.w    #1,d0
         bcc.s    gonextA\<num>     ;Check for word-boundary.
         ELSEIF
         bcs.s    gonextA\<num>
         ENDC

         addq.w   #2,a0             ;Need to handle word boundary.
         addq.w   #2,a1
         addq.w   #2,a2
         addq.w   #2,a3
gonextA\<num>
         dbra     d3,tightloopA\<num>
         rts

         ENDM

num      SET      0
         REPT     16
         loopdefA
num      SET      num+1

         ENDR


* Following are the remaining 3 octants.

alwaysright

* Now we have to calculate the 'slant' of the line, ie. what persentage
* of the time do we have to increment the y-coord.
* Now if only i knew which of divu and divs were faster....
* First, however, avoid division by zero.

         tst.w    d2
         beq      singlepoint       ;Branch to draw only a single point.
         swap     d3
         clr.w    d3                ;Need to calculate y*2^16/x.
         divu     d2,d3             ;d2 now contains line-slant.

         jmp      jump_tabC(pc,d7.w);Pick right loop from color.
jump_tabC
* This junk just creates 16 branches to labels tightloopC0..15.
* The most tricky bit is the 'label\<num>' bit. This expands to a label
* with the decimal representation of num at its end, so if num=5, it
* gives 'label5'. With my Devpac 2, this works nicely; hopefully it is
* also supported by other assemblers (else buy Devpac - it's great!)
bradefC  MACRO
         bra      tightloopC\<num>
         ENDM

num      SET      0
         REPT     16
         bradefC
num      SET      num+1
         ENDR

* This creates the 16 loops, one for each color.

loopdefC MACRO
* This is the loop that needs to be optimized real hard.
* It outputs one point with only
*  always  11 instructions, sometimes 14, occasionally 18.
*  this is 12 words,                  15 words,        19 words.
*
* This could be optimized by plotting points in 4 registers instead of
* going directly to memory. This would improve very flat lines, but would
* slow down lines nearer to 45 degrees unless another special case were
* added; I guess I was just too lazy.
tightloopC\<num>
* Handle bit 0.
         IFEQ     (num)&1
         and.w    d1,(a0)
         ELSEIF
         or.w     d0,(a0)
         ENDC
* Handle bit 1.
         IFEQ     (num)&2
         and.w    d1,(a1)
         ELSEIF
         or.w     d0,(a1)
         ENDC
* Handle bit 2.
         IFEQ     (num)&4
         and.w    d1,(a2)
         ELSEIF
         or.w     d0,(a2)
         ENDC
* Handle bit 3.
         IFEQ     (num)&8
         and.w    d1,(a3)
         ELSEIF
         or.w     d0,(a3)
         ENDC

         IFNE     num-15            ;Clear bit not needed for color15.
         ror.w    #1,d1
         ENDC
         IFNE     num               ;Set bit not needed for color 0.
         ror.w    #1,d0
         bcc.s    checkC\<num>      ;Check for word-boundary.
         ELSEIF
         bcs.s    checkC\<num>
         ENDC
         addq.w   #2,a0             ;Need to handle word boundary.
         addq.w   #2,a1
         addq.w   #2,a2
         addq.w   #2,a3

checkC\<num>
         add.w    d3,d5
         bcc.s    gonextC\<num>     ;Check if x is to be increased.

         adda.w   d4,a0
         adda.w   d4,a1
         adda.w   d4,a2
         adda.w   d4,a3             ;increase y-coord.

gonextC\<num>
         dbra     d2,tightloopC\<num>
         rts

         ENDM

num      SET      0
         REPT     16
         loopdefC
num      SET      num+1

         ENDR




lineleft
         neg.w    d2                ;get abs. value of x.
* Initialising some values.
         move.w   #$8000,d0         ;Mask for set pixel.
         move.w   #$7fff,d1         ;Mask for clear pixel.
* (Yes I know - d0 is not needed for color 0 + similar for d1. Feel free to
* optimize further if you use these colors a lot...)

         move.w   d1,d5             ;Store just less than 0.5 for rounding.
         ror.w    d4,d0
         ror.w    d4,d1             ;Get bit in correct position.
;NOTE - Change this for different plane width.
         moveq    #40,d4            ;Width of planes in bytes.

* To draw the line with no 'holes' in it (and no overlap of points either),
* either the x-coord or the y-coord has to be changed by one for each point
* plotted, while the other is only changed occationally.

         cmp.w    d2,d3
         blt      alwaysleft        ;Jump if x is to be changed always.


* Now we have to calculate the 'slant' of the line, ie. what persentage
* of the time do we have to increment the x-coord.
* Now if only i knew which of divu and divs were faster....
* First, however, avoid division by zero.

         tst.w    d3
         beq      singlepoint       ;Branch to draw only a single point.
         cmp.w    d3,d2
         beq.s    deg45B
         swap     d2
         clr.w    d2                ;Need to calculate x*2^16/y.
         divu     d3,d2             ;d2 now contains line-slant.

         jmp      jump_tabB(pc,d7.w);Pick right loop from color.
deg45B
         move.w   #$FFFF,d2
         jmp      jump_tabB(pc,d7.w);Pick right loop from color.

jump_tabB
* This junk just creates 16 branches to labels tightloopA0..15.
* The most tricky bit is the 'label\<num>' bit. This expands to a label
* with the decimal representation of num at its end, so if num=5, it
* gives 'label5'. With my Devpac 2, this works nicely; hopefully it is
* also supported by other assemblers (else buy Devpac - it's great!)
bradefB  MACRO
         bra      tightloopB\<num>
         ENDM

num      SET      0
         REPT     16
         bradefB
num      SET      num+1
         ENDR

* This creates the 16 loops, one for each color.

loopdefB MACRO
* This is the loop that needs to be optimized real hard.
* It outputs one point with only
*  always  11 instructions, sometimes 14, occasionally 18.
*  this is 12 words,                  15 words,        19 words.

tightloopB\<num>
* Handle bit 0.
         IFEQ     (num)&1
         and.w    d1,(a0)
         ELSEIF
         or.w     d0,(a0)
         ENDC
* Handle bit 1.
         IFEQ     (num)&2
         and.w    d1,(a1)
         ELSEIF
         or.w     d0,(a1)
         ENDC
* Handle bit 2.
         IFEQ     (num)&4
         and.w    d1,(a2)
         ELSEIF
         or.w     d0,(a2)
         ENDC
* Handle bit 3.
         IFEQ     (num)&8
         and.w    d1,(a3)
         ELSEIF
         or.w     d0,(a3)
         ENDC

         add.w    d4,a0
         add.w    d4,a1
         add.w    d4,a2
         add.w    d4,a3             ;increase y-coord.

         add.w    d2,d5
         bcc.s    gonextB\<num>     ;Check if x is to be increased.

         IFNE     num-15            ;Clear bit not needed for color15.
         rol.w    #1,d1
         ENDC
         IFNE     num               ;Set bit not needed for color 0.
         rol.w    #1,d0
         bcc.s    gonextB\<num>     ;Check for word-boundary.
         ELSEIF
         bcs.s    gonextB\<num>
         ENDC

         subq.w   #2,a0             ;Need to handle word boundary.
         subq.w   #2,a1
         subq.w   #2,a2
         subq.w   #2,a3
gonextB\<num>
         dbra     d3,tightloopB\<num>
         rts

         ENDM

num      SET      0
         REPT     16
         loopdefB
num      SET      num+1

         ENDR


alwaysleft

* Now we have to calculate the 'slant' of the line, ie. what persentage
* of the time do we have to increment the y-coord.
* Now if only i knew which of divu and divs were faster....
* First, however, avoid division by zero.

         tst.w    d2
         beq      singlepoint       ;Branch to draw only a single point.
         swap     d3
         clr.w    d3                ;Need to calculate y*2^16/x.
         divu     d2,d3             ;d2 now contains line-slant.

         jmp      jump_tabD(pc,d7.w);Pick right loop from color.
jump_tabD
* This junk just creates 16 branches to labels tightloopD0..15.
* The most tricky bit is the 'label\<num>' bit. This expands to a label
* with the decimal representation of num at its end, so if num=5, it
* gives 'label5'. With my Devpac 2, this works nicely; hopefully it is
* also supported by other assemblers (else buy Devpac - it's great!)
bradefD  MACRO
         bra      tightloopD\<num>
         ENDM

num      SET      0
         REPT     16
         bradefD
num      SET      num+1
         ENDR

* This creates the 16 loops, one for each color.

loopdefD MACRO
* This is the loop that needs to be optimized real hard.
* It outputs one point with only
*  always  11 instructions, sometimes 14, occasionally 18.
*  this is 12 words,                  15 words,        19 words.
*
* This could be optimized by plotting points in 4 registers instead of
* going directly to memory. This would improve very flat lines, but would
* slow down lines nearer to 45 degrees unless another special case were
* added; I guess I was just too lazy.
tightloopD\<num>
* Handle bit 0.
         IFEQ     (num)&1
         and.w    d1,(a0)
         ELSEIF
         or.w     d0,(a0)
         ENDC
* Handle bit 1.
         IFEQ     (num)&2
         and.w    d1,(a1)
         ELSEIF
         or.w     d0,(a1)
         ENDC
* Handle bit 2.
         IFEQ     (num)&4
         and.w    d1,(a2)
         ELSEIF
         or.w     d0,(a2)
         ENDC
* Handle bit 3.
         IFEQ     (num)&8
         and.w    d1,(a3)
         ELSEIF
         or.w     d0,(a3)
         ENDC

         IFNE     num-15            ;Clear bit not needed for color15.
         rol.w    #1,d1
         ENDC
         IFNE     num               ;Set bit not needed for color 0.
         rol.w    #1,d0
         bcc.s    checkD\<num>      ;Check for word-boundary.
         ELSEIF
         bcs.s    checkD\<num>
         ENDC
         subq.w   #2,a0             ;Need to handle word boundary.
         subq.w   #2,a1
         subq.w   #2,a2
         subq.w   #2,a3

checkD\<num>
         add.w    d3,d5
         bcc.s    gonextD\<num>     ;Check if x is to be increased.

         adda.w   d4,a0
         adda.w   d4,a1
         adda.w   d4,a2
         adda.w   d4,a3             ;increase y-coord.

gonextD\<num>
         dbra     d2,tightloopD\<num>
         rts

         ENDM

num      SET      0
         REPT     16
         loopdefD
num      SET      num+1

         ENDR



* Our (silly) caller passed us the same point for start and end - just give
* him the single lousy point he wants.
singlepoint
         jmp      jump_tabE(pc,d7.w);Pick right plotrutine from color.
jump_tabE
* This junk just creates 16 branches to labels plotE0..15.
* The most tricky bit is the 'label\<num>' bit. This expands to a label
* with the decimal representation of num at its end, so if num=5, it
* gives 'label5'. With my Devpac 2, this works nicely; hopefully it is
* also supported by other assemblers (else buy Devpac - it's great!)
bradefE  MACRO
         bra      plotE\<num>
         ENDM

num      SET      0
         REPT     16
         bradefE
num      SET      num+1
         ENDR

plotdefE MACRO
* Plot a point in color num.
* Handle bit 0.
plotE\<num>
         IFEQ     (num)&1
         and.w    d1,(a0)
         ELSEIF
         or.w     d0,(a0)
         ENDC
* Handle bit 1.
         IFEQ     (num)&2
         and.w    d1,(a1)
         ELSEIF
         or.w     d0,(a1)
         ENDC
* Handle bit 2.
         IFEQ     (num)&4
         and.w    d1,(a2)
         ELSEIF
         or.w     d0,(a2)
         ENDC
* Handle bit 3.
         IFEQ     (num)&8
         and.w    d1,(a3)
         ELSEIF
         or.w     d0,(a3)
         ENDC

         rts

         ENDM

num      SET      0
         REPT     16
         plotdefE
num      SET      num+1

         ENDR

holgerl@amiux.UUCP (Holger Lubitz) (04/06/91)

In article <1991Apr3.085847.14287@mintaka.lcs.mit.edu> rjc@geech.gnu.ai.mit.edu (Ray Cromwell) writes:
>  I seem to recall someone posted a 'really fast' line drawing
>routine a while back. It was highly optmized for speed using the
>CPU (not the blitter. It also was very long because loops were unrolled
>and it had lots of special cases for different line slopes.)

I probably missed it. Could any kind soul who archived it email it to me or
simply repost ?

Thanks in advance,
Holger

--
Holger Lubitz            | holgerl@amiux.uucp
Kl. Drakenburger Str. 24 | holgerl@amiux.han.de
D-W-3070 Nienburg        | cbmvax.commodore.com!cbmehq!cbmger!amiux!holgerl

baker@wbc.enet.dec.com (04/06/91)

	Actually, back in the late 70's or early 80's, I saw
	a paper by someone who was doing realtime rotations
	of a 3-D data-plot on an 8-bit micro.  Using a table
	or two, he had reduced the rotation calculations to 
	a couple integer multiplies and a shift.  I don't
	remember the algorithm, but I can probably find the
	reference.

	Regards,

	*********************************************************
	* Art Baker			| "Read ME,		*
	* baker@wbc.enet.dec.com	|   Dr Memory..."	*
	* PLINK: A*BAKER		|			*
	*********************************************************

jonabbey@cs.utexas.edu (Jonathan David Abbey) (04/06/91)

In article <9554@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
|I wish I knew how to use tables without taking too much memory: I'm using
|360 degrees around each angle, so to multiply a coordinate with a sin/cos
|(this is one table) value i would need 360 words PER coordinate, I'm now
|using coordinates that range from -16K to + 16K. I don't even want to
|calculate this ! (22.5 meg !) Offcourse i could use less percision, for
|instance only rotating by an even number of degrees (0,2,4...,360) and
|don't use such a large a coordinate system, but i don't want to compromise
|between speed/presicion just yet !
| 

Use quanta of rotation 1/256th of a circle.  That way you can do rotations
with a single indexed offset into your trig tables.  Short, sweet, and oh
so quick.

>+--------------------------------------------+------------------------+
>| Marc Blom        | Email: mmblom@cs.vu.nl  | What IS the question ? |
>| Gondel 14/13     |                         |                        |
>| 8243 BL Lelystad | Phone: 03200-46233 (NL) | TO OS OR NOT TO OS,    |
>| The Netherlands  |                         | that is the question ! |
>+------------------+-------------------------+------------------------+

-- 
-------------------------------------------------------------------------------
Jonathan David Abbey              \"Take your place on the great Mandela" P,P&M
the university of texas at austin  \  jonabbey@cs.utexas.edu     "Love me, love
computer science/math?/psychology?  \ (512) 472-2052              my Amiga" -Me

limonce@pilot.njin.net (Tom Limoncelli +1 201 408 5389) (04/06/91)

In article <9529@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:

> 'm wondering how fast 3D graphics can really go. I've written some rotation
> and perspective routines (in assembly) and profiling shows that 60 % of the
> time is spend in my routine RotatePoint. This routine rotates 1 point around
> all three axes in only about 900 cycles minimum and about 1100 cycles max.

Yow!  You're using the wrong algorithm!  I recommend "Computer
Graphics, principles and practice" by Foley, vanDam, Feiner & Hughes.
Chap 5 has some better ways of doing this kind of stuff.  It really
makes you appreciate those linear algebra classes that were so
tempting to sleep through.  They bring a big rotate/scale/move routine
down to one precomputation then a matrix-multiple for every point in
the display.  By doing one precomputation you save a LOT of work for
all those other points.

Also, comp.graphics has a Monthly Posting Of Frequently Asked
Questions (and their Answers).  It includes *many* good tips.  Some of
them are actual simple formulas, some are snotty "here's a list of
books... go to the library!" but, hey, the net isn't a replacement for
good research.

Tom
-- 
One thousand, one hundred, seventy five people died of AIDS
last week.  Did someone mention a war in Iraq?

Alex_Topic@resbbs.UUCP (Alex Topic) (04/06/91)

  Well talking about 3d graphics.. Which the BLITTER should be used for this
type of thing on the Amiga. Do you have any idea if CBM is working on a new
CUSTOM CHIP set.. I mean with a fast blitter, more sprites...ect
        I know the Amiga is a nice put together computer.. But they could of
made it a bit better. It should of had upto 256 colors in Low res, plus the
SPRITES could of atleast been alittle more powerfull.   
        ONe system I like is the NEO GEO.. it has 300+ Sprites.. 2048 on
screen colors... God knows if they have a BLITTER type chip.. 
         Oh well dont take me as a complainer, I think the Amiga has alot
power for the price, but it could have had more abilities.. oh well Cya all
around..
  
A.t.

ben@epmooch.UUCP (Rev. Ben A. Mesander) (04/06/91)

>In article <9576@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
[...]
>I think i misunderstood you, I am allready using tables for the sin/cos
>values, I thought you meant tables for the multiplies, then the coordinate
>system's size is very important, the larger your coordinate system, the
>larger the multable.

I think often you can reduce the size of the multable by carefully looking
at your algorithms. I've done something like this for graphics in Z-80
assembler, where I used properties of the screen addressing to reduce my
multable (everything had to fit in 32K) similarly to the trig identities
for reducing the sin/cos table. Look at your problem very carefully...

>Marc Blom, Vrije Universiteit Amsterdam, Email: mmblom@cs.vu.nl | short .sig

--
| ben@epmooch.UUCP   (Ben Mesander)       | "Cash is more important than |
| ben%servalan.UUCP@uokmax.ecn.uoknor.edu |  your mother." - Al Shugart, |
| !chinet!uokmax!servalan!epmooch!ben     |  CEO, Seagate Technologies   |

mmblom@cs.vu.nl (Marc Blom) (04/07/91)

Newsgroups: comp.sys.amiga.programmer
Subject: Re: Fast 3d Graphics
Summary: 
Expires: 
References: <20338@cbmvax.commodore.com> <9554@star.cs.vu.nl> <1073@cbmger.UUCP>
Sender: 
Followup-To: 
Distribution: 
Organization: Fac. Wiskunde & Informatica, VU, Amsterdam
Keywords: fast 3d optimizing assembly cycles

In article <1073@cbmger.UUCP> peterk@cbmger.UUCP (Peter Kittel GERMANY) writes:
>In article <9554@star.cs.vu.nl> mmblom@cs.vu.nl (Marc Blom) writes:
>>
>> 
>>I wish I knew how to use tables without taking too much memory: I'm using
>>360 degrees around each angle, so to multiply a coordinate with a sin/cos
>>(this is one table) value i would need 360 words PER coordinate,
>
>No, you can cut down this by a factor of 4 by making use of the identities
>of sin and cos for offsets of n*90 degrees. This is common technique.
>

Yes, I know and i am using this, i just forgot it.


>> I'm now
>>using coordinates that range from -16K to + 16K. I don't even want to
>>calculate this ! (22.5 meg !) Offcourse i could use less percision, for
>>instance only rotating by an even number of degrees (0,2,4...,360) and
>>don't use such a large a coordinate system, but i don't want to compromise
>>between speed/presicion just yet !
>
>Hmm, why is the size of the coordinate system important?
>In first attempt I would only make a table of sin and cos values in some
>arbitrary stepping. If you need high precision, then for only one quadrant
>in small steps; else for the whole circle in bigger steps. In this system
>you save the sin/cos evaluation, but you have still to multiply your
>coordinates with these factors.
>

I think i misunderstood you, I am allready using tables for the sin/cos
values, I thought you meant tables for the multiplies, then the coordinate
system's size is very important, the larger your coordinate system, the
larger the multable.

>-- 
>Best regards, Dr. Peter Kittel  // E-Mail to  \\  Only my personal opinions... 
>Commodore Frankfurt, Germany  \X/ {uunet|pyramid|rutgers}!cbmvax!cbmger!peterk


--
Marc Blom, Vrije Universiteit Amsterdam, Email: mmblom@cs.vu.nl | short .sig

galetti@uservx.afwl.af.mil (04/09/91)

In article <20338@cbmvax.commodore.com>, chrisg@cbmvax.commodore.com (Chris Green) writes:
> 
> 	Doing a quick cycle count on my routine, it looks like 485 worst case.
> 60% is really high to spend on rotation. 10% is more likely for an amiga.
> You ARE using a 3x3 matrix and not 3 separate axis rotations, right?
> 

Does your routine rotate about an arbitrary axis or about one of the principle
axes?  If it rotates about an arbitrary axis, do you assume the axis is fixed
or do you allow it to vary?  If you allow it to vary, there are a few
additional multiplications you have to perform to set up the rotation matrix,
not to mention a few additional trig. functions.  For an arbitrary axis, 485
cycles sounds awfully low.

> -- 
> *-------------------------------------------*---------------------------*
> |Chris Green - Graphics Software Engineer   - chrisg@commodore.COM      f
> |                  Commodore-Amiga          - uunet!cbmvax!chrisg       n
> |My opinions are my own, and do not         - killyouridolssonicdeath   o
> |necessarily represent those of my employer.- itstheendoftheworld       r
> *-------------------------------------------*---------------------------d
  ___________________________________________________________________________
 /   Ralph Galetti                  Internet:   galetti@uservx.afwl.af.mil   \
|    PL/LITT                        Interests:  computers, music, computers   |
|    Kirtland AFB, NM 87117-6008                and music, golf, sleep.       |
 \__"No, they couldn't actually prove that it was HIS vomit" - Nigel Tufnel__/