[rec.games.programmer] Need hard core help doing some 3d optimizations

jdb@reef.cis.ufl.edu (Brian K. W. Hook) (04/12/91)

I just ran Turbo Profiler on some code of mine, and it looks like roughly
90% of my CPU time is being shoved into one function.  Any help on the
optimizations would be really appreciated.

E-mail would be preferred, however posts are fine too.

void Calc3D ( int WorldX, int WorldY, int WorldZ,
	      int MX, int MY, int MZ,
	      int *DisplayX, int *DisplayY)
{
float xa, ya, za;

   WorldX=-WorldX;
   xa=_yawCosFactor*WorldX-_yawSinFactor*WorldZ;
   za=_yawSinFactor*WorldX+_yawCosFactor*WorldZ;
   WorldX=_rollCosFactor*xa+_rollSinFactor*WY;
   ya=_rollCosFactor*WorldY-_rollSinFactor*xa;
   WorldZ=_pitchCosFactor*za-_pitchSinFactor*ya;
   WorldY=_pitchSinFactor*za+_pitchCosFactor*ya;
   WorldX+=MX;
   WorldY+=MY;
   WorldZ+=MZ;
   *DisplayX=AngPerspFactor*WorldX/WorldZ;
   *DisplayY=AngPerspFactor*WorldY/WorldZ;
}

AngPerspFactor is a float, as are the assorted factors.  I have already
used a COS and SIN lookup table in another part of the module to expedite
things, but this is really slowing things down considerably.

Any help is appreciated.

Brian

2fmlcalls@kuhub.cc.ukans.edu (04/12/91)

> I just ran Turbo Profiler on some code of mine, and it looks like roughly
> 90% of my CPU time is being shoved into one function.  Any help on the
> optimizations would be really appreciated.

snip

> float xa, ya, za;

snip

> AngPerspFactor is a float, as are the assorted factors.  I have already
> used a COS and SIN lookup table in another part of the module to expedite
> things, but this is really slowing things down considerably.
> 
> Any help is appreciated.
> 
> Brian

I don't fully follow C, but I know float.  Go with scaled integers or longs and
use integer division.  As well, use scaled integers for your sin/co look=up
tables (i.e. sin(45 degrees) = 707 rather than 0.707).  When you are through
multiplying, do a div 1000 (or C equivalent of integer division).

General optimizations - use fewer points.  If you get creative you may find you
can represent your planes/tanks/whatnots with less points (and thus fewer lines
to draw as well).

john calhoun

rcb@shaman.cc.ncsu.edu (Randy Buckland) (04/12/91)

jdb@reef.cis.ufl.edu (Brian K. W. Hook) writes:
>void Calc3D ( int WorldX, int WorldY, int WorldZ,
>	      int MX, int MY, int MZ,
>	      int *DisplayX, int *DisplayY)
>{
>float xa, ya, za;

>   WorldX=-WorldX;
>   xa=_yawCosFactor*WorldX-_yawSinFactor*WorldZ;
>   za=_yawSinFactor*WorldX+_yawCosFactor*WorldZ;
>   WorldX=_rollCosFactor*xa+_rollSinFactor*WY;
>   ya=_rollCosFactor*WorldY-_rollSinFactor*xa;
>   WorldZ=_pitchCosFactor*za-_pitchSinFactor*ya;
>   WorldY=_pitchSinFactor*za+_pitchCosFactor*ya;
>   WorldX+=MX;
>   WorldY+=MY;
>   WorldZ+=MZ;
>   *DisplayX=AngPerspFactor*WorldX/WorldZ;
>   *DisplayY=AngPerspFactor*WorldY/WorldZ;
>}

Try this

Store your cos/sin values as (int)(cos(angle)*1024)
Then do the matrix multiply as

	c(0,0) = (a(0,0)*b(0,0) + a(0,1)*b(1,0) 
		+ a(0,2)*b(2,0) + (a(0,3)*b(3,0)) >> 10;

where a is your initial 4x4 coordinate matrix and b is your translation
matrix that is precomputed. ALL values in the b matrix should be integers
that are 1024 times their proper values while a is an integer matrix with
the proper values and c is an integer matrix. This results in 4
integer multiplies, 3 integer adds and 1 shift (the reason for 1024 instead
of 1000 scale factor) No floating point at all.

Also, try to determine if you are really rotating about all 3 axes. If not,
you can drastically simplify the computations and increase the speed.

--
Randy Buckland						"It's hard to work
North Carolina State University				in a group when you're
randy@ncsu.edu (919) 737-2517				omnipotent"	-- Q

ahodgson@athena.mit.edu (Antony Hodgson) (04/12/91)

In article <1991Apr11.202429.29648@kuhub.cc.ukans.edu> 2fmlcalls@kuhub.cc.ukans.edu writes:
>> I just ran Turbo Profiler on some code of mine, and it looks like roughly
>> 90% of my CPU time is being shoved into one function.  Any help on the
>> optimizations would be really appreciated.
>> ... float xa, ya, za;
>
>I don't fully follow C, but I know float.  Go with scaled integers or longs and
>use integer division.  

Check out the latest Dr. Dobb's journal on using memory-mapped math
coprocessors.  They claim that float division is not faster than 386
integer division on some PC hardware platforms.

>As well, use scaled integers for your sin/co look=up
>tables (i.e. sin(45 degrees) = 707 rather than 0.707).  When you are through
>multiplying, do a div 1000 (or C equivalent of integer division).

If you go this route, don't scale in base 10;  scale in base 2 and use
shift operations instead of division when possible.

Tony Hodgson
ahodgson@hstbme.mit.edu

ahodgson@athena.mit.edu (Antony Hodgson) (04/12/91)

In article <1991Apr11.202429.29648@kuhub.cc.ukans.edu> 2fmlcalls@kuhub.cc.ukans.edu writes:
>> I just ran Turbo Profiler on some code of mine, and it looks like roughly
>> 90% of my CPU time is being shoved into one function.  Any help on the
>> optimizations would be really appreciated.
>
>I don't fully follow C, but I know float.  Go with scaled integers or longs and
>use integer division.  

Check out the latest Dr. Dobb's Journal.  In an article on memory mapped
numeric coprocessors, they claim that floating point division is now faster
than 386 integer division with the appropriate math coprocessor attached.

>As well, use scaled integers for your sin/co look=up
>tables (i.e. sin(45 degrees) = 707 rather than 0.707).  When you are through
>multiplying, do a div 1000 (or C equivalent of integer division).

If you go this route, consider scaling by a power of 2 rather than a power of
10.  You may be able to carry out divisions by shift operations rather than
division operations.

Tony Hodgson
ahodgson@hstbme.mit.edu

shaunc@gold.gvg.tek.com (Shaun Case) (04/13/91)

In article <27979@uflorida.cis.ufl.EDU> jdb@reef.cis.ufl.edu (Brian K. W. Hook) writes:
>I just ran Turbo Profiler on some code of mine, and it looks like roughly
>90% of my CPU time is being shoved into one function.  Any help on the
>optimizations would be really appreciated.

On a related note, I got some odd results with the profiler once or twice until
I realized that my screen saver was interrupting the profiler, and was throwing
the profile WAY off.  While the my screen saver is running, the profiler is suspended,
so when you exit the screen saver, all the time spent in it gets assigned to 
whatever area the profiler was looking at whenever the screen saver went active.
Routines that previously had 13-15% suddenly went up to 30%, then 90% the next time
around... 

Beware TSRs in all their myriad forms.

// Shaun //

-- 
Shaun Case:  shaunc@gold.gvg.tek.com  or  atman%ecst.csuchico.edu@RELAY.CS.NET 
 or Shaun Case of 1:119/666.0 (Fidonet)  or  1@9651 (WWIVnet)
---
It's enough to destroy a young moose's faith!

Ordania-DM@cup.portal.com (Charles K Hughes) (04/13/91)

  Does this discussion really need to be in 3 newsgroups?

Brian writes:
>
>I just ran Turbo Profiler on some code of mine, and it looks like roughly
>90% of my CPU time is being shoved into one function.  Any help on the

  I don't know what the rest of your program looks like, but I don't think
this code should take up 90% of it.  
  (Before following my mediocre suggestions, follow the others that have been
posted - TSR removal, use Ints, etc.)

>optimizations would be really appreciated.
>
>E-mail would be preferred, however posts are fine too.
>
>void Calc3D ( int WorldX, int WorldY, int WorldZ,
>	      int MX, int MY, int MZ,
>	      int *DisplayX, int *DisplayY)
>{
>float xa, ya, za;
>
>   WorldX=-WorldX;
>   xa=_yawCosFactor*WorldX-_yawSinFactor*WorldZ;
>   za=_yawSinFactor*WorldX+_yawCosFactor*WorldZ;
>   WorldX=_rollCosFactor*xa+_rollSinFactor*WY;
>   ya=_rollCosFactor*WorldY-_rollSinFactor*xa;
>   WorldZ=_pitchCosFactor*za-_pitchSinFactor*ya;
>   WorldY=_pitchSinFactor*za+_pitchCosFactor*ya;
>   WorldX+=MX;
>   WorldY+=MY;
>   WorldZ+=MZ;
>   *DisplayX=AngPerspFactor*WorldX/WorldZ;
>   *DisplayY=AngPerspFactor*WorldY/WorldZ;

   Change these last two to:
    yes_a_temp=AngPerspFactor/WorldZ;
    *DisplayX=yes_a_temp*WorldX
    *DisplayY=yes_a_temp*WorldY

   Division costs a lot more than multiplication or storage.

>}
>
>AngPerspFactor is a float, as are the assorted factors.  I have already
>used a COS and SIN lookup table in another part of the module to expedite
>things, but this is really slowing things down considerably.
>
>Any help is appreciated.
>
>Brian
>
>   

  Now, for independend thoughts that don't quite fit into the program...
There are formulas for transforming the equations you have above.
Calculus and trig books should have them.

Good luck.
Charles_K_Hughes@cup.portal.com

jmbj@grebyn.com (Jim Bittman) (04/13/91)

[stuff deleted]

>>> 90% of my CPU time is being shoved into one function.  Any help on the
>>> optimizations would be really appreciated.
>>> ... float xa, ya, za;
>>
>
>Check out the latest Dr. Dobb's journal on using memory-mapped math
               ^^^^^^^^^^^^^^^^^^^^^^^^^
Why not add a floating point DSP coprocessor, and be done with it?

>coprocessors.  They claim that float division is not faster than 386
>integer division on some PC hardware platforms.

Jim Bittman, jmbj@grebyn.com
(author of DSP article in above mentioned magazine |=)

uad1077@dircon.co.uk (Ian Kemmish) (04/13/91)

2fmlcalls@kuhub.cc.ukans.edu writes:

>> I just ran Turbo Profiler on some code of mine, and it looks like roughly
>> 90% of my CPU time is being shoved into one function.  Any help on the
>> optimizations would be really appreciated.

>snip

>> float xa, ya, za;

>snip

>> AngPerspFactor is a float, as are the assorted factors.  I have already
>> used a COS and SIN lookup table in another part of the module to expedite
>> things, but this is really slowing things down considerably.
>> 
>> Any help is appreciated.
>> 
>> Brian

>I don't fully follow C, but I know float.  Go with scaled integers or longs and
>use integer division.  As well, use scaled integers for your sin/co look=up
>tables (i.e. sin(45 degrees) = 707 rather than 0.707).  When you are through
>multiplying, do a div 1000 (or C equivalent of integer division).

>General optimizations - use fewer points.  If you get creative you may find you
>can represent your planes/tanks/whatnots with less points (and thus fewer lines
>to draw as well).

>john calhoun

But don't get rid of floats completely.  If you can, parameterise the
decision to use floats or scaled ints in a config file.  On mips
systems, and with luck the civilised habit will spread, doing
arithmetic in floats os considerably faster than doing it in
integers (assuming its not something simple like a long run
of adds and subtracts with just a couple of token multiplies and
divides thrown in), which in turn is faster than arithmetic using
shorts.  Scaled integers and shorts are slower, again.  I recently
saw a preliminary copy of the JPEG sample implementation, and
it suffered from this ``integer-centrism''....

-- 
Ian D. Kemmish                    Tel. +44 767 601 361
18 Durham Close                   uad1077@dircon.UUCP
Biggleswade                       ukc!dircon!uad1077
Beds SG18 8HZ United Kingdom    uad1077@dircon.co.uk

conners@cs.fau.edu (Sean Conner) (04/17/91)

In article <1991Apr13.154112.11345@dircon.co.uk> uad1077@dircon.co.uk (Ian Kemmish) writes:

[  stuff about using scaled ints stuffed into bit bucket  ]

>
>But don't get rid of floats completely.  If you can, parameterise the
>decision to use floats or scaled ints in a config file.  On mips
>systems, and with luck the civilised habit will spread, doing
>arithmetic in floats os considerably faster than doing it in
>integers (assuming its not something simple like a long run
>of adds and subtracts with just a couple of token multiplies and
>divides thrown in), which in turn is faster than arithmetic using
>shorts.  Scaled integers and shorts are slower, again.  I recently
>saw a preliminary copy of the JPEG sample implementation, and
>it suffered from this ``integer-centrism''....
>
>-- 
>Ian D. Kemmish                    Tel. +44 767 601 361
>18 Durham Close                   uad1077@dircon.UUCP
>Biggleswade                       ukc!dircon!uad1077
>Beds SG18 8HZ United Kingdom    uad1077@dircon.co.uk

  So, okay, I have a MIPS system.  No problem.  But what about us poor
programmers who can only afford a 386/387 system?  It is still faster to
use ints than floats, for two reasons:

	1.  On a 386, the longest a IDIV will take (this is with a memory
	    operand here) is 46 cycles.  And the longest an IMUL will take
	    is 41 cycles.  On the 387, the SHORTEST time it will take to
	    divide two reals is 91 cycles.  A floating point multiply takes
	    about the same as the IMUL instruction (depends on addressing
	    modes, though).  Drop down to a 286/287 combo, and the IMUL/IDIV
	    instructions take about 20-30 cycles, while the FMUL/FDIV take
	    about (for a FMUL) 130-170 or (for a FDIV) 190-230 cycles.  Shall
	    we go on to a 86/87 combo?

	2.  Do we REALLY need the ability to go from 1E-100 to 1E+100 in range
	    when doing graphics?  And what does it mean to add 1E-100 to 1E+100?


  For me, scaled ints are fine.

  -spc (Floats?  I don't need no stinking floats!  :-)

dmurdoch@watstat.waterloo.edu (Duncan Murdoch) (04/17/91)

In article <1991Apr16.210819.17873@cs.fau.edu> conners@cs.fau.edu (Sean Conner) writes:
>
>  So, okay, I have a MIPS system.  No problem.  But what about us poor
>programmers who can only afford a 386/387 system?  It is still faster to
>use ints than floats, for two reasons:
>
>	1.  On a 386, the longest a IDIV will take (this is with a memory
>	    operand here) is 46 cycles.  And the longest an IMUL will take
>	    is 41 cycles.  On the 387, the SHORTEST time it will take to
>	    divide two reals is 91 cycles.  A floating point multiply takes
>	    about the same as the IMUL instruction (depends on addressing
>	    modes, though).  

Just in case you're interested:  the times for the 486 are

  IDIV    19-44
  FDIV    73
  IMUL    13-42
  FMUL    11-16

So it looks as though floating point math would be a contender, even without
a MIPS, as long as you avoided the divides.  I don't know the algorithms
you want to use, but very often the denominator in a series of divides stays the
same for a long time, so you can replace it with a series of 
multiplies by the reciprocal.

Duncan Murdoch
dmurdoch@watstat.waterloo.edu

conners@cs.fau.edu (Sean Conner) (04/18/91)

In article <1991Apr17.133142.24149@maytag.waterloo.edu> dmurdoch@watstat.waterloo.edu (Duncan Murdoch) writes:
>
>Just in case you're interested:  the times for the 486 are
>
>  IDIV    19-44
>  FDIV    73
>  IMUL    13-42
>  FMUL    11-16
>
>So it looks as though floating point math would be a contender, even without
>a MIPS, as long as you avoided the divides.  I don't know the algorithms
>you want to use, but very often the denominator in a series of divides stays the
>same for a long time, so you can replace it with a series of 
>multiplies by the reciprocal.
>
>Duncan Murdoch
>dmurdoch@watstat.waterloo.edu

  So, we have the 386/387 covered.  What about the 68020/68881(2)?  I have
an Amiga at home :-)

 -spc (Why do I program in Assembly?  Speed.  Pure and simple.)

hpasanen@cs.hut.fi (Harri Pasanen) (04/18/91)

Just an idea:

Use C++ and define your own number class.  Then depending on the class
header included you can then have either ints or floats, and compile
the code for the machine at hand for its maximum performance.

Harri Pasanen

jcs@crash.cts.com (John Schultz) (04/23/91)

In <1991Apr17.235954.2334@cs.fau.edu> conners@cs.fau.edu (Sean Conner) writes:

[stuff deleted]

>  So, we have the 386/387 covered.  What about the 68020/68881(2)?  I have
>an Amiga at home :-)

  With the 68020/030/881/882 fixed point is much faster. Previous posts on
this topic only considered mul and div (?). Add and sub are *very* slow for
floats (slower than muls), whereas for ints, add is very fast. The best way to
check the speed difference in methods is to write two versions of a function,
one using floats, the other using ints, then compare times. If you mix
floats with ints on the 680x0/88x, you'll see some speed penalties for
processor to coprocessor moves (around 107 cycles in some cases). With
32 bit / 64 bit intermediate ints, you should be able to handle all computer
graphics math. Physics equations are a different story... Anyway, with
an assembler such as Adapt, you can dissassemble your code and compare
instruction cycle times for various methods. The only way to know for sure
it to time both methods and compare.

  John