jdb@reef.cis.ufl.edu (Brian K. W. Hook) (04/12/91)
I just ran Turbo Profiler on some code of mine, and it looks like roughly
90% of my CPU time is being shoved into one function. Any help on the
optimizations would be really appreciated.
E-mail would be preferred, however posts are fine too.
void Calc3D ( int WorldX, int WorldY, int WorldZ,
int MX, int MY, int MZ,
int *DisplayX, int *DisplayY)
{
float xa, ya, za;
WorldX=-WorldX;
xa=_yawCosFactor*WorldX-_yawSinFactor*WorldZ;
za=_yawSinFactor*WorldX+_yawCosFactor*WorldZ;
WorldX=_rollCosFactor*xa+_rollSinFactor*WY;
ya=_rollCosFactor*WorldY-_rollSinFactor*xa;
WorldZ=_pitchCosFactor*za-_pitchSinFactor*ya;
WorldY=_pitchSinFactor*za+_pitchCosFactor*ya;
WorldX+=MX;
WorldY+=MY;
WorldZ+=MZ;
*DisplayX=AngPerspFactor*WorldX/WorldZ;
*DisplayY=AngPerspFactor*WorldY/WorldZ;
}
AngPerspFactor is a float, as are the assorted factors. I have already
used a COS and SIN lookup table in another part of the module to expedite
things, but this is really slowing things down considerably.
Any help is appreciated.
Brian
2fmlcalls@kuhub.cc.ukans.edu (04/12/91)
> I just ran Turbo Profiler on some code of mine, and it looks like roughly > 90% of my CPU time is being shoved into one function. Any help on the > optimizations would be really appreciated. snip > float xa, ya, za; snip > AngPerspFactor is a float, as are the assorted factors. I have already > used a COS and SIN lookup table in another part of the module to expedite > things, but this is really slowing things down considerably. > > Any help is appreciated. > > Brian I don't fully follow C, but I know float. Go with scaled integers or longs and use integer division. As well, use scaled integers for your sin/co look=up tables (i.e. sin(45 degrees) = 707 rather than 0.707). When you are through multiplying, do a div 1000 (or C equivalent of integer division). General optimizations - use fewer points. If you get creative you may find you can represent your planes/tanks/whatnots with less points (and thus fewer lines to draw as well). john calhoun
rcb@shaman.cc.ncsu.edu (Randy Buckland) (04/12/91)
jdb@reef.cis.ufl.edu (Brian K. W. Hook) writes: >void Calc3D ( int WorldX, int WorldY, int WorldZ, > int MX, int MY, int MZ, > int *DisplayX, int *DisplayY) >{ >float xa, ya, za; > WorldX=-WorldX; > xa=_yawCosFactor*WorldX-_yawSinFactor*WorldZ; > za=_yawSinFactor*WorldX+_yawCosFactor*WorldZ; > WorldX=_rollCosFactor*xa+_rollSinFactor*WY; > ya=_rollCosFactor*WorldY-_rollSinFactor*xa; > WorldZ=_pitchCosFactor*za-_pitchSinFactor*ya; > WorldY=_pitchSinFactor*za+_pitchCosFactor*ya; > WorldX+=MX; > WorldY+=MY; > WorldZ+=MZ; > *DisplayX=AngPerspFactor*WorldX/WorldZ; > *DisplayY=AngPerspFactor*WorldY/WorldZ; >} Try this Store your cos/sin values as (int)(cos(angle)*1024) Then do the matrix multiply as c(0,0) = (a(0,0)*b(0,0) + a(0,1)*b(1,0) + a(0,2)*b(2,0) + (a(0,3)*b(3,0)) >> 10; where a is your initial 4x4 coordinate matrix and b is your translation matrix that is precomputed. ALL values in the b matrix should be integers that are 1024 times their proper values while a is an integer matrix with the proper values and c is an integer matrix. This results in 4 integer multiplies, 3 integer adds and 1 shift (the reason for 1024 instead of 1000 scale factor) No floating point at all. Also, try to determine if you are really rotating about all 3 axes. If not, you can drastically simplify the computations and increase the speed. -- Randy Buckland "It's hard to work North Carolina State University in a group when you're randy@ncsu.edu (919) 737-2517 omnipotent" -- Q
ahodgson@athena.mit.edu (Antony Hodgson) (04/12/91)
In article <1991Apr11.202429.29648@kuhub.cc.ukans.edu> 2fmlcalls@kuhub.cc.ukans.edu writes: >> I just ran Turbo Profiler on some code of mine, and it looks like roughly >> 90% of my CPU time is being shoved into one function. Any help on the >> optimizations would be really appreciated. >> ... float xa, ya, za; > >I don't fully follow C, but I know float. Go with scaled integers or longs and >use integer division. Check out the latest Dr. Dobb's journal on using memory-mapped math coprocessors. They claim that float division is not faster than 386 integer division on some PC hardware platforms. >As well, use scaled integers for your sin/co look=up >tables (i.e. sin(45 degrees) = 707 rather than 0.707). When you are through >multiplying, do a div 1000 (or C equivalent of integer division). If you go this route, don't scale in base 10; scale in base 2 and use shift operations instead of division when possible. Tony Hodgson ahodgson@hstbme.mit.edu
ahodgson@athena.mit.edu (Antony Hodgson) (04/12/91)
In article <1991Apr11.202429.29648@kuhub.cc.ukans.edu> 2fmlcalls@kuhub.cc.ukans.edu writes: >> I just ran Turbo Profiler on some code of mine, and it looks like roughly >> 90% of my CPU time is being shoved into one function. Any help on the >> optimizations would be really appreciated. > >I don't fully follow C, but I know float. Go with scaled integers or longs and >use integer division. Check out the latest Dr. Dobb's Journal. In an article on memory mapped numeric coprocessors, they claim that floating point division is now faster than 386 integer division with the appropriate math coprocessor attached. >As well, use scaled integers for your sin/co look=up >tables (i.e. sin(45 degrees) = 707 rather than 0.707). When you are through >multiplying, do a div 1000 (or C equivalent of integer division). If you go this route, consider scaling by a power of 2 rather than a power of 10. You may be able to carry out divisions by shift operations rather than division operations. Tony Hodgson ahodgson@hstbme.mit.edu
shaunc@gold.gvg.tek.com (Shaun Case) (04/13/91)
In article <27979@uflorida.cis.ufl.EDU> jdb@reef.cis.ufl.edu (Brian K. W. Hook) writes: >I just ran Turbo Profiler on some code of mine, and it looks like roughly >90% of my CPU time is being shoved into one function. Any help on the >optimizations would be really appreciated. On a related note, I got some odd results with the profiler once or twice until I realized that my screen saver was interrupting the profiler, and was throwing the profile WAY off. While the my screen saver is running, the profiler is suspended, so when you exit the screen saver, all the time spent in it gets assigned to whatever area the profiler was looking at whenever the screen saver went active. Routines that previously had 13-15% suddenly went up to 30%, then 90% the next time around... Beware TSRs in all their myriad forms. // Shaun // -- Shaun Case: shaunc@gold.gvg.tek.com or atman%ecst.csuchico.edu@RELAY.CS.NET or Shaun Case of 1:119/666.0 (Fidonet) or 1@9651 (WWIVnet) --- It's enough to destroy a young moose's faith!
Ordania-DM@cup.portal.com (Charles K Hughes) (04/13/91)
Does this discussion really need to be in 3 newsgroups? Brian writes: > >I just ran Turbo Profiler on some code of mine, and it looks like roughly >90% of my CPU time is being shoved into one function. Any help on the I don't know what the rest of your program looks like, but I don't think this code should take up 90% of it. (Before following my mediocre suggestions, follow the others that have been posted - TSR removal, use Ints, etc.) >optimizations would be really appreciated. > >E-mail would be preferred, however posts are fine too. > >void Calc3D ( int WorldX, int WorldY, int WorldZ, > int MX, int MY, int MZ, > int *DisplayX, int *DisplayY) >{ >float xa, ya, za; > > WorldX=-WorldX; > xa=_yawCosFactor*WorldX-_yawSinFactor*WorldZ; > za=_yawSinFactor*WorldX+_yawCosFactor*WorldZ; > WorldX=_rollCosFactor*xa+_rollSinFactor*WY; > ya=_rollCosFactor*WorldY-_rollSinFactor*xa; > WorldZ=_pitchCosFactor*za-_pitchSinFactor*ya; > WorldY=_pitchSinFactor*za+_pitchCosFactor*ya; > WorldX+=MX; > WorldY+=MY; > WorldZ+=MZ; > *DisplayX=AngPerspFactor*WorldX/WorldZ; > *DisplayY=AngPerspFactor*WorldY/WorldZ; Change these last two to: yes_a_temp=AngPerspFactor/WorldZ; *DisplayX=yes_a_temp*WorldX *DisplayY=yes_a_temp*WorldY Division costs a lot more than multiplication or storage. >} > >AngPerspFactor is a float, as are the assorted factors. I have already >used a COS and SIN lookup table in another part of the module to expedite >things, but this is really slowing things down considerably. > >Any help is appreciated. > >Brian > > Now, for independend thoughts that don't quite fit into the program... There are formulas for transforming the equations you have above. Calculus and trig books should have them. Good luck. Charles_K_Hughes@cup.portal.com
jmbj@grebyn.com (Jim Bittman) (04/13/91)
[stuff deleted] >>> 90% of my CPU time is being shoved into one function. Any help on the >>> optimizations would be really appreciated. >>> ... float xa, ya, za; >> > >Check out the latest Dr. Dobb's journal on using memory-mapped math ^^^^^^^^^^^^^^^^^^^^^^^^^ Why not add a floating point DSP coprocessor, and be done with it? >coprocessors. They claim that float division is not faster than 386 >integer division on some PC hardware platforms. Jim Bittman, jmbj@grebyn.com (author of DSP article in above mentioned magazine |=)
uad1077@dircon.co.uk (Ian Kemmish) (04/13/91)
2fmlcalls@kuhub.cc.ukans.edu writes: >> I just ran Turbo Profiler on some code of mine, and it looks like roughly >> 90% of my CPU time is being shoved into one function. Any help on the >> optimizations would be really appreciated. >snip >> float xa, ya, za; >snip >> AngPerspFactor is a float, as are the assorted factors. I have already >> used a COS and SIN lookup table in another part of the module to expedite >> things, but this is really slowing things down considerably. >> >> Any help is appreciated. >> >> Brian >I don't fully follow C, but I know float. Go with scaled integers or longs and >use integer division. As well, use scaled integers for your sin/co look=up >tables (i.e. sin(45 degrees) = 707 rather than 0.707). When you are through >multiplying, do a div 1000 (or C equivalent of integer division). >General optimizations - use fewer points. If you get creative you may find you >can represent your planes/tanks/whatnots with less points (and thus fewer lines >to draw as well). >john calhoun But don't get rid of floats completely. If you can, parameterise the decision to use floats or scaled ints in a config file. On mips systems, and with luck the civilised habit will spread, doing arithmetic in floats os considerably faster than doing it in integers (assuming its not something simple like a long run of adds and subtracts with just a couple of token multiplies and divides thrown in), which in turn is faster than arithmetic using shorts. Scaled integers and shorts are slower, again. I recently saw a preliminary copy of the JPEG sample implementation, and it suffered from this ``integer-centrism''.... -- Ian D. Kemmish Tel. +44 767 601 361 18 Durham Close uad1077@dircon.UUCP Biggleswade ukc!dircon!uad1077 Beds SG18 8HZ United Kingdom uad1077@dircon.co.uk
conners@cs.fau.edu (Sean Conner) (04/17/91)
In article <1991Apr13.154112.11345@dircon.co.uk> uad1077@dircon.co.uk (Ian Kemmish) writes: [ stuff about using scaled ints stuffed into bit bucket ] > >But don't get rid of floats completely. If you can, parameterise the >decision to use floats or scaled ints in a config file. On mips >systems, and with luck the civilised habit will spread, doing >arithmetic in floats os considerably faster than doing it in >integers (assuming its not something simple like a long run >of adds and subtracts with just a couple of token multiplies and >divides thrown in), which in turn is faster than arithmetic using >shorts. Scaled integers and shorts are slower, again. I recently >saw a preliminary copy of the JPEG sample implementation, and >it suffered from this ``integer-centrism''.... > >-- >Ian D. Kemmish Tel. +44 767 601 361 >18 Durham Close uad1077@dircon.UUCP >Biggleswade ukc!dircon!uad1077 >Beds SG18 8HZ United Kingdom uad1077@dircon.co.uk So, okay, I have a MIPS system. No problem. But what about us poor programmers who can only afford a 386/387 system? It is still faster to use ints than floats, for two reasons: 1. On a 386, the longest a IDIV will take (this is with a memory operand here) is 46 cycles. And the longest an IMUL will take is 41 cycles. On the 387, the SHORTEST time it will take to divide two reals is 91 cycles. A floating point multiply takes about the same as the IMUL instruction (depends on addressing modes, though). Drop down to a 286/287 combo, and the IMUL/IDIV instructions take about 20-30 cycles, while the FMUL/FDIV take about (for a FMUL) 130-170 or (for a FDIV) 190-230 cycles. Shall we go on to a 86/87 combo? 2. Do we REALLY need the ability to go from 1E-100 to 1E+100 in range when doing graphics? And what does it mean to add 1E-100 to 1E+100? For me, scaled ints are fine. -spc (Floats? I don't need no stinking floats! :-)
dmurdoch@watstat.waterloo.edu (Duncan Murdoch) (04/17/91)
In article <1991Apr16.210819.17873@cs.fau.edu> conners@cs.fau.edu (Sean Conner) writes: > > So, okay, I have a MIPS system. No problem. But what about us poor >programmers who can only afford a 386/387 system? It is still faster to >use ints than floats, for two reasons: > > 1. On a 386, the longest a IDIV will take (this is with a memory > operand here) is 46 cycles. And the longest an IMUL will take > is 41 cycles. On the 387, the SHORTEST time it will take to > divide two reals is 91 cycles. A floating point multiply takes > about the same as the IMUL instruction (depends on addressing > modes, though). Just in case you're interested: the times for the 486 are IDIV 19-44 FDIV 73 IMUL 13-42 FMUL 11-16 So it looks as though floating point math would be a contender, even without a MIPS, as long as you avoided the divides. I don't know the algorithms you want to use, but very often the denominator in a series of divides stays the same for a long time, so you can replace it with a series of multiplies by the reciprocal. Duncan Murdoch dmurdoch@watstat.waterloo.edu
conners@cs.fau.edu (Sean Conner) (04/18/91)
In article <1991Apr17.133142.24149@maytag.waterloo.edu> dmurdoch@watstat.waterloo.edu (Duncan Murdoch) writes: > >Just in case you're interested: the times for the 486 are > > IDIV 19-44 > FDIV 73 > IMUL 13-42 > FMUL 11-16 > >So it looks as though floating point math would be a contender, even without >a MIPS, as long as you avoided the divides. I don't know the algorithms >you want to use, but very often the denominator in a series of divides stays the >same for a long time, so you can replace it with a series of >multiplies by the reciprocal. > >Duncan Murdoch >dmurdoch@watstat.waterloo.edu So, we have the 386/387 covered. What about the 68020/68881(2)? I have an Amiga at home :-) -spc (Why do I program in Assembly? Speed. Pure and simple.)
hpasanen@cs.hut.fi (Harri Pasanen) (04/18/91)
Just an idea: Use C++ and define your own number class. Then depending on the class header included you can then have either ints or floats, and compile the code for the machine at hand for its maximum performance. Harri Pasanen
jcs@crash.cts.com (John Schultz) (04/23/91)
In <1991Apr17.235954.2334@cs.fau.edu> conners@cs.fau.edu (Sean Conner) writes: [stuff deleted] > So, we have the 386/387 covered. What about the 68020/68881(2)? I have >an Amiga at home :-) With the 68020/030/881/882 fixed point is much faster. Previous posts on this topic only considered mul and div (?). Add and sub are *very* slow for floats (slower than muls), whereas for ints, add is very fast. The best way to check the speed difference in methods is to write two versions of a function, one using floats, the other using ints, then compare times. If you mix floats with ints on the 680x0/88x, you'll see some speed penalties for processor to coprocessor moves (around 107 cycles in some cases). With 32 bit / 64 bit intermediate ints, you should be able to handle all computer graphics math. Physics equations are a different story... Anyway, with an assembler such as Adapt, you can dissassemble your code and compare instruction cycle times for various methods. The only way to know for sure it to time both methods and compare. John