[comp.lang.c] Execution time bottleneck: How

mazur@inmet.inmet.com (02/16/91)

In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>   for (j=1; j <= n; j++) {
>      sum = 0.0;
>      for (i=1; i<= n; i++) {
>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>      }
>      a[j] = k*sum;
>   }
>This part of the algorithmn is a bottleneck. I want to make this as
>quick as possible. What are my options? I have heard that assembler
>would be faster. How much faster would assembler be? Would it be worth
>the effort? Are there other alternatives that you can see?

I missed the original article, so I'm not sure what you are compiling this
code with, but rewriting the loop so that you decrement from (n-1) to 0 
may reduce the number of instructions necessary to do the comparison (i.e.,
you may be able to do a "tst; beq" instead of "ld i, tst, beq").  Small
savings, but they add up.

Beth Mazur               
mazur@inmet.inmet.com  
-..!uunet!inmet!mazur 

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (02/21/91)

In article <492@tin.crc.ac.uk> dcurtis@crc.ac.uk (Dr. David Curtis) writes:
>
>>In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>>>   for (j=1; j <= n; j++) {
>>>      sum = 0.0;
>>>      for (i=1; i<= n; i++) {
>>>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>>>      }
>>>      a[j] = k*sum;
>>>   }
>>>This part of the algorithmn is a bottleneck. I want to make this as
>>>quick as possible. What are my options? I have heard that assembler
>>>would be faster. How much faster would assembler be? Would it be worth
>>>the effort? Are there other alternatives that you can see?
>
>   }
>
>Now what follows I'm less sure of. The main aim is to avoid calculating
>array offsets and just increment pointers instead. Can anyone say if 
>that kind of thing produces significant speed-ups?
>   
>   float *ap,*xjp,*jp,*a1p,*x1p;
>   a1p=&a[1];
>   x1p=&x[1];
>   for (ap=a1p, xjp=x1p, j=n; j ; --j, ++ap, ++xjp) {
>      float xj;
>      xj=*xjp;
>      sum = 0.0;
>      for (xip=x1p, i=n; i; --i, ++xip) {
>        float xjfromi;
>        xfromj=*xip-xj; 
>        sum += exp( con * xfromj * xfromj);
>      }
>      *ap = k*sum;
>   }
>
>

Sometimes yes, sometimes no. I have tried benchmarking this sort of stuff
extensively. However on  many compilers an exp (a single one) is sufficiently
sluggish to override the benefit. On the other hand --- on MOST compilers
using two dimensional arrays (either a[i][j] or a[i+m*j] style) is
sufficiently bad that the exp can't swamp it and using pointers is a 
substantial help. This kind of optimization is more common in Fortran
compilers, but even there is not universal.

Try it on your compiler and see. 

Doug McDonald

dcurtis@crc.ac.uk (Dr. David Curtis) (02/21/91)

>In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>>   for (j=1; j <= n; j++) {
>>      sum = 0.0;
>>      for (i=1; i<= n; i++) {
>>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>>      }
>>      a[j] = k*sum;
>>   }
>>This part of the algorithmn is a bottleneck. I want to make this as
>>quick as possible. What are my options? I have heard that assembler
>>would be faster. How much faster would assembler be? Would it be worth
>>the effort? Are there other alternatives that you can see?

Yes, apart from decrementing instead of incrementing the loop, as has 
already been suggested:

   for (j=1; j <= n; j++) {
      float xj;
      xj=x[j];
      sum = 0.0;
      for (i=1; i<= n; i++) {
        float xjfromi;
        xfromj=x[i]-xj; 
        sum += exp( con * xfromj * xfromj);
      }
      a[j] = k*sum;
   }

Now what follows I'm less sure of. The main aim is to avoid calculating
array offsets and just increment pointers instead. Can anyone say if 
that kind of thing produces significant speed-ups?
   
   float *ap,*xjp,*jp,*a1p,*x1p;
   a1p=&a[1];
   x1p=&x[1];
   for (ap=a1p, xjp=x1p, j=n; j ; --j, ++ap, ++xjp) {
      float xj;
      xj=*xjp;
      sum = 0.0;
      for (xip=x1p, i=n; i; --i, ++xip) {
        float xjfromi;
        xfromj=*xip-xj; 
        sum += exp( con * xfromj * xfromj);
      }
      *ap = k*sum;
   }


Dave Curtis

Academic Department of Psychiatry,    Janet:       dc@UK.AC.UCL.SM.PSYCH
Middlesex Hospital,                   Elsewhere:   dc@PSYCH.SM.UCL.AC.UK
Mortimer Street, London W1N 8AA.      EARN/Bitnet: dc%PSYCH.SM.UCL@UKACRL
Tel 071-636 8333 Fax 071-323 1459     Usenet: ...!mcsun!ukc!mrccrc!D.Curtis

scs@adam.mit.edu (Steve Summit) (02/23/91)

In article <492@tin.crc.ac.uk> dcurtis@crc.ac.uk (Dr. David Curtis) writes:
> Now what follows I'm less sure of. The main aim is to avoid calculating
> array offsets and just increment pointers instead. Can anyone say if 
> that kind of thing produces significant speed-ups?

Schade; this comes up one week too soon.  Here is a sneak preview
of revision 1.24 of the comp.lang.c frequently-asked questions
list, coming soon to a spool directory near you:

100. Are pointers really faster than arrays?  Do function calls really
     slow things down?

A:   The answers to these and many similar questions depend of course on
     the processor and compiler in use.  If you simply must know, you'll
     have to time test programs carefully.  (Often the differences are
     so slight that tens or hundreds of thousands of iterations are
     required even to see them.  Check the compiler's assembly language
     output to see if two purported alternatives aren't compiled
     identically.)

     It is "usually" faster to march through large arrays with pointers
     rather than array subscripts, but for some processors the reverse
     is true.

     Function calls, though obviously incrementally slower than in-line
     code, contribute so much to modularity and code clarity that there
     is rarely good reason to avoid them.

                                            Steve Summit
                                            scs@adam.mit.edu

ADARTNEL@ESOC.BITNET (02/24/91)

You may be able to replace the exp function with an approximation,

     exp(x) = ( ( ( (x/5+1) * x/4 + 1 ) * x/3 + 1 ) * x/2 + 1 ) * x + 1

but this depends on the accuracy you require and the range of values of x.

dik@cwi.nl (Dik T. Winter) (02/25/91)

Numerical mathematics time.

In article <91053.093740ADARTNEL@ESOC.BITNET> ADARTNEL@ESOC.BITNET writes:
 > You may be able to replace the exp function with an approximation,
 >      exp(x) = ( ( ( (x/5+1) * x/4 + 1 ) * x/3 + 1 ) * x/2 + 1 ) * x + 1
 > but this depends on the accuracy you require and the range of values of x.

This requires 4 divisions, 4 multiplications and 5 additions.  Divisions can
be changed to multiplications, than we have 9 multiplications and
5 additions.  The maximal absolute error on (-0.5,0.5) is approximately
2.4e-5.  The code above can be rewritten to use only 5 multiplications and
5 additions.

Now take the following approximation:
    (((0.042190*x+0.169287)*x+0.499951)*x+0.999836)*x+1.0;
4 multiplications and 4 additions.  Maximal absolute error on (-0.5,0.5)
approximately 1.8e-5.  We can change to:
   ((((0.008421*x+0.042190)*x+0.166656)*x+0.499951)*x+1.0)*x+1.0;
with 5 multiplications and 5 additions and a maximal absolute error on
(-0.5,0.5) of approximately 1.5e-6.
But do not use these last two outside the interval.

The BSD 4.3-tahoe code takes (approximately) 1 division, 7 multiplications
and 8 additions to get 56 bit (VAX) precision.  (However that code does
some additional work to allow all arguments that do not result in overflow.)

Magic?  No numerical mathematics.  Using Taylor series almost always leads
to slower routines.
--
No there are no typo's in those coefficients.  I just calculated them.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

neufeld@aurora.physics.utoronto.ca (Christopher Neufeld) (02/26/91)

In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>   for (j=1; j <= n; j++) {
>      sum = 0.0;
>      for (i=1; i<= n; i++) {
>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>      }
>      a[j] = k*sum;
>   }
>This part of the algorithmn is a bottleneck. I want to make this as
>quick as possible. What are my options? I have heard that assembler
>would be faster. How much faster would assembler be? Would it be worth
>the effort? Are there other alternatives that you can see?
>
   Well, this might look obvious, but do those delta-x form a discrete
set? If so, you can use a lookup table and calculate all the
exponentials out in front. It depends on your application, but when I
was doing Monte-Carlo simulations which passed through an evaluation
step like this some 10^6 times per data point, I sped things up by
making a lookup table of some thousands of entries, all worked out ahead
of time, and then modified the algorithm so that it believed that it was
working with integers (to simplify the algebra in calculating the table
index number). If you've got enough memory, and that inner loop is
executed often enough, then an approach like this might help.


-- 
 Christopher Neufeld....Just a graduate student  | Note: new host.
 neufeld@aurora.physics.utoronto.ca    Ad astra! | helios will still
 cneufeld@{pnet91,pro-cco}.cts.com               | forward my mail to
 "Don't edit reality for the sake of simplicity" | me on aurora.