mazur@inmet.inmet.com (02/16/91)
In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } >This part of the algorithmn is a bottleneck. I want to make this as >quick as possible. What are my options? I have heard that assembler >would be faster. How much faster would assembler be? Would it be worth >the effort? Are there other alternatives that you can see? I missed the original article, so I'm not sure what you are compiling this code with, but rewriting the loop so that you decrement from (n-1) to 0 may reduce the number of instructions necessary to do the comparison (i.e., you may be able to do a "tst; beq" instead of "ld i, tst, beq"). Small savings, but they add up. Beth Mazur mazur@inmet.inmet.com -..!uunet!inmet!mazur
mcdonald@aries.scs.uiuc.edu (Doug McDonald) (02/21/91)
In article <492@tin.crc.ac.uk> dcurtis@crc.ac.uk (Dr. David Curtis) writes: > >>In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: >>> for (j=1; j <= n; j++) { >>> sum = 0.0; >>> for (i=1; i<= n; i++) { >>> sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); >>> } >>> a[j] = k*sum; >>> } >>>This part of the algorithmn is a bottleneck. I want to make this as >>>quick as possible. What are my options? I have heard that assembler >>>would be faster. How much faster would assembler be? Would it be worth >>>the effort? Are there other alternatives that you can see? > > } > >Now what follows I'm less sure of. The main aim is to avoid calculating >array offsets and just increment pointers instead. Can anyone say if >that kind of thing produces significant speed-ups? > > float *ap,*xjp,*jp,*a1p,*x1p; > a1p=&a[1]; > x1p=&x[1]; > for (ap=a1p, xjp=x1p, j=n; j ; --j, ++ap, ++xjp) { > float xj; > xj=*xjp; > sum = 0.0; > for (xip=x1p, i=n; i; --i, ++xip) { > float xjfromi; > xfromj=*xip-xj; > sum += exp( con * xfromj * xfromj); > } > *ap = k*sum; > } > > Sometimes yes, sometimes no. I have tried benchmarking this sort of stuff extensively. However on many compilers an exp (a single one) is sufficiently sluggish to override the benefit. On the other hand --- on MOST compilers using two dimensional arrays (either a[i][j] or a[i+m*j] style) is sufficiently bad that the exp can't swamp it and using pointers is a substantial help. This kind of optimization is more common in Fortran compilers, but even there is not universal. Try it on your compiler and see. Doug McDonald
dcurtis@crc.ac.uk (Dr. David Curtis) (02/21/91)
>In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: >> for (j=1; j <= n; j++) { >> sum = 0.0; >> for (i=1; i<= n; i++) { >> sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); >> } >> a[j] = k*sum; >> } >>This part of the algorithmn is a bottleneck. I want to make this as >>quick as possible. What are my options? I have heard that assembler >>would be faster. How much faster would assembler be? Would it be worth >>the effort? Are there other alternatives that you can see? Yes, apart from decrementing instead of incrementing the loop, as has already been suggested: for (j=1; j <= n; j++) { float xj; xj=x[j]; sum = 0.0; for (i=1; i<= n; i++) { float xjfromi; xfromj=x[i]-xj; sum += exp( con * xfromj * xfromj); } a[j] = k*sum; } Now what follows I'm less sure of. The main aim is to avoid calculating array offsets and just increment pointers instead. Can anyone say if that kind of thing produces significant speed-ups? float *ap,*xjp,*jp,*a1p,*x1p; a1p=&a[1]; x1p=&x[1]; for (ap=a1p, xjp=x1p, j=n; j ; --j, ++ap, ++xjp) { float xj; xj=*xjp; sum = 0.0; for (xip=x1p, i=n; i; --i, ++xip) { float xjfromi; xfromj=*xip-xj; sum += exp( con * xfromj * xfromj); } *ap = k*sum; } Dave Curtis Academic Department of Psychiatry, Janet: dc@UK.AC.UCL.SM.PSYCH Middlesex Hospital, Elsewhere: dc@PSYCH.SM.UCL.AC.UK Mortimer Street, London W1N 8AA. EARN/Bitnet: dc%PSYCH.SM.UCL@UKACRL Tel 071-636 8333 Fax 071-323 1459 Usenet: ...!mcsun!ukc!mrccrc!D.Curtis
scs@adam.mit.edu (Steve Summit) (02/23/91)
In article <492@tin.crc.ac.uk> dcurtis@crc.ac.uk (Dr. David Curtis) writes: > Now what follows I'm less sure of. The main aim is to avoid calculating > array offsets and just increment pointers instead. Can anyone say if > that kind of thing produces significant speed-ups? Schade; this comes up one week too soon. Here is a sneak preview of revision 1.24 of the comp.lang.c frequently-asked questions list, coming soon to a spool directory near you: 100. Are pointers really faster than arrays? Do function calls really slow things down? A: The answers to these and many similar questions depend of course on the processor and compiler in use. If you simply must know, you'll have to time test programs carefully. (Often the differences are so slight that tens or hundreds of thousands of iterations are required even to see them. Check the compiler's assembly language output to see if two purported alternatives aren't compiled identically.) It is "usually" faster to march through large arrays with pointers rather than array subscripts, but for some processors the reverse is true. Function calls, though obviously incrementally slower than in-line code, contribute so much to modularity and code clarity that there is rarely good reason to avoid them. Steve Summit scs@adam.mit.edu
ADARTNEL@ESOC.BITNET (02/24/91)
You may be able to replace the exp function with an approximation, exp(x) = ( ( ( (x/5+1) * x/4 + 1 ) * x/3 + 1 ) * x/2 + 1 ) * x + 1 but this depends on the accuracy you require and the range of values of x.
dik@cwi.nl (Dik T. Winter) (02/25/91)
Numerical mathematics time. In article <91053.093740ADARTNEL@ESOC.BITNET> ADARTNEL@ESOC.BITNET writes: > You may be able to replace the exp function with an approximation, > exp(x) = ( ( ( (x/5+1) * x/4 + 1 ) * x/3 + 1 ) * x/2 + 1 ) * x + 1 > but this depends on the accuracy you require and the range of values of x. This requires 4 divisions, 4 multiplications and 5 additions. Divisions can be changed to multiplications, than we have 9 multiplications and 5 additions. The maximal absolute error on (-0.5,0.5) is approximately 2.4e-5. The code above can be rewritten to use only 5 multiplications and 5 additions. Now take the following approximation: (((0.042190*x+0.169287)*x+0.499951)*x+0.999836)*x+1.0; 4 multiplications and 4 additions. Maximal absolute error on (-0.5,0.5) approximately 1.8e-5. We can change to: ((((0.008421*x+0.042190)*x+0.166656)*x+0.499951)*x+1.0)*x+1.0; with 5 multiplications and 5 additions and a maximal absolute error on (-0.5,0.5) of approximately 1.5e-6. But do not use these last two outside the interval. The BSD 4.3-tahoe code takes (approximately) 1 division, 7 multiplications and 8 additions to get 56 bit (VAX) precision. (However that code does some additional work to allow all arguments that do not result in overflow.) Magic? No numerical mathematics. Using Taylor series almost always leads to slower routines. -- No there are no typo's in those coefficients. I just calculated them. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
neufeld@aurora.physics.utoronto.ca (Christopher Neufeld) (02/26/91)
In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } >This part of the algorithmn is a bottleneck. I want to make this as >quick as possible. What are my options? I have heard that assembler >would be faster. How much faster would assembler be? Would it be worth >the effort? Are there other alternatives that you can see? > Well, this might look obvious, but do those delta-x form a discrete set? If so, you can use a lookup table and calculate all the exponentials out in front. It depends on your application, but when I was doing Monte-Carlo simulations which passed through an evaluation step like this some 10^6 times per data point, I sped things up by making a lookup table of some thousands of entries, all worked out ahead of time, and then modified the algorithm so that it believed that it was working with integers (to simplify the algebra in calculating the table index number). If you've got enough memory, and that inner loop is executed often enough, then an approach like this might help. -- Christopher Neufeld....Just a graduate student | Note: new host. neufeld@aurora.physics.utoronto.ca Ad astra! | helios will still cneufeld@{pnet91,pro-cco}.cts.com | forward my mail to "Don't edit reality for the sake of simplicity" | me on aurora.