racine@yunexus.yorku.ca (Jeff Racine) (02/11/91)
Execution time query: I have the following nested loop in which I must sum (over all i), for each x[j], e raised to the power x[i]-x[j] squared (times a constant). This section of the program is of order n-squared (i.e. execution time increases at the rate of the number of observations squared). for (j=1; j <= n; j++) { sum = 0.0; for (i=1; i<= n; i++) { sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); } a[j] = k*sum; } This part of the algorithmn is a bottleneck. I want to make this as quick as possible. What are my options? I have heard that assembler would be faster. How much faster would assembler be? Would it be worth the effort? Are there other alternatives that you can see? Any improvement would be welcome. Any help would be greatly appreciated. Thanks for your time and effort. -------------------------------------------------------------------- Jeff Racine Tel (416)-656-2916 racine@nexus.yorku.ca racine@yunexus.UUCP /* Yo Mama */
mcdaniel@adi.com (Tim McDaniel) (02/13/91)
racine@yunexus.yorku.ca (Jeff Racine) asks about sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); in a pair of nested loops. This part of the algorithm is a bottleneck. Hallelujah, I just got religion. This is the *first* optimization article I remember in comp.lang.c where the author knew that the code fragment was a bottleneck! (If it's not a bottleneck, it doesn't really *matter* what you do; don't put the effort into optimizing if you're not sure.) What are my options? I have heard that assembler would be faster. How much faster would assembler be? Would it be worth the effort? Are there other alternatives that you can see? Jeff, if we were telepaths, we wouldn't need netnews. 8-) Environment: What hardware are you on? What O. S.? What C compiler? What assembler? What other code is around this fragment? Data: What's the typical value of N? Anything special about X[] (integer or other known GCD, triangular)? Anyhting special about CON? Effort: Will this program be run often? Is N large? In short, is the execution time large, and does it matter if it is? (I presume "yes" throughout.) Is the program going to massage the same data on many runs (i.e. would precomputation help)? I doubt that assembler would help very much on common systems, though that's only an ignorant guess. -- "I'd hang you from my nipples ... but it would shock the children." Tim McDaniel Applied Dynamics Int'l.; Ann Arbor, Michigan, USA Internet: mcdaniel@adi.com UUCP: {uunet,sharkey}!amara!mcdaniel
barmar@think.com (Barry Margolin) (02/13/91)
In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } First a question about the correctness: do you really want to loop from 1 to n? In C, if the array has n elements, the indices run from 0 to n-1, not from 1 to n. >This part of the algorithmn is a bottleneck. I want to make this as >quick as possible. What are my options? I have heard that assembler >would be faster. How much faster would assembler be? Would it be worth >the effort? Are there other alternatives that you can see? Assembler probably won't help much, unless you're already a pretty good assembler programmer. However, it would probably be helpful if you looked at the assembler code produced by your C compiler, to see whether there are ways you could recode the C so that it would produce more optimal compiled code. You might want to look at the book "Writing Efficient Programs" by Jon Louis Bentley (he's the same person who went on to write the "Programming Pearls" column in CACM (these columns have since been collected into a couple of books). It describes many general techniques for determining why a program is slow and improving it. As for your code, there are a couple of minor changes that may help a little (but small improvements inside inner loops can have noticeable effects). They assume that your C compiler doesn't have a very good optimizer (generally a pretty safe assumption). I'll reference the efficiency rules that Bentley describes. Register Allocation. You reference i and j very frequently, so they should be declared "register". This may not be necessary, though; I expect many compilers default to putting loop index variables into registers. Loop Rule 1 -- Code Motion Out of Loops. Your inner loop refers to x[j] each time through the loop, although j doesn't change during that loop. It would probably be better to do "register x_sub_j_temp = x[j];" before the inner loop, and then access x_sub_j_temp in the inner loop instead of x[j]. You've already applied this rule in your assignments to a[j], by the way. A slower, but equivalent, way to have written your code would have been for the inner loop to contain: a[j] += k * exp (...); Expression Rule 3 -- Common Subexpression Elimination. You perform the subtraction of the two elements twice when squaring. Instead, store the result of the subtraction into a temporary, and square that: temp_difference = x[i] - x[j]; sum += exp (con * temp_difference * temp_difference); Many compilers are able to do this optimization themselves (it's one of the oldest and most well known), so it may not improve efficiency (but you might need to use the "register" declaration to get exactly the same code); look at the compiler's assembler output to see. However, I personally find this good style even if it doesn't change run time; the originally code forced me to examine carefully to be sure that both subtractions were the same (it could have been (x[i]-x[j])*(x[j]-x[i]), which looks almost the same); by simplifying the expression you make it easier for the human reader to discern the structure of the formula (it would also help to use a more descriptive name than "temp_difference", especially if this difference has a name in the application domain). -- Barry Margolin, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (02/13/91)
In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca wants to calculate \~ a[j] = k*/_ c**((x[i]-x[j])**2), for j=1..n i=1..n for given k, c, x[]. Assembler is most unlikely to be any help at all, because a large chunk of the time is spent inside exp(), which isn't going to run any faster when you call it from assembler. One thing that *can* be done is to chop the time nearly in half by noting that (x[i]-x[j])**2 = (x[j]-x[i])**2. We can also optimise the i==j case, as x[i]-x[j] = 0 then, hence c**(et cetera) = 1.0. register int i, j; register double t; for (j = 1; j <= n; j++) a[j] = 1.0; for (j = 1; j <= n; j++) for (i = j+1; i <= n; i++) { t = x[i] - x[j]; t = exp(t*t*con); /* c**((x[i]-x[j])**2) */ a[i] += t; a[j] += t; /* symmetric! */ } for (j = 1; j <= n; j++) a[j] *= k; This should do roughly half the work of the original code. A good rule of thumb is: FIRST look for algebraic identities that can reduce the amount of computation you need, and only AFTER you have done that should you even begin to think about hacking the code. -- Professional programming is paranoid programming
gvr@cs.brown.edu (George V. Reilly) (02/13/91)
In article <1991Feb12.233522.5195@Think.COM>, barmar@think.com (Barry Margolin) writes an excellent followup to Jeff Racine's question about hand-optimizing a bottleneck in his code. > Execution time query: > > I have the following nested loop in which I must sum (over all i), for each > x[j], e raised to the power x[i]-x[j] squared (times a constant). > > This section of the program is of order n-squared (i.e. execution time > increases at the rate of the number of observations squared). > > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } > I second Barry's recommendation of Bentley's `Writing Efficient Programs'. One additional suggestion: if you can make assertions about r = con * (x[i] - x[j]) * (x[i] - x[j]) such as r is an integer in the range 1 <= r <= 50 or r is a multiple of 0.125 in the range -3.75 <= r <= +7.875, then you can precompute a table of all of the possible results. Caveat: if r is a multiple of some number (e.g., 0.1) which cannot be expressed as an exact binary fraction (assuming binary FP hardware), then the results may be inaccurate. In article <15884.27b919a3@levels.sait.edu.au> marwk@levels.sait.edu.au (Ray) writes: > (1) rewrite the sum as Sum = 2 * S{i=1,i=n-1} S{j=i+1,j=n} f(i,j) > this cuts the number of calculations in half. I can make no sense of this statement. It's not equivalent to the original code. ________________ George V. Reilly `| Pas une pipe' gvr@cs.brown.edu +1 (401) 863-7684 uunet!brunix!gvr gvr@browncs.bitnet Box 1910, Brown U, Prov, RI 02912
kooijman@duteca (Richard Kooijman) (02/13/91)
racine@yunexus.yorku.ca (Jeff Racine) writes: >Execution time query: >I have the following nested loop in which I must sum (over all i), for each >x[j], e raised to the power x[i]-x[j] squared (times a constant). >This section of the program is of order n-squared (i.e. execution time >increases at the rate of the number of observations squared). > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } exp ( a * b * c ) = exp (a) ^ (b * c) = pow ( exp(a), b * c) = pow ( con2, b * c) exp(a) with a=con makes a new constant con2 = exp(con), which only has to be calculated once. Although your C compiler should be smart enough to see it itself: replace : (x[i]-x[j])*(x[i]-x[j]) with : dummy = x[i]-x[j]; dummy * dummy This causes x[i]-x[j] to be evaluated only once. The expression sum+= ... will now be as follows: dummy = x[i] - x[j]; sum += pow ( con2, dummy * dummy ); Now, I hope that pow() works at least as fast as exp(), because otherwise you wouldn't have speeded it up a lot, except for reducing the number of operations. I can't see more optimizations, but maybe I look into some more later. Richard.
marwk@levels.sait.edu.au (02/13/91)
In article <21658@yunexus.YorkU.CA>, racine@yunexus.yorku.ca (Jeff Racine) writes: > Execution time query: > > I have the following nested loop in which I must sum (over all i), for each > x[j], e raised to the power x[i]-x[j] squared (times a constant). > > This section of the program is of order n-squared (i.e. execution time > increases at the rate of the number of observations squared). > > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } > > This part of the algorithmn is a bottleneck. I want to make this as > quick as possible. What are my options? I have heard that assembler > would be faster. How much faster would assembler be? Would it be worth > the effort? Are there other alternatives that you can see? Any > improvement would be welcome. > > Any help would be greatly appreciated. Thanks for your time and effort. > > > -------------------------------------------------------------------- There are some redundant calculations here: (1) rewrite the sum as Sum = 2 * S{i=1,i=n-1} S{j=i+1,j=n} f(i,j) this cuts the number of calculations in half. (2) x(i) - x(j) is being calculated twice in the inner loop. store these differences in a variable in the inner loop, then square this var. On a MAC fx with a 68882 this will make little difference, but with a PC and 80287 this will provide a noticable increase in speed. (3) make the loop variables 'register int' (4) if allowed make x[] a register variable too. and the difference variable. else make them global variables. (5) if possible make n a #defined constant rather than a variable. I would not bother with assembler unless you can make better use of the registers than the compiler, or you have very fast real multiplication, difference and exponential routines you can use. There may be other improvements, but I cannot think of any nmore at the moment. Ray -- University of South Australia | Plus ca change, plus c'est la meme chose. P.O. Box 1 | Ghing thien me how, ming thien gung me how. Ingle Farm | Knobs, knobs everywhere, South Australia | just vary a knob to think!
pt@geovision.gvc.com (Paul Tomblin) (02/13/91)
barmar@think.com (Barry Margolin) writes: >In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: >> for (j=1; j <= n; j++) { >> sum = 0.0; >> for (i=1; i<= n; i++) { >> sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); >> } >> a[j] = k*sum; >> } >First a question about the correctness: do you really want to loop from 1 >to n? In C, if the array has n elements, the indices run from 0 to n-1, >not from 1 to n. >>This part of the algorithmn is a bottleneck. I want to make this as >>quick as possible. What are my options? I have heard that assembler >>would be faster. How much faster would assembler be? Would it be worth >>the effort? Are there other alternatives that you can see? [Some good suggestions from Barry deleted] >Register Allocation. >Loop Rule 1 -- Code Motion Out of Loops. >Expression Rule 3 -- Common Subexpression Elimination. putting all the optimization I can think of off the top of my head, and assuming that he *really* meant 0 to n-1, I come up with: register x_ptr_t ptri, ptrj; register a_ptr_t ptra; register double sum, sub_expression; ptrj = x[0]; ptra = a[0]; ptr_to_end = x[n]; while (ptrj < ptr_to_end) { sum = 0.0; ptri = x[0]; while (ptri < ptr_to_end) { sub_expression = ptri * ptrj; sum += exp( con * sub_expression * sub_expression); ptri++; } ptra++ = k*sum; ptri++; } This combines MOST of the techniques that I learnt from an article in Microsoft System Journal talking about the optimizer in MS-C. In other words, if you use MS-C, you don't need to do most of these things. If you use pcc (the bundled compiler on Suns, Ultrix, on lots of other unix boxen), you DO need to do all this in critical loops. The techniques I used: Pointer unraveling? (I think that's what it's called): Converting array references to pointers eliminates a lot of calculations of array offsets on lousy optimizers Register allocation: Tell the compiler that the variables are going to be referenced a lot. If you have enough registers, it will help (again, only if your compiler isn't smart enough to decide on its own) Common subexpressions: Another easy one, but I don't think pcc does it. Anybody think of anything else? Did I make any major boo-boos? -- Paul Tomblin, Department of Redundancy Department. ! My employer does The Romanian Orphans Support Group needs your help, ! not stand by my Ask me for details. ! opinions.... pt@geovision.gvc.com or {cognos,uunet}!geovision!pt ! Me neither.
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/14/91)
In article <4763@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: > register int i, j; > register double t; > for (j = 1; j <= n; j++) a[j] = 1.0; > for (j = 1; j <= n; j++) > for (i = j+1; i <= n; i++) { > t = x[i] - x[j]; > t = exp(t*t*con); /* c**((x[i]-x[j])**2) */ > a[i] += t; > a[j] += t; /* symmetric! */ > } > for (j = 1; j <= n; j++) a[j] *= k; Some random observations, starting from this point: It's wasteful to repeatedly multiply a homogeneous formula by a constant. First multiply all the x's by the square root of the constant. Assuming there's space available (or you can trash the original array, or you can divide afterwards and not worry about roundoff error): register int i; register int j; register double t; register double sqrtc; sqrtc = sqrt(c); for (j = 1;j <= n;++j) y[j] = x[j] * sqrtc; for (j = 1;j <= n;++j) a[j] = 1.0; for (j = 1;j <= n;++j) for (i = j + 1;i <= n;++i) { t = y[i] - y[j]; t = exp(t * t); a[i] += t; a[j] += t; } for (j = 1;j <= n;j++) a[j] *= k; Next, the loops will run faster backwards on most machines. register int i; register int j; register double t; register double sqrtc; sqrtc = sqrt(c); j = n; do { y[j] = x[j] * sqrtc; } while(--j); j = n; do { a[j] = 1.0; } while(--j); j = n; do { i = n; do { t = y[i] - y[j]; t = exp(t * t); a[i] += t; a[j] += t; } while(--i != j); } while(--j); j = n; do { a[j] *= k; } while(--j); Next, the accumulated a[j] sum might as well be kept in a register, as well as y[j]: register int i; register int j; register double t; register double sqrtc; register double yj; register double aj; sqrtc = sqrt(c); j = n; do { y[j] = x[j] * sqrtc; } while(--j); j = n; do { a[j] = 1.0; } while(--j); j = n; do { aj = a[j]; yj = y[j]; i = n; do { t = y[i] - yj; t = exp(t * t); aj += t; a[i] += t; } while(--i != j); a[j] = aj; } while(--j); j = n; do { a[j] *= k; } while(--j); Under some compilers on some machines it will produce a slight speedup to keep a + i and y + i in registers: register int i; register int j; register double t; register double sqrtc; register double yj; register double aj; register double *ypi; register double *api; sqrtc = sqrt(c); j = n; do { y[j] = x[j] * sqrtc; a[j] = 1.0; } while(--j); j = n; do { aj = a[j]; yj = y[j]; i = n; api = a + i; ypi = y + i; do { t = *ypi - yj; t = exp(t * t); aj += t; *api += t; } while(--api, --ypi, (--i != j)); a[j] = aj; } while(--j); j = n; do { a[j] *= k; } while(--j); On the other hand, the loop should be written somewhat differently for vector machines: register int i; register int j; register double t; register double sqrtc; register double yj; register double aj; sqrtc = sqrt(c); j = n; do { y[j] = x[j] * sqrtc; a[j] = 1.0; } while(--j); j = n; do { yj = y[j]; i = n; do { t = y[i] - yj; t = exp(t * t); a[i] += t; z[i] = t; } while(--i != j); aj = a[j]; i = n; do { aj += z[i]; } while(--i != j); a[j] = aj; } while(--j); j = n; do { a[j] *= k; } while(--j); If the machine can't vectorize exp then the loop should be split still differently. Another strategy to compute (yi - yj)^2 is to keep yi^2 + yj^2 and 2yiyj in registers, and precompute tables of y[i+1]^2 - yi^2 and y[i+1]/yi. To update the registers requires an addition and a multiplication; to get the final value requires a subtraction and an exp. On some wacko machine where multiplications are faster than additions, you could even keep exp(yi^2 + yj^2) and 2yiyj around, with a table of exp(y[i+1]^2 - yi^2) and a table of y[i+1]/yi. Then updating needs two multiplications, and the final value requires an exp and a multiplication. These won't be faster on any computer I've used. Caveat: These are all untested. ---Dan
gwyn@smoke.brl.mil (Doug Gwyn) (02/14/91)
In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } >This part of the algorithmn is a bottleneck. I want to make this as >quick as possible. What are my options? I have heard that assembler >would be faster. How much faster would assembler be? Would it be worth >the effort? Are there other alternatives that you can see? Any >improvement would be welcome. Conversion to assembler would not help much, since the majority of the time is spent in evaluating the exponential function. There are some simple rearrangements that would help substantially, however: (1) Factor out the e^con and fold it into the "k" multiplier. (2) If your compiler is not highly optimizing, introduce a register double variable into which x[i]-x[j] is stored, then mutliply that temporary by itself to produce the argument to exp(). (3) Consider precomputing the matrix exp((x[i]-x[j])^2), which is symmetric, so you need only compute a little more than half the [i,j] combinations. If this is being used for some sort of matching procedure with integral x[.] values, you might also consider using an information-statistics procedure using n*log(n) terms instead; such terms can be precomputed (or dynamically computed and cached) for a gain in efficiency.
gwyn@smoke.brl.mil (Doug Gwyn) (02/14/91)
In article <64745@brunix.UUCP> gvr@cs.brown.edu (George V. Reilly) writes: -In article <15884.27b919a3@levels.sait.edu.au> marwk@levels.sait.edu.au (Ray) -writes: -> (1) rewrite the sum as Sum = 2 * S{i=1,i=n-1} S{j=i+1,j=n} f(i,j) -> this cuts the number of calculations in half. -I can make no sense of this statement. It's not equivalent to -the original code. The idea was right, but he handled the diagonal incorrectly.
dik@cwi.nl (Dik T. Winter) (02/14/91)
This is ludicrious. In article <17664:Feb1319:36:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > In article <4763@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: (A perfectly good solution; so good that I mailed the same to the original poster.) What most people fail to see is that the bulk of the work in the loop is the calculation of the exponential! So all solutions that fail to reduce that number of calculations are doomed to fail to reduce the execution time very much. (And the solution I saw that used pow rather than exp will not help either because pow will in general use exp, or an algorithm that is roughly equivalent.) Now to Dan's points: > Some random observations, starting from this point: It's wasteful to > repeatedly multiply a homogeneous formula by a constant. First multiply > all the x's by the square root of the constant. Assuming there's space > available (or you can trash the original array, or you can divide > afterwards and not worry about roundoff error): Considering the number of multiplications the exponential uses, this is peanuts (what is it? I am too lazy to look at the source, but I presume more than 10.). > Next, the loops will run faster backwards on most machines. Ah yes, going backwards getting a comparison to 0 rather than to a non-zero value. Useful, *if* the body of the loop is small. > Next, the accumulated a[j] sum might as well be kept in a register, as > well as y[j]: And because the exponential is called this means one more register has to be saved. Question: what do we gain? > Under some compilers on some machines it will produce a slight speedup > to keep a + i and y + i in registers: And again. > On the other hand, the loop should be written somewhat differently for > vector machines: Ha! And because of the call to exp most C compilers for vector machines will not vectorize it. You have to do quite a lot more for those. (Yup, just two weeks ago I did some optimization on the Cray of a loop involving logarithms.) > If the machine can't vectorize exp then the loop should be split still > differently. Yes, loop splitting. Do in scalar what can not be vectorized. So now you are going to vectorize 1-10% of the work involved. A waste of programmers time. Etc. I have seen similar other responses. In general in this kind of cases: *look where the time is going*. In this example the time is going to the exponential function. If this function is not available in hardware it will in general take about 10 to 20 times the time needed for a multiplication (this is an estimate not based on fact, it is probably too low). Even if this function (or its computational workhorse) is available in hardware, it takes a lot of time: Intel 80387; F2XM1 211-476 clocks (workhorse); Motorola 68881; FETOX 466+ clocks. So chaving off one or more subtractions, memory accesses etc. will not help very much. The basic idea must be to reduce the number of calculations of exp (which the solution of Richard A. O'Keefe did). Another thing that is worth considering is reducing the time needed to calculate the exp. In that case questions like argument range, required precision etc. arise. It might be worthwile to roll your own version of exp (see for instance Cody&Waite, Software manual for the mathematical functions; the title is approximate). Especially if the required precision is smaller this might help. But never ever do some micro-optimization that speeds up some part of the algorithm by a factor of perhaps 100 if it takes only a very small percentage of the total time needed (see Amdahl's Law); that is really a waste of programmers time. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/14/91)
In article <2934@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > This is ludicrious. Look, Richard's original took 15.6 seconds on a Sun, 14.2 with cc -O4, all with n = 1000. Mine took 15.0 seconds, 13.8 with cc -O4. That's a 4% speedup. On a Convex, Richard's took 3.91 without optimization, 3.03 with. Mine took 3.56 without, 2.72 with. That's a 10% speedup. With a fast exp I have lying around, the speedups become 33% and 52% respectively. (This exp uses four tables of size 65536 and does table lookup.) I made one mistake in my transformations: j should have been initialized to n - 1, not n, in the main loop. (Richard should have done the same thing, but he was properly defensive and used <= rather than !=. Oops.) That bug took me one minute to find. Writing the article in the first place took about fifteen minutes, most of that for explaining the transformations rather than doing them. Dik, take a step back and look at your contribution to this discussion. Who's trying to be more helpful, you or me? I'm saying ``Here are some optimizations. I hope you find them useful.'' You're saying ``Don't optimize your code. Even though you care about its speed enough to spend a few minutes asking the net for help, surely you don't think it's worth a few minutes of programming time to make it run noticeably faster.'' Why would anyone want to take that advice? ---Dan
dik@cwi.nl (Dik T. Winter) (02/14/91)
In article <24587:Feb1411:32:5391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > Look, Richard's original took 15.6 seconds on a Sun, 14.2 with cc -O4, > all with n = 1000. Mine took 15.0 seconds, 13.8 with cc -O4. That's a 4% > speedup. Great. > > On a Convex, Richard's took 3.91 without optimization, 3.03 with. Mine > took 3.56 without, 2.72 with. That's a 10% speedup. Looks better. > > With a fast exp I have lying around, the speedups become 33% and 52% > respectively. (This exp uses four tables of size 65536 and does table > lookup.) But that is what you are looking for. > (First, I had already mailed my solution to the original poster, which was basicly the same as Ricard O'Keefe's and Doug Gwyn's, plus some additional suggestions.) > Dik, take a step back and look at your contribution to this discussion. > Who's trying to be more helpful, you or me? I'm saying ``Here are some > optimizations. I hope you find them useful.'' You're saying ``Don't > optimize your code. Even though you care about its speed enough to spend > a few minutes asking the net for help, surely you don't think it's worth > a few minutes of programming time to make it run noticeably faster.'' I did not say that. What I said is that the transformations you gave will not improve it very much (and indeed you find 4% and 10% respectively; the 10% probably due to the vectorization). Further I said that it is worthwile looking at the exponential; which you just confirmed! So who is more helpful? You gave just a few random solutions that result in a 4% speedup on a Sun, while the suggestion I did can give large improvements. Now if your suggestion had been to use your fast exponential..... (And, sorry, I do not have one nearby, so I could not offer one.) -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/15/91)
Dik, the optimizations I posted in a previous article are responsible for between a 4% and a 50% speedup, depending on your machine, your compiler, etc. This isn't the speedup you get from an optimizing compiler. This isn't the speedup you get from buying a new machine. This is the speedup from the ``ludicrous'' ``micro-optimization'' that you took such pains to criticize. In article <2940@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > (and indeed you find 4% and 10% respectively; the > 10% probably due to the vectorization). As I said the first time, the 10% is the speedup you get on a Convex with the standard math library exp() when you apply the ``ludicrous'' optimizations I pointed out. It is not due to vectorization. > You gave just a few random solutions that result in a 4% speedup on a Sun, > while the suggestion I did can give large improvements. Yes, and I could equally well have said ``buy a Cray.'' If the original poster didn't have a Cray this would result in ``large improvements.'' Similarly, the code becomes quite a lot faster if the poster uses a fast exp()---but do you really think that ``use a fast exp()'' is any more helpful than ``buy a Cray''? No. Neither one answers the question. Furthermore, if the poster *does* have a fast exp() running on his Cray, the optimizations I posted will give an even better speedup. ---Dan
rmartin@clear.com (Bob Martin) (02/15/91)
>Execution time query: > > for (j=1; j <= n; j++) { > sum = 0.0; > for (i=1; i<= n; i++) { > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > } > a[j] = k*sum; > } > >This part of the algorithmn is a bottleneck. I want to make this as >quick as possible. What are my options? I have heard that assembler >would be faster. How much faster would assembler be? Would it be worth >the effort? Are there other alternatives that you can see? Any >improvement would be welcome. I am not a math whiz, I couldn't see any way to get the exp out of the inner loop. However there are a few things that can make the program run significantly faster... Assume that x runs from 0 - n-1 instead of 1 - n. double x[MAXSIZE], a[MAXSIZE]; register double *ip, *jp, *ap; double diff; register int j, i; int n = MAXSIZE; /* or whatever size you decide on */ for(j=n, jp=x, ap=a; j; j--, jp++, ap++) { sum = 0.0; for(i=n, ip=x; i; i--, ip++) { diff = *ip - *jp; sum += exp( con * diff * diff ); } *ap = sum; } Array subscripting is often slower than direct pointer dereferences. Also the subtraction of x[i]-x[j] need only be done once. Putting the pointers into registers may also help quite a bit. The changes I have made here are similar to the trade offs that an assembly language programmer would make. Further translating to assembly language may increase your performance yet again, I would guess that it would not be dramatically more effective than the above. ----- Here is another thought which may reduce the O complexity of the algorithm from n^2 to < (n^2+n)/2. (Although it will make the code a bit more complex) The matrix S[i][j] = exp(con * (x[i]-x[j])^2) is symetric about the i==j axis. S[i][j] == S[j][i]!!!. Also, the diagonal always evaluates to 1.!!! This means that you need calculate less than half of the S values. for (j=0; j<n; j++) { a[j] = 1; /* put in the diagonal */ } for (j=0; j<(n-1); j++) { /* don't need to do the last row */ for (i=j+1; i<n; i++) { /* this excludes the i==j axis */ diff = x[i] - x[j]; s = exp(con * diff * diff); a[j] += s; a[i] += s; } } Now if you "pointerize" this algorithm, it will be much faster than your original. Caveat: This is all theory, I haven't tested it at all. Note: I am posting this since other netpeople more knowledgable than I may offer more help. -- +-Robert C. Martin-----+:RRR:::CCC:M:::::M:| Nobody is responsible for | | rmartin@clear.com |:R::R:C::::M:M:M:M:| my words but me. I want | | uunet!clrcom!rmartin |:RRR::C::::M::M::M:| all the credit, and all | +----------------------+:R::R::CCC:M:::::M:| the blame. So there. |
JRowe@exua.exeter.ac.uk (John Rowe) (02/15/91)
In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: Execution time query: I have the following nested loop in which I must sum (over all i), for each x[j], e raised to the power x[i]-x[j] squared (times a constant). for (j=1; j <= n; j++) { sum = 0.0; for (i=1; i<= n; i++) { sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); } a[j] = k*sum; } Jeff needs to speed it up. The C guys will tell you how to speed up the loops; being a physicist, ie really just a FORTRAN programmer, I would just point out to halve the number of them and maybe take out a multiply. With minimal recoding maybe something along the lines of: /* Just typed in - maybe typos */ foo = sqrt(con > 0 ? con : -con); for (j = 1; j <= n; ++j) { a[j] = 1.0; /* i = j case */ y[j] = x[j] * foo; } if ( con > 0.0 ) for (j = 1; j < n; ++j) for (i = j + 1; i <= n; ++i) { a[j] += foo = exp( (y[i]-y[j]) * (y[i]-y[j]) ); a[i] += foo; /* Two sums per exp */ } else for (j = 1; j < n; ++j) for (i = j + 1; i <= n; ++i) { a[j] += foo = exp( - (y[i]-y[j]) * (y[i]-y[j]) ); a[i] += foo; /* NB ^ only difference.*/ } for (j=1; j <= n; ++j) a[j] *= k; [pious platitudes about first improving the algorithm, taking advantage of symmetry, etc. omitted]. I don't know your machine but it's the halving of the number of exponentials that I would expect to give you the greatest gain, 50%. If you've got a modern, floating point orientated RISC machine (RS6000, i860??) floating point multiplies may be so fast it's not it may not be worth the hassle of the vector y and the if ( con < 0.0 ) bit. Any FORTRAN compiler would optimise out the (y[i]-y[j]) * (y[i]-y[j]) in its sleep, so should your C compiler. If not you could do it by hand as I see someone has sugested, but that's *probably* even less of a gain. I would guess that the question of c v assembler hangs upon how long your cpu takes to do a floating point multiply and exponential against the quality of your compiler's optimiser. Halving the number of exponentials is only a factor of two but it's zero effort and it's always there however you end up coding it. But I may have missed really something obvious as to why you can't do that; I had a late night this morning. Good luck and hope this helps. John Rowe Exeter University Computational Physics Group Exeter U.K.
dik@cwi.nl (Dik T. Winter) (02/15/91)
In article <26862:Feb1416:46:4391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > Dik, the optimizations I posted in a previous article are responsible > for between a 4% and a 50% speedup, depending on your machine, your > compiler, etc. The optimizations you gave in your first article gave between 4 and 10% speedup, not to 50%. To break down the different speedups on a Sun SLC, comparing O'Keefe's variant with your 5 versions (done with n=1000, 5 times; time in seconds): version time variant OK 55.93 O'Keefe's variant 1 55.15 1.4% speedup (using tmp for xi-xj, including c) 2 53.37 4.6% speedup (looping down, not up) 3 53.30 4.7% speedup (register variable for a[j]) 4 53.87 3.7% speedup (pointer for a+i and y+j) 5 53.82 3.8% speedup (vectorizable code) And yes, I consider this micro optimization. > As I said the first time, the 10% is the speedup you get on a Convex > with the standard math library exp() when you apply the ``ludicrous'' > optimizations I pointed out. It is not due to vectorization. Might be. Does the Convex vectorize all five variants? > > Yes, and I could equally well have said ``buy a Cray.'' If the original > poster didn't have a Cray this would result in ``large improvements.'' Sure. Times on a Cray Y/MP: version time remarks OK 0.3458 original 1 0.3282 5.1% speedup 2 0.3292 4.8% speedup 3 0.3312 4.2% speedup 4 0.3717 7.5% slowdown 5 0.3469 0.3% slowdown Although the compiler needed a bit of persuasion to vectorize the original and versions 1 and 2. So also here: micro optimization. (Calling version 4 and 5 optimized is even stressing the meaning of the term a bit!) > Similarly, the code becomes quite a lot faster if the poster uses a fast > exp()---but do you really think that ``use a fast exp()'' is any more > helpful than ``buy a Cray''? No. Neither one answers the question. But 'use fast exp()' is much more cheaper than 'buy a Cray'! And it can be handled by an adqeuate programmer. > Furthermore, if the poster *does* have a fast exp() running on his Cray, > the optimizations I posted will give an even better speedup. I doubt it very much. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
oz@yunexus.yorku.ca (Ozan Yigit) (02/17/91)
In article <24587:Feb1411:32:5391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >On a Convex, Richard's took 3.91 without optimization, 3.03 with. Mine >took 3.56 without, 2.72 with. That's a 10% speedup. On a MIPS 2000, (using -O3) Richard's took 1.08 seconds. Yours took 2.4 seconds. You figure out the percentages. oz
cjkuo@locus.com (Chengi Jimmy Kuo) (02/17/91)
"Use a faster exp()" is the correct answer. Only you the programmer know about such things as digits of accuracy, whether you can work in ints or fixed decimals, other things that could be designed into your private exp(). Also in your exp(), you can handle the concept that e**(kx) => (e**k)**x as well as reducing your expression. Jimmy Kuo -- cjkuo@locus.com "The correct answer to an either/or question is both!"
csp@gtenmc.UUCP (02/23/91)
In article <1991Feb12.233522.5195@Think.COM> barmar@think.com (Barry Margolin) writes: >In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes: >> for (j=1; j <= n; j++) { >> sum = 0.0; >> for (i=1; i<= n; i++) { >> sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); >> } >> a[j] = k*sum; >> } > Here is my solution :- for ( p++ = x , b++ = a , j = 1 ; j <= n ; j++ , p++ , *b++ = k * sum ) for ( p2++ = x , sum = 0.0 , i = 1 ; i <= n ; p2++ , i++ ) temp = *p2 - *p, sum += exp( temp * temp * con ); C S Palkar --
garry@ceco.ceco.com (Garry Garrett) (02/24/91)
In article <1991Feb14.170400.8089@clear.com>, rmartin@clear.com (Bob Martin) writes: > >Execution time query: > > > > for (j=1; j <= n; j++) { > > sum = 0.0; > > for (i=1; i<= n; i++) { > > sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) ); > > } > > a[j] = k*sum; > > } > > > >This part of the algorithmn is a bottleneck. I want to make this as > >quick as possible. What are my options? I have heard that assembler > >would be faster. How much faster would assembler be? Would it be worth > >the effort? Are there other alternatives that you can see? Any > >improvement would be welcome. > > double x[MAXSIZE], a[MAXSIZE]; > register double *ip, *jp, *ap; > double diff; > register int j, i; > int n = MAXSIZE; /* or whatever size you decide on */ > > for(j=n, jp=x, ap=a; j; j--, jp++, ap++) { > sum = 0.0; > for(i=n, ip=x; i; i--, ip++) { > diff = *ip - *jp; > sum += exp( con * diff * diff ); /* consider replacing the above 2 lines with: * sum += exp( con * (diff = *ip - *jp) * diff); * the savings is in the fact that after diff = ... is calcualted, * the value (written out to the memory location that contains diff) is * still in a register, and need not be read back into the register. */ > } > *ap = sum; > } > [... or] > > for (j=0; j<n; j++) { > a[j] = 1; /* put in the diagonal */ > } > > for (j=0; j<(n-1); j++) { /* don't need to do the last row */ > for (i=j+1; i<n; i++) { /* this excludes the i==j axis */ > diff = x[i] - x[j]; > s = exp(con * diff * diff); /* same comment applies here */ > a[j] += s; > a[i] += s; > } > } > Hope it helps Garry Garrett (sorry, I have to add a bunch of <CR> 's or my mailer won't accept this)