[comp.lang.c] Execution time bottleneck: How to speed up execution?

racine@yunexus.yorku.ca (Jeff Racine) (02/11/91)

Execution time query:

I have the following nested loop in which I must sum (over all i), for each
x[j], e raised to the power x[i]-x[j] squared (times a constant).

This section of the program is of order n-squared (i.e. execution time
increases at the rate of the number of observations squared).

   for (j=1; j <= n; j++) {
      sum = 0.0;
      for (i=1; i<= n; i++) {
         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
      }
      a[j] = k*sum;
   }

This part of the algorithmn is a bottleneck. I want to make this as
quick as possible. What are my options? I have heard that assembler
would be faster. How much faster would assembler be? Would it be worth
the effort? Are there other alternatives that you can see?  Any
improvement would be welcome.

Any help would be greatly appreciated. Thanks for your time and effort.


--------------------------------------------------------------------
Jeff Racine             Tel (416)-656-2916
racine@nexus.yorku.ca   
racine@yunexus.UUCP     /* Yo Mama */

mcdaniel@adi.com (Tim McDaniel) (02/13/91)

racine@yunexus.yorku.ca (Jeff Racine) asks about 

   sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );

in a pair of nested loops.

   This part of the algorithm is a bottleneck.

Hallelujah, I just got religion.  This is the *first* optimization
article I remember in comp.lang.c where the author knew that the code
fragment was a bottleneck!  (If it's not a bottleneck, it doesn't
really *matter* what you do; don't put the effort into optimizing if
you're not sure.)

   What are my options? I have heard that assembler
   would be faster. How much faster would assembler be? Would it be worth
   the effort? Are there other alternatives that you can see?

Jeff, if we were telepaths, we wouldn't need netnews.  8-)

Environment: What hardware are you on?  What O. S.?  What C compiler?
What assembler?  What other code is around this fragment?

Data: What's the typical value of N?  Anything special about X[]
(integer or other known GCD, triangular)?  Anyhting special about CON?

Effort: Will this program be run often?  Is N large?  In short, is the
execution time large, and does it matter if it is?  (I presume "yes"
throughout.)  Is the program going to massage the same data on many
runs (i.e. would precomputation help)?

I doubt that assembler would help very much on common systems, though
that's only an ignorant guess.

--
"I'd hang you from my nipples ... but it would shock the children."

Tim McDaniel                 Applied Dynamics Int'l.; Ann Arbor, Michigan, USA
Internet: mcdaniel@adi.com                UUCP: {uunet,sharkey}!amara!mcdaniel

barmar@think.com (Barry Margolin) (02/13/91)

In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>   for (j=1; j <= n; j++) {
>      sum = 0.0;
>      for (i=1; i<= n; i++) {
>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>      }
>      a[j] = k*sum;
>   }

First a question about the correctness: do you really want to loop from 1
to n?  In C, if the array has n elements, the indices run from 0 to n-1,
not from 1 to n.

>This part of the algorithmn is a bottleneck. I want to make this as
>quick as possible. What are my options? I have heard that assembler
>would be faster. How much faster would assembler be? Would it be worth
>the effort? Are there other alternatives that you can see?

Assembler probably won't help much, unless you're already a pretty good
assembler programmer.  However, it would probably be helpful if you looked
at the assembler code produced by your C compiler, to see whether there are
ways you could recode the C so that it would produce more optimal compiled
code.

You might want to look at the book "Writing Efficient Programs" by Jon
Louis Bentley (he's the same person who went on to write the "Programming
Pearls" column in CACM (these columns have since been collected into a
couple of books).  It describes many general techniques for determining why
a program is slow and improving it.

As for your code, there are a couple of minor changes that may help a
little (but small improvements inside inner loops can have noticeable
effects).  They assume that your C compiler doesn't have a very good
optimizer (generally a pretty safe assumption).  I'll reference the
efficiency rules that Bentley describes.

Register Allocation.
  You reference i and j very frequently, so they should be declared
  "register".  This may not be necessary, though; I expect many compilers
  default to putting loop index variables into registers.

Loop Rule 1 -- Code Motion Out of Loops.
  Your inner loop refers to x[j] each time through the loop, although j
  doesn't change during that loop.  It would probably be better to do
  "register x_sub_j_temp = x[j];" before the inner loop, and then access
  x_sub_j_temp in the inner loop instead of x[j].

  You've already applied this rule in your assignments to a[j], by the way.
  A slower, but equivalent, way to have written your code would have been
  for the inner loop to contain:

	a[j] += k * exp (...);

Expression Rule 3 -- Common Subexpression Elimination.
  You perform the subtraction of the two elements twice when squaring.
  Instead, store the result of the subtraction into a temporary, and square
  that:

	temp_difference = x[i] - x[j];
	sum += exp (con * temp_difference * temp_difference);

  Many compilers are able to do this optimization themselves (it's one of
  the oldest and most well known), so it may not improve efficiency (but
  you might need to use the "register" declaration to get exactly the same
  code); look at the compiler's assembler output to see.  However, I
  personally find this good style even if it doesn't change run time; the
  originally code forced me to examine carefully to be sure that both
  subtractions were the same (it could have been (x[i]-x[j])*(x[j]-x[i]),
  which looks almost the same); by simplifying the expression you make it
  easier for the human reader to discern the structure of the formula (it
  would also help to use a more descriptive name than "temp_difference",
  especially if this difference has a name in the application domain).
--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (02/13/91)

In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca
wants to calculate
             \~
    a[j] = k*/_  c**((x[i]-x[j])**2),    for j=1..n
             i=1..n
for given k, c, x[].  Assembler is most unlikely to be any help at all,
because a large chunk of the time is spent inside exp(), which isn't
going to run any faster when you call it from assembler.  One thing
that *can* be done is to chop the time nearly in half by noting that
(x[i]-x[j])**2 = (x[j]-x[i])**2.  We can also optimise the i==j case,
as x[i]-x[j] = 0 then, hence c**(et cetera) = 1.0.

	register int i, j;
	register double t;

	for (j = 1; j <= n; j++) a[j] = 1.0;
	for (j = 1; j <= n; j++)
	    for (i = j+1; i <= n; i++) {
		t = x[i] - x[j];
		t = exp(t*t*con);	/* c**((x[i]-x[j])**2) */
		a[i] += t;
		a[j] += t;		/* symmetric! */
	    }
	for (j = 1; j <= n; j++) a[j] *= k;

This should do roughly half the work of the original code.
A good rule of thumb is:  FIRST look for algebraic identities that
can reduce the amount of computation you need, and only AFTER you
have done that should you even begin to think about hacking the code.

-- 
Professional programming is paranoid programming

gvr@cs.brown.edu (George V. Reilly) (02/13/91)

In article <1991Feb12.233522.5195@Think.COM>, barmar@think.com
(Barry Margolin) writes an excellent followup to Jeff Racine's
question about hand-optimizing a bottleneck in his code.

> Execution time query:
>
> I have the following nested loop in which I must sum (over all i), for each
> x[j], e raised to the power x[i]-x[j] squared (times a constant).
>
> This section of the program is of order n-squared (i.e. execution time
> increases at the rate of the number of observations squared).
>
>    for (j=1; j <= n; j++) {
>       sum = 0.0;
>       for (i=1; i<= n; i++) {
>          sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>       }
>       a[j] = k*sum;
>    }
>

I second Barry's recommendation of Bentley's `Writing Efficient Programs'.

One additional suggestion: if you can make assertions about
	r = con * (x[i] - x[j]) * (x[i] - x[j])
such as r is an integer in the range 1 <= r <= 50 or 
r is a multiple of 0.125 in the range -3.75 <= r <= +7.875,
then you can precompute a table of all of the possible results.
Caveat: if r is a multiple of some number (e.g., 0.1) which
cannot be expressed as an exact binary fraction (assuming
binary FP hardware), then the results may be inaccurate.

In article <15884.27b919a3@levels.sait.edu.au> marwk@levels.sait.edu.au (Ray)
writes:

> (1) rewrite the sum as Sum = 2 * S{i=1,i=n-1} S{j=i+1,j=n} f(i,j)
>     this cuts the number of calculations in half.

I can make no sense of this statement.  It's not equivalent to
the original code.
________________
George V. Reilly   `| Pas une pipe'	gvr@cs.brown.edu   +1 (401) 863-7684
uunet!brunix!gvr   gvr@browncs.bitnet	Box 1910, Brown U, Prov, RI 02912

kooijman@duteca (Richard Kooijman) (02/13/91)

racine@yunexus.yorku.ca (Jeff Racine) writes:

>Execution time query:

>I have the following nested loop in which I must sum (over all i), for each
>x[j], e raised to the power x[i]-x[j] squared (times a constant).

>This section of the program is of order n-squared (i.e. execution time
>increases at the rate of the number of observations squared).

>   for (j=1; j <= n; j++) {
>      sum = 0.0;
>      for (i=1; i<= n; i++) {
>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>      }
>      a[j] = k*sum;
>   }


exp ( a * b * c ) = exp (a) ^ (b * c) = pow ( exp(a), b * c) =
pow ( con2, b * c)


exp(a) with a=con makes a new constant con2 = exp(con), which only
has to be calculated once.
Although your C compiler should be smart enough to see it itself:
replace :  (x[i]-x[j])*(x[i]-x[j])  with : dummy = x[i]-x[j];
                                           dummy * dummy

This causes x[i]-x[j] to be evaluated only once.

The expression sum+= ... will now be as follows:

dummy = x[i] - x[j];
sum += pow ( con2, dummy * dummy );

Now, I hope that pow() works at least as fast as exp(), because otherwise
you wouldn't have speeded it up a lot, except for reducing the number of 
operations.

I can't see more optimizations, but maybe I look into some more later.


Richard.

marwk@levels.sait.edu.au (02/13/91)

In article <21658@yunexus.YorkU.CA>, racine@yunexus.yorku.ca (Jeff Racine) writes:
> Execution time query:
>
> I have the following nested loop in which I must sum (over all i), for each
> x[j], e raised to the power x[i]-x[j] squared (times a constant).
>
> This section of the program is of order n-squared (i.e. execution time
> increases at the rate of the number of observations squared).
>
>    for (j=1; j <= n; j++) {
>       sum = 0.0;
>       for (i=1; i<= n; i++) {
>          sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>       }
>       a[j] = k*sum;
>    }
>
> This part of the algorithmn is a bottleneck. I want to make this as
> quick as possible. What are my options? I have heard that assembler
> would be faster. How much faster would assembler be? Would it be worth
> the effort? Are there other alternatives that you can see?  Any
> improvement would be welcome.
>
> Any help would be greatly appreciated. Thanks for your time and effort.
>
>
> --------------------------------------------------------------------

There are some redundant calculations here:

(1) rewrite the sum as Sum = 2 * S{i=1,i=n-1} S{j=i+1,j=n} f(i,j)

    this cuts the number of calculations in half.

(2) x(i) - x(j) is being calculated twice in the inner loop.

    store these differences in a variable in the inner loop,
    then square this var.

    On a MAC fx with a 68882 this will make little difference, but
    with a PC and 80287 this will provide a noticable increase in speed.

(3) make the loop variables 'register int'

(4) if allowed make x[] a register variable too.
               and the difference variable.

    else make them global variables.

(5) if possible make n a #defined constant rather than a variable.

I would not bother with assembler unless you can make better use of
the registers than the compiler, or you have very fast real multiplication,
difference and exponential routines you can use.

There may be other improvements, but I cannot think of any nmore at the
moment.

Ray
--
University of South Australia   | Plus ca change, plus c'est la meme chose.
P.O. Box 1                      | Ghing thien me how, ming thien gung me how.
Ingle Farm                      | Knobs, knobs everywhere,
South Australia                 |              just vary a knob to think!

pt@geovision.gvc.com (Paul Tomblin) (02/13/91)

barmar@think.com (Barry Margolin) writes:

>In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>>   for (j=1; j <= n; j++) {
>>      sum = 0.0;
>>      for (i=1; i<= n; i++) {
>>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>>      }
>>      a[j] = k*sum;
>>   }

>First a question about the correctness: do you really want to loop from 1
>to n?  In C, if the array has n elements, the indices run from 0 to n-1,
>not from 1 to n.

>>This part of the algorithmn is a bottleneck. I want to make this as
>>quick as possible. What are my options? I have heard that assembler
>>would be faster. How much faster would assembler be? Would it be worth
>>the effort? Are there other alternatives that you can see?
[Some good suggestions from Barry deleted]
>Register Allocation.
>Loop Rule 1 -- Code Motion Out of Loops.
>Expression Rule 3 -- Common Subexpression Elimination.

putting all the optimization I can think of off the top of my head, and
assuming that he *really* meant 0 to n-1, I come up with:

	register	x_ptr_t ptri, ptrj;
	register	a_ptr_t ptra;
	register	double	sum, sub_expression;

	ptrj = x[0];
	ptra = a[0];
	ptr_to_end = x[n];
	while (ptrj < ptr_to_end) {
		sum = 0.0;
		ptri = x[0];
		while (ptri < ptr_to_end) {
			sub_expression = ptri * ptrj;
			sum += exp( con * sub_expression * sub_expression);
			ptri++;
		}
		ptra++ = k*sum;
		ptri++;
	}

This combines MOST of the techniques that I learnt from an article in
Microsoft System Journal talking about the optimizer in MS-C.  In other
words, if you use MS-C, you don't need to do most of these things.  If 
you use pcc (the bundled compiler on Suns, Ultrix, on lots of other unix
boxen), you DO need to do all this in critical loops.

The techniques I used:

Pointer unraveling? (I think that's what it's called): Converting array
references to pointers eliminates a lot of calculations of array offsets
on lousy optimizers

Register allocation: Tell the compiler that the variables are going to
be referenced a lot.  If you have enough registers, it will help (again,
only if your compiler isn't smart enough to decide on its own)

Common subexpressions: Another easy one, but I don't think pcc does it.

Anybody think of anything else?  Did I make any major boo-boos?
-- 
Paul Tomblin, Department of Redundancy Department.       ! My employer does 
The Romanian Orphans Support Group needs your help,      ! not stand by my
Ask me for details.                                      ! opinions.... 
pt@geovision.gvc.com or {cognos,uunet}!geovision!pt      ! Me neither.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/14/91)

In article <4763@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
> 	register int i, j;
> 	register double t;
> 	for (j = 1; j <= n; j++) a[j] = 1.0;
> 	for (j = 1; j <= n; j++)
> 	    for (i = j+1; i <= n; i++) {
> 		t = x[i] - x[j];
> 		t = exp(t*t*con);	/* c**((x[i]-x[j])**2) */
> 		a[i] += t;
> 		a[j] += t;		/* symmetric! */
> 	    }
> 	for (j = 1; j <= n; j++) a[j] *= k;

Some random observations, starting from this point: It's wasteful to
repeatedly multiply a homogeneous formula by a constant. First multiply
all the x's by the square root of the constant. Assuming there's space
available (or you can trash the original array, or you can divide
afterwards and not worry about roundoff error):

	register int i;
	register int j;
	register double t;
	register double sqrtc;

	sqrtc = sqrt(c);
	for (j = 1;j <= n;++j) y[j] = x[j] * sqrtc;
	for (j = 1;j <= n;++j) a[j] = 1.0;
	for (j = 1;j <= n;++j)
	  for (i = j + 1;i <= n;++i) {
	    t = y[i] - y[j];
	    t = exp(t * t);
	    a[i] += t;
	    a[j] += t;
	  }
 	for (j = 1;j <= n;j++) a[j] *= k;

Next, the loops will run faster backwards on most machines.

	register int i;
	register int j;
	register double t;
	register double sqrtc;

	sqrtc = sqrt(c);
	j = n; do { y[j] = x[j] * sqrtc; } while(--j);
	j = n; do { a[j] = 1.0; } while(--j);
	j = n; do {
	  i = n; do {
	    t = y[i] - y[j];
	    t = exp(t * t);
	    a[i] += t;
	    a[j] += t;
	  } while(--i != j);
	} while(--j);
	j = n; do { a[j] *= k; } while(--j);

Next, the accumulated a[j] sum might as well be kept in a register, as
well as y[j]:

	register int i;
	register int j;
	register double t;
	register double sqrtc;
	register double yj;
	register double aj;

	sqrtc = sqrt(c);
	j = n; do { y[j] = x[j] * sqrtc; } while(--j);
	j = n; do { a[j] = 1.0; } while(--j);
	j = n; do {
	  aj = a[j]; yj = y[j];
	  i = n; do {
	    t = y[i] - yj;
	    t = exp(t * t);
	    aj += t;
	    a[i] += t;
	  } while(--i != j);
	  a[j] = aj;
	} while(--j);
	j = n; do { a[j] *= k; } while(--j);

Under some compilers on some machines it will produce a slight speedup
to keep a + i and y + i in registers:

	register int i;
	register int j;
	register double t;
	register double sqrtc;
	register double yj;
	register double aj;
	register double *ypi;
	register double *api;

	sqrtc = sqrt(c);
	j = n; do { y[j] = x[j] * sqrtc; a[j] = 1.0; } while(--j);
	j = n; do {
	  aj = a[j]; yj = y[j];
	  i = n; api = a + i; ypi = y + i; do {
	    t = *ypi - yj;
	    t = exp(t * t);
	    aj += t;
	    *api += t;
	  } while(--api, --ypi, (--i != j));
	  a[j] = aj;
	} while(--j);
	j = n; do { a[j] *= k; } while(--j);

On the other hand, the loop should be written somewhat differently for
vector machines:

	register int i;
	register int j;
	register double t;
	register double sqrtc;
	register double yj;
	register double aj;

	sqrtc = sqrt(c);
	j = n; do { y[j] = x[j] * sqrtc; a[j] = 1.0; } while(--j);
	j = n; do {
	  yj = y[j];
	  i = n; do {
	    t = y[i] - yj;
	    t = exp(t * t);
	    a[i] += t;
	    z[i] = t;
	  } while(--i != j);
	  aj = a[j];
	  i = n; do {
	    aj += z[i];
	  } while(--i != j);
	  a[j] = aj;
	} while(--j);
	j = n; do { a[j] *= k; } while(--j);

If the machine can't vectorize exp then the loop should be split still
differently.

Another strategy to compute (yi - yj)^2 is to keep yi^2 + yj^2 and 2yiyj
in registers, and precompute tables of y[i+1]^2 - yi^2 and y[i+1]/yi. To
update the registers requires an addition and a multiplication; to get
the final value requires a subtraction and an exp. On some wacko machine
where multiplications are faster than additions, you could even keep
exp(yi^2 + yj^2) and 2yiyj around, with a table of exp(y[i+1]^2 - yi^2)
and a table of y[i+1]/yi. Then updating needs two multiplications, and
the final value requires an exp and a multiplication. These won't be
faster on any computer I've used.

Caveat: These are all untested.

---Dan

gwyn@smoke.brl.mil (Doug Gwyn) (02/14/91)

In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>   for (j=1; j <= n; j++) {
>      sum = 0.0;
>      for (i=1; i<= n; i++) {
>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>      }
>      a[j] = k*sum;
>   }
>This part of the algorithmn is a bottleneck. I want to make this as
>quick as possible. What are my options? I have heard that assembler
>would be faster. How much faster would assembler be? Would it be worth
>the effort? Are there other alternatives that you can see?  Any
>improvement would be welcome.

Conversion to assembler would not help much, since the majority of
the time is spent in evaluating the exponential function.  There are
some simple rearrangements that would help substantially, however:
	(1)  Factor out the e^con and fold it into the "k" multiplier.
	(2)  If your compiler is not highly optimizing, introduce a
	register double variable into which x[i]-x[j] is stored, then
	mutliply that temporary by itself to produce the argument to
	exp().
	(3)  Consider precomputing the matrix exp((x[i]-x[j])^2), which
	is symmetric, so you need only compute a little more than half
	the [i,j] combinations.
If this is being used for some sort of matching procedure with integral
x[.] values, you might also consider using an information-statistics
procedure using n*log(n) terms instead; such terms can be precomputed
(or dynamically computed and cached) for a gain in efficiency.

gwyn@smoke.brl.mil (Doug Gwyn) (02/14/91)

In article <64745@brunix.UUCP> gvr@cs.brown.edu (George V. Reilly) writes:
-In article <15884.27b919a3@levels.sait.edu.au> marwk@levels.sait.edu.au (Ray)
-writes:
-> (1) rewrite the sum as Sum = 2 * S{i=1,i=n-1} S{j=i+1,j=n} f(i,j)
->     this cuts the number of calculations in half.
-I can make no sense of this statement.  It's not equivalent to
-the original code.

The idea was right, but he handled the diagonal incorrectly.

dik@cwi.nl (Dik T. Winter) (02/14/91)

This is ludicrious.

In article <17664:Feb1319:36:1291@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
 > In article <4763@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
(A perfectly good solution; so good that I mailed the same to the original
poster.)

What most people fail to see is that the bulk of the work in the loop is the
calculation of the exponential!  So all solutions that fail to reduce that
number of calculations are doomed to fail to reduce the execution time very
much.  (And the solution I saw that used pow rather than exp will not help
either because pow will in general use exp, or an algorithm that is roughly
equivalent.)  Now to Dan's points:
 > Some random observations, starting from this point: It's wasteful to
 > repeatedly multiply a homogeneous formula by a constant. First multiply
 > all the x's by the square root of the constant. Assuming there's space
 > available (or you can trash the original array, or you can divide
 > afterwards and not worry about roundoff error):
Considering the number of multiplications the exponential uses, this is
peanuts (what is it?  I am too lazy to look at the source, but I presume
more than 10.).
 > Next, the loops will run faster backwards on most machines.
Ah yes, going backwards getting a comparison to 0 rather than to a non-zero
value.  Useful, *if* the body of the loop is small.
 > Next, the accumulated a[j] sum might as well be kept in a register, as
 > well as y[j]:
And because the exponential is called this means one more register has
to be saved.  Question: what do we gain?
 > Under some compilers on some machines it will produce a slight speedup
 > to keep a + i and y + i in registers:
And again.
 > On the other hand, the loop should be written somewhat differently for
 > vector machines:
Ha!  And because of the call to exp most C compilers for vector machines
will not vectorize it.  You have to do quite a lot more for those.
(Yup, just two weeks ago I did some optimization on the Cray of a loop
involving logarithms.)
 > If the machine can't vectorize exp then the loop should be split still
 > differently.
Yes, loop splitting.  Do in scalar what can not be vectorized.  So now
you are going to vectorize 1-10% of the work involved.  A waste of programmers
time.

Etc.  I have seen similar other responses.  In general in this kind of cases:
*look where the time is going*.  In this example the time is going to
the exponential function.  If this function is not available in hardware it
will in general take about 10 to 20 times the time needed for a multiplication
(this is an estimate not based on fact, it is probably too low).  Even if
this function (or its computational workhorse) is available in hardware, it
takes a lot of time:
Intel 80387; F2XM1 211-476 clocks (workhorse); Motorola 68881; FETOX 466+
clocks.
So chaving off one or more subtractions, memory accesses etc. will not help
very much.  The basic idea must be to reduce the number of calculations of
exp (which the solution of Richard A. O'Keefe did).
Another thing that is worth considering is reducing the time needed to
calculate the exp.  In that case questions like argument range, required
precision etc. arise.  It might be worthwile to roll your own version of
exp (see for instance Cody&Waite, Software manual for the mathematical
functions; the title is approximate).  Especially if the required precision
is smaller this might help.

But never ever do some micro-optimization that speeds up some part of the
algorithm by a factor of perhaps 100 if it takes only a very small
percentage of the total time needed (see Amdahl's Law); that is really a
waste of programmers time.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/14/91)

In article <2934@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
> This is ludicrious.

Look, Richard's original took 15.6 seconds on a Sun, 14.2 with cc -O4,
all with n = 1000. Mine took 15.0 seconds, 13.8 with cc -O4. That's a 4%
speedup.

On a Convex, Richard's took 3.91 without optimization, 3.03 with. Mine
took 3.56 without, 2.72 with. That's a 10% speedup.

With a fast exp I have lying around, the speedups become 33% and 52%
respectively. (This exp uses four tables of size 65536 and does table
lookup.)

I made one mistake in my transformations: j should have been initialized
to n - 1, not n, in the main loop. (Richard should have done the same
thing, but he was properly defensive and used <= rather than !=. Oops.)
That bug took me one minute to find. Writing the article in the first
place took about fifteen minutes, most of that for explaining the
transformations rather than doing them.

Dik, take a step back and look at your contribution to this discussion.
Who's trying to be more helpful, you or me? I'm saying ``Here are some
optimizations. I hope you find them useful.'' You're saying ``Don't
optimize your code. Even though you care about its speed enough to spend
a few minutes asking the net for help, surely you don't think it's worth
a few minutes of programming time to make it run noticeably faster.''

Why would anyone want to take that advice?

---Dan

dik@cwi.nl (Dik T. Winter) (02/14/91)

In article <24587:Feb1411:32:5391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
 > Look, Richard's original took 15.6 seconds on a Sun, 14.2 with cc -O4,
 > all with n = 1000. Mine took 15.0 seconds, 13.8 with cc -O4. That's a 4%
 > speedup.
Great.
 > 
 > On a Convex, Richard's took 3.91 without optimization, 3.03 with. Mine
 > took 3.56 without, 2.72 with. That's a 10% speedup.
Looks better.
 > 
 > With a fast exp I have lying around, the speedups become 33% and 52%
 > respectively. (This exp uses four tables of size 65536 and does table
 > lookup.)
But that is what you are looking for.
 > 
(First, I had already mailed my solution to the original poster, which was
basicly the same as Ricard O'Keefe's and Doug Gwyn's, plus some additional
suggestions.)
 > Dik, take a step back and look at your contribution to this discussion.
 > Who's trying to be more helpful, you or me? I'm saying ``Here are some
 > optimizations. I hope you find them useful.'' You're saying ``Don't
 > optimize your code. Even though you care about its speed enough to spend
 > a few minutes asking the net for help, surely you don't think it's worth
 > a few minutes of programming time to make it run noticeably faster.''
I did not say that.  What I said is that the transformations you gave will
not improve it very much (and indeed you find 4% and 10% respectively; the
10% probably due to the vectorization).  Further I said that it is worthwile
looking at the exponential; which you just confirmed!  So who is more helpful?
You gave just a few random solutions that result in a 4% speedup on a Sun,
while the suggestion I did can give large improvements.  Now if your
suggestion had been to use your fast exponential.....  (And, sorry, I do not
have one nearby, so I could not offer one.)
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (02/15/91)

Dik, the optimizations I posted in a previous article are responsible
for between a 4% and a 50% speedup, depending on your machine, your
compiler, etc. This isn't the speedup you get from an optimizing
compiler. This isn't the speedup you get from buying a new machine.
This is the speedup from the ``ludicrous'' ``micro-optimization'' that
you took such pains to criticize.

In article <2940@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
> (and indeed you find 4% and 10% respectively; the
> 10% probably due to the vectorization).

As I said the first time, the 10% is the speedup you get on a Convex
with the standard math library exp() when you apply the ``ludicrous''
optimizations I pointed out. It is not due to vectorization.

> You gave just a few random solutions that result in a 4% speedup on a Sun,
> while the suggestion I did can give large improvements.

Yes, and I could equally well have said ``buy a Cray.'' If the original
poster didn't have a Cray this would result in ``large improvements.''
Similarly, the code becomes quite a lot faster if the poster uses a fast
exp()---but do you really think that ``use a fast exp()'' is any more
helpful than ``buy a Cray''? No. Neither one answers the question.
Furthermore, if the poster *does* have a fast exp() running on his Cray,
the optimizations I posted will give an even better speedup.

---Dan

rmartin@clear.com (Bob Martin) (02/15/91)

>Execution time query:
>
>   for (j=1; j <= n; j++) {
>      sum = 0.0;
>      for (i=1; i<= n; i++) {
>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>      }
>      a[j] = k*sum;
>   }
>
>This part of the algorithmn is a bottleneck. I want to make this as
>quick as possible. What are my options? I have heard that assembler
>would be faster. How much faster would assembler be? Would it be worth
>the effort? Are there other alternatives that you can see?  Any
>improvement would be welcome.

I am not a math whiz, I couldn't see any way to get the exp out of the
inner loop.  However there are a few things that can make the program
run significantly faster...

Assume that x runs from 0 - n-1 instead of 1 - n.

	double x[MAXSIZE], a[MAXSIZE];
	register double *ip, *jp, *ap;
	double diff;
	register int j, i;
	int n = MAXSIZE; /* or whatever size you decide on */

	for(j=n, jp=x, ap=a; j; j--, jp++, ap++) {
		sum = 0.0;
		for(i=n, ip=x; i; i--, ip++) {
			diff = *ip - *jp;
			sum += exp( con * diff * diff );
		}
		*ap = sum;
	}

Array subscripting is often slower than direct pointer dereferences.
Also the subtraction of x[i]-x[j] need only be done once.
Putting the pointers into registers may also help quite a bit.

The changes I have made here are similar to the trade offs that an assembly
language programmer would make.  Further translating to assembly language
may increase your performance yet again, I would guess that it would not
be dramatically more effective than the above.

-----

Here is another thought which may reduce the O complexity of the 
algorithm from n^2 to < (n^2+n)/2. (Although it will make 
the code a bit more complex)

The matrix S[i][j] = exp(con * (x[i]-x[j])^2) is symetric about the 
i==j axis.  S[i][j] == S[j][i]!!!.  Also, the diagonal always evaluates
to 1.!!!  This means that you need calculate less than half of the S 
values.  

	for (j=0; j<n; j++) {
		a[j] = 1;	/* put in the diagonal */
	}

	for (j=0; j<(n-1); j++) { /* don't need to do the last row */
		for (i=j+1; i<n; i++) {  /* this excludes the i==j axis */
			diff = x[i] - x[j];
			s = exp(con * diff * diff);
			a[j] += s;
			a[i] += s;
		}
	}

Now if you "pointerize" this algorithm, it will be much faster than
your original.

Caveat:  This is all theory, I haven't tested it at all.

Note:  I am posting this since other netpeople more knowledgable than
	   I may offer more help.

-- 
+-Robert C. Martin-----+:RRR:::CCC:M:::::M:| Nobody is responsible for |
| rmartin@clear.com    |:R::R:C::::M:M:M:M:| my words but me.  I want  |
| uunet!clrcom!rmartin |:RRR::C::::M::M::M:| all the credit, and all   |
+----------------------+:R::R::CCC:M:::::M:| the blame.  So there.     |

JRowe@exua.exeter.ac.uk (John Rowe) (02/15/91)

In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) 
writes:

   Execution time query:

   I have the following nested loop in which I must sum (over all i), for each
   x[j], e raised to the power x[i]-x[j] squared (times a constant).

      for (j=1; j <= n; j++) {
	 sum = 0.0;
	 for (i=1; i<= n; i++) {
	    sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
	 }
	 a[j] = k*sum;
      }

Jeff needs to speed it up. The C guys will tell you how to speed up
the loops; being a physicist, ie really just a FORTRAN programmer, I
would just point out to halve the number of them and maybe take out a
multiply. With minimal recoding maybe something along the lines of:
 /*	Just typed in - maybe typos */

      foo = sqrt(con > 0 ? con : -con);
      for (j = 1; j <= n; ++j) {
        a[j] = 1.0;                        /* i = j case */
        y[j] = x[j] * foo;
      }

      if ( con > 0.0 ) 
        for (j = 1; j < n; ++j) 
    	  for (i = j + 1; i <= n; ++i) {
	    a[j] += foo = exp( (y[i]-y[j]) * (y[i]-y[j]) );
	    a[i] += foo;                   /* Two sums per exp */
	  }
      else                     
        for (j = 1; j < n; ++j) 
	  for (i = j + 1; i <= n; ++i) {
	    a[j] += foo = exp( - (y[i]-y[j]) * (y[i]-y[j]) );
	    a[i] += foo; /* NB ^ only difference.*/
	 }

      for (j=1; j <= n; ++j) a[j] *= k;

[pious platitudes about first improving the algorithm, taking
advantage of symmetry, etc. omitted]. I don't know your machine but
it's the halving of the number of exponentials that I would expect to
give you the greatest gain, 50%. If you've got a modern, floating
point orientated RISC machine (RS6000, i860??) floating point
multiplies may be so fast it's not it may not be worth the hassle of
the vector y and the if ( con < 0.0 ) bit. Any FORTRAN compiler would
optimise out the (y[i]-y[j]) * (y[i]-y[j]) in its sleep, so should
your C compiler. If not you could do it by hand as I see someone has
sugested, but that's *probably* even less of a gain. I would guess
that the question of c v assembler hangs upon how long your cpu takes
to do a floating point multiply and exponential against the quality of
your compiler's optimiser.

Halving the number of exponentials is only a factor of two but it's
zero effort and it's always there however you end up coding it.

But I may have missed really something obvious as to why you can't do
that; I had a late night this morning.

Good luck and hope this helps.

John Rowe
Exeter University Computational Physics Group
Exeter
U.K.

dik@cwi.nl (Dik T. Winter) (02/15/91)

In article <26862:Feb1416:46:4391@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
 > Dik, the optimizations I posted in a previous article are responsible
 > for between a 4% and a 50% speedup, depending on your machine, your
 > compiler, etc.
The optimizations you gave in your first article gave between 4 and 10% speedup,
not to 50%.  To break down the different speedups on a Sun SLC, comparing
O'Keefe's variant with your 5 versions (done with n=1000, 5 times; time in
seconds):
version	time	variant
OK	55.93	O'Keefe's variant
1	55.15	1.4% speedup (using tmp for xi-xj, including c)
2	53.37	4.6% speedup (looping down, not up)
3	53.30	4.7% speedup (register variable for a[j])
4	53.87	3.7% speedup (pointer for a+i and y+j)
5	53.82	3.8% speedup (vectorizable code)
And yes, I consider this micro optimization.

 > As I said the first time, the 10% is the speedup you get on a Convex
 > with the standard math library exp() when you apply the ``ludicrous''
 > optimizations I pointed out. It is not due to vectorization.
Might be.  Does the Convex vectorize all five variants?
 > 
 > Yes, and I could equally well have said ``buy a Cray.'' If the original
 > poster didn't have a Cray this would result in ``large improvements.''
Sure.  Times on a Cray Y/MP:
version	time	remarks
OK	0.3458	original
1	0.3282	5.1% speedup
2	0.3292	4.8% speedup
3	0.3312	4.2% speedup
4	0.3717	7.5% slowdown
5	0.3469	0.3% slowdown
Although the compiler needed a bit of persuasion to vectorize the original
and versions 1 and 2.  So also here: micro optimization.  (Calling version
4 and 5 optimized is even stressing the meaning of the term a bit!)

 > Similarly, the code becomes quite a lot faster if the poster uses a fast
 > exp()---but do you really think that ``use a fast exp()'' is any more
 > helpful than ``buy a Cray''? No. Neither one answers the question.
But 'use fast exp()' is much more cheaper than 'buy a Cray'!  And it can
be handled by an adqeuate programmer.

 > Furthermore, if the poster *does* have a fast exp() running on his Cray,
 > the optimizations I posted will give an even better speedup.
I doubt it very much.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

oz@yunexus.yorku.ca (Ozan Yigit) (02/17/91)

In article <24587:Feb1411:32:5391@kramden.acf.nyu.edu>
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>On a Convex, Richard's took 3.91 without optimization, 3.03 with. Mine
>took 3.56 without, 2.72 with. That's a 10% speedup.

On a MIPS 2000, (using -O3) Richard's took 1.08 seconds. Yours took 2.4
seconds.  You figure out the percentages.

oz

cjkuo@locus.com (Chengi Jimmy Kuo) (02/17/91)

"Use a faster exp()" is the correct answer.  Only you the programmer know
about such things as digits of accuracy, whether you can work in ints or
fixed decimals, other things that could be designed into your private exp().
Also in your exp(), you can handle the concept that e**(kx) => (e**k)**x as
well as reducing your expression.

Jimmy Kuo
-- 
cjkuo@locus.com
"The correct answer to an either/or question is both!"

csp@gtenmc.UUCP (02/23/91)

In article <1991Feb12.233522.5195@Think.COM> barmar@think.com (Barry Margolin) writes:
>In article <21658@yunexus.YorkU.CA> racine@yunexus.yorku.ca (Jeff Racine) writes:
>>   for (j=1; j <= n; j++) {
>>      sum = 0.0;
>>      for (i=1; i<= n; i++) {
>>         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
>>      }
>>      a[j] = k*sum;
>>   }
>
      Here is my solution :-

	 for ( p++ = x , b++ = a , j = 1 ; j <= n ; j++ , p++ , *b++ = k * sum )
	    for ( p2++ = x , sum = 0.0 , i = 1 ; i <= n ; p2++ , i++ )
	       temp = *p2 - *p,
	       sum += exp( temp * temp * con );

      C S Palkar

--

garry@ceco.ceco.com (Garry Garrett) (02/24/91)

In article <1991Feb14.170400.8089@clear.com>, rmartin@clear.com (Bob Martin) writes:
> >Execution time query:
> >
> >   for (j=1; j <= n; j++) {
> >      sum = 0.0;
> >      for (i=1; i<= n; i++) {
> >         sum += exp( con * (x[i]-x[j]) * (x[i]-x[j]) );
> >      }
> >      a[j] = k*sum;
> >   }
> >
> >This part of the algorithmn is a bottleneck. I want to make this as
> >quick as possible. What are my options? I have heard that assembler
> >would be faster. How much faster would assembler be? Would it be worth
> >the effort? Are there other alternatives that you can see?  Any
> >improvement would be welcome.
> 
> 	double x[MAXSIZE], a[MAXSIZE];
> 	register double *ip, *jp, *ap;
> 	double diff;
> 	register int j, i;
> 	int n = MAXSIZE; /* or whatever size you decide on */
> 
> 	for(j=n, jp=x, ap=a; j; j--, jp++, ap++) {
> 		sum = 0.0;
> 		for(i=n, ip=x; i; i--, ip++) {
> 			diff = *ip - *jp;
> 			sum += exp( con * diff * diff );

/* consider replacing the above 2 lines with:
 *			sum += exp( con * (diff = *ip - *jp) * diff);
 * the savings is in the fact that after diff = ... is calcualted,
 * the value (written out to the memory location that contains diff) is
 * still in a register, and need not be read back into the register.
 */

> 		}
> 		*ap = sum;
> 	}
> 
	[... or]
> 
> 	for (j=0; j<n; j++) {
> 		a[j] = 1;	/* put in the diagonal */
> 	}
> 
> 	for (j=0; j<(n-1); j++) { /* don't need to do the last row */
> 		for (i=j+1; i<n; i++) {  /* this excludes the i==j axis */
> 			diff = x[i] - x[j];
> 			s = exp(con * diff * diff);
/*	same comment applies here */
> 			a[j] += s;
> 			a[i] += s;
> 		}
> 	}
> 

	Hope it helps

Garry Garrett

(sorry, I have to add a bunch of <CR> 's or my mailer won't accept this)