[comp.lang.c] Standard deviation again

kat3@ihlpl.ATT.COM (Craig) (03/27/89)

> 	Sure; and you could just as easily have picked another data
> set that produced a correct result from the onepass program, and an
> incorrect result from twopass.  Which is why, as has been pointed out
> before, neither algorithm should be used in a general standard
> deviation program.

	After thinking about the response to my posting on the computation
of the standard deviation, I think a little background on my interest
in the twopass algorithm is needed.  The following provide cases where,
if it is numerically stable,the twopass algorithm should be considered.

	1)  If your machine does vector calculations, then the twopass
		algorithm would be faster than the numerically stable
		onepass algorithm.

	2)  If you already know the "mean," and just wish to estimate
		the standard deviation, you are using the second half of
		the twopass algorithm.

	3)  It is unclear, if all data resides in core, whether a numerically
		stable onpass algorithm, or the twopass algorithm is
		faster.

In any case, if the twopass algorithm is numerically unstable, I would like
an example of a dataset where it fails.

	The second reason for making an issue about claims that the twopass
algorithm is numerically unstable is the fact that it is not documented
in the literature.  The following recent statistics book claims the algorithm
is stable:

	Thisted, Ronald A., "Elements of Statistical Computing," 1988,
		Chapman and Hall, pp. 11.

	Finally, I give below, in pseudo-code, an extremely slow, but
very accurate algorithm for computing the standard deviation:

	3)
		sort(x);
		sum = 0.0;
		for (i=0; i<N; i++)
			sum += x[i];
		xbar = sum/N;
		for (i=0; i<N; i++)
			z[i] = (x[i] - xbar)*(x[i] - xbar);
		sort(z);
		sqsum = 0.0;
		for (i=0; i<N; i++)
			sqsum += z[i];
		std_dev = sqrt(sqsum/N);

Once again, note that this algorithm would be faster on a vectorized
machine.

	In any case, algorithms which are slow on a sequential machine
may have to be reconsidered with the advent of new hardware.  The only
question which must be addressed is whether the algorithm gives
numerically correct results.



					Bob Craig