[comp.lang.c] standard deviation

cs3b3aj@maccs.McMaster.CA (Stephen M. Dunn) (03/24/89)

   Well, of course, the standard textbook methods of calculating the
standard deviation are as follows:

1)  assuming values are in array x, there are n values, and the mean is 
    known and is in the variable mean: 

temp = 0;
for (i = 1; i <= n; i++)
   temp += (x [i] - mean) ^ 2;
std_dev = sqrt (temp / n);

    (note that my C is _so_ rusty, I'm not even sure if I did the
     exponentiation correctly!  Anyway, the algorithm is right.) 

2)  assuming that the SQUARES of the values are in array x, there are n 
    values, and the mean is known and is in the variable mean: 

temp = 0;
for (i = 1; i <= n; i++)
   temp += x [i] ^ 2;
std_dev = sqrt ((temp - (mean ^ 2) / n) / n);

   In both cases, temp, x and mean would likely be doubles, and n and i 
would be ints. 
   Depending on the magnitudes of your numbers, one of these algorithms may 
be more accurate than the other.
   Hope these help.

Regards,

-- 
======================================================================
! Stephen M. Dunn, cs3b3aj@maccs.McMaster.CA ! DISCLAIMER:           !
! I always wanted to be a lumberjack! - M.P. ! I'm only an undergrad !
======================================================================

vevea@paideia.uchicago.edu (Jack L. Vevea) (03/24/89)

In article <2221@maccs.McMaster.CA> cs3b3aj@maccs.UUCP (Stephen M. Dunn) writes:
>
>   Well, of course, the standard textbook methods of calculating the
>standard deviation are as follows:
>
>1)  assuming values are in array x, there are n values, and the mean is 
>    known and is in the variable mean: 
>
>temp = 0;
>for (i = 1; i <= n; i++)
>   temp += (x [i] - mean) ^ 2;
>std_dev = sqrt (temp / n);

And this is precisely what Doug Gwynn was warning against in his
earlier posting; if your numbers have variations of small magnitude,
you will end up with more rounding error than meaning.  Don't use this.



>2)  assuming that the SQUARES of the values are in array x, there are n 
>    values, and the mean is known and is in the variable mean: 
>
>temp = 0;
>for (i = 1; i <= n; i++)
>   temp += x [i] ^ 2;
>std_dev = sqrt ((temp - (mean ^ 2) / n) / n);
>

And if the array already contains the squares, why square them again, pray?

Try this:

for(i=0;i<n;i++) {
	temp1 += x * x;
	temp2 += x; }

sd = sqrt(temp1 - (temp2 * temp2 / n)) / (n-1);


(Note division by n-1, not n; although the original poster didn't
give much information on what he planned to do with this, he almost
certainly wants _sample_ sd's, not population as the above quoted
formulae assume.)








Saepe fidelis.

kat3@ihlpl.ATT.COM (Craig) (03/26/89)

There are two issues associated with the computation of the standard
deviation which have been brought up in the provious postings.  The
first of these is which of the following is the "correct" estimator
for the standard deviation from a sample:

	x[0], x[1], ..., x[N-1]

of size N.

	mean = 0.0;
	for (i=0; i<N; i++)
		mean += x[i];
	mean /= N;
	sum = 0.0;
	for (i=0; i<N; i++)
		sum += (x[i] - mean)*(x[i] - mean);

	1) std_dev = sqrt(sum/N);

	2) std_dev = sqrt(sum/(N-1.0));

I would like to emphasis here that both 1) and 2) are valid estimators
with both having people who recommend them.

	1) is what is called the moment estimator, and also the
		maximum likelyhood estimator.

	2) is sometimes called the unbiased estimator since the
		expected value of sum/(N-1) is the variance of
		the population.

In any case there is little difference in the computed value's.

The second issue, which is more important, is how to compute the value
accurately on a machine with finite precision.  There were two computation
formula's provided.  It should be noted that mathematically they are
equivalent, thus the only issue is which provides the more accurate result.

NOTE:  In all the computations below, I will use a denominator of N.  If
	you want the other estimator, just change the one term.

	1)
		mean = 0.0;
		for (i=0; i<N; i++)
			mean += x[i];
		mean /= N;
		sum = 0.0;
		for (i=0; i<N; i++)
			sum += (x[i] - mean)*(x[i] - mean);
		std_dev = sqrt(sum/N);

	2)
		xsum = xsq = 0.0;
		for (i=0; i<N; i++) {
			xsum += x[i];
			xsq += x[i]*x[i];
		}
		std_dev = sqrt((xsq - xsum*xsum/N)/N);

NOTE:  These are code segments, in the sense that above, the x[] array
	was declared and assigned values, i was declared an int, sqrt()
	returns a double, etc.

NOTE:  These coding segments are not the most efficient.

1) is generally called the two pass formula for computing the standard
	deviation.

2) is called a one-pass formula for computing the standard deviation.

At this point, we notice that 1) is definitely slower than 2) since it
requires two passes through the data.  HOWEVER, I claim that it is more
accurate.
Inorder to provide some data rather than words, below are two test programs
each of which:

		1) reads a data set
		2) computes the standard deviation
		3) prints out the result.

------------------- cut here -----------------------
#include <stdio.h>
extern char *malloc();


main(argc, argv)
int argc;
char **argv;
{
	double *x, std_dev, sum, mean, sqrt();
	int i, N;
	FILE *fopen(), *fp;

	if (argc != 2) {
		printf("std: requires an input file!\n");
		exit(1);
	} else if ((fp = fopen(argv[1], "r")) == NULL) {
		printf("std: cannot open %s\n", argv[1]);
		exit(1);
	}
	fscanf(fp, "%d", &N);
	x = (double *) malloc(N*sizeof(double));
	if (x == NULL) {
		printf("std:  not sufficient memory!\n");
		exit(1);
	}
	for (i=0; i<N; i++)
		fscanf(fp, "%lf", &x[i]);
	mean = 0.0;
	for (i=0; i<N; i++)
		mean += x[i];
	mean /= N;
	sum = 0.0;
	for (i=0; i<N; i++)
		sum += (x[i] - mean)*(x[i] - mean);
	std_dev = sqrt(sum/N);
	printf("std_dev = %24.16e\n", std_dev);
}
------------------------- cut here --------------------
#include <stdio.h>
extern char *malloc();


main(argc, argv)
int argc;
char **argv;
{
	double *x, std_dev, xsum, xsq, sqrt();
	int i, N;
	FILE *fopen(), *fp;

	if (argc != 2) {
		printf("std: requires an input file!\n");
		exit(1);
	} else if ((fp = fopen(argv[1], "r")) == NULL) {
		printf("std: cannot open %s\n", argv[1]);
		exit(1);
	}
	fscanf(fp, "%d", &N);
	x = (double *) malloc(N*sizeof(double));
	if (x == NULL) {
		printf("std:  not sufficient memory!\n");
		exit(1);
	}
	for (i=0; i<N; i++)
		fscanf(fp, "%lf", &x[i]);
	xsum = xsq = 0.0;
	for (i=0; i<N; i++) {
		xsum += x[i];
		xsq += x[i]*x[i];
	}
	std_dev = sqrt((xsq - xsum*xsum/N)/N);
	printf("std_dev = %24.16e\n", std_dev);
}
------------------------- cut here -------------------------
The input data is given below:
------------------------- cut here -------------------------
10
111111111.0
111111112.0
111111111.0
111111112.0
111111111.0
111111112.0
111111111.0
111111112.0
111111111.0
111111112.0
-------------------------- cut here ------------------------


I will compile the first program and call the executable "twopass,"
and the second program will be called "onepass."  The next section gives
the compile and output from the data:
-------------------------- cut here ------------------------
hrdcpy cc -O -o twopass test.c -lm
+ /etc/preroot /native /bin/cc -O -o twopass test.c -lm 
hrdcpy cc -O -o onepass test1.c -lm
+ /etc/preroot /native /bin/cc -O -o onepass test1.c -lm 
hrdcpy onepass data
std_dev =   1.7888543819998317e+00
hrdcpy twopass data
std_dev =   5.0000000000000000e-01
---------------------------- cut here ----------------------

Notice that the onepass standard deviation is 1.78, while the twopass
standard deviation is 0.5!!!!!!  Incidentally, the true value is 0.5!!



					Bob Craig

vevea@paideia.uchicago.edu (Jack L. Vevea) (03/27/89)

In article <9847@ihlpl.ATT.COM> kat3@ihlpl.UUCP (Craig,R.J.) writes:
>------------------------- cut here -------------------------
>The input data is given below:
>------------------------- cut here -------------------------
>10
>111111111.0
>111111112.0
>111111111.0
>111111112.0
>111111111.0
>111111112.0
>111111111.0
>111111112.0
>111111111.0
>111111112.0
>-------------------------- cut here ------------------------
>
>
>Notice that the onepass standard deviation is 1.78, while the twopass
>standard deviation is 0.5!!!!!!  Incidentally, the true value is 0.5!!
>


	Sure; and you could just as easily have picked another data
set that produced a correct result from the onepass program, and an
incorrect result from twopass.  Which is why, as has been pointed out
before, neither algorithm should be used in a general standard
deviation program.

	There was a posting recently to sci.math.stat, not crossposted
here, that gave a reference for better computational algorithms.








Saepe fidelis.

kat3@ihlpl.ATT.COM (Craig) (03/27/89)

> 	Sure; and you could just as easily have picked another data
> set that produced a correct result from the onepass program, and an
> incorrect result from twopass.  Which is why, as has been pointed out
> before, neither algorithm should be used in a general standard
> deviation program.

I don't think you can find such a data set.  The given onepass algorithm
is just "BAD."  There are other algorithms which are "better" in the
sense that they are faster, but all "good" algorithms for the computation
use "centering" to increase accuracy.

If you can provide such a data set, I would be very interested.


					Bob Craig