cs3b3aj@maccs.McMaster.CA (Stephen M. Dunn) (03/24/89)
Well, of course, the standard textbook methods of calculating the standard deviation are as follows: 1) assuming values are in array x, there are n values, and the mean is known and is in the variable mean: temp = 0; for (i = 1; i <= n; i++) temp += (x [i] - mean) ^ 2; std_dev = sqrt (temp / n); (note that my C is _so_ rusty, I'm not even sure if I did the exponentiation correctly! Anyway, the algorithm is right.) 2) assuming that the SQUARES of the values are in array x, there are n values, and the mean is known and is in the variable mean: temp = 0; for (i = 1; i <= n; i++) temp += x [i] ^ 2; std_dev = sqrt ((temp - (mean ^ 2) / n) / n); In both cases, temp, x and mean would likely be doubles, and n and i would be ints. Depending on the magnitudes of your numbers, one of these algorithms may be more accurate than the other. Hope these help. Regards, -- ====================================================================== ! Stephen M. Dunn, cs3b3aj@maccs.McMaster.CA ! DISCLAIMER: ! ! I always wanted to be a lumberjack! - M.P. ! I'm only an undergrad ! ======================================================================
vevea@paideia.uchicago.edu (Jack L. Vevea) (03/24/89)
In article <2221@maccs.McMaster.CA> cs3b3aj@maccs.UUCP (Stephen M. Dunn) writes: > > Well, of course, the standard textbook methods of calculating the >standard deviation are as follows: > >1) assuming values are in array x, there are n values, and the mean is > known and is in the variable mean: > >temp = 0; >for (i = 1; i <= n; i++) > temp += (x [i] - mean) ^ 2; >std_dev = sqrt (temp / n); And this is precisely what Doug Gwynn was warning against in his earlier posting; if your numbers have variations of small magnitude, you will end up with more rounding error than meaning. Don't use this. >2) assuming that the SQUARES of the values are in array x, there are n > values, and the mean is known and is in the variable mean: > >temp = 0; >for (i = 1; i <= n; i++) > temp += x [i] ^ 2; >std_dev = sqrt ((temp - (mean ^ 2) / n) / n); > And if the array already contains the squares, why square them again, pray? Try this: for(i=0;i<n;i++) { temp1 += x * x; temp2 += x; } sd = sqrt(temp1 - (temp2 * temp2 / n)) / (n-1); (Note division by n-1, not n; although the original poster didn't give much information on what he planned to do with this, he almost certainly wants _sample_ sd's, not population as the above quoted formulae assume.) Saepe fidelis.
kat3@ihlpl.ATT.COM (Craig) (03/26/89)
There are two issues associated with the computation of the standard deviation which have been brought up in the provious postings. The first of these is which of the following is the "correct" estimator for the standard deviation from a sample: x[0], x[1], ..., x[N-1] of size N. mean = 0.0; for (i=0; i<N; i++) mean += x[i]; mean /= N; sum = 0.0; for (i=0; i<N; i++) sum += (x[i] - mean)*(x[i] - mean); 1) std_dev = sqrt(sum/N); 2) std_dev = sqrt(sum/(N-1.0)); I would like to emphasis here that both 1) and 2) are valid estimators with both having people who recommend them. 1) is what is called the moment estimator, and also the maximum likelyhood estimator. 2) is sometimes called the unbiased estimator since the expected value of sum/(N-1) is the variance of the population. In any case there is little difference in the computed value's. The second issue, which is more important, is how to compute the value accurately on a machine with finite precision. There were two computation formula's provided. It should be noted that mathematically they are equivalent, thus the only issue is which provides the more accurate result. NOTE: In all the computations below, I will use a denominator of N. If you want the other estimator, just change the one term. 1) mean = 0.0; for (i=0; i<N; i++) mean += x[i]; mean /= N; sum = 0.0; for (i=0; i<N; i++) sum += (x[i] - mean)*(x[i] - mean); std_dev = sqrt(sum/N); 2) xsum = xsq = 0.0; for (i=0; i<N; i++) { xsum += x[i]; xsq += x[i]*x[i]; } std_dev = sqrt((xsq - xsum*xsum/N)/N); NOTE: These are code segments, in the sense that above, the x[] array was declared and assigned values, i was declared an int, sqrt() returns a double, etc. NOTE: These coding segments are not the most efficient. 1) is generally called the two pass formula for computing the standard deviation. 2) is called a one-pass formula for computing the standard deviation. At this point, we notice that 1) is definitely slower than 2) since it requires two passes through the data. HOWEVER, I claim that it is more accurate. Inorder to provide some data rather than words, below are two test programs each of which: 1) reads a data set 2) computes the standard deviation 3) prints out the result. ------------------- cut here ----------------------- #include <stdio.h> extern char *malloc(); main(argc, argv) int argc; char **argv; { double *x, std_dev, sum, mean, sqrt(); int i, N; FILE *fopen(), *fp; if (argc != 2) { printf("std: requires an input file!\n"); exit(1); } else if ((fp = fopen(argv[1], "r")) == NULL) { printf("std: cannot open %s\n", argv[1]); exit(1); } fscanf(fp, "%d", &N); x = (double *) malloc(N*sizeof(double)); if (x == NULL) { printf("std: not sufficient memory!\n"); exit(1); } for (i=0; i<N; i++) fscanf(fp, "%lf", &x[i]); mean = 0.0; for (i=0; i<N; i++) mean += x[i]; mean /= N; sum = 0.0; for (i=0; i<N; i++) sum += (x[i] - mean)*(x[i] - mean); std_dev = sqrt(sum/N); printf("std_dev = %24.16e\n", std_dev); } ------------------------- cut here -------------------- #include <stdio.h> extern char *malloc(); main(argc, argv) int argc; char **argv; { double *x, std_dev, xsum, xsq, sqrt(); int i, N; FILE *fopen(), *fp; if (argc != 2) { printf("std: requires an input file!\n"); exit(1); } else if ((fp = fopen(argv[1], "r")) == NULL) { printf("std: cannot open %s\n", argv[1]); exit(1); } fscanf(fp, "%d", &N); x = (double *) malloc(N*sizeof(double)); if (x == NULL) { printf("std: not sufficient memory!\n"); exit(1); } for (i=0; i<N; i++) fscanf(fp, "%lf", &x[i]); xsum = xsq = 0.0; for (i=0; i<N; i++) { xsum += x[i]; xsq += x[i]*x[i]; } std_dev = sqrt((xsq - xsum*xsum/N)/N); printf("std_dev = %24.16e\n", std_dev); } ------------------------- cut here ------------------------- The input data is given below: ------------------------- cut here ------------------------- 10 111111111.0 111111112.0 111111111.0 111111112.0 111111111.0 111111112.0 111111111.0 111111112.0 111111111.0 111111112.0 -------------------------- cut here ------------------------ I will compile the first program and call the executable "twopass," and the second program will be called "onepass." The next section gives the compile and output from the data: -------------------------- cut here ------------------------ hrdcpy cc -O -o twopass test.c -lm + /etc/preroot /native /bin/cc -O -o twopass test.c -lm hrdcpy cc -O -o onepass test1.c -lm + /etc/preroot /native /bin/cc -O -o onepass test1.c -lm hrdcpy onepass data std_dev = 1.7888543819998317e+00 hrdcpy twopass data std_dev = 5.0000000000000000e-01 ---------------------------- cut here ---------------------- Notice that the onepass standard deviation is 1.78, while the twopass standard deviation is 0.5!!!!!! Incidentally, the true value is 0.5!! Bob Craig
vevea@paideia.uchicago.edu (Jack L. Vevea) (03/27/89)
In article <9847@ihlpl.ATT.COM> kat3@ihlpl.UUCP (Craig,R.J.) writes: >------------------------- cut here ------------------------- >The input data is given below: >------------------------- cut here ------------------------- >10 >111111111.0 >111111112.0 >111111111.0 >111111112.0 >111111111.0 >111111112.0 >111111111.0 >111111112.0 >111111111.0 >111111112.0 >-------------------------- cut here ------------------------ > > >Notice that the onepass standard deviation is 1.78, while the twopass >standard deviation is 0.5!!!!!! Incidentally, the true value is 0.5!! > Sure; and you could just as easily have picked another data set that produced a correct result from the onepass program, and an incorrect result from twopass. Which is why, as has been pointed out before, neither algorithm should be used in a general standard deviation program. There was a posting recently to sci.math.stat, not crossposted here, that gave a reference for better computational algorithms. Saepe fidelis.
kat3@ihlpl.ATT.COM (Craig) (03/27/89)
> Sure; and you could just as easily have picked another data > set that produced a correct result from the onepass program, and an > incorrect result from twopass. Which is why, as has been pointed out > before, neither algorithm should be used in a general standard > deviation program. I don't think you can find such a data set. The given onepass algorithm is just "BAD." There are other algorithms which are "better" in the sense that they are faster, but all "good" algorithms for the computation use "centering" to increase accuracy. If you can provide such a data set, I would be very interested. Bob Craig