rolf@natmlab.OZ (Rolf Turner) (09/26/85)
I've noticed that the hist function in S does something which seems wrong to me. When the argument 'scale' is TRUE, the write-up says that hist produces counts on a "density scale". It seems to me that if you're using the word "density" this should imply that the histogram integrates to 1. Instead, hist produces counts that SUM to 1. (However the i-th count is NOT the estimated probability of an observation lying in the i-th interval. Explicitly c(i) count(i) = ----------------- sum c(j) j where n(i)/n c(i) = -----------------, n = sum n(j), w = sum w(j) . w(i)/w j j Observe that the denominators n and w cancel in the final count, and so are irrelevant. (Note: n(i) is the i-th count; w(i) is the width of the i-th interval.) I've decided to replace the old count(i)'s by the more sensible n(i) h(i) = ----------------- n*w(i) so that the integral of the histogram, = sum h(j)*w(j), is one. j Moreover, the area of the i-th rectangle, h(i)*w(i), is now the estimated probability of an observation lying in the i-th interval. Observe that the old h(i) count(i) = ------------------- sum h(j) j The change was made by modifying the subroutine hhcntz, in hhcntz.r, by replacing the "if (idens)" clause by the following: if (idens) { # change count into density if requested sumc = 0. do i = 1,nclass { sumc = sumc+class(i) } do i = 1,nclass { class(i) = class(i)/(sumc*(cbreak(i+1)-cbreak(i))) } }
rab@alice.UucP (Rick Becker) (10/10/85)
> From: rolf@natmlab.OZ (Rolf Turner) > > I've noticed that the hist function in S does something which > seems wrong to me. When the argument 'scale' is TRUE, the > write-up says that hist produces counts on a "density scale". > It seems to me that if you're using the word "density" this > should imply that the histogram integrates to 1. Instead, > hist produces counts that SUM to 1. The "scale" argument to hist was not meant to produce a density scaling. It was defined so that the y-axis would remain comparable when looking at histograms that had varying numbers of observations from the same underlying distribution as well as remaining the same when the observations were changed by a scale factor. A density scale would have the first property but would change drastically if all of the observations were multiplied by 10, say. The idea was to decouple the y-axis scale from the x-axis scale with the "scale" argument. In the old (1981) documentation, the argument was called "density", even though it was computed in this way. We changed the name to "scale" to avoid giving people the impression that it meant true density scaling. In retrospect, it may have been a better idea to use the more familiar true density scaling. However, if you make the change that Rolf Turner suggested, it would probably be a good idea to name the argument "density" once again. -- Rick Becker alice!rab research!rab