[net.math.stat] The function 'hist' in S.

rolf@natmlab.OZ (Rolf Turner) (09/26/85)

I've noticed that the hist function in S does something which
seems wrong to me.  When the argument 'scale' is TRUE, the
write-up says that hist produces counts on a "density scale".
It seems to me that if you're using the word "density" this
should imply that the histogram integrates to 1.  Instead,
hist produces counts that SUM to 1.  (However the i-th count
is NOT the estimated probability of an observation lying in the
i-th interval.

Explicitly

                          c(i)
	count(i) =  -----------------
                        sum c(j)
                         j

where
                         n(i)/n
            c(i) =  -----------------,      n = sum n(j),   w = sum w(j)  .
                         w(i)/w                  j               j

Observe that the denominators n and w cancel in the final count, and so are
irrelevant.  (Note: n(i) is the i-th count; w(i) is the width of the i-th
interval.)

I've decided to replace the old count(i)'s by the more sensible

                          n(i)
	    h(i) =  -----------------
                         n*w(i)

so that the integral of the histogram, = sum h(j)*w(j), is one.
                                          j

Moreover, the area of the i-th rectangle, h(i)*w(i), is now the estimated
probability of an observation lying in the i-th interval.
Observe that the old

                           h(i)
	count(i) =  -------------------
                         sum h(j)
                          j

The change was made by modifying the subroutine hhcntz, in hhcntz.r,
by replacing the "if (idens)" clause by the following:

if (idens) {	# change count into density if requested
	sumc = 0.
	do i = 1,nclass { sumc = sumc+class(i) }

	do i = 1,nclass {
		class(i) = class(i)/(sumc*(cbreak(i+1)-cbreak(i)))
	}
}

rab@alice.UucP (Rick Becker) (10/10/85)

> From: rolf@natmlab.OZ (Rolf Turner)
> 
> I've noticed that the hist function in S does something which
> seems wrong to me.  When the argument 'scale' is TRUE, the
> write-up says that hist produces counts on a "density scale".
> It seems to me that if you're using the word "density" this
> should imply that the histogram integrates to 1.  Instead,
> hist produces counts that SUM to 1.

The "scale" argument to hist was not meant to produce a density
scaling.  It was defined so that the y-axis would remain comparable
when looking at histograms that had varying numbers of observations
from the same underlying distribution as well as remaining the same
when the observations were changed by a scale factor.  A density scale
would have the first property but would change drastically if all of
the observations were multiplied by 10, say.  The idea was to decouple
the y-axis scale from the x-axis scale with the "scale" argument.

In the old (1981) documentation, the argument was called "density", even
though it was computed in this way.  We changed the name to "scale" to
avoid giving people the impression that it meant true density scaling.

In retrospect, it may have been a better idea to use the more familiar
true density scaling.  However, if you make the change that Rolf Turner
suggested, it would probably be a good idea to name the argument
"density" once again.
-- 

    Rick Becker  alice!rab  research!rab