[news.groups] RESULTS of OPINION POLL on Renaming Newsgroups

mackeown@CompSci.Bristol.AC.UK (W. Mackeown) (11/22/89)

>I fear that there is one thing worse than bad statistics, and that is
>bad statistics with technical talk to make it sound authoritative.

I see that there is one thing funnier than r.h.f, and that is Brad Templeton
talking about statistics. But seriously, please read a textbook on statistics:

Brad Templeton commented <44075@looking.on.ca> on <1046@csisles.Bristol.AC.UK>:
>>I received 52 replies to Poll 1 before the deadline, 1 November, 24:00 GMT,
>>more than enough to get beyond small-sample statistics.
> ^^^^
>>95 % confidence interval for E[{YES, Q-YES}] is 66% +/- 6.47%  Clearly,
>>there is strong support for renaming the newsgroups using a more logical
>>name-structure. 
>
>First of all, 52 replies from an audience of this size is not enough
>for a good survey.                 [--- Brad Templeton ]

It is a very common misunderstanding for people unacquainted with sampling
theory to believe that the size of a sample >must< be large relative to the
size of the population, otherwise the sample will not accurately reflect the
population from which it is drawn: ">>only<< size 52 out of 37000 people?!"
A sufficient condition for low sampling error (ie. random bias) is that the
sample size be large relative to the population size. The necessary and
sufficient condition for low sampling error is that the >>absolute<< size of
the sample be large.

This poll is simply a case of binomial sampling without replacement from a
large but finite population. Since I received 52 replies, the statistics for
these poll results are such that I can say that there is only a 1 in 20 chance
that the true opinion of the Usenet news.groups readership is not within 6.47%
of 66% in favour of renaming.

The statistical results I am using here are (1) that the accuracy (ie. inverse
variance) of estimates based on such samples is 1/Vs = (n/Vp) * (N-1)/(N-n)
where Vs is the sample variance, Vp is the population variance, N and n are
the sizes of the population and the sample respectively (N~=37000, n/N < 0.01,
hence Vs ~= Vp/n).  And (2) the Central Limit Theorem to approximate a binomial
distribution by a normal distribution (the approximation is good for n>=30 and
very good for n>=50: my n=52!) (3) The Raff condition on (2): for the binomial
r.v. Z, iff n(p^1.5) > 1.07 then max.|approximation error| < 0.05 for all
values of r in the coefficients, nCr (here n(p^1.5) = 27.88 >>> 1.07) (4) That
the sample proportion has variance pq/n = p(1-p)/n about the mean p (hence
1.96 * stan-dev. = 6.47% gives a 95% confidence limit).

>                   In particular, because the sample was self-selected.
>A 95% confidence interval means *nothing* if there are no controls on the
>sample.  You simply can't say you have such a confidence interval.
>
>Self-selected polls can tell you something, and I have in fact referred
>to them from time to time [...] but when you use them you must be clear
>about the holes in their validity. To talk about confidence intervals in
>such cases doesn't mean a lot.

If by "self-selected" you mean that I selected only certain people to
participate in my poll, in order that I would bias the results, then I can
assure you that I did no such thing. I first posted my poll in news.groups.
That produced several replies.  I then e-mailed my poll to anyone who had not
yet replied but who had posted article(s) in news.groups related to the names
of newsgroups. I included everyone there, i.e. both the people for, as well
as those against, any changes to the namespace. That produced the remainder
of the replies.

In news.groups during the 3 weeks from 1 October to 21 October, 97 people
posted 127 articles on subjects related to the structure of the namespace,
including 45 people who posted 83 articles about the sci.aquaria proposal. 
I received 52 replies from this group of people which is sufficient to ensure
that with 95% probability there is no significant >random< bias in my sample.

I did not control for any hypothetical demographic causes of systematic
bias because I did not and still do not see any plausible mechanism(s) to
explain how they might affect the result. Pollsters control for a demographic
variation in their samples ONLY where they have ALREADY proposed and validated
a plausible mechanism by which that particular variation might affect the
result. For example, you >>could<< control for shoe size but as there seems to
be no plausible mechanism by which it might affect opinions you would ignore
that variation in your sample. And by the same token, you would also not even
bother to measure it.

It is possible that the people who posted in news.groups might not in some
way be representative of the general readership of news.groups. That could
produce a systematic bias in my sample. However, I assumed that there was
unlikely to be any significant difference in opinion, either for or against
renaming, between people posting articles, and people reading articles, in
news.groups. If you can find a plausible and validated mechanism of bias then
I must of course question the validity of my poll. Until then, I believe and
support its conclusion.

>Between 7,000 and 20,000 folks read this group by my estimates.  That only
>58 cared enough about renaming to answer this poll says that not many
>people cared about the poll.  That's about all you can conclude!

Not necessarily. You assume that everyone sees every article. You also ignore
the possibility that some people who care about renaming (for or against) may
be too busy with other work to reply. Some readers of news.groups may not have
cared enough to answer the poll, but given the traffic level in news.groups at
the time, it seems likely that many more people may not have noticed or read
the poll, or may not even have read news.groups before the articles expired at
their site. Are you certain that you have never missed an article in a high
traffic level newsgroup ?
-- 
William Mackeown (mackeown@compsci.bristol.ac.uk) 

To confuse belief with knowledge is to confirm human nature.

cik@l.cc.purdue.edu (Herman Rubin) (11/24/89)

In article <1199@csisles.Bristol.AC.UK>, mackeown@CompSci.Bristol.AC.UK (W. Mackeown) writes:
			...........................

> This poll is simply a case of binomial sampling without replacement from a
> large but finite population. Since I received 52 replies, the statistics for
> these poll results are such that I can say that there is only a 1 in 20 chance
> that the true opinion of the Usenet news.groups readership is not within 6.47%
> of 66% in favour of renaming.

For this to be the case without strong, usually untenable, assumptions,
requires that the probability of any set of k individuals being included
must be the same as that for any other set of k individuals.  [Most polls
do something better, namely, stratified random sampling, but the basic idea
is the same.]  The problem is whether the individuals responding voluntarily
are representative, and this is usually false.

Any inference based on voluntary responses is so fraught with danger as to
be totally untrustworthy without making very strong assumptions which are
extremely unlikely to be reasonable.  There are Bayesian ways of handling
the problem, but these are very sensitive to the assumptions, and the 
conclusion may very well depend more on the assumptions than on the data.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)