mackeown@CompSci.Bristol.AC.UK (W. Mackeown) (11/22/89)
>I fear that there is one thing worse than bad statistics, and that is >bad statistics with technical talk to make it sound authoritative. I see that there is one thing funnier than r.h.f, and that is Brad Templeton talking about statistics. But seriously, please read a textbook on statistics: Brad Templeton commented <44075@looking.on.ca> on <1046@csisles.Bristol.AC.UK>: >>I received 52 replies to Poll 1 before the deadline, 1 November, 24:00 GMT, >>more than enough to get beyond small-sample statistics. > ^^^^ >>95 % confidence interval for E[{YES, Q-YES}] is 66% +/- 6.47% Clearly, >>there is strong support for renaming the newsgroups using a more logical >>name-structure. > >First of all, 52 replies from an audience of this size is not enough >for a good survey. [--- Brad Templeton ] It is a very common misunderstanding for people unacquainted with sampling theory to believe that the size of a sample >must< be large relative to the size of the population, otherwise the sample will not accurately reflect the population from which it is drawn: ">>only<< size 52 out of 37000 people?!" A sufficient condition for low sampling error (ie. random bias) is that the sample size be large relative to the population size. The necessary and sufficient condition for low sampling error is that the >>absolute<< size of the sample be large. This poll is simply a case of binomial sampling without replacement from a large but finite population. Since I received 52 replies, the statistics for these poll results are such that I can say that there is only a 1 in 20 chance that the true opinion of the Usenet news.groups readership is not within 6.47% of 66% in favour of renaming. The statistical results I am using here are (1) that the accuracy (ie. inverse variance) of estimates based on such samples is 1/Vs = (n/Vp) * (N-1)/(N-n) where Vs is the sample variance, Vp is the population variance, N and n are the sizes of the population and the sample respectively (N~=37000, n/N < 0.01, hence Vs ~= Vp/n). And (2) the Central Limit Theorem to approximate a binomial distribution by a normal distribution (the approximation is good for n>=30 and very good for n>=50: my n=52!) (3) The Raff condition on (2): for the binomial r.v. Z, iff n(p^1.5) > 1.07 then max.|approximation error| < 0.05 for all values of r in the coefficients, nCr (here n(p^1.5) = 27.88 >>> 1.07) (4) That the sample proportion has variance pq/n = p(1-p)/n about the mean p (hence 1.96 * stan-dev. = 6.47% gives a 95% confidence limit). > In particular, because the sample was self-selected. >A 95% confidence interval means *nothing* if there are no controls on the >sample. You simply can't say you have such a confidence interval. > >Self-selected polls can tell you something, and I have in fact referred >to them from time to time [...] but when you use them you must be clear >about the holes in their validity. To talk about confidence intervals in >such cases doesn't mean a lot. If by "self-selected" you mean that I selected only certain people to participate in my poll, in order that I would bias the results, then I can assure you that I did no such thing. I first posted my poll in news.groups. That produced several replies. I then e-mailed my poll to anyone who had not yet replied but who had posted article(s) in news.groups related to the names of newsgroups. I included everyone there, i.e. both the people for, as well as those against, any changes to the namespace. That produced the remainder of the replies. In news.groups during the 3 weeks from 1 October to 21 October, 97 people posted 127 articles on subjects related to the structure of the namespace, including 45 people who posted 83 articles about the sci.aquaria proposal. I received 52 replies from this group of people which is sufficient to ensure that with 95% probability there is no significant >random< bias in my sample. I did not control for any hypothetical demographic causes of systematic bias because I did not and still do not see any plausible mechanism(s) to explain how they might affect the result. Pollsters control for a demographic variation in their samples ONLY where they have ALREADY proposed and validated a plausible mechanism by which that particular variation might affect the result. For example, you >>could<< control for shoe size but as there seems to be no plausible mechanism by which it might affect opinions you would ignore that variation in your sample. And by the same token, you would also not even bother to measure it. It is possible that the people who posted in news.groups might not in some way be representative of the general readership of news.groups. That could produce a systematic bias in my sample. However, I assumed that there was unlikely to be any significant difference in opinion, either for or against renaming, between people posting articles, and people reading articles, in news.groups. If you can find a plausible and validated mechanism of bias then I must of course question the validity of my poll. Until then, I believe and support its conclusion. >Between 7,000 and 20,000 folks read this group by my estimates. That only >58 cared enough about renaming to answer this poll says that not many >people cared about the poll. That's about all you can conclude! Not necessarily. You assume that everyone sees every article. You also ignore the possibility that some people who care about renaming (for or against) may be too busy with other work to reply. Some readers of news.groups may not have cared enough to answer the poll, but given the traffic level in news.groups at the time, it seems likely that many more people may not have noticed or read the poll, or may not even have read news.groups before the articles expired at their site. Are you certain that you have never missed an article in a high traffic level newsgroup ? -- William Mackeown (mackeown@compsci.bristol.ac.uk) To confuse belief with knowledge is to confirm human nature.
cik@l.cc.purdue.edu (Herman Rubin) (11/24/89)
In article <1199@csisles.Bristol.AC.UK>, mackeown@CompSci.Bristol.AC.UK (W. Mackeown) writes: ........................... > This poll is simply a case of binomial sampling without replacement from a > large but finite population. Since I received 52 replies, the statistics for > these poll results are such that I can say that there is only a 1 in 20 chance > that the true opinion of the Usenet news.groups readership is not within 6.47% > of 66% in favour of renaming. For this to be the case without strong, usually untenable, assumptions, requires that the probability of any set of k individuals being included must be the same as that for any other set of k individuals. [Most polls do something better, namely, stratified random sampling, but the basic idea is the same.] The problem is whether the individuals responding voluntarily are representative, and this is usually false. Any inference based on voluntary responses is so fraught with danger as to be totally untrustworthy without making very strong assumptions which are extremely unlikely to be reasonable. There are Bayesian ways of handling the problem, but these are very sensitive to the assumptions, and the conclusion may very well depend more on the assumptions than on the data. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)