reid@glacier.ARPA (Brian Reid) (03/16/86)
Several points. First, the "arbitron" program does not count a newsreader unless he/she is current in at least one group. "Current" means that he/she has processed at least one message that is recent enough that it has not yet been expired. This means that if you run "expire" every week on your site, with no options, then it is possible to count a user who has not read news in 20 days, but it will never count a user who has not read news in 21 days. Now, the big point. About statistics and sampling. Scientific sampling and profiling theory is based around the premise that you understand the characteristics of the target population, and you are trying to find a small sample that is representative of that population. Unfortunately, nobody has any real idea how big USENET is. There are 3277 hosts listed in the mod.map database. Greg Woods' most recent biweekly survey showed that messages were posted from 1355 different hosts. My own data, accumulated from Glacier's history file, shows that 2044 sites have posted messages in the last year. I am using 2044 as the "size of the network", and when I say that I have got data from X percent of the network, I mean that X percent of 2044 sites have sent in data. Now, since the nature of the target population is not known, there is no way to guarantee that the sampling is representative. What I can do, though, is to see whether or not the sampling is uniform. I split the data up into various sets of quarters, and then see how much one the results in one quarter of the data differ from the results in another. This is a way of measuring the width of the distribution bell curve, and is a standard technique for checking the consistency of experimental data. What I found is that the data is remarkably consistent. I tried dividing it up into universities, AT&T, other commercial, and government, and looking for systematic differences between those groups. I tried dividing it into large sites and small sites. I tried dividing by CPU type. No matter how I partitioned the data, the pieces of the partition had very consistent statistics. That means that I can be pretty confident that the sample is representative. About 25% of the data sent in came not from SAs, but from ordinary users. I therefore don't think that there is a huge bias towards sites with active SA's. I think that the primary bias is towards sites having readers who care about net volume. As of this moment I have data from 99 sites; that represents 5% of the presumed size of the network, and represents 4138 people who are current in some newsgroup. It is rare for a network survey to get 99 individual responses; I now have data for 4138 individuals. I claim that the results are significant enough to pay attention to at this point. I will be happy to provide references to elementary or advanced statistics textbooks for those people who would like to learn more about the kind of analysis I am doing. Brian -- Brian Reid decwrl!glacier!reid Stanford reid@SU-Glacier.ARPA