[net.news] About the readership statistics. Some explanations.

reid@glacier.ARPA (Brian Reid) (03/16/86)

Several points. First, the "arbitron" program does not count a newsreader
unless he/she is current in at least one group. "Current" means that he/she
has processed at least one message that is recent enough that it has not yet
been expired. This means that if you run "expire" every week on your site,
with no options, then it is possible to count a user who has not read news
in 20 days, but it will never count a user who has not read news in 21 days.

Now, the big point. About statistics and sampling. Scientific sampling and
profiling theory is based around the premise that you understand the
characteristics of the target population, and you are trying to find a small
sample that is representative of that population. Unfortunately, nobody has
any real idea how big USENET is. There are 3277 hosts listed in the mod.map
database. Greg Woods' most recent biweekly survey showed that messages were
posted from 1355 different hosts. My own data, accumulated from Glacier's
history file, shows that 2044 sites have posted messages in the last year.

I am using 2044 as the "size of the network", and when I say that I have
got data from X percent of the network, I mean that X percent of 2044 sites
have sent in data.

Now, since the nature of the target population is not known, there is no
way to guarantee that the sampling is representative. What I can do, though,
is to see whether or not the sampling is uniform. I split the data up into
various sets of quarters, and then see how much one the results in one
quarter of the data differ from the results in another. This is a way of
measuring the width of the distribution bell curve, and is a standard
technique for checking the consistency of experimental data. What I found is
that the data is remarkably consistent. I tried dividing it up into
universities, AT&T, other commercial, and government, and looking for
systematic differences between those groups. I tried dividing it into large
sites and small sites. I tried dividing by CPU type. No matter how I
partitioned the data, the pieces of the partition had very consistent
statistics. That means that I can be pretty confident that the sample is
representative.

About 25% of the data sent in came not from SAs, but from ordinary users. I
therefore don't think that there is a huge bias towards sites with active
SA's. I think that the primary bias is towards sites having readers who care
about net volume.

As of this moment I have data from 99 sites; that represents 5% of the
presumed size of the network, and represents 4138 people who are current
in some newsgroup. It is rare for a network survey to get 99 individual
responses; I now have data for 4138 individuals. I claim that the results
are significant enough to pay attention to at this point. I will be happy to
provide references to elementary or advanced statistics textbooks for those
people who would like to learn more about the kind of analysis I am doing.

Brian
-- 
	Brian Reid	decwrl!glacier!reid
	Stanford	reid@SU-Glacier.ARPA