reid@decwrl.UUCP (01/01/87)


This is the first article in a monthly posting series from the Network
Measurement Project at the DEC Western Research Laboratory in Palo Alto,

This survey is based on a sample of data taken from various USENET sites.
At the end of this message there is a short explanation of the measurement
techniques and the meaning of the various statistics. The messages that
follow this one show survey data sorted by various criteria.

The complete set of readership data (of which this is a summary) is posted
in mod.newslists. The software that will let your site participate in the
survey is in net.sources.

			Brian Reid

                             This            Estimated
                            Sample         for entire net
Sites:                      401                 5300
Fraction reporting:        7.57%                 100%
Users with accounts:      49773               657000
Netreaders:               11882               157000

Average readers per site:                          30
Percent of users who are netreaders:            23.87%
Average traffic per day (megabytes):            1.449
Average traffic per day (messages):               561
Traffic measurement interval:    last              21 days
Readership measurement interval: last              75 days
Sites used to measure propagation:                151

Valid data received from these sites:

3comvax a60 aaec abic abnji acctel acetes acornrc adelie akov68.dec.com
alberta alliant altos.dec.com alv amdahl amdcad ames argus
arthur.cs.purdue.edu ascvax asd.dec.com astro astrovax atari athena
aurora author.dec.com axis b-tech bartok.dec.com bass basser bcm5000
bdmrrr bemis.dec.com bene beno beta.dec.com bigbang bigtex
bizet.dec.com bms-at bnl bolt.dec.com brand brspyr1 btnix bu-cs cacilj
cadomin cae780 caip.rutgers.edu calgary carmel cascade casee.dec.com
cavell cbdkc1 cbosgd celica.dec.com cesare.dec.com cgfsv1.dec.com
cgfsv2.dec.com cgl.ucsf.edu cgoo01.dec.com chalmers charlie chas2
cheviot ci-dandelion circe cisden cisunx cit-vax clio cod cognos
concurrent.co.uk cookie.dec.com cooper cp1 cpro cpsc53 cpw.columbia.edu
crcge1 crin crvax1.dec.com cs.nott.ac.uk csadfa csustan cuae2 cuuxb cvl
cwruecmp cxsea daimi dalcs darth davasun dayton dciem dcl-cs dcl-csvax
debet.dec.com dec-marlboro.arpa decna.dec.com decwrl decwrl.dec.com
desint devon diamond.bbn.com dicome dinadan dlb dmcnh doshita drexel
dssdev.dec.com dukempd dycom ector.cs.purdue.edu edison
elbereth.rutgers.edu elroy elsie elwood.dec.com ems endor enmasse
entropy eros eta ethos euclid.dec.com fai felix firqb.dec.com fisher
flinders fortune fritz ganash garfield garfield.mun.cdn gargoyle gatech
geac genrad glacier gould9 gouldsd gramps.dec.com grc97 grebyn
gt-stratus haddock.isc.com hao hcx1 hdsvx1 hercules hjuxa hoptoad hpcea
hpldora hpscad.dec.com hqda-ai hscfvax ihnp4 ileaf ima imagen imt3b2
infinet infopro intrin invest ipso.oz iscuva ishtar isl ittvax
izimbra.css.gov jasper jimi jon.dec.com jplgodo kirk.dec.com korppi
kosman kpe labrea ldp.dec.com liuida lll-crg luke macbeth maccs macs
marlboro.dec.com marlin masscomp maynard mcc-pp mcgill-vision mck-csc
me-ncr meccts midacs mind mips mit-eddie mit-trillian mks mntgfx
mordred.cs.purdue.edu mormps.dec.com mss msudoc mtgzy mtgzz mulga
munnari munsell myrias naakka navajo nbires ncr-sd ncrcae nears nesterc
nexus.dec.com nike noao novavax nplpsg nsc nttlab nucsrl obiwan.dec.com
oblio ocean oddjob oktext omepd onecom opus orion osi3b2 osiris
osu-eddie osupyr panda pbhya pbhyc pdn pegasus penet percival peregrine
philtis phoenix phri pitt pixar plaid plus5 pogo polaris pompeo.dec.com
popeye poseidon potaru.dec.com potomac princeton psivax ptsfa ptsfb
ptsfc ptsfd qantel qnda01 quad1 ra raster rayssd reality1 rlvd
rochester rocky rosevax rsts32.dec.com rti-sel saber samira sandia
sandoz santra saturn sauron scgvaxd scicom sdcsvax se-sd shasta sicsten
sigma sjuvax smaug.dec.com soma spar sphinx sri-spam stb stride styx
sunybcs teddy teklds temvax termin tesla tilt tipple.dec.com
tkov58.dec.com tmsoft topaz.rutgers.edu tropix truman.dec.com trwhal
trwspf tuck tucos turtlevax tutctl tutor tymix ubc-cs uiucuxa uiucuxc
uiucuxe uiucuxf ujocs ulowell umd5.umd.edu umn-cs umndub uncle.dec.com
uokmax uqcspe.oz usc-oberon usceast usiv03.dec.com utacs utah-cs
utah-gr utcs utcsri uwmacc valmet vianet viking.dec.com vilya
vino.dec.com viper voder voodoo vrdxhq vu-vlsi vulcan walldata walrus
wang7 wanginst watale watarts watcal watcgl watdaisy watdcsu watdragon
wateng water watlion watmath watmum watnot watopt watpix watrose
watvlsi wjh12 wnuxb wolf wuphys xios yale yarra yetti yogi.dec.com zeus


Survey data is taken by having one person at each site run a program called
"arbitron", which looks at the news or notes files and determines the
newsgroups that the user has read within a recent interval. To "read" a
newsgroup means to have been presented with the opportunity to look at at
least one message in it. Going through a newsgroup with the "n" key counts
as reading it. For a news site, "user X reads group Y" means that user X's
.newsrc file has marked at least one unexpired message in Y. If there is no
traffic in a newsgroup for the measurement period, then the survey will show
that nobody reads the group. For a notes site, "user X reads group Y" means
that user X has been in the notesfile with the sequencer in the last 14 days.
The "14 days" interval for notesfiles corresponds to "unexpired" for news.

The "arbitron" program is periodically posted to net.sources, or is available
from me (decwrl!reid). The notesfiles version of the program should be
available through standard notesfiles software distribution channels as well.


"This Sample" means the set of sites that have sent in an arbitron report
within the past "Readership measurement interval" days. In every case the
most recent report from each site is used. At the moment, some of the
readership reports are several months old. In future postings those reports
will have expired and will not be included.

One might argue that the sample is self-selected, like the famous Literary
Digest Dewey-Truman election poll sample. It does in fact have a certain
self-selection factor in it, because we only get data from sites at which
someone participates in the survey. However, we do not require the
participation of every user at a site, only one user. The survey program
returns data for every user on the system on which it was run. Since there
are an average of 30 people per site reading news, there is a certain amount
of randomness introduced that way. Of course, the sample is biased in favor
of large sites (they are more likely to have a user willing to run the survey
program) and software-development-oriented sites (more likely to have a user
*able* to run the survey program). I intend to post, reasonably soon, some
breakdowns of statistics about the sites that have responded.


I determine the network size by looking at the set of sites that are
mentioned in the Path lines of news articles arriving at decwrl. This number
is consistently higher than the number of sites that posted a message (as
measured and posted from Seismo) because it includes passive sites that are
on the paths between posting sites and decwrl. Each month I store the names
of the hosts that are named that month, and for this report I used the past
9 months worth of data.

There are 5249 different sites in the Path lines of articles that
arrived at decwrl in the last 9 months. There are 4959 different sites in
the mod.map data, but mod.map includes every site that participates in uucp;
there is a considerable number of machines that exchange uucp mail but do not
get USENET. Of those 5249 sites, 46 (0%) are DEC E-net hosts not part of
uucp, and which therefore are not included in the 4959 figure.

Despite these various difficulties, I believe that 5300 is the best
estimate for the size of USENET. Because it is actually a measurement of the
number of sites that have posted a message or that are on the path to a site
that has posted a message, it will be slightly smaller than the number of
sites that actually read netnews. Any site that believes it is not being
counted can just ensure that it posts at least one message a year, so that
it will be counted.


The number of users at each site is determined in a site-specific fashion.
Sometimes it is done by counting the number of user accounts that have
shells and login directories. Sometimes it is done by counting the number of
people who have logged in to the machine in some interval. Sometimes other
techniques are used. This number is probably not very accurate--certainly
not more accurate than to within a factor of two.


There are two sources of error in this number. The number is computed by
multiplying the number of people in the sample who actually read the group by
the ratio of estimated network size to sample size. The estimated total can
therefore be biased by errors in the network size estimate (see above) and
also by errors in the determination of whether or not someone reads a group.
Assuming that "reading a group" is roughly the same as "thumbing through a
magazine", in that you don't necessarily have to read anything, but you have
to browse through it and see what is there, then the measurement error will
come primarily from inability to locate .newsrc files, which can either be
protected or moved out of root directories. There is no way of measuring the
effect on the measurements from unlocated .newsrc files, but it is not likely
to be more than a few percent of the total news readers.


This number is the percent of the sites that are even receiving this
newsgroup. The information necessary to compute propagation was not generated
by early versions of the arbitron program, so the "basis" (number of sites)
used to generate the Propagation figure is smaller than the "Sites in this
sample" figure. A site's data will be used to compute propagation if either
(a) it reports zero readers for at least one group, or (b) it is using an
arbitron with an explicit version number that is high enough. 


Traffic is measured at decwrl, in Palo Alto, California. Any message that has
arrived at decwrl within the last "Traffic measurement interval" days is
counted, regardless of when it was posted. Monthly rates are computed by
taking the total traffic, dividing by the number of days in the traffic
measurement interval, and multiplying by 30. Decwrl runs 2.10.3 news, which
does not store the "Date-Received", "Relay-version" or "Posting-version"
header lines; the amount of space occupied at your site might be higher, and
the number of bytes transmitted between machines is probably higher. By
definition this number is correct, because it is an exact measurement, but it
may differ from the traffic at your site by as much as 15% due to timing
differences and news version differences. Timing differences will be random,
but will average out in the long run. News version differences will cause a
systematic error that is additively uniform across all newsgroups, and which
therefore does not significantly affect ratios.

If a message is crossposted to several groups simultaneously, it is charged
only to the first-named group in the list.


This number is exactly what it says: the number of messages per month in
that newsgroup, divided by the number of 1000 readers. It is an indication
of how involved the readers of the group are in the traffic, of whether they
are mostly listeners or mostly talkers. Its accuracy is limited by the
accuracy of its two components. The messages per month  figure is exact; the
reader count is only as accurate as the network size estimate, which is in
worst case accurate to 40%. Therefore you should treat this number as having
an error margin of plus or minus 40%. However, ratios between participation
ratios for different newsgroups are quite accurate, since the network-size
component divides out.


The most controversial field in the survey report is the "$US per month per
reader". It is the estimated number of dollars that are being spent on
behalf of each reader, worldwide, on telephone costs to transmit this
newsgroup. The cost ratio does not include the cost of disk storage to store
the news or of computer time to process it; both of those are assumed to be

The cost ratio is computed as follows:

$US/month/reader = ($USPerMonthPerSite * numberOfSites) / numberOfReaders
$USPerMonthPersite = KBytesTrafficPerMonth * $USPerKByte
$USPerKByte = ($USperMinute / KBytesPerMinute) * (1 - CompressionFactor)
$USperMinute = 0.10	[ten cents per minute avg phone cost]
KBytesPerMinute = 60 * BytesPerSecond / 1000
BytesPerSecond = 100	[average transfer rate over 1200-baud line]
CompressionFactor = 0.4 [40% compression is typical for netnews]

Combining all these gives

$USPerMonthPersite =
    KBytesTrafficPerMonth * (0.10 / 6) * (1 - 0.4)
  = KBytesTrafficPerMonth / 100


$US/month/reader =
    (KBytesTrafficPerMonth * numberOfSites) / (100 * numberOfReaders)

The accuracy of this number is in fact better than the accuracy of the
participation ratio, because the source of error--the network size
estimate--is present both in the numerator and the denominator, and therefore
cancels out. The primary source of bias in this number comes from the bias in
the "estimated number of readers, worldwide", which is described above. Treat
this value as being accurate to within about 25%.


I would like to receive data from every site on USENET. The arbitron programs
(posted to net.sources along with this report) work on news 2.9, 2.10.[1-3],
2.11, and on many versions of notesfiles.

Brian Reid
DEC Western Research Laboratory, Palo Alto CA