[comp.sources.d] revised "arbitron"

news@becker.UUCP (UseNet News) (07/20/89)

I have decided to post a revised version of
"arbitron" for News statistics gathering due to
the recent discussion about "unix-pc" and other
groups being under-reported due to bugs in the
script.

 --------- 8< --------- 8< --------- 8< --------- 8< --------- 8< ---------

#! /bin/sh
# @(#)arbitron	2.4.3	Mon Jun  5 05:03:55 EDT 1989
# arbitron -- this program produces rating sweeps for USENET.
#
# Usage: arbitron
#
# To use this program, edit the "configuration" section below so that the
# information is correct for your site, and then run it. It will produce a
# readership survey for your machine and mail that survey to decwrl.dec.com,
# with a cc to you.
#
# To participate in the international monthly ratings sweeps, 
# run "arbitron" every month. I will run the statistics program on the first
# day of each month; it will include any report that has reached it by that
# time. To make sure your site's data is included, run the survey program no
# later than the 20th day of each month.
#
# Brian Reid, DEC Western Research Lab, reid@decwrl.dec.com
# Updated and bugfixed by 
#	Spencer Thomas, U.of Utah
#	Geoff Kuenning, SAH Consulting
# Updated to work with 2.10.1 and older news systems by
#	Lindsay Cleveland, AT&T Technologies/Bell Labs
# Made to work with 16-bit address spaces by
#	Andy Walker, Maths Dept., University of Nottingham, UK
# Nagging Bourne shell bug fixed by
#	Tom Donahue, Rabbit Software Corp
# Newsgroup inclusion bug fix by
#	Bruce Becker, G.T.S.
#
# Note that the results of this program are dependent on the rate at which
# you expire news.  If you are a small site that expires news rapidly, the
# results may indicate fewer active readers than you actually have.
#
###########################################################################
# Configuration information. Edit this section to reflect your site data. #
TMPDIR=/tmp
NEWS=/usr/lib/news
SPOOL=/usr/spool/news

# Make a crude stab at determining the system type. If your installation has
# only one type of system, you can edit out the "if" statement and just turn
# this into an assignment statement of the correct value.
if [ -d /usr/ucb ]; then
	STYPE="bsd"
else
	STYPE="usg"
fi

# Range of /etc/passwd UID's that represent actual people (rather than
# maintenance accounts or daemons or whatever)
lowUID=100
highUID=9999

# If you aren't running a distributed news system (nntpd & rrn, usually),
# leave NEWSHOST blank. Else set it to the name of the host from which you
# can rcp a copy of the active file.
NEWSHOST=

# uucp path: {sun, hplabs, pyramid, decvax, ucbvax}!decwrl!netsurvey
summarypath="netsurvey@decwrl.dec.com $USER"


# We need to find the uucp name of your host. If this code doesn't work,
# then just put it in literally like this:
#	hostname="ihnp4"

case $STYPE in
	bsd)	cmd='hostname || uuname -l';;
	sysv)	cmd='uname -n || uuname -l || hostname';;
	*)	cmd='uuname -l';;
esac

hostname=`sh -c "$cmd" 2>&-`

PATH=$NEWS:/usr/local/bin:/usr/ucb:/usr/bin:/bin
############################################################################
export PATH
# ---------------------------------------------------------------------------
trap "rm -f $TMPDIR/arb.*.$$; exit" 0 1 2 3 15
set `date`
dat="$2$6"
destination="${MAILER-mail} $summarypath"

################################
# Here are several expressions, each of which figures out approximately how
# many people use this machine. Comment out all but 1 of them; pick the one
# you like best. Initially the most universal but least reliable of them is
# uncommented.
# # ###### Scheme #1: fast but usually returns too big a number
nusers=`awk -F: '
BEGIN { N = 0 }
$3 >= '"$lowUID"' && $3 <= '"$highUID"' { N = N + 1 }
END { print N }' </etc/passwd`

# # ###### Scheme #2 (works with BSD systems)
#nusers=`last | sort -u +0 -1 | wc -l`

# # ###### Scheme #3 (works with USG systems)
#nusers=`who /etc/wtmp | sort -u +0 -1 | wc -l`

################################
#
# Set up awk scripts;  these are too large to pass as arguments on most
# systems.
#
# This awk script generates the actual output report.
# We use 'sed' to substitute in the shell variables to save ourselves
# endless hassle trying to find quoting/backslashing problems.
#
# The input to this script consists of two types of lines (pre-sorted):
#
#	(1) Active-file lines.  These have four fields:  newsgroup name,
#	    first existing article, last article number, 'y' or 'n'
#	    to allow/disallow posting.
#			mod.mac 00001 00001 y
#
#	(2) .newsrc-derived lines.  These have three fields:  the newsgroup
#	    name, the user name and the articles-read information.  The latter
#	    can be arbitrarily complex.  It can also be arbitrarily long;
#	    this can potentially break either awk or sed, in which
#	    case the script will not work.
#			mod.map joe 1-199
#
#	The script uses the type 1 lines to define the newsgroups
#	and their active article ranges.  The .newsrc (type 2) lines are
#	then used to deduce which users are reading that group (a group
#	is being read if the last article seen is in that group's active
#	article range).
#
sed    "/^#/d
	s/NUSERS/$nusers/g
	s/HOSTNAME/$hostname/g
	s/DATE/$dat/g" > $TMPDIR/arb.fmt.$$ << 'DOG'
# makereport -- utility for "arbitron". Early versions were copied from a
# similar script distributed with "subscribers.sh" by Blonder, McCreery, and
# Herron.
#
BEGIN	{
	grpcount  = 0
	realusers = 0
}
#
# Active file line:  dispose of previous group (if any), record group, and
# record first and last article numbers.  Set group's reader count to none.
NF == 4 {
	if (grpname != "") {
		printf("%d %s\n", grpcount, grpname)
	}
	grpname  = $1
	grpfirst = $3
	grplast  = $2
	grpcount = 0
}
#
# .newsrc line.  Break out the final number, which is the last article that
# has actually been read.  This is a pretty good indicator of the person's
# true interest in the group.  If 'lastread' for the group is a current
# (unexpired) article, record a reader for that group.  Finally, record
# the user as a "real" user of the news system.
#
NF == 3 {
	if ($1 != grpname) next;
	n1 = split($3, n2, "-")
	n3 = split(n2[n1], n4, ",")
	lastread = n4[n3]
	if ((grpfirst != grplast) && (lastread >= grpfirst) && \
		(lastread <= grplast)) {
		grpcount++
		if (realuser[$2] != 1) {
		    realuser[$2] = 1
		    realusers++
		}
	}
}
#
# End of file.  Print the report in 2 columns.
END	{
	# For reorganized network, report a group even if nobody reads it.
	# This will help us keep track of where the groups propagate.
	if (grpname != "") {
		printf("%d %s\n", grpcount, grpname)
	}
	printf("9999 Host\t\t%s\n", "HOSTNAME")
	printf("9998 Users\t\t%d\n", NUSERS)
	printf("9997 NetReaders\t%d\n", realusers)
	printf("9996 ReportDate\t%s\n", "DATE")
	printf("9995 SystemType\tnews-arbitron-2.4.3\n")
}
DOG

cat >$TMPDIR/arb.pwd.$$ <<'MOUSE'
BEGIN	{
	seen["/"] = 1
	seen[""]  = 1
}
{
	if (seen[$6] != 1)	{
		printf("if [ -r %s/.newsrc ] ; then ", $6)
		printf("sed -n '/: [0-9]/s/:/ %s/p' <%s/.newsrc; fi\n", $1, $6)
		seen[$6] = 1
	}
}
MOUSE

# First, make sure we have an active file
if [ -z "$NEWSHOST" ]; then
	ACTIVE=$NEWS/active
else
	ACTIVE=/tmp/arb.active.$$
	rcp $NEWSHOST:$NEWS/active $ACTIVE
fi

if [ ! -s $ACTIVE ]; then
	echo arbitron: ACTIVE file missing or empty. Cannot continue.
	exit 1
fi

# Next, get the list of .newsrc files with duplicates and unreadable files
# removed.
awk -F: -f $TMPDIR/arb.pwd.$$ </etc/passwd | sh >$TMPDIR/arb.tmp.$$

# Check to make sure that we found some
if [ -s $TMPDIR/arb.tmp.$$ ]; then
	# See if "active" file has 4 fields or only two (pre-2.10.2)
	set `sed 1q < $ACTIVE`
	if [ $# -eq 2 ]; then
		egrep  '^[a-z][-0-9_a-z]*\.' $ACTIVE |
		while read group last; do
			dir=`echo "$group" | sed 's;\.;/;g'`
			first=`ls $SPOOL/$dir | grep '^[0-9]*' | sort -n | sed 1q`
			case $STYPE in
				usg)	echo "$group $last ${first:-$last} X";;
		  		*)	echo "$group $last ${first-$last} X";;
		   	esac
		done
	else
		egrep '^[a-z][-0-9_a-z]*\.' $ACTIVE
	fi |
	sort - $TMPDIR/arb.tmp.$$ |
	awk -f $TMPDIR/arb.fmt.$$ |
	sort -nr |
	sed    '/^$/d
		/^[0-9]* to\./d
		s/^999[0-9] //' |
	$destination
else
	echo Unable to find any readable .newsrc files 2>&1
	exit 1
fi

 --------- 8< --------- 8< --------- 8< --------- 8< --------- 8< ---------

Cheers,
-- 
Bruce Becker	Toronto, Ont.
Internet: news@becker.UUCP	UUCP: ...!uunet!mnetor!becker!news
"'Comefrom' considered useful" - the Anti-Dijkstra

epsilon@wet.UUCP (Eric P. Scott) (07/23/89)

I'll be happy to run this ... once Brian Reid endorses it.
I'm not about to "break" his assumptions.

					-=EPS=-