[news.software.nntp] modified arbitron script for NNTP server/clients

loverso@Xylogics.COM (John Robert LoVerso) (11/17/89)

Here's an arbitron I hacked up to handle the case of clients reading
news via NNTP or NFS.  The problem is that if you use something
ala HIDDENNET where you only advertise one name to the world
(`Xylogics.COM'), you can only send in one arbitron report, and so
those people who read via NNTP never get counted.  This only solves
that problem in the was that it affects me and as such it is not
as general as it could be.  In particular, I assume you can `rsh'
to the other hosts to run part of the script.

Note that this only needs to be installed on one machine; when it
runs it pipes over to the other hosts the script that needs to get
run.  I did this because I didn't want to have to update the script
on a bunch of hosts (which I might not control, etc).  A second
reason was a desire to change as little of arbitron as I had to.

It works best to install this on your news server, because embedded in
a comment below is a few lines to create a list of NNTP clients
who've accessed you in the last month.

WARNING: somewhere below, I use a "<<-".  If you've an ancient
Bourne shell (v7, 4BSD), change it into a "<<" and then remove all
the tabs from the lines of the here-is document.

One other minor change is to look for ~/News/.newsrc files.

As a side note: this looks like the type of script that someone (not
me) could write cleaner in perl.  I don't have the time, hence this
hacked ditty.

../John

#! /bin/sh
# @(#)arbitron	2.4.4	10/20/89 + jlv11/17/89
# arbitron -- this program produces rating sweeps for USENET.
#
# Usage: arbitron
#
# To use this program, edit the "configuration" section below so that the
# information is correct for your site, and then run it. It will produce a
# readership survey for your machine and mail that survey to decwrl.dec.com,
# with a cc to you.
#
# To participate in the international monthly ratings sweeps, 
# run "arbitron" every month. I will run the statistics program on the first
# day of each month; it will include any report that has reached it by that
# time. To make sure your site's data is included, run the survey program no
# later than the 20th day of each month.
#
# Brian Reid, DEC Western Research Lab, reid@decwrl.dec.com
# Updated and bugfixed by 
#	Spencer Thomas, U.of Utah
#	Geoff Kuenning, SAH Consulting
# Updated to work with 2.10.1 and older news systems by
#	Lindsay Cleveland, AT&T Technologies/Bell Labs
# Made to work with 16-bit address spaces by
#	Andy Walker, Maths Dept., University of Nottingham, UK
# Nagging Bourne shell bug fixed by
#	Tom Donahue, Rabbit Software Corp
# Various suggestions provided by:
#	Karl Tombre, CRIN, Vandoeuvre France
#	Dan Kolkowitz, Stanford
#	Paul Eggert, Unisys
# Fixes for YP and NN by
#	Dan Kolkowitz, Stanford
#	Brent Chapman, Capital Market Technology
# Hacks for gathering client stats (NNTP/NFS)
#	John Robert LoVerso, Xylogics
#
# Note that the results of this program are dependent on the rate at which
# you expire news.  If you are a small site that expires news rapidly, the
# results may indicate fewer active readers than you actually have.
#
###########################################################################
# Configuration information. Edit this section to reflect your site data. #
TMPDIR=/tmp
NEWS=/usr/lib/news
SPOOL=/usr/spool/news

# Make a crude stab at determining the system type. If your installation has
# only one type of system, you can edit out the "if" statement and just turn
# this into an assignment statement of the correct value.
if [ -d /usr/ucb ]
then
    STYPE="bsd"
else
    STYPE="usg"
fi

# Range of /etc/passwd UID's that represent actual people (rather than
# maintenance accounts or daemons or whatever)
lowUID=99
highUID=9999

# If you aren't running a distributed news system (nntpd & rrn, usually),
# leave NEWSHOST blank. Else set it to the name of the host from which you
# can rcp a copy of the active file.
NEWSHOST=

# Other hosts (where we can rsh) which use the same active file as us/NEWSHOST.
#
# This is for the case where news is on a master server, and accessed by
# clients using something like NFS or NNTP.  OTHERS should be defined to
# be all those client hosts.  If people read news on this machine, too,
# then be sure to include "localhost"(!).
#
# I create `nntp/readers' by running the following DAILY in /usr/lib/news/nntp:
#
#	# find local NNTP hosts
#	cd .readers
#	touch `awk '{ print $6 }' ../nntplog | grep -v : | sort | uniq`
#	find . -mtime +31 -exec rm {} ';'
#	touch localhost
#	echo * > ../readers
#
# Leave this UNDEFINED if this is the only host you want arbitron to base its
# report on.
#
#OTHERS=`cat ../nntp/readers`

# If usernames are not unique across the OTHERS, then set to false.
# If this is set to true, then the newsrc lines of joe@foovax and joe@mysun
# are considered those of the same user; usually the first found will win
# out.  If false, then the above are considered separate users.
#
# This has no meaning unless OTHERS is set.
UNIQUE=true

# uucp path: {sun, hplabs, pyramid, decvax, ucbvax}!decwrl!netsurvey
summarypath="netsurvey@decwrl.dec.com usenet"


# We need to find the uucp name of your host. If this code doesn't work,
# then just put it in literally like this:
#	hostname="decwrl"

case $STYPE in
	bsd) cmd='hostname || uuname -l';;
	sysv)cmd='uname -n || uuname -l || hostname';;
	*)   cmd='uuname -l';;
esac;

hostname=`sh -c "$cmd" 2>&-`

PATH=$NEWS:/usr/local/bin:$PATH
############################################################################
export PATH
# ---------------------------------------------------------------------------
trap exit 1 2 3 15
trap 's=$?; rm -f $TMPDIR/#arb.*.$$; exit $s' 0
set `date`
dat="$2$6"
destination="${MAILER-mail} $summarypath"

PASSWD=$TMPDIR/'#'arb.passwd.$$
FORMAT=$TMPDIR/'#'arb.fmt.$$
PWMUNGE=$TMPDIR/'#'arb.pwd.$$
GLEAMED=$TMPDIR/'#'arb.tmp.$$

################################
# Here are several expressions, each of which figures out approximately how
# many people use this machine. Comment out all but 1 of them; pick the one
# you like best. Initially the most universal but least reliable of them is
# uncommented.
#
# WARNING:
# This is unused if OTHERS is set
# In that case, a remote variation of scheme 1 is used (see below).
#
> $PASSWD
if [ ! $?OTHERS ]
then
	(ypcat passwd || cat /etc/passwd) > $PASSWD

	# # ###### Scheme #1: fast but usually returns too big a number
	nusers=`awk -F: "BEGIN {N=0}\\$3>=$lowUID && \\$3<=$highUID{N=N+1}END{print N}" <$PASSWD`

	# # ###### Scheme #2 (works with BSD systems)
	#nusers=`last | sed '/^wtmp begins/d; s/ .*//; /^$/d' | sort -u | wc -l`

	# # ###### Scheme #3 (works with USG systems)
	#nusers=`who /etc/wtmp | sort -u +0 -1 | wc -l`

fi

################################
#
# Set up awk scripts;  these are too large to pass as arguments on most
# systems.
#

# This awk script generates the actual output report.
# We use 'sed' to substitute in the shell variables to save ourselves
# endless hassle trying to find quoting/backslashing problems.
#
# The input to this script consists of two types of lines (pre-sorted):
#
#	(1) Active-file lines.  These have four fields:  newsgroup name,
#	    first existing article, last article number, 'y' or 'n'
#	    to allow/disallow posting.
#			mod.mac 00001 00001 y
#
#	(2) .newsrc-derived lines.  These have three fields:  the newsgroup
#	    name, the user name and the articles-read information.  The latter
#	    can be arbitrarily complex.  It can also be arbitrarily long;
#	    this can potentially break either awk or sed, in which
#	    case the script will not work.
#			mod.map joe 1-199
#
#	The script uses the type 1 lines to define the newsgroups
#	and their active article ranges.  The .newsrc (type 2) lines are
#	then used to deduce which users are reading that group (a group
#	is being read if the last article seen is in that group's active
#	article range).
#
sed "/^#/d
     s/HOSTNAME/$hostname/g
     s/DATE/$dat/g" > $FORMAT << 'DOG'
# makereport -- utility for "arbitron". Early versions were copied from a
# similar script distributed with "subscribers.sh" by Blonder, McCreery, and
# Herron.
#
	BEGIN	{ rdrcount = 0 ; reader = "" ; grpcount = 0 ; realusers = 0}
#
# Active file line:  dispose of previous group (if any), record group, and
# record first and last article numbers.  Set group's reader count to none.
	NF == 4 { if (grpname != "") {
			printf("%d %s\n",grpcount, grpname)
		  }
		  grpname = $1
		  grpfirst = $3
		  grplast = $2
		  grpcount = 0
		}
#
# .newsrc line.  Break out the final number, which is the last article that
# has actually been read.  This is a pretty good indicator of the person's
# true interest in the group.  If 'lastread' for the group is a current
# (unexpired) article, record a reader for that group.  Finally, record
# the user as a "real" user of the news system.
#
	NF == 3 { if ($1 != grpname) next;
		  n1 = split($3, n2, "-")
		  n3 = split(n2[n1], n4, ",")
		  lastread = n4[n3]
	if ((grpfirst != grplast) && (lastread >= grpfirst) && (lastread <= grplast)) {
			grpcount++
			if (realuser[$2] != 1) {
			    realuser[$2] = 1
			    realusers++
			}
		  }
		}
#
# End of file.  Print the report in 2 columns.
	END	{ printf("9999 Host\t\t%s\n","HOSTNAME")
		  printf("9998 Users\t\t%s\n","NUSERS")
		  printf("9997 NetReaders\t%d\n",realusers)
		  printf("9996 ReportDate\t%s\n","DATE")
		  printf("9995 SystemType\tnews-arbitron-2.4\n")
# For reorganized network, report a group even if nobody reads it. This will
# help us keep track of where the groups propagate.
		  printf("%d %s\n",grpcount, grpname)
		}
DOG

cat >$PWMUNGE <<'MOUSE'
BEGIN	{ seen["/"]=1; seen[""] = 1; }
	{ if (seen[$6]!=1) {
		printf("if [ -r %s/.nn/rc ] ; then ", $6)
		printf("sed -n '/^+/s/^. \([0-9]*\) \(.*\)/\2 %s \1/p'",$1)
		printf(" <%s/.nn/rc;\n",$6)
		printf("elif [ -r %s/News/.newsrc ] ; then ", $6)
		printf("sed -n '/: [0-9]/s/:/ %s/p' <%s/News/.newsrc\n",$1,$6)
		printf("elif [ -r %s/.newsrc ] ; then ", $6)
		printf("sed -n '/: [0-9]/s/:/ %s/p' <%s/.newsrc; fi\n",$1,$6)
		seen[$6]=1;
	  }
}
MOUSE

# First, make sure we have an active file
if [ -z "$NEWSHOST" ]
then ACTIVE=$NEWS/active
else ACTIVE=/tmp/arb.active.$$
     rcp $NEWSHOST:$NEWS/active $ACTIVE
fi

if [ ! -s $ACTIVE ]
then
    echo arbitron: ACTIVE file missing or empty. Cannot continue. >&2
    (exit 1); exit
fi

# Next, get the contents of .newsrc files with duplicates and unreadable files
# removed.
if [ $?OTHERS ]
then
	> $GLEAMED
	for host in ${OTHERS-localhost}
	do
		( cat <<- EOF
			# copy over the PWMUNGE script
			cat > $PWMUNGE << 'MOUSE'
			`cat $PWMUNGE`
			MOUSE

			# get passwd information
			(ypcat passwd || cat /etc/passwd) 2>/dev/null |
			tee $PASSWD |

			# extract a list of usernames
			awk -F: '\$3>=$lowUID && \$3<=$highUID{print"user",\$1}'

			# find and extract .newsrc info
			awk -F: -f $PWMUNGE <$PASSWD |
			if ${UNIQUE-true}
			then
				sh
			else
				sh |
				awk '{print \$1,"'\`hostname\`':"\$2,\$3 }'
			fi

			rm $PASSWD $PWMUNGE
		EOF
		) | rsh $host /bin/sh >> $GLEAMED
	done
	nusers=`grep '^user ' $GLEAMED |
		if ${UNIQUE-true}
		then
			sort -u
		else
			sort
		fi | wc -l | tr -d ' '`
else
	awk -F: -f $TMPDIR/arb.pwd.$$ <$PASSWD | sh >$GLEAMED
fi


# Check to make sure that we found some
if [ -s $GLEAMED ]
then # See if "active" file has 4 fields or only two (pre-2.10.2)
     set `sed 1q < $ACTIVE`
     if [ $# -eq 2 ]
     then egrep  '^[a-z][-0-9_a-z]*\.' $ACTIVE |
	  while read group last
	  do dir=`echo "$group" | sed 's;\.;/;g'`
	     first=`ls $SPOOL/$dir | grep '^[0-9]*' | sort -n | sed 1q`
	     case $STYPE in
		usg) echo "$group $last ${first:-$last} X";;
		  *) echo "$group $last ${first-$last} X"
	     esac
	  done
     else egrep '^[a-z][-0-9_a-z]*\.' $ACTIVE
     fi |
     sort - $GLEAMED |
     awk -f $FORMAT |
     sort -nr |
     sed "/^$/d
          s/NUSERS/$nusers/g
	  s/^999[0-9] //" |
     $destination
else echo arbitron: Unable to find any readable .newsrc files >&2
     (exit 1); exit
fi
-- 
John Robert LoVerso			Xylogics, Inc.  617/272-8140
loverso@Xylogics.COM			Annex Terminal Server Development Group