[alt.sources] arbitron.pl

pat@kla (02/11/90)

Here is a version of the usenet statistics gathering `arbitron'
program written in perl.  The advantages are listed in the comments at
the beginning of the file.  It should be run on a machine with read
access to all user's home directories (for access to their .newsrc
files.)

I consider this a Beta version, submitted for testing against a
variety configurations.  Most of this program should be compatible
with perl 2.18, but for full functionality, perl 3.0 is required (for
socket and getpwent() support).

For reference, we are running a mix of Sun hardware (Sun3,
Sun4, Sun386i) and system versions (3.5, LeftCoast 4.0.3, RightCoast
4.0.1).  We receive our news via uucp and maintain it under B-news
2.11 with relatively short expiration times.  News is distributed
within our LAN via NNTP & rrn.

------------------------- Cut Here -------------------------------------------
#! /bin/perl
$VERSION = "pl-$1"  if  ( '$Revision: 1.6 $' =~ /([0-9]*\.[0-9]*) \$/ ) ;
#
#
# arbitron.pl -- this program produces rating sweeps for USENET
#
#
# Usage: arbitron.pl
#    or: perl arbitron.pl
#
# This program is intended as a replacement for the sh/sed/awk/... based
# script in common use.  This version offers the following advantages:
#
#	1)  Greater portability.
#	    (Perl runs on a lot more systems than you can find compatible
#	    versions of Bourne shell, sed, awk, etc.)
#	2)  More accurate reporting if the local expiration rate is very high.
#	    The shell script compared .newsrc contents against the active
#	    file.  This version keeps an optional arb.last file to compare
#	    against.
#	3)  No temporary files.
#	4)  Can be run on systems which do not support multi-tasking.
#	5)  If your mailer allows the specification of the subject as
#	    a command line parameter, this program will generate a
#	    subject line of the form `arbitron data for MonYYYY'.
#	6)  Direct connection to NNTP socket in distributed environment.
#	    (Perl 3.0 is required for this option.)
#	7)  Much easier to read. (Unless you are an awk & sed wizard :-)
#
# Disadvantages:
#	1)  This version is slower than the Bourne shell version.
#	    (about 50% on a Sun3/60 running SunOS 3.5)
#
# Other differences:
#	1)  Obtaining count of valid users:
#
#	    The script defaults to accepting all users within the specified
#	    range of userids; but provides optional mechanisms for counting
#	    actual userids stored in the wtmp(5) file.
#
#	    This program provides a generalized filter capability which can
#	    easily check the value of any of the fields from the password
#	    file.  It was felt that this is a more useful mechanism in
#	    a networked environment.
#
#
# To use this program, edit the "configuration" section below so that the
# information is correct for your site, and then run it. It will produce a
# readership survey for your machine and mail that survey to decwrl.dec.com,
# with a cc to you.
#
# To participate in the international monthly ratings sweeps, run "arbitron"
# every month.  The statistics program is run on the first day of each month;
# it will include any report that has reached it by that time. To make sure 
# your site's data is included, run the survey program no later than the 20th
# day of each month.
#
# ---------------------------------------------------------------------------
# $Log:	arbitron.pl,v $
# Revision 1.6  90/02/10  18:53:02  pat
# Added support for direct NNTP access of active group list.
# Extended comments.
# Removed several unnecessary variables.
# Added debugging flags $noSave and $noMail.
# 
# Revision 1.5  90/01/11  20:27:32  pat
# Added get_user_info() subroutine to allow the use of getpwent() with
#     perl3.0, but retain compatibility with perl2.18.  Also allowed
#     cleanup of code at beginning of user processing loop.
# 
# Revision 1.4  90/01/11  19:01:40  pat
# Changed subroutine good_user and the loop that calls it to expect parameters
#    as supplied by the perl-3.0 getpwent() function.
# 
# Revision 1.3  90/01/11  17:26:41  pat
# Various clarifications in the comments.
# Moved date retrieval lines to just under initial comment cluster to obtain
#     better grouping of the sections which are most likely to be customized.
# Changed default summary path to check for LOGNAME, then USER, to determine
#     the local address to mail the report to.  This is to accomidate
#     alternate user names created purely for cron-activated jobs, etc.
# Removed intermediate variable used in determining host name.
# Changed generation of $destination to make it easier to remove/modify the
#     subject parameters.
# Explicitly look for /usr/ucb/mail when building $destination, instead of
#     relying upon path information.
# 
# Revision 1.2  90/01/10  09:59:15  pat
# Fixed automatic extraction of $VERSION from RCS revision information.
# Added RCS update log.
# 
# ---------------------------------------------------------------------------
# Debugging flags:
#	Set $noSave to prevent writing the .arblast file.
#	Set $noMail to write the report to stdout instead of mailing it.

$noSave	= ;
$noMail = ;

# ---------------------------------------------------------------------------
#
# Find out the date.  Sweeps are reported as MonYYYY.

@monthnames = ("Jan", "Feb", "Mar", "Apr", "May", "Jun",
	       "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") ;

@tim = localtime(time) ;
$dat = $monthnames[$tim[4]] . $tim[5] ;

# ===========================================================================
# Configuration information. Edit this section to reflect your site data.
#
#	$ARBLAST is the file to which the `last-reported' information
#		is to be stored.
#	$ACTIVE is the path to the news system's active file.  It is only
#		used if $NNTPserver is not set.
#	$SORT_OUTPUT should be set if you want the report sorted by 
#		readership.  Setting it to null will improve performance.
#	$NNTPserver is the hostname or internet address of the local
#		NNTP server.  Direct NNTP connection requires a version
#		of perl which supports sockets.  If your version does
#		not support sockets, or you do not use NNTP, you may
#		simply set $NNTPserver to ''.
#	$NNTPport is the port number to use to connect to your NNTP server.
#		The standard allocation is 119.  Check your /etc/services
#		file or equivalent.
#	$Me is the mail address of the local user to whom a copy of the
#		report will be mailed.  It may be null.
#	$summarypath is a list of addresses to which a copy of the report
#		will be mailed.  It should be in the form expected by
#		the mailer.  To have your statistics included in the
#		published network-wide reports, this list should include
#		`netservey@decwrl.dec.com' (or the uucp equivalent
#		{sun, hplabs, pyramid, decvax, ucbvax}!decwrl!netsurvey )
#	$lowUID and $highUID delimit the range of /etc/passwd UID's which
#		represent actual people (rather than maintenance accounts,
#		daemons, or whatever.)
#

$ARBLAST = "/usr/lib/news/arb.last" ;
$ACTIVE  = '/usr/lib/news/active' ;

$SORT_OUTPUT = 1 ;		# Set to null if you don't want to sort

$SERVERfile = '/vol/local/lib/rn/server' ;

$NNTPserver = $ENV{'NNTPSERVER'} ;
if  ((! $NNTPserver) && (open  (server, "<$SERVERfile"))) {
    chop  ($NNTPserver = <server>) ;
    close (server) ;
} ;
if  (! $NNTPserver)	{ $NNTPserver = 'mailhost' ; } ;

$NNTPport = 119 ;


$Me = ( $ENV{'LOGNAME'}  ||  $ENV{'USER'}  ||  'news' ) ;
$summarypath	= "netsurvey@decwrl.dec.com $Me" ;

$lowUID	 = 100 ;
$highUID = 9999 ;


# This subroutine is passed an array consisting of a single entry from
# the password file. It should return true if that entry represents a
# valid user (to check for news usage), false otherwise.
# (Return value is value of last expression.)
#
# You may make this as complex as you deem appropriate.  The default will
# reject any user with the password '*'; or with a uid less than $lowUID,
# or greater than $highUID; or with no home directory.
# The commented out lines show how to also reject specific users by name.

sub good_user {
    local ($name, $passwd,  $uid,
    	   $gid,  $quota,   $comment,
	   $gcos, $homedir, $shell) = @_ ;
   
    return ( ($passwd ne '*')
	     &&  ($uid >= $lowUID)  &&  ($uid <= $highUID)
	     &&  ($homedir ne '')
	     &&  ($name ne "realtime")  &&  ($name ne "std_user")
	     &&  ($name ne "eglbnch1")
	   ) ;
}


# This is only used in the hostname stuff below.

# We need to find the uucp name of your host. If this code doesn't work,
# then just put it in literally like this:
#	$hostname="kla"
#
# `uuname -n' doesn't work on bsd systems.  `hostname' may not work on
# sysV or non-unix systems.  For bsd, the hostname output is preferred
# because `uuname -l' will truncate to 7 chars.  I'm not sure why it
# is not the preferred mechanism on sysV; but this is the sequence used
# in the script version.

if  ( -d '/usr/ucb' )	{ chop ($hostname = `hostname`)  ; }
else			{ chop ($hostname = `uuname -n`) ; } ;
	
chop ($hostname = `uuname -l`)	unless ($hostname) ;
chop ($hostname = `hostname`)	unless ($hostname) ;


# The name of the base password file.  This will not be referenced if
# passwords can be obtained via getpwent or `ypcat passwd'.
$PASSWD	= '/etc/passwd' ;


# ---------------------------------------------------------------------------
# If your mailer can accept a subject on the command line, set $Subject
# to the appropriate sequence, otherwise comment out the assignment below.
# The default below works with /usr/ucb/mail on bsd derived systems.

$Subject = sprintf ('-s "arbitron data for %s"', $dat) ;

# If you are running on a system which does not support the creation
# of a mailer subtask to which the report can be piped, replace
# the $destination value with the name of a file which can be mailed
# later.

$Mailer = '/usr/ucb/mail'	if  ( -x '/usr/ucb/mail' ) ;
$Mailer = 'mail'		unless ($Mailer) ;
$Mailer = $ENV{'MAILER'}  ||  $Mailer ;


$destination = ($noMail ? ">-" : "| $Mailer $Subject $summarypath") ;

# End portion most likely to require site customization.
# ===========================================================================
#
# Determine what newsgroups are active
#

if  ($NNTPserver)	{
    local ($sockaddr, $name, $aliases, $proto,
 	   $type, $len, $thisaddr, $thataddr, $this, $that) ;

    # If your perl does not support sockets, comment out the following
    # code and make sure that $NNTPserver is set to '' above.

    # Note: You may have to run makelib on sys/socket.h to create a
    # perl version.  Makelib should be in your perl distribution.

    do 'sys/socket.h' || die "Can't do sys/socket.h: $@" ;

    $sockaddr = 'S n a4 x8';
    chop ($hostname = `hostname`);

    ($name, $aliases, $proto)     = getprotobyname('tcp') ;
    ($name, $aliases, $NNTPport)  = getservbyname ($NNTPport, 'tcp')
	    unless $NNTPport =~ /^\d+$/ ;
    ($name, $aliases, $type, $len, $thisaddr) = gethostbyname ($hostname) ;
    ($name, $aliases, $type, $len, $thataddr) = gethostbyname ($NNTPserver) ;

    $this = pack ($sockaddr, &AF_INET,         0, $thisaddr) ;
    $that = pack ($sockaddr, &AF_INET, $NNTPport, $thataddr) ;

    socket  (active, &PF_INET, &SOCK_STREAM, $proto) || die "socket: $!" ;
    bind    (active, $this)			     || die "bind: $!" ;
    connect (active, $that)			     || die "connect: $!" ;

    select (active) ; $| = 1 ; select (stdout) ;

    $_ = <active> ;
    die $_	unless m/^2/ ;		# Extremely primitive result check.

    print active "LIST\r\n" ;		# Request list of newsgroups.

    $_ = <active> ;
    die $_	unless m/^215/ ;	# Extremely primitive result check.

    # End of socket-dependant stuff.
} else {
	open (active, "$NEWS/active")
	|| die "Could not open active file $NEWS/active" ;
} ;

sub get_active_line {
    local ($line) ;

    $line = <active> ;

    if  ($NNTPserver && ($line eq ".\r\n"))  {
	print active "QUIT\r\n" ;
	$line = '' ;
    } ;
    
    return ($line) ;
}

for  ($nactive = 0 ; $_ = do get_active_line() ; ) {
	if  (/(\S*)\s*([0-9]*)\s*([0-9]*)/) {
		$1 =~ tr/A-Z/a-z/ ;		# Canonicalize to lower case
		$last{$1}	= int ($2) ;
		$first{$1}	= int ($3) ;
		$readers{$1}	= 0 ;
		$nactive++ ;
	}
}

if  ($nactive <= 0)	{ die "No active newsgroups" ; }

if  (open (arbLast, $ARBLAST)) {
    while ( <arbLast> ) {
	if  ( ($groupname,$limit) = /(\S*)\s([0-9]*)/ ) {
    	    $first{$groupname} = int($limit) + 1 ;
        }
    }

    close (arbLast) ;
}


# ---------------------------------------------------------------------------
# Grovel through the password file, counting users and processing
# .newsrc files.

$nusers = 0 ;		# Total number of users
$newsusers = 0 ;	# Number of users who read news.


# Return information about the current user.  This routine is entirely to
# provide getpwent() emulation for older versions of perl.
#
sub get_user_info {
    # Use this line if your perl has getpwent(); otherwise comment it out
    return getpwent() ;

    # Un-comment the following lines if your perl does not have getpwent().
#   local ($name,$passwd,$uid,$gid,$gcos,$homedir,$shell) ;
#	
#
#   if  ( ! $PasswdIsOpen) {
#	open (passwd, 'ypcat passwd |')	|| open (passwd, $PASSWD) ;
#	$PasswdIsOpen = 1 ;
#   }
#
#   return () if eof (passwd) ;
#	
#   ($name,$passwd,$uid,$gid,$gcos,$homedir,$shell)
#	= (<passwd> =~ /(.*):(.*):(.*):(.*):(.*):(.*):(.*)\n/ ) ;
#	
#   return ($name,$passwd,$uid,$gid,'','',$gcos,$homedir,$shell) ;
}



# Grovel through the user list, eliminating those which do not appear
# to be `real users'. If  a `real user' has a .newsrc file, process it.
while (( @UserInfo = do get_user_info()) && ($#UserInfo > 0))
{
    if (do good_user (@UserInfo))
    {
	$nusers++ ;		# Increment the count of valid users.

	if  (open (newsrc, $UserInfo[7] . "/.newsrc")) {
	    local ($groups_read, $group_name, $last_read) ;
	    $groups_read = 0 ;
	    
	    while ( <newsrc> ) {
	        if  ( /(\S*)[:!].*[,-]([0-9]+)$/ ) {
	    	    ($group_name = $1) =~ tr/A-Z/a-z/ ;  # Make lower case
	    	    $last_read = int ($2) ;

	    	    if  ( ($last_read >= $first{$group_name})	&&
			  ($last_read <= $last{$group_name}) )
		    {
		        $groups_read++ ;
			$readers{$group_name}++ ;
		    }
	        }
	    }

	   $newsusers++		if  ($groups_read) ;
	}
    }
}


# Now build the report

open (Report, "$destination")	|| die  "Could not open mailer" ;


$noSave  ||  open (arbLast, ">$ARBLAST")  ||  die  "Could not open arb.last" ;



print Report "Host\t\t$hostname\n" ;
print Report "Users\t\t$nusers\n" ;
print Report "NetReaders\t$newsusers\n" ;
print Report "ReportDate\t$dat\n" ;
print Report "SystemType\tnews-arbitron-$VERSION\n" ;


if  ($SORT_OUTPUT) {
    sub by_readership {
	local ($val) ;

	$val = $readers{$b} - $readers{$a} ;
	return ( ($val != 0) ? $val : (($b le $a) ? -1 : 1) ) ;
    }


    @allgroups = keys (readers) ;
    foreach $groupname (sort by_readership @allgroups) {
	 print Report  $readers{$groupname} . " $groupname\n" ;
	 print arbLast "$groupname " . $last{$groupname} . "\n"
	     unless $noSave ;
    }
} else {
    while (($groupname, $Readers) = each (readers)) {
	print Report  "$Readers $groupname\n" ;
	print arbLast "$groupname " . $last{$groupname} . "\n"
	    unless $noSave ;
    }
}

exit ;
Copyright (c) 1989 PM Lashley under the terms of the GNU General Public License
PMLashley	...{sun | megatest | sts | zygot}!cohesive!kla!pat
<<< I haven't lost my mind.  It's backed up on tape somewhere... >>>