[news.admin] Arbitron "Users" count is very wrong under BSD

dave@onfcanim.UUCP (Dave Martindale) (01/18/88)

This morning, arbitron ran automatically and its output was mailed to me.
I was very surprised to find that we have 27 active users on this machine.

On BSD unix, arbitron generally uses the pipeline

  nusers=`last | sort -u +0 -1 | wc -l`

to count the number of users who have logged in this month, in order to
ignore dormant users in the passwd file.  Running the "last | sort"
component of the pipeline by hand did indeed produce 27 lines of output,
which consisted of:

  13 living, breating users
  10 uucp logins
   2 dummy entries "reboot" and "shutdown"
   2 lines due to the printf("\nwtmp begins %s\n", ctime(whatever))
     that last does after printing all data in the file

So, 13 out of 27 reported users were real.  I expect figures to be similarly
inflated on most BSD systems.

I've replaced the pipeline above with:

  nusers=`last | sed -e '/^$/d' -e '/^wtmp begins/d' -e '/ ~ /d' -e '/^U/d' \
	| sort -u +0 -1 | wc -l`

Since it depends on the fact that our uucp logins all begin with 'U', you
would have to edit it as appropriate for your username conventions.

The pipeline used for USG systems probably needs patching in a similar manner.

If everyone fixed this, I wonder how much the "total users" count
would go down net-wide?

	Dave Martindale
	{musocs,watmath}!onfcanim!dave

rsk@s.cc.purdue.edu (Rock Wombat) (01/20/88)

In article <15538@onfcanim.UUCP> dave@onfcanim.UUCP (Dave Martindale) writes:
>On BSD unix, arbitron generally uses the pipeline
>  nusers=`last | sort -u +0 -1 | wc -l`
>to count the number of users who have logged in this month, in order to
>ignore dormant users in the passwd file.

This can also result in a count that is too low.  On our systems, we zero
the utmp and wtmp files every 24 hours, so this pipeline would only reveal
those users who have logged in since the last zap.  I'm not sure that there's
a good solution to this problem; is there a general way to answer the question
"How many active users does this machine have?" that avoids this difficulty?

-- 
Rich Kulawiec, rsk@s.cc.purdue.edu, s.cc.purdue.edu!rsk
PUCC Unix Staff

reid@decwrl.dec.com (Brian Reid) (01/20/88)

Counting the number of users on the system is probably the most error-prone
of all the measurements. The various counting procedures are extremely
sensitive to local administrative policies. This is the main reason why the
script mails a copy of the results to the system administrator as well as to
me. I've never believed them to be more accurate than maybe to within half or
double the posted amount, and the comments attached to each months' Arbitron
report say as much.

page@swan.ulowell.edu (Bob Page) (01/20/88)

YaBut...

# @(#)arbitron	2.4.2	06/05/87

# Range of /etc/passwd UID's that represent actual people (rather than
# maintenance accounts or daemons or whatever)
lowUID=90
highUID=20000

# # ###### Scheme #1: fast but usually returns too big a number
nusers=`awk -F: "BEGIN {N=0}\\$3>=$lowUID && \\$3<=$highUID{N=N+1}END{print N}" </etc/passwd`

Maybe arbitron needs to combine 1 & 2.  Do a 'last' to get the active
users, then filter out daemons etc.  Or awk the passwd file then
prune it based on active users (in 'last').

No, I'm not volunteering for anything.

..Bob
-- 
Bob Page, U of Lowell CS Dept.  page@swan.ulowell.edu  ulowell!page
"I don't know such stuff.  I just do eyes."  -- from 'Blade Runner'

jerry@oliveb.olivetti.com (Jerry Aguirre) (01/21/88)

In article <1978@s.cc.purdue.edu> rsk@s.cc.purdue.edu.UUCP (Rock Wombat) writes:
>This can also result in a count that is too low.  On our systems, we zero
>the utmp and wtmp files every 24 hours, so this pipeline would only reveal
>those users who have logged in since the last zap.  I'm not sure that there's
>a good solution to this problem; is there a general way to answer the question
>"How many active users does this machine have?" that avoids this difficulty?

How about extracting home directories from the password file and then
checking to see if they have a .login or .profile that has been accessed
in the measurement interval.  This will filter out inactive users and
daemon logins (UUCP).

Of course users with multiple accounts will still show up as more than
one person but I see no automatic way to avoid that.  That problem is
even worse here because I send in reports for 5 systems.  Many of the
users have secondary accounts on other systems so they get counted
multiple times.  What I really need is a way to merge the data from the
5 systems and submit a single report with duplicates merged.  I am sure
that other sites using NNTP for reading have the same problem.  It is
common at some sites for a new user to automatically get an account on
every system.

					Jerry Aguirre @ Olivetti ATC

wundt@wundt.psy.vu.nl (Wundt Administrator) (01/21/88)

There are several suggestions given other than last | sort
to count the number of users.

One of these requires that "active" users be grouped within
a contiguous range of uid numbers, but you can sort.

See the arbitron.sh (the program ! as distributed) for all the suggestions
given.

The first example I read about (complaining) said that only
13 of the 27 users were real. If uucp daemons, etc. are kept
in the range 0-20 or 0-100 (then root dosn't get counted either)

this could solve one problem. Further, users who are not
allowed to read news could be given very high UIDs, e.g.
greater than 10,000 (or 1000).

The "problem" with this suggestion requires that the password
file be "manageable". If "process- demon- users" are spread throughout
it probably won't be fixed.

(I'd include the text here, but if you are really interested,
read the distribution next time around (if you haven't got it on hand)

michael felt

") (01/22/88)

Rock Wombat writes:

) is there a general way to answer the question
) "How many active users does this machine have?" that avoids this difficulty?

How ubiquitous and uniform is /usr/adm/lastlog?

			Matt

rsalz@bbn.com (Rich Salz) (01/22/88)

-In article rsk@s.cc.purdue.edu.UUCP (Rock Wombat) writes:
-...is there a general way to answer the question
-"How many active users does this machine have?"
Not really, short of hardcoding a number into your arbitron script;
which you might not want to rule out-of-hand...

In news.admin, jerry@oliveb.UUCP (Jerry Aguirre) writes:
>multiple times.  What I really need is a way to merge the data from the
>5 systems and submit a single report with duplicates merged.

Here's a script that basically does it, feed it a bunch of reports.
#! /bin/sh
##  arb-merge.  Read set of arbitron reports and merge them.
##  This needs to be made portable and configurable the way the real arbitron
##  stuff is, but for now...  it works.
HOST=`hostname`
DATE=`date | sed -e 's/....\(...\).*19\(..\)/\119\2/'`

cat $@ | awk '\
BEGIN	{
    # Are we ignoring the current system (e.g., duplicate)?
    Ignore = 1
    # Total number of users and readers for all systems.
    Users = 0
    NetReaders = 0
    # List of systems whose reports we have processed.
    SysCount = 1
    SysList[0] = "--ERR--"
    # List of newsgroup names and count thereof.
    GroupCount = 1
    GroupName[0] = "--ERR--"
    # Associative array of number of readers, indexed by group name.
    GroupReaders[0] = "--ERR--"
    # Associate array of "seen this newsgroup?", indexed by group name.
    HaveGroup["--ERR--"] = "no"
}

$1 == "Host" {
    # We assume there are not lots of hosts, so no associative array.
    Ignore = 0
    for (i = 1; i < SysCount; i++)
	if (SysList[i] == $2)
	    Ignore = 1
    if (Ignore == 0) {
	SysList[SysCount] = $2
	SysCount++
    }
}

$1 == "Users" {
    if (Ignore == 0)
	Users += $2
}

$1 == "NetReaders" {
    if (Ignore == 0)
	NetReaders += $2
}

$1 == "ReportDate" {
    # We could (should?) check for a bad date here, and ignore reports,
    # but that means we might have to back out of bumping the Users and
    # NetReaders somehow -- not worth it.
}

$1 ~ /[0-9]+/ && $2 ~ /comp\.|misc\.|news\.|rec\.|sci\.|soc\.|talk\./ {
    if (Ignore == 0)
	if (HaveGroup[$2] != "yes") {
	    HaveGroup[$2] = "yes"
	    GroupName[GroupCount] = $2
	    GroupCount++
	    GroupReaders[$2] = $1
	}
	else
	    GroupReaders[$2] += $1
}

END {
    printf "99999 Host\t\tHOST\n"
    printf "99998 Users\t\t%d\n", Users
    printf "99997 NetReaders\t%d\n", NetReaders
    printf "99996 ReportDate\tDATE\n"
    printf "99995 SystemType\tnews-arbitron-2.4\n"
    for (i = 1; i < SysCount; i++)
	printf "99994 OtherHost\t%s\n", SysList[i]
    for (i = 1; i < GroupCount; i++)
	print GroupReaders[GroupName[i]], GroupName[i]
}' \
    | sort -nr \
    | sed -e "s/HOST/${HOST}/" -e "s/DATE/${DATE}/" -e "s/9999[0-9] //"

We don't use it at BBN yet, but eventually I'll have some daemon on
all major servers run arbitron and mail the results to me or a program,
or have clients just mail me their .newsrc anonymously...
-- 
For comp.sources.unix stuff, mail to sources@uunet.uu.net.

jc@minya.UUCP (01/23/88)

In article <14258@oddjob.UChicago.EDU>, matt@oddjob.UChicago.EDU ("Don't even know my real name!") writes:
> Rock Wombat writes:
> 
> ) is there a general way to answer the question
> ) "How many active users does this machine have?" that avoids this difficulty?
> 
> How ubiquitous and uniform is /usr/adm/lastlog?
> 
Well, here there is no such file.  This machine is used by only two 
users, and we aren't interested in running accounting.  We have much
better uses for the cycles and disk blocks.  With the growth of the
workstation market, we will see more and more machines run like this. 
I've explained to quite a lot of users how to free up space and time 
on their workstations by eliminating all the stuff that makes sense
only on a multi-user system.

Perhaps we need mods to readnews, vnews, rn, and so on that keep a
file of readership history?  Well, actually, we don't; I've determined
the number of readers on several systems by something like:
	find /usr -name ".newsrc" -mtime -10 -print | wc -l
How's that for a cpu-gobbling solution?  Does anyone know an easy
way to produce a list of the home directories of all users?  Maybe
awk could be used to chew up /etc/passwd and spit out the fifth
field of each, "/.newsrc" could be appended to each, the mtime of
each could be tested, and so on.  Now if I only understood awk
well enough to get anything other than "bailing out near line 1"
98% of the time...

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

dave@lsuc.uucp (David Sherman) (01/25/88)

In article <181@wundt.psy.vu.nl> wundt@psy.vu.nl (Wundt Administrator) writes:
>There are several suggestions given other than last | sort
>to count the number of users.
>
>One of these requires that "active" users be grouped within
>a contiguous range of uid numbers, but you can sort.

That's OK, but a bit magical, and apt to be lost track of
at some point.  Better is to look into the last field of
/etc/passwd, and do something such as assuming everyone
without either a null entry (implying /bin/sh) or /bin/csh
is a non-real user.  That'll take care of uucico processes
and other non-shell users.

Of course, the various names for shells on your site may
vary from being just sh and csh.

David Sherman
The Law Society of Upper Canada
Toronto
-- 
{ uunet!mnetor  pyramid!utai  decvax!utcsri  ihnp4!utzoo } !lsuc!dave

rwhite@nusdhub.UUCP (Robert C. White Jr.) (02/10/88)

Hi all,
	you are going about this thing all wrong..... [if I may.]

Each of us has a certain number of entries in our /etc/passwd files
which do not represent valid users, that is obvious,  it is also
relativly constant as a scalar quantity.  While the actual number
of valid users may vary greatly on some systems [schools especially]
the actual number of deamons is about the same on any given system
over any given time.
	Granted, when you add a SNA adapter and assocaited software,
or a new LAN, or some such, you do tend to add an "owner" or
deamon, but other than that, you get a one-user-one-entry relation.

	Arbitron should have an SYSDEPUSR=## line which contains
the number of system level user entries in the /etc/passwd file.
We then count up the number of system level entries our passwd
file contains when we install arbitron.  every time there-after,
that arbitron runs it can simply COUNT=`expr ${COUNT} - ${SYSDEPUSR}`
the normal figuring method and BANG!!!! you get more reasonable numbers.

	As far as this goes, an install script could even be
manufactured which would implement all those nasty suggestions
once and only once, and then leave the rest of the maitenence
to whoever cares. [i.e.  If I think my arbitron numbers are
all wrong, I simply rerun the install script, and answer a
few simple questions like ignore(y/n) ?]

Rob.

sa@ttidca.TTI.COM (Steve Alter) (02/19/88)

In article <585@nusdhub> rwhite@nusdhub (Robert C. White Jr.) writes:
} Each of us has a certain number of entries in our /etc/passwd files
} which do not represent valid users, that is obvious,  it is also
} relativly constant as a scalar quantity.  While the actual number
} of valid users may vary greatly on some systems [schools especially]
} the actual number of deamons is about the same on any given system
} over any given time.
 
(:-)  (:-)  (:-)  (:-)  (:-)  (:-)  (:-)  (:-)  (:-)  (:-)  (:-)  (:-)
Relatively constant as a scalar quantity?  My foot!  Fully
*two-thirds* of our password-file (which was around 1700 lines
this morning) is occupied by these non-user accounts!  They're
project accounts; they're scattered throughout the file, and
more of them appear every week.  I had to do some major hacking
on that little section of arbitron to get it to come up with a
semi-accurate count of our real users.

Of course, we're definitely the exception. {8-)

Smileys included because this is just a fun (non-informative)
and non-flaming posting, but it's still the truth.
(-:)  (-:)  (-:)  (-:)  (-:)  (-:)  (-:)  (-:)  (-:)  (-:)  (-:)  (-:)

-- Steve Alter
...!{csun,rdlvax,trwrb,psivax}!ttidca!alter  or  alter@tti.com
Citicorp/TTI, Santa Monica CA  (213) 452-9191 x2541