[news.misc] Statistics by article

msb@sq.UUCP (11/06/87)

Greg (greg@maypo.berkeley.edu) writes:
> ... I would like to propose a new statistics-gathering feature for
> rn in addition to the Arbitron ratings.  In this scheme, rn would
> monitor which articles each reader actually reads, and how long each
> reader spends on those articles.

This is an intriguing idea, but there are technical problems.
The basic problem is that you can't tell, just because the image
remains on the screen for a long time, that the reader is really
reading that article.  Similarly, you can't tell, just because they
spend 20 minutes in Rnmail composing a response, that they are really
responding to that article.  (In particular, they could be !'d out,
dealing with something more urgent.)

Some compensation for this could be achieved by finally reporting the
median time rather than the mean, or perhaps some other percentile if
(as I suspect) the median proves to be 0 for almost all articles.

In addition, readers may not want such monitoring.  I don't think *I*
would, and I make no bones about the fact that I spend a lot of time
reading news.  My reaction to this might well be to stop using rn
and spend 20 minutes writing a Very Simple News Reading Program.

Besides that, admins might not care for the extra system load...
if I was in that position, I'd be inclined to turn if off, just like
other accounting logs.

I've fixed the vague Subject line of the article, but left it crossposted
to news.misc and news.admin, because I thought it was appropriate.
Followuppers beware.

Mark Brader		On our campus the UNIX system has proved to be not
SoftQuad Inc.		only an effective software tool, but an agent of
Toronto			technical and social change within the University.
utzoo!sq!msb, msb@sq.com					John Lions

chuq@plaid.Sun.COM (Chuq Von Rospach) (11/09/87)

>> In this scheme, rn would
>> monitor which articles each reader actually reads, and how long each
>> reader spends on those articles.

>This is an intriguing idea, but there are technical problems.

Erik Fair wrote an article a while back about this for Login:. The concept
was called an Accolade (design by Erik, cute name by me). The whole idea was
that you could keep track of what other people read, and only read those
articles that other folks (whose reading taste you trust) felt was worth
reading. Sort of like a net-wide kill file at a very high level of
abstraction.

One minor problem. If everyone starts sending out Accolade messages on
everything they read, what do you think will happen to network traffic? 
Even if you limit it to sending one message per rn session and sending it to
a single point (ala Arbitron) instead of broadcasting it to the net, the
amount of data being slogged around is astounding. If you figure 7,000
people read usenet once a day (very low numbers! VERY low numbers) and the
package is 1,000 bytes, the receiving end needs to handle seven megabytes of
data a day. And these numbers are ridiculously small -- you could triple
them and still not be realistic.

It's a very nice idea. But from a technical point of view, it is a cure much
worse than the disease.

chuq
---
Chuq "Fixed in 4.0" Von Rospach			chuq@sun.COM	Delphi: CHUQ

greg@jiff.berkeley.edu (Greg) (11/10/87)

Greg (greg@maypo.berkeley.edu) writes:
> rn would
> monitor which articles each reader actually reads, and how long each
> reader spends on those articles.

Mark Brader writes:
>The basic problem is that you can't tell, just because the image
>remains on the screen for a long time, that the reader is really
>reading that article.

A good point.  I propose some reasonable cutoff time, like five
minutes, i.e. if a reader spend two hours on one article, five minutes
is averaged in instead.  There will still be some error from readers
walking away from their terminals, performing shell escapes, and so
on.  I figure that the error will be roughly randomly distributed among
articles in proportion to their popularity anyway; to some extent this
variation will merely add a constant factor to the "true" statistics.

It may also be more appropriate to report the median reading time
rather than the mean.

>In addition, readers may not want such monitoring [because of invasion
>of privacy].

I find it difficult to fret over the fate of a list of numbers about me,
without my name attached, that are promptly averaged in with thousands
of other such numbers.  I think most readers are the same way.  The few
who object are free to turn off the feature.  I'd prefer the voting power
to privacy.  In any case, as with the Arbitron ratings, the Nielsens need
not poll EVERY user on the net.

>Besides that, admins might not care for the extra system load...

Chuq Von Rospach also brought up the load issue, in the context of network
load rather than system load:

>If you figure 7,000
>people read usenet once a day (very low numbers! VERY low numbers) and the
>package is 1,000 bytes, the receiving end needs to handle seven megabytes of
>data a day. And these numbers are [ridiculous underestimates]...
>It's a very nice idea. But from a technical point of view, it is a cure much
>worse than the disease.

I don't see why such an expensive scheme is necessary.  In my scheme
the network load for trading statistics is necessarily much smaller
than the load of the news groups themselves.  A news feed in a local
network doles out articles to rn programs running on various hosts; the
rn programs would reply with the stats for their news sessions.  The news
feed would then compress the statistics on its own by taking local
averages and sums.  Every week or so the Nielsen program would collect
data from all of the news feed hosts.  The data on all of the
articles would be in the same report; there would be about 20-40 bytes
of data per article.  Estimating that the articles themselves are 1K
long on average, the Nielsens would be only a small fraction of both
the global network load and the local system load.
--
Greg

eric@snark.UUCP (Eric S. Raymond) (11/10/87)

In article <1987Nov6.124824.20374@sq.uucp>, msb@sq.UUCP writes:
> This is an intriguing idea, but there are technical problems.
> The basic problem is that you can't tell, just because the image
> remains on the screen for a long time, that the reader is really
> reading that article.

I finessed the problems you point out by not recording reading time, only the
fact that you've seen the text of an article and whether you praised or
condemned it.

> Besides that, admins might not care for the extra system load...
> if I was in that position, I'd be inclined to turn if off, just like
> other accounting logs.

Not necessary. It's really cheap. My news readers keep a trail of where-you've
been locations (the '-' command works to *arbitrary* depth). At the end of the
session I run through the trail, appending a small record to a logfile with
write(2) (for atomicity, and to avoid the buffering overhead). It all happens
so fast you don't even notice the delay after typing 'q'.

As for space...I suppose the file could get large if you had lots of readers
reading lots of stuff -- but that's why you want to run an abstract generator
on it every night, truncating the logfile when you do it. Once a month you
optionally ship your results off to a central collection point. It's ll very
painless, really.
-- 
      Eric S. Raymond
      UUCP:  {{seismo,ihnp4,rutgers}!cbmvax,sdcrdcf!burdvax,vu-vlsi}!snark!eric
      Post:  22 South Warren Avenue, Malvern, PA 19355    Phone: (215)-296-5718