[net.news] Statistics, polls: honest, no flames

woods@hao.UUCP (Greg Woods) (03/21/86)

  I just want to correct a couple of misconceptions (which are probably due
  to my own poor writing anyway). I have been cast as the "bad guy" in this,
  because I have "opposed", in some sense, someone who is doing his best
  to gain information that everyone wants. Contrary to popular
  interpretation, I fully support this effort. What disturbs me is
  twofold: personal public criticism for daring to oppose this net
  'hero', and that people are so blind to the inaccuracies (very well
  explained by Lauren, so I have no need to repeat them) that they
  are ready to start using these results to determine what groups we
  keep and which we don't. THAT is all I object to, is certain
  potential USES of the results, not the survey itself. I think
  the data are quite enlightening; that so many want mod.movies,
  for example, suggests that maybe, just maybe, there is more
  support for moderated groups than the public discussions on the
  subject would tend to show. But this is a very GENERAL conclusion.
  What scares ME is the possibility of axing certain groups based
  solely on these results, like 'hey, look, net.blah has the highest
  cost per reader, let's get rid of it'. Brian has partially  counteracted
  this by admitting that the margin for error is large; but he also 
  claims that his survey is exempt from the 'self-select sample' effects
  brought up by Lauren. I do not agree with that assesment. There are
  other self-select factors that everyone is ignoring. My favorite example
  :-) is the use of the Bourne shell. Many older versions of UNIX, and
  smaller systems, do not support this. How does that enter into the self-
  selection process?
     Well, this is already longer than I intended, and I promise it will
     be my last posting on the subject. I encourage Brian to continue
     the survey; I still wish to hell he had posted a C program instead
     of a shell script so more could participate, and I caution everyone
     not to make too many far-reaching decisions based on the results.

     --Greg

--
{ucbvax!hplabs | decvax!noao | mcvax!seismo | ihnp4!seismo}
       		        !hao!woods

CSNET: woods@ncar.csnet  ARPA: woods%ncar@CSNET-RELAY.ARPA

"If the game is lost, we're all the same; no one left to place or take the 
blame; Will we leave this place an empty stone, or a shining ball of earth,
we can call our home"

chuq@sun.uucp (Chuq Von Rospach) (03/22/86)

>   ... and that people are so blind to the inaccuracies (very well
>   explained by Lauren, so I have no need to repeat them) that they
>   are ready to start using these results to determine what groups we
>   keep and which we don't.

Lauren's article (as rebutted by Brian) was a LOT more innacurate than
the statistics he attempted to discredit. Yes, I'm MORE than ready to use
the results of the statistics to try to streamline the net so that it will
benefit the majority of the readers. This, of course, has to be distressing
to people in the groups with exceptionally high volume and very few readers,
since what Brian has really done is blow away the USENET attitudes regarding
volume and utility -- there is now REAL evidence that volume and readership
are completely unconnected, and we can track down (and potentially
eliminate) the ego-based write mostly groups.


>I think
>   the data are quite enlightening; that so many want mod.movies,
>   for example, suggests that maybe, just maybe, there is more
>   support for moderated groups than the public discussions on the
>   subject would tend to show. But this is a very GENERAL conclusion.

Actually, I think this conclusion is incorrect. In many cases the mod groups
have so little volume that people haven't gotten around to unsubscribing
to it yet.

>   What scares ME is the possibility of axing certain groups based
>   solely on these results, like 'hey, look, net.blah has the highest
>   cost per reader, let's get rid of it'.

This may seem silly, but I think that it is logical for streamlining 
of the net to be done by getting rid of the high volume/low readership
groups -- the most affect for the least netwide trauma (except to the people
who like to hear themselves type).

>    but he also 
>   claims that his survey is exempt from the 'self-select sample' effects
>   brought up by Lauren. I do not agree with that assesment. There are
>   other self-select factors that everyone is ignoring.

I didn't realize you were trained in statistics. How would you recommend
improving the data then? No offense intended, but I prefer to listen to the
people trained in the discipline...

Now, before people accuse me of being too hard on Greg, let me make a few
points. I'm not bitching directly at Greg on this, but at some attitudes
that happen to be in his posting that seem to be generic on the net:

First, Brian's stats are showing some real fallacies in the way things are
done on Usenet. One is the assumption that volume == utility, which is
being shown to be definitely not true. In many groups, a few very
vociferous users can completely overwhelm the rest of the readership.

Second, there is the implied 'it isn't good for me, so we can't do it'.
Eventually we're going to have to make decisions about what the net is
really here for, as volume and costs continue to rise. The LOGICAL thing is
to streamline that which affects the least users, which is difficult to do
currently because we've never before known who is reading thins -- only who
is writing. We can now change that.

Third, there is a consistent problem on the net because people say things
like "I disagree with this and so it is wrong". Well, Brian knows a LOT
about statistics. He has access to some of the best statisticians in the
world at Stanford, and he's put a LOT of work into convincing himself that
these stats are valid. Unless you know stats as well as him and know what he
has really done, I can't think of a way in the world that you could convince
me that he is wrong, especially when (as is typical on the net) you have NO
facts to back your assertions.

--
The ONLY problem I have with Brian's stats is the amount of work it takes
(on a net-wide basis) to implement. I don't think they are practical to 
use on a regular basis as a way of making decisions on the net. They are
definitely useful for occasionally figuring out what is going on out there,
though, and I'd love to see them run every six months or so.

I do think we need a new measure of group utility. Previously that measure
has been total volume. I suggest we consider using total volume divided by
the number of DIFFERENT posters over a given time. This could be implemented
easily as part of the newslist data at seismo, and will give us a good ratio
of total interest, assuming you believe that 1 megabyte posted by 20 people
is more useful than 1 megabyte posted by three people. There are some groups
where this breaks down (especially *.sources* and net.jokes, I would guess)
but would be that total number of posters would be a good measure of the 
total number of readers. Comments?

chuq


-- 
:From catacombs of a past participle:   Chuq Von Rospach 
chuqi%plaid@sun.ARPA			FidoNet: 125/84
CompuServe: 73317,635
{decwrl,decvax,hplabs,ihnp4,pyramid,seismo,ucbvax}!sun!plaid!chuq

I used to really worry about splitting my infinitives until I realized
that most people had never heard of them.

reid@glacier.ARPA (Brian Reid) (03/22/86)

I'm curious about Greg's bete noire, the missing /bin/sh.

I would very much like to hear from any other site that does not have a
Bourne Shell. In my experience with Unix systems, the Bourne Shell is the
one constant--other shells may come and go, but /bin/sh is always there.
That's why I used it.

Also, it's a lot more work to write a C program than to cobble together a
shell script. I don't have much experience at writing C programs, and I
wanted to get the arbitron program working quickly. I don't like programming
in C very much, but shell scripts are fun in a perverse kind of way.

Greg, just for you I will make a csh version of Arbitron as soon as I get my
grades handed in Monday morning. I'm grading finals at the moment.

Brian
-- 
	Brian Reid	decwrl!glacier!reid
	Stanford	reid@SU-Glacier.ARPA

gds@mit-eddie.MIT.EDU (Greg Skinner) (03/23/86)

I would like to caution that before any global decisions are made
regarding which groups to keep, etc., we wait for a lot more of the
net to report in.  For example, there have only been a few entries
from AT&T -- and none from any of the major sites (ihnp4, cbosgd,
etc.) and the sites they feed.  I believe that data will be critical
in determining net readership -- AT&T is the largest multiorganization
in Usenet and most probably has the most readers of any
multiorganization in Usenet.  I hope we get the AT&T data soon.

I think even more statistics could be taken.  For example, we could
figure out how many articles are cross-posted per newsgroup, how many
articles posted per site to a group, probably others as well.  Is the
newsstats data at seismo sufficient to extract that information?  If
not, I might write a program to take that data, if I can find the
time.

This may be a premature guess but I think one of the outcomes of this
poll will encourage local unmoderated distributions, especially if the
data bears out that the readership and writership of certain groups is
localized to certain geographic or organizational area.

-- 
It's like a jungle sometimes, it makes me wonder how I keep from goin' under.

Greg Skinner (gregbo)
{decvax!genrad, allegra, gatech, ihnp4}!mit-eddie!gds
gds@eddie.mit.edu

msc@saber.UUCP (Mark Callow) (03/24/86)

> 
> I'm curious about Greg's bete noire, the missing /bin/sh.
> 
So am I.  Here's the result of an "ls -C *.sh" on my 2.10.2 news source
directory.

/usr/src/usr.bin/news/src 
c2sendbatch.sh	csendbatch.sh	install.sh	makeactive.sh	sendbatch.sh
checkgroups.sh	cunbatch.sh	localize.sh	rmgroup.sh

I know most of these scripts are used for installation and one can run
news without batching.  Still if anyone without a Bourne shell installed news,
they must have gone to considerable effort.
-- 
From the TARDIS of Mark Callow
msc@saber.uucp,  sun!saber!msc@decwrl.dec.com ...{ihnp4,sun}!saber!msc
"Boards are long and hard and made of wood"

rees@apollo.uucp (Jim Rees) (03/24/86)

On the topic of self-selection, I would agree with Greg's assertion that
some sites won't run arbitron.  Usenet "site" apollo is actually about
1500 machines with 2000 users, about 200 of them news users.  The arbitron
script won't work here because of the diversity of machines, protections
on home directories, and people who shut down their node at night.  At one
time I was enough of a shell hacker to make it work, but I just don't have
the time or inclination for that stuff any more.

cda@ucbopal.berkeley.edu (Charlotte Allen) (03/25/86)

In article <3389@sun.uucp> chuq@sun.uucp (Chuq Von Rospach) writes:
> but I think that it is logical for streamlining 
>of the net to be done by getting rid of the high volume/low readership
>groups -- the most affect for the least netwide trauma (except to the people
>who like to hear themselves type).

Why don't we get rid of the high volume/low readership posters (guess who
comes to mind....)

msc@saber.UUCP (Mark Callow) (03/27/86)

> net to report in.  For example, there have only been a few entries
> from AT&T -- and none from any of the major sites (ihnp4, cbosgd,

This is the second posting implying that the data from the "major" sites
is important to this readership survey and must be gathered before any
decisions are made.  Why?

The volume of traffic that passes through a site is totally irrelevant
to a survey of newsgroup readership.  The important criteria is the number
of users particularly news readers on the machine.  Some of the most "major"
sites are simply mail and news store and forward machines. (e.g. decvax and
ucbvax)  As such they probably don't have any users let alone users who
read news on them.
-- 
From the TARDIS of Mark Callow
msc@saber.uucp,  sun!saber!msc@decwrl.dec.com ...{ihnp4,sun}!saber!msc
"Boards are long and hard and made of wood"

chapman@miro.berkeley.edu (Brent Chapman) (03/28/86)

In article <1960@saber.UUCP> msc@saber.UUCP (Mark Callow) writes:
>> net to report in.  For example, there have only been a few entries
>> from AT&T -- and none from any of the major sites (ihnp4, cbosgd,
>
>This is the second posting implying that the data from the "major" sites
>is important to this readership survey and must be gathered before any
>decisions are made.  Why?
>
>The volume of traffic that passes through a site is totally irrelevant
>to a survey of newsgroup readership.  The important criteria is the number
>of users particularly news readers on the machine.  Some of the most "major"
>sites are simply mail and news store and forward machines. (e.g. decvax and
>ucbvax)  As such they probably don't have any users let alone users who
>read news on them.

I can say from personal experience that this is true.  Here at Berkeley,
the load on ucbvax seldom drops below about 4, even when no-one is
logged in.  Very few people are willing to put up with the loads
on ucbvax just to read news.  We have an alternative, which I am
not sure is handled by the survey program.  (Please note:  the scheme
I'm about to describe may in fact be very common at large, multi-
machine sites, but I don't have any experience with sites other
than Berkeley, so I may be pointing out something trivial.  If that
is the case, I apologize for wasting your time.)

Here, there are a few machines (ucbvax, ucbcad, and ucbjade, I
believe) that actually have news on them.  These machines presumeably
have the standard news programs available on them (I don't know;
I don't have an account on any of these machines).  They also have
a "news server".  The news server is much like the other Internet
servers, such as mail servers and ftp servers.  When another
machine (such as ucbmiro, the machine I'm using now) wants news
access, it opens a socket to a news server on ucbvax, ucbjade,
or ucbcad, and deals with articles through that interface (I just
LOVE EtherNets!).

We have a program called 'rrn' which is apparently 'rn' re-built
to deal with the server, instead of directly with the file system.
I'm not certain; I've never seen 'rn'.  In any case, my question
is whether or not the people on these 'non-news' machines are
included in the survey.  If they are not, then you are excluding
most of the news readers at Berkeley.

I'm not trying to run down the survey or the surveyor; I think
it is a good idea, and that a lot of thought has gone into it
to make the survey as accurate as possible.

Brent Chapman
ucbvax!miro!chapman
chapman@miro.berkeley.edu

fair@ucbarpa.berkeley.edu (Erik E. Fair) (03/29/86)

In point of fact, ucbvax has quite a few netnews readers, in spite of
the wildly variable load of the machine. I ran arbitron on ucbvax, and
sent off the results to netsurvey@su-glacier.arpa. (257 users, 99 net
readers).

However, since we run a distributed netnews system here at UCB (as
noted by Mr. Chapman), I also followed up the arbitron results with a
letter to Brian Reid explaining our system, and including a copy of the
weekly report that indicates which groups were accessed with what
frequency. While we don't have it broken out by user (or even by
machine; just a raw count of how many times each group was requested
for examination by all the clients of the server that week), I think it
is a good measure of what the UCB community is reading.

Plug: the software in question implements RFC977 (Network News Transfer
Protocol, [NNTP]), written by Phil Lapsley <ucbvax!phil>, and Brian
Kantor <sdcsvax!brian>, with some kibitzing from me. It is presently
available for public FTP from ucbvax (10.2.0.78, pub/nntp.tar), soon to
be posted to mod.sources.

	keeper of the network news for ucbvax,

	Erik E. Fair	ucbvax!fair	fair@ucbarpa.berkeley.edu

chuq@sun.uucp (Chuq Von Rospach) (03/29/86)

> In article <3389@sun.uucp> chuq@sun.uucp (Chuq Von Rospach) writes:
> > but I think that it is logical for streamlining 
> >of the net to be done by getting rid of the high volume/low readership
> >groups -- the most affect for the least netwide trauma (except to the people
> >who like to hear themselves type).
> 
> Why don't we get rid of the high volume/low readership posters (guess who
> comes to mind....)

Well, I'd guess offhand just about anyone in a non-technical
group on the seismo top 25. Contrary to the snide insinuation, that ain't
me, since I've been on the top 25 once in the last six months.

On a practical level, getting rid of individual users is an administrative
impossibility for the net -- the only control at the user level is in the
hands of the SA. Getting rid of bloated newsgroups IS under net control, and
worth looking at.

Personally, if it was possible, I'd like to see us get rid of articles with
no factual content, useless repetitions, childish accusations and other
associated garbage. My biggest worry on this, though, is that there would be
no net left when we were done.

me
-- 
:From the lofty realms of Castle Plaid:          Chuq Von Rospach 
chuq%plaid@sun.COM	FidoNet: 125/84		 CompuServe: 73317,635
{decwrl,decvax,hplabs,ihnp4,pyramid,seismo,ucbvax}!sun!plaid!chuq

The first rule of magic is simple. Don't waste your time waving your hands
and hoping when a rock or a club will do -- McCloctnik the Lucid

tim@ism780c.UUCP (Tim Smith) (04/01/86)

In article <3417@sun.uucp> chuq@sun.uucp (Chuq Von Rospach) writes:
>
>Personally, if it was possible, I'd like to see us get rid of articles with
>no factual content, useless repetitions, childish accusations and other
>associated garbage. My biggest worry on this, though, is that there would be
>no net left when we were done.

What I would like to see is for everyone to be required to use Compuserve
for six months before being allowed to post to USENET.  One doesn't post
content-free flames when one is paying 12 bucks an hour for connect time!
Perhaps the habit of posting short, to the point, articles would carry over
to USENET.  Of course there is no way to actually implement this...
-- 

Tim Smith       sdcrdcf!ism780c!tim || ima!ism780!tim || ihnp4!cithep!tim

mwm@ucbopal.berkeley.edu (Mike (I'll be mellow when I'm dead) Meyer) (04/01/86)

In article <3417@sun.uucp> chuq@sun.uucp (Chuq Von Rospach) writes:
>Well, I'd guess offhand just about anyone in a non-technical
>group on the seismo top 25. Contrary to the snide insinuation, that ain't
>me, since I've been on the top 25 once in the last six months.
>
>On a practical level, getting rid of individual users is an administrative
>impossibility for the net -- the only control at the user level is in the
>hands of the SA. Getting rid of bloated newsgroups IS under net control, and
>worth looking at.

Ok, I can't resist. We went over this problem at lunch today (random chance,
that), and came up with the following two-step solution:

1) a hack to inews so that it refuses to accept articles posted by anyone on
	a list of user@site type names.

2) An awk script (or something similar) that takes the top 25 list, and
	turns it into a list for step one. Criteria should include
	newsgroups posted to and # of s.d. away from average.

If the backbone started running this code (or something like it), we would
have instant, objective deletion of high-volume users on a netwide level,
but only for a couple of weeks. And maybe, just maybe, the thought of being
censored that way would make people think before posting.

	<mike

chuq@sun.uucp (Chuq Von Rospach) (04/02/86)

> ME:
> >On a practical level, getting rid of individual users is an administrative
> >impossibility for the net -- the only control at the user level is in the
> >hands of the SA. Getting rid of bloated newsgroups IS under net control, and
> >worth looking at.
> 
> Ok, I can't resist. We went over this problem at lunch today (random chance,
> that), and came up with the following two-step solution:
> 
> 1) a hack to inews so that it refuses to accept articles posted by anyone on
> 	a list of user@site type names.
> 
> 2) An awk script (or something similar) that takes the top 25 list, and
> 	turns it into a list for step one. Criteria should include
> 	newsgroups posted to and # of s.d. away from average.
> 
> If the backbone started running this code (or something like it), we would
> have instant, objective deletion of high-volume users on a netwide level,
> but only for a couple of weeks. And maybe, just maybe, the thought of being
> censored that way would make people think before posting.

Problems:
    o you censor people silently -- if you allow a message to be posted
    and then make it silently go away downstream, how do they know they
    were deleted from the net? Don't assume any of these people read
    net.news or read anything but the group they are posting in. You may
    get rid of the articles, but you aren't solving the problem -- they
    don't know they are being censored and don't change their ways.

    o what happens when the data from seismo is WRONG? Without a human
    in the loop, problems will definitely occur. What happens when I start
    posting forged messages causing people I don't like to get knocked off
    the net (and being knocked off, can't even complain about it!)

    o How do you keep from knocking out the people making positive 
    contributions to netnews? chris torek writes a public domain version of
    4.2 in his spare time. He posts it, to the wonderment of all. He then
    gets kicked off the net for excessive volume. Isn't this a NEGATIVE
    inducement to doing good things?

It ain't as easy as it looks. Coming up with a fair way of cutting back the
dead weight sounds good, but it has a lot of practical problems. We just
don't have the administrative tools to do it right, I think.
-- 
:From the lofty realms of Castle Plaid:          Chuq Von Rospach 
chuq%plaid@sun.COM	FidoNet: 125/84		 CompuServe: 73317,635
{decwrl,decvax,hplabs,ihnp4,pyramid,seismo,ucbvax}!sun!plaid!chuq

The first rule of magic is simple. Don't waste your time waving your hands
and hoping when a rock or a club will do -- McCloctnik the Lucid