[news.config] new survey to supplement arbitron. Please run this program.

reid@decwrl.dec.com (Brian Reid) (05/19/89)

I'm escalating my network measurement surveys. below is a short
C program that analyzes flow paths and reports them to a central location.

It has been alpha-tested at 25 sites, including Vax, Sun 3, Sun 4, Pyramid,
and some 3B machines. There may well be bugs, but I guarantee that if there
are, that they will be obscure and difficult bugs. 

This program takes about as long as "expire -r" to run, so it's not something
you want to do very often. Here at decwrl where we keep 45 or 50 days of news
online, it can take several hours to run. On smaller machines which keep only
a few days of news online, it finishes in several minutes.

Ideally I'd like everybody to run this program once a month and mail the
results to pathsurvey@decwrl.dec.com. It's important that small sites as well
as big ones participate.

Thank you. Once the data starts rolling in, I'll be posting the results once
a month, just like arbitron.

	Brian Reid
	DEC Western Research

------------------------------------


/* inpaths.c -- track the paths of incoming news articles and prepare
 *	      in a format suitable for decwrl pathsurveys
 *
 *
 * This program inputs a list of filenames of news articles, and outputs a
 * data report which should be mailed to the decwrl Network Monitoring
 * Project at address "pathsurvey@decwrl.dec.com".
 *
 *
 * Run it like this:
 *
 *  cd /usr/spool/news
 *  find . -type f -print | inpaths "yourhost" | mail pathsurvey@decwrl.dec.com
 *
 *  where "yourhost" is the host name of your computer.
 *
 * If you have a huge amount of news spooled and don't want to run 
 * all of it through inpaths, you can do something like
 *
 *   find . -type f -mtime -10 -print | ...
 * 
 * there are 3 options: -s, -m, and -l for short, medium, and long report.
 * The default is to produce a long report. If you are worried about mail
 * expenses you can send a shorter report. The long report is typically
 * about 50K bytes for a major site, and perhaps 25K bytes for a smaller
 * site. 
 *
 * Brian Reid
 *	V1.0	 Sep 1986
 *	V2.0	 May 1989
 *     
 */

#define VERSION "2.2"
#include <stdio.h>
#include <fcntl.h>
#include <ctype.h>
#include <sys/types.h>

#define SURVEYPERIOD 21		/* Maximum number of days in survey period */
#define	INTERVAL	SURVEYPERIOD*60*60*24
#define HEADBYTES 1024

main (argc,argv)
  int argc;
  char **argv;
 {
    char linebuf[1024], jc, *lptr, *cp, *cp1, *cp2;
    char rightdelim;
    char *pathfield;
    char artbuf[HEADBYTES];
    char * scanlimit;
    char *hostname;
    char hostString[128];
    int needHost;
    static int passChar[256];
    int isopen,columns,verbose,totalTraffic;

	/* definitions for getopt */
    extern int optind;
    extern char *optarg;

 /* structure used to tally the traffic between two hosts */
    typedef struct trec {
	struct trec *rlink;
	struct nrec *linkid;
	int tally;
    } ;

 /* structure to hold the information about a host */
    typedef struct nrec {
	struct nrec *link;
	struct trec *rlink;
	char *id;
	long sentto; /* tally of articles sent to somebody from here */
    } ;
    struct nrec *hosthash[128], *hnptr, *list, *relay;
    struct trec *rlist;
    int i, article, gotbytes, c;
    extern errno;

    hostname = "unknown";
    verbose = 2;
    while (( c=getopt(argc, argv, "sml" )) != EOF)
    switch (c) {
	case 's': verbose=0; break;
	case 'm': verbose=1; break;
	case 'l': verbose=2; break;
	case '?': fprintf(stderr,
	"usage: %s [-s] [-m] [-l] hostname\n",argv[0]);
	exit(1);
    }
    if (optind < argc) {
        hostname = argv[optind];
    } else {
	fprintf(stderr,"usage: %s [-s] [-m] [-l] `hostname`\n",argv[0]);
	exit(1);
    }

    fprintf(stderr,"computing %s inpaths for host %s\n",
	verbose==0 ? "short" : (verbose==1 ? "medium" : "long"),hostname);
    for (i = 0; i<128; i++) hosthash[i] = (struct nrec *) NULL;

/* precompute character types to speed up scan */
    for (i = 0; i<=255; i++) {
    	passChar[i] = 0;
	if (isalpha(i) || isdigit(i)) passChar[i] = 1;
	if (i == '-' || i == '.' || i == '_') passChar[i] = 1;
    }
    totalTraffic = 0;    

    while (gets(linebuf) != NULL) {
        lptr = linebuf;
	isopen = 0;

/* Skip files that do not have pure numeric names */
	i = strlen(lptr)-1;
	do {
	    if (!isdigit(linebuf[i])) {
	        if (linebuf[i]=='/') break;
		goto bypass;
	    }
	    i--;
	} while (i>=0);

/* Open the file for reading */
	article = open(lptr, O_RDONLY);
	isopen = (article > 0);

/* Read in the first few bytes of the article; find the end of the header */
	gotbytes = read(article, artbuf, HEADBYTES);
	if (gotbytes < 10) goto bypass;

/* Find "Path:" header field */
	pathfield = (char *) 0;
	scanlimit = &artbuf[gotbytes];
	for (cp=artbuf; cp <= scanlimit; cp++) {
	    if (*cp == '\n') break;
	    if (pathfield) break;
	    if (strncmp(cp, "Path: ", 6) == 0) {
		pathfield = cp; goto gotpath;
	    }
	    while (*cp != '\n' && cp <= scanlimit) cp++;
	}
	fprintf(stderr,"%s: didn't find 'Path:' in 1st %d bytes.\n",
	    lptr,HEADBYTES);
	goto bypass; 

gotpath: ;

/* Extract all of the host names from the "Path:" field and put them in our
host table.								 */
	cp = pathfield;
	while (*cp != NULL && *cp != '\n') cp++;
	if (cp == NULL) {
	    fprintf(stderr,"%s: end of Path line not in buffer.\n",lptr);
	    goto bypass;
	}

	totalTraffic++;
	*cp = 0;
	pathfield += 5;	/* skip 'Path:' */
	cp1 = pathfield;
	relay = (struct nrec *) NULL;
	rightdelim = '!';
	while (cp1 < cp) {
	    /* get next field */
	    while (*cp1=='!') cp1++;
	    cp2 = ++cp1;
	    while (passChar[(int) (*cp2)]) cp2++;

	    rightdelim = *cp2; *cp2 = 0;
	    if (rightdelim=='!' && *cp1 != (char) NULL) {
	    /* see if already in the table */
		list = hosthash[*cp1];
		while (list != NULL) {
		    /*
		     * Attempt to speed things up here a bit.  Since we hash
		     * on the first char, we see if the second char is a match
		     * before calling strcmp()
		     */
		    if (list->id[1] == cp1[1] && !strcmp(list->id, cp1)) {
			hnptr = list;
			break;		/* I hate unnecessary goto's */
		    }
		    list = list->link;
		}
		if(list == NULL) {
			/* get storage and splice in a new one */
			hnptr = (struct nrec *) malloc(sizeof (struct nrec));
			hnptr->id = (char *) strcpy(malloc(1+strlen(cp1)),cp1);
			hnptr->link = hosthash[*cp1];
			hnptr->rlink = (struct trec *) NULL;
			hnptr->sentto = (long) 0;
			hosthash[*cp1] = hnptr;
		}
	    }
/* 
At this point "hnptr" points to the host record of the current host. If
there was a relay host, then "relay" points to its host record (the relay
host is just the previous host on the Path: line. Since this Path means
that news has flowed from host "hnptr" to host "relay", we want to tally
one message in a data structure corresponding to that link. We will
increment the tally record that is attached to the source host "hnptr".
*/

	    if (relay != NULL && relay != hnptr) {
		rlist = relay->rlink;
		while (rlist != NULL) {
		    if (rlist->linkid == hnptr) goto have2;
		    rlist = rlist->rlink;
		}
		rlist = (struct trec *) malloc(sizeof (struct trec));
		rlist->rlink = relay->rlink;
		relay->rlink = rlist;
		rlist->linkid = hnptr;
		rlist->tally = 0;

    have2:      rlist->tally++;
		hnptr->sentto++;
	    }

	    cp1 = cp2;
	    relay = hnptr;
	    if (rightdelim == ' ' || rightdelim == '(') break;
	}
bypass: if (isopen) close(article) ;
    }
/* Now dump the host table */
    printf("ZCZC begin inhosts %s %s %d %d %d\n",
    	VERSION,hostname,verbose,totalTraffic,SURVEYPERIOD);
    for (jc=0; jc<127; jc++) {
	list = hosthash[jc];
	while (list != NULL) {
	    if (list->rlink != NULL) {
		if (verbose > 0 || (100*list->sentto > totalTraffic))
		    printf("%d\t%s\n",list->sentto, list->id);
	    }
	    list = list->link;
	}
    }
    printf("ZCZC end inhosts %s\n",hostname);

    printf("ZCZC begin inpaths %s %s %d %d %d\n",
        VERSION,hostname,verbose,totalTraffic,SURVEYPERIOD);
    for (jc=0; jc<127; jc++) {
	list = hosthash[jc];
	while (list != NULL) {
	    if (verbose > 1 || (100*list->sentto > totalTraffic)) {
		if (list->rlink != NULL) {
		    columns = 3+strlen(list->id);
		    sprintf(hostString,"%s H ",list->id);
		    needHost = 1;
		    rlist = list->rlink;
		    while (rlist != NULL) {
		        if (
			     (100*rlist->tally > totalTraffic)
			  || ((verbose > 1)&&(5000*rlist->tally>totalTraffic))
			   ) {
			    if (needHost) printf("%s",hostString);
			    needHost = 0;
			    relay = rlist->linkid;
			    if (columns > 70) {
				printf("\n%s",hostString);
				columns = 3+strlen(list->id);
			    }
			    printf("%d Z %s U ", rlist->tally, relay->id);
			    columns += 9+strlen(relay->id);
			}
			rlist = rlist->rlink;
		    }
		    if (!needHost) printf("\n");
		}
	    }
	    list = list->link;
	}
    }
    printf("ZCZC end inpaths %s\n",hostname);
    fclose(stdout);
    exit(0);
}

jbuck@epimass.EPI.COM (Joe Buck) (05/20/89)

Brian, your program, if invoked in the way you request, will process
crossposted articles N times, where N is the number of groups present.
Please, let's not waste net resources by conducting a large-scale survey
with a basic error in it.

Rather than do a "find" to locate article names, you can count
crossposted articles only once by reading the history file to obtain
article filenames.  Since this is going to alt.sources, I obviously
need to include a source: here is a perl program that eats a history
file and spits out a sorted list of host pairs, showing the links your
news has travelled through.

------------------------------ cut here ------------------------------
#! /usr/bin/perl

# This perl program scans through all the news on your spool
# (using the history file to find the articles) and prints
# out a sorted list of frequencies that each pair of hosts
# appears in the Path: headers.  That is, it determines how,
# on average, your news gets to you.
#
# If an argument is given, it is the name of a previous output
# of this program.  The figures are read in, and host pairs
# from articles newer than the input file are added in.
# So that this will work, the first line of the output of the
# program is of the form
# Last-ID: <5679@chinet.UUCP>
# (without the # sign).  It records the last Message-ID in the
# history file; to add new articles, we skip in the history file
# until we find the message-ID that matches "Last-ID".

$skip = 0;
if ($#ARGV >= 0) {
    $ofile = $ARGV[0];
    die "Can't open $ofile!\n" unless open (of, $ofile);
# First line must contain last msgid to use.
    $_ = <of>;
    ($key, $last_id) = split (' ');
    die "Invalid input file format!\n" if ($key ne "Last-ID:");
    $skip = 1;
# Read in the old file.
    while (<of>) {
	($cnt, $pair) = split(' ');
	$pcount{$pair} = $cnt;
    }
}
# Let's go.

die "Can't open history file!\n" unless open (hist, "/usr/lib/news/history");
die "Can't cd to news spool directory!\n" unless chdir ("/usr/spool/news");

$np = $nlocal = 0;
while (<hist>) {
#
# $_ contains a line from the history file.  Parse it.
# Skip it if the article has been cancelled or expired
# If the $skip flag is true, we skip until we have the right msgid
#
    ($id, $date, $time, $file) = split (' ');
    next if ($file eq 'cancelled' || $file eq '');
    if ($skip) {
	if ($id eq $last_id) { $skip = 0; }
	next;
    }
#
# format of field is like comp.sources.unix/2345 .  Get ng and filename.
#
    ($ng, $n) = split (/\//, $file);
    $file =~ tr%.%/%;
#
# The following may be used to skip any local groups.  Here, we
# skip group names beginning with "epi" or "su".  Change to suit taste.
#
    next if $ng =~ /^epi|^su/;
    next unless open (art, $file);	# skip if cannot open file
#
# Article OK.  Get its path.
    while (<art>) {
        ($htype, $hvalue) = split (' ');
	if ($htype eq "Path:") {
# We have the path, in hvalue.
	    $np++;
	    @path = split (/!/, $hvalue);
# Handle locally posted articles.
	    if ($#path < 2) { $nlocal++; last;}
# Create and count pairs.
	    for ($i = 0; $i < $#path - 1; $i++) {
		$pair = $path[$i] . "!" . $path[$i+1];
		$pcount{$pair} += 1;
	    }
	    last;
	}
    }
}
# Make sure print message comes out before sort data.
$| = 1;
print "Last-ID: $id\n";
$| = 0;
# write the data out, sorted.  Open a pipe.
die "Can't exec sort!\n" unless open (sortf, "|sort -nr");

while (($pair, $n) = each (pcount)) {
    printf sortf ("%6d %s\n", $n, $pair);
}
close sortf;
-- 
-- Joe Buck	jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck

reid@decwrl.dec.com (Brian Reid) (05/20/89)

Joe,
   It's been my experience that there's too much variation in history file
formats out there for programs based on the history file to be very
universal. Also I claim it is not an error to doublecount a crossposted
message. Also perl is not universal, though it is certainly nice. If anybody
is willing to massage that perl script into a format that my automated
processing programs can use, I'd be happy, but I don't want to encourage
people to run software that takes advantage of undocumented features of the
news sytem (e.g. the history file format). That will make it much more
difficult to make future changes to that format.

Brian

rsalz@bbn.com (Rich Salz) (05/21/89)

(Excuse the light-hearted tone; too many Saturday-morning cartoons...)

On our last visit Joe Buck <jbuck@epimass.EPI.COM> pointed out that Brian
Reid's "inpath" screws up with cross-posted articles.  He then provided a
perl program to read the history file.  Brian <everyone knows brian>
subsequently pointed out that not everyone has perl, and that reading
internal data files is not a good thing.

Fear not, gentle fellows.  How about this?  Use the :F: flag in your news
sys file to write the article pathnames into a file, then run a small
script monthly out of cron to invoke inpaths.  Something like this:
    /usr/lib/news/sys:
	survey:all:F:/usr/spool/batch/survey
    /usr/lib/crontab:
	2 10 1 * * /usr/lib/news/doinpaths
    /usr/lib/news/doinpaths:
	cd /usr/spool/batch
	touch survey.work
	cat survey >>survey.work
	cp /dev/null survey
	/usr/lib/news/inpaths -l <survey.work | mail pathsurvey@decwrl.dec.com
	rm survey.work

The only question:  do you have enough space to store the pathnames of
a months worth of news?
	/rich $alz
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.

simpson@poseidon.uucp (Scott Simpson) (05/21/89)

NNTP sites cannot run this.  We don't get news.
	Scott Simpson
	TRW Space and Defense Sector
	oberon!trwarcadia!simpson  	(UUCP)
	trwarcadia!simpson@usc.edu	(Internet)

david@indetech.UUCP (David Kuder) (05/22/89)

In article <1735@papaya.bbn.com> rsalz@bbn.com (Rich Salz) writes:
=(Excuse the light-hearted tone; too many Saturday-morning cartoons...)
= ...
!Fear not, gentle fellows.  How about this?  Use the :F: flag in your news
!sys file to write the article pathnames into a file, then run a small
!script monthly out of cron to invoke inpaths.  Something like this:
! ...
> The only question:  do you have enough space to store the pathnames of
>a months worth of news?
>	/rich $alz
	Who has enough disk space to store a months worth of articles?
And as long as I'm asking, does not having a month's worth lessen
the value of the data Brian is collecting?  I'm lucky to have enough disk
to keep 3 weeks of history file and 2 weeks of most news groups.

bin@primate.wisc.edu (Brain in Neutral) (05/22/89)

From article <2836@emerald.indetech.uucp>, by david@indetech.UUCP (David Kuder):
> In article <1735@papaya.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>> The only question:  do you have enough space to store the pathnames of
>>a months worth of news?
>>	/rich $alz
> 	Who has enough disk space to store a months worth of articles?

A month's worth of >> path names <<.  Much different!

Paul DuBois
dubois@primate.wisc.edu		rhesus!dubois
bin@primate.wisc.edu		rhesus!bin

spaf@cs.purdue.edu (Gene Spafford) (05/22/89)

Keep a month's worth of news?  Ha!  That stopped being possible years
ago.  Here, we keep 1 week of comp,news,sci,rec,misc,soc,gnu & bionet.
We keep 1 day worth of talk and comp.binaries.*, and we don't get alt.
Even with those restrictions, the disk has filled up a few times with
too much cruft.  I doubt we're the only site in such a situation.

I'm sure that Brian's numbers will be skewed some by this.
-- 
Gene Spafford
NSF/Purdue/U of Florida  Software Engineering Research Center,
Dept. of Computer Sciences, Purdue University, W. Lafayette IN 47907-2004
Internet:  spaf@cs.purdue.edu	uucp:	...!{decwrl,gatech,ucbvax}!purdue!spaf

reid@decwrl.dec.com (Brian Reid) (05/22/89)

Part of what I am measuring in this survey is the amount of news that people
keep online. However much you have, be it 2 days (the lowest I've seen so
far) or 75 days (the highest I've seen so far), the data is very worthwhile.

Thank you all for your cooperation. I now have reports from abougt 60 sites.
I'm hoping to get reports from 200 sites by June 1, so that the monthly
report posted that day will have a high enough statistical significance that
it can be included in long-term trend analyses.

david@ms.uky.edu (David Herron -- One of the vertebrae) (05/22/89)

In article <4981@wiley.UUCP> simpson@poseidon.UUCP (Scott Simpson) writes:
>NNTP sites cannot run this.  We don't get news.

Eh?  Excuuuse me?  NNTP means Network News Transfer Protocol
				      ^^^^
				      ^^^^
				      ^^^^
				      You most certainly DO get news if you use NNTP
-- 
<- David Herron; an MMDF guy                              <david@ms.uky.edu>
<- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<- By all accounts, Cyprus (or was it Crete?) was covered with trees at one time
<- 		-- Until they discovered Bronze

reid@decwrl.dec.com (Brian Reid) (05/23/89)

In article <6796@medusa.cs.purdue.edu> spaf@cs.purdue.edu (Gene Spafford) writes:
>
>Keep a month's worth of news?  Ha!  That stopped being possible years
>ago.

We keep 50 days of news online. I need at least a month online to do the
arbitron analysis, and the rest is a safety factor.

>I'm sure that Brian's numbers will be skewed some by this.

No they won't at all. It's easy to tell how much news is there, and the flows
are all percentages anyhow. If you keep 100 days of news online, then the
aggregate flows can be computed to a greater accuracy than if you keep 1 day
of news online, but there's plenty of unskewed information in any amount of
stored-up news. Having 100 sites report on the contents of their 1-day
spooling directory has a lot more information in it, unskewed, than having 1
site report on 100 days.

jbuck@epimass.EPI.COM (Joe Buck) (05/23/89)

In article <85@jove.dec.com> reid@decwrl.dec.com (Brian Reid) writes:
>Joe,
>   It's been my experience that there's too much variation in history file
>formats out there for programs based on the history file to be very
>universal.

As far as I know, both the 2.11 news and C news history file format
gives the article file name(s) in the third tab-separated field, and the
2.11 format looks the same as it has since 2.10.2 at least.  I do not
know what format TMNN uses.  I know of no other formats.

> Also I claim it is not an error to doublecount a crossposted message.

Depends on what you intend to accomplish.  If all news categories had
the same propogation, and if cross-posting were equally prevalent in
all categories, it wouldn't make a difference in any attempt to measure
the news topology.

Unfortunately, this isn't true.  Many sites don't get the talk and soc
groups; however, crossposting seems more common in those groups.  Result:
distribution links that send talk groups will be emphasized more than
those that send only comp and news.

Actually, I think the arbitron statistics are getting increasingly
distorted due to this phenomenon.  I know there are a large number of
"comp"-only sites out there, and I submit that such sites are far less
likely to run arbitron.  Result: comp is more popular, and rec, soc,
and talk less popular than the extrapolated statistics show.  Can't
prove this, of course.

> Also perl is not universal, though it is certainly nice.

Perhaps my communication wasn't clear here.  I just intended that
program as an example of how to process all news articles by scanning
the (2.11) history file.  I wasn't necessarily saying you should use
it instead.

> If anybody
>is willing to massage that perl script into a format that my automated
>processing programs can use, I'd be happy, but I don't want to encourage
>people to run software that takes advantage of undocumented features of the
>news sytem (e.g. the history file format). That will make it much more
>difficult to make future changes to that format.

I see your point, but exactly what features of the news system can we
say are documented?  That the news articles are each in separate files
and have all-numeric names (an assumption your program makes, which is
not true of notes, for example)?  There's a discussion over on
news.software.nntp on precisely this point, over a proposal to extend
the LIST command to display more files.  About the only absolutely
standard program you could write would be written to assume nothing
but the NNTP protocol and access all articles that way (slow, slow, slow).

Another way of eliminating double-counted crossposts using your method
is to check the Newsgroups: header and not count the article if the
first newsgroup doesn't match the file name (so articles are only
counted in the first group posted to).


-- 
-- Joe Buck	jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck

grr@cbmvax.UUCP (George Robbins) (05/23/89)

In article <11733@s.ms.uky.edu> david@ms.uky.edu (David Herron -- One of the vertebrae) writes:
> In article <4981@wiley.UUCP> simpson@poseidon.UUCP (Scott Simpson) writes:
> >NNTP sites cannot run this.  We don't get news.
> 
> Eh?  Excuuuse me?  NNTP means Network News Transfer Protocol
> 				      ^^^^
> 				      You most certainly DO get news if you use NNTP

Except if they're NNTP clients running only rrn, in which case they may
have no local spool, which is probably what Scott refers to...

-- 
George Robbins - now working for,	uucp: {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing	arpa: cbmvax!grr@uunet.uu.net
Commodore, Engineering Department	fone: 215-431-9255 (only by moonlite)

bin@primate.wisc.edu (Brain in Neutral) (05/23/89)

From article <300@indri.primate.wisc.edu>, by bin@primate.wisc.edu (Brain in Neutral):
> From article <2836@emerald.indetech.uucp>, by david@indetech.UUCP (David Kuder):
>> In article <1735@papaya.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>>> The only question:  do you have enough space to store the pathnames of
>>>a months worth of news?
>>>	/rich $alz
>> 	Who has enough disk space to store a months worth of articles?
> 
> A month's worth of >> path names <<.  Much different!

It's been pointed out to me that it's not so different after all,
since inpaths opens the files to read the articles.  Um.


Paul DuBois
dubois@primate.wisc.edu		rhesus!dubois
bin@primate.wisc.edu		rhesus!bin

grr@cbmvax.UUCP (George Robbins) (05/24/89)

In article <80@jove.dec.com> reid@decwrl.dec.com (Brian Reid) writes:
> I'm escalating my network measurement surveys. below is a short
> C program that analyzes flow paths and reports them to a central location.
 
OK, I've let this fine program run for hours on my spool area and mailed
off the results.  How about posting a copy of the analysis program so I
can interpret my own data?

-- 
George Robbins - now working for,	uucp: {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing	arpa: cbmvax!grr@uunet.uu.net
Commodore, Engineering Department	fone: 215-431-9255 (only by moonlite)

grr@cbmvax.UUCP (George Robbins) (05/24/89)

In article <304@indri.primate.wisc.edu> bin@primate.wisc.edu writes:
> From article <300@indri.primate.wisc.edu>, by bin@primate.wisc.edu (Brain in Neutral):
> > From article <2836@emerald.indetech.uucp>, by david@indetech.UUCP (David Kuder):
> >> In article <1735@papaya.bbn.com> rsalz@bbn.com (Rich Salz) writes:
> >>> The only question:  do you have enough space to store the pathnames of
> >>>a months worth of news?
> >> 	Who has enough disk space to store a months worth of articles?

heh heh heh ... telebit modems and CDC sabre 1.2Gbyte drives -
    a news admin's crutches as technology struggles to keep up with usenet...

> > A month's worth of >> path names <<.  Much different!
> It's been pointed out to me that it's not so different after all,
> since inpaths opens the files to read the articles.  Um.

Still, it is conceptually easy to "feed" all incoming/outgoing news articles
to a dummy site that is really a "sed" script to strip off the path lines and
store them for later analysis.  Brian's program would have to be improved
to simply process a file of header lines, but this seems trivial, in exchange
for eliminating any need to retain articles and the bizarre overhead that
searching every article in the news spool often requires.

It depends somewhat on what Brian is trying to track, "reception" of news
or "long term" availabilty of news online...

-- 
George Robbins - now working for,	uucp: {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing	arpa: cbmvax!grr@uunet.uu.net
Commodore, Engineering Department	fone: 215-431-9255 (only by moonlite)

emv@math.lsa.umich.edu (Edward Vielmetti) (05/25/89)

In article <6964@cbmvax.UUCP> grr@cbmvax.UUCP (George Robbins) writes:
>
>It depends somewhat on what Brian is trying to track, "reception" of news
>or "long term" availabilty of news online...
>
From looking at the output, it appears to be fodder for 'netmap'.  All you
need then is a dictionary of where sites are (pathalias data) and voila!
you can provide genuine, current information on what the real 'backbone'
paths by which news flows, relative to any one site in particular or
when aggregated to the whole world.

Very interesting data, glad to see that it's being collected.