[mod.sources] v06i009: Tools for analyzing netnews paths

sources-request@mirror.UUCP (06/19/86)

Submitted by: talcott!seismo!epiwrl!epimass!jbuck
Mod.sources: Volume 6, Issue 9
Archive-name: getpaths

[ I had to reformat the AWK script in histog, strangely enough;
  otherwise AWK crashed.  My version is in the file histog.mod
	--r$ ]

# This is a shell archive.  Remove anything before this line,
# then unpack it by saving it in a file and typing "sh file".
# Contents:  6sites README backbone-list closure getopt.c getpaths.1 getpaths.c
#	histog histog.mod hostcount makefile makefile.42 makefile.sysv
#	makefile.v7 oscantree.c scantree.c threelink twolink
 
echo x - 6sites
sed 's/^XX//' > "6sites" <<'@//E*O*F 6sites//'
XXBACKBONE!decvax!seismo!ihnp4!tektronix!sdcrdcf!decwrl
@//E*O*F 6sites//
chmod u=rw,g=rw,o=rw 6sites
 
echo x - README
sed 's/^XX//' > "README" <<'@//E*O*F README//'
XX		A Collection of Usenet path-counting tools
XX		Joseph T. Buck - Entropic Processing, Inc.

XXThis is a set of tools for computing statistics on the Usenet paths
XXon your machine.

XXThe program 'getpaths' gathers paths from your news spool, taking cross
XXpostings into account (each path appears in the output once).
XXSee the man page supplied.

XXThere are three makefiles provided; I hope getpaths can be made to
XXrun on anything.  Use makefile.42 for 4.2bsd or 4.3bsd; makefile.sysv
XXfor System III or System V; use makefile.v7 for Version 7 or 4.1bsd.

XXIf you have trouble, there are three questions you have to answer:

XX	1) Do you have the 4.2 directory access library?  If so, use
XX	   scantree.c, otherwise use oscantree.c.

XX	2) v7 and bsd systems have functions "index" and "rindex".
XX	   SysIII and SysV changed the names to strchr and strrchr,
XX	   for no good reason.  See the definition of CFLAGS in
XX	   makefile.sysv.

XX	3) Do you have getopt? (SysIII,V do, bsd, v7 don't).  I'm
XX	   including Henry Spencer's public domain version, since
XX	   it's short and valuable and Berkeley folks can't run the
XX	   program without it.

XXYou may find the scantree functions useful in other programs.  You're
XXwelcome to them, but leave my name in, please.

XXWith no options, getpaths prints the paths in every article on the standard
XXoutput.  If you expire news after one week, this will result in about 300 Kbytes
XXof output.  To restrict yourself to a subtree, you may give an option, which
XXis a pathname that uses /usr/spool/news as a base.

XX	getpaths mod
XX	getpaths net/politics

XXOnce you've used getpaths to gather some paths, you can use the shell scripts
XXto compute some statistics on them:

XXhistog -
XX	Computes the average path length and prints a histogram.
XXhostcount -
XX	Counts hostnames that occur anywhere in a path and prints them
XX	in declining order of frequency.
XXtwolink -
XX	Prints connected pairs of sites in declining order of frequency.
XX	I used twolink to prepare my most recent map.
XXthreelink -
XX	Prints connected triplets of sites in declining order of frequency.

XXFinally, there is "closure", which will give "awk" a good workout on your
XXsystem.  It is an automated version of the procedure that I used to build
XXan extended backbone map.

XXclosure takes three arguments.  The first argument is the name of a
XXfile containing the "core" sites you wish to start with (you can
XXthink of these as "backbone" sites, but I'm not claiming to define what
XX"backbone" really means).

XXThe file must be in a somewhat peculiar format; each line must begin
XXwith the word BACKBONE (in caps), followed by the site names,
XXseparated by !  marks.  For example, if you want to start with hosts
XXalpha, beta, and gamma, the backbone file could be 

XXBACKBONE!alpha!beta!gamma

XXor

XXBACKBONE!alpha
XXBACKBONE!beta
XXBACKBONE!gamma

XXYes, it's ugly.

XXThe second argument is the name of the file containing a list of paths
XX(as output by getpaths, perhaps).  Finally, the third argument, an
XXinteger, is a threshold.

XXclosure searches the paths file for chains of paths that connect two
XX"core" sites; a chain is disregarded unless it occurs at least
XX"threshold" times, where "threshold" is the third argument.  Every
XXsite in each such chain is added to the "core".  This is repeated
XXuntil no more sites can be added to the "core".  A report of each
XXiteration and what paths are added to the core is printed on the
XXstandard output.

XXI've supplied two "initial cores" you can use; "6sites" includes
XXihnp4, seismo, decvax, decwrl, tektronix, and sdcrdcf, and
XX"backbone-list" contains all sites on Gene Spafford's backbone map.

XXTry the commands

XXgetpaths net > net.paths
XXclosure 6sites net.paths 20

XXor

XXclosure backbone-list net.paths 20

XXI'm interested in seeing the results for sites far from the SF Bay
XXArea.  You can write me at

XX		oliveb-\
XX			\
XX	ihnp4---pesnta--epimass!jbuck
XX		/
XX	hplabs-/
@//E*O*F README//
chmod u=rw,g=rw,o=rw README
 
echo x - backbone-list
sed 's/^XX//' > "backbone-list" <<'@//E*O*F backbone-list//'
XXBACKBONE!cbosgd!clyde!watmath!utzoo
XXBACKBONE!ihnp4!cuae2!alberta!ubc-vision!burl!ulysses!bellcore
XXBACKBONE!uw-beaver!tektronix!decvax!linus
XXBACKBONE!qantel!hplabs!oliveb!glacier!decwrl
XXBACKBONE!sdcrdcf!hao!sdcsvax!drillsys!gatech!akgua!mcnc!philabs
XXBACKBONE!lll-lcc!lll-crg!cmcl2!seismo
@//E*O*F backbone-list//
chmod u=rw,g=rw,o=rw backbone-list
 
echo x - closure
sed 's/^XX//' > "closure" <<'@//E*O*F closure//'
XX#! /bin/sh
XX# closure: Joseph T. Buck, Entropic Processing, Inc.
XX# Usage: closure startfile paths thresh
XX# This rather awk-ward script builds a "transitive closure" of a backbone.
XXif test $# != 3
XXthen echo Usage: $0 backbone-list paths threshold >&2; exit 1
XXfi
XX# The file /tmp/$$c contains the current backbone list; /tmp/$$a contains
XX# The latest set of additions.
XXcp $1 /tmp/$$c
XXecho "------ Original backbone: ------"
XXawk -F! '{ for (i=1; i<=NF; i++) if ($i != "BACKBONE") printf ("%s ",$i); }
XXEND { printf ("\n");}' /tmp/$$c
XXi=1
XXwhile true
XXdo
XX   awk -F! '{
XX    if ($1 == "BACKBONE") {
XX	for (i = 2; i <= NF; i++)
XX	    backbone[$i] = 1;
XX    }
XX    else {
XX	i = 1;
XX	while (i <= NF && !backbone[$i])
XX	    i++;
XX	if (i > NF)
XX	    next;
XX	while (1) {
XX	    while (i <= NF && backbone[$i])
XX		i++;
XX	    if (i > NF)
XX		break;
XX	    first = i - 1;
XX	    while (i <= NF && !backbone[$i])
XX		i++;
XX	    if (i > NF)
XX		break;
XX	    for (j = first; j < i; j++)
XX		printf ("%s!", $j);
XX	    printf ("%s\n", $i);
XX	}
XX    }
XX}' /tmp/$$c $2 | sort | uniq -c |\
XX   awk '{ if ($1 >= thr) printf("BACKBONE!%s\n", $2); }' thr=$3 - > /tmp/$$a
XX   if test ! -s /tmp/$$a
XX     then echo "------ Final backbone: ------"
XX	  awk -F! '{for (i=2; i<NF; i++) print $i}' /tmp/$$c | sort -u
XX	  rm /tmp/$$c /tmp/$$a
XX	  exit 0
XX   fi
XX   echo "------ Pass $i: add the following: ------"
XX   awk -F! '{ for (i=2; i<NF; i++) printf("%s!", $i); printf("%s\n", $NF);}' /tmp/$$a
XX   cat /tmp/$$a >> /tmp/$$c
XX   i=`expr $i + 1`
XXdone

@//E*O*F closure//
chmod u=rwx,g=rwx,o=rwx closure
 
echo x - getopt.c
sed 's/^XX//' > "getopt.c" <<'@//E*O*F getopt.c//'
XX/*
XX * getopt - get option letter from argv
XX * by Henry Spencer
XX * posted to Usenet net.sources list
XX */
XX#include <stdio.h>

XXchar	*optarg;	/* Global argument pointer. */
XXint	optind = 0;	/* Global argv index. */

XXstatic char	*scan = NULL;	/* Private scan pointer. */

XXextern char	*index();

XXint
XXgetopt (argc, argv, optstring)
XXint     argc;
XXchar   *argv[];
XXchar   *optstring;
XX{
XX    register char   c;
XX    register char  *place;

XX    optarg = NULL;

XX    if (scan == NULL || *scan == '\0') {
XX	if (optind == 0)
XX	    optind++;

XX	if (optind >= argc || argv[optind][0] != '-' || argv[optind][1] == '\0')
XX	    return (EOF);
XX	if (strcmp (argv[optind], "--") == 0) {
XX	    optind++;
XX	    return (EOF);
XX	}
XX	scan = argv[optind] + 1;
XX	optind++;
XX    }
XX    c = *scan++;
XX    place = index (optstring, c);

XX    if (place == NULL || c == ':') {
XX	fprintf (stderr, "%s: unknown option -%c\n", argv[0], c);
XX	return ('?');
XX    }
XX    place++;
XX    if (*place == ':') {
XX	if (*scan != '\0') {
XX	    optarg = scan;
XX	    scan = NULL;
XX	}
XX	else {
XX	    optarg = argv[optind];
XX	    optind++;
XX	}
XX    }
XX    return (c);
XX}
@//E*O*F getopt.c//
chmod u=rw,g=rw,o=rw getopt.c
 
echo x - getpaths.1
sed 's/^XX//' > "getpaths.1" <<'@//E*O*F getpaths.1//'
XX.TH GETPATHS LOCAL 6/2/86
XX.SH NAME
XXgetpaths \- gather Usenet article paths
XX.SH SYNOPSIS
XX.B getpaths
XX[
XX.BI \-n " file"
XX] [
XX.BI \-u
XX] [
XX.I distrib
XX]
XX.SH DESCRIPTION
XX.PP
XX.I Getpaths
XXscans through the news spool and prints out all article paths.  Crossposted
XXarticles have their paths printed only once.
XX.PP
XXBy default, usernames are removed from the paths.  If the
XX.B \-u
XXflag is given, usernames are left on.
XX.PP
XXThe
XX.BI \-n " file"
XXoption causes paths to be included only from articles that are newer
XX(in terms of file modification time) than
XX.IR file .
XX.PP
XXIf no argument is given, the entire news spool will be scanned.  The
XX.I distrib
XXargument is a relative path from /usr/spool/news.  For example,
XX.PP
XXgetpaths net
XX.PP
XXgets paths from net.*, and
XX.PP
XXgetpaths net/micro/atari
XX.PP
XXdoes what you think.
XX.SH BUGS
XXThe algorithm for eliminating crosspostings only prints a path if
XXthe first newsgroup on the Newsgroups: line matches the directory currently
XXbeing scanned.  So if you say "getpaths na", an article with the header
XXline
XX.PP
XXNewsgroups: net.wanted,na.forsale
XX.PP
XXwill not have its path printed (though it will print if the groups are given
XXin the reverse order).  In other words, the first group is always considered
XXthe primary group, and other groups are always considered crossposts.
XX.SH AUTHOR
XXJoseph T. Buck, Entropic Processing, Inc.
XX.br
XX{pesnta,oliveb}!epimass!jbuck
@//E*O*F getpaths.1//
chmod u=rw,g=rw,o=rw getpaths.1
 
echo x - getpaths.c
sed 's/^XX//' > "getpaths.c" <<'@//E*O*F getpaths.c//'
XX/*
XX	getpaths.c
XX	Program to snarf up lots of Usenet paths
XX	Joseph T. Buck, Entropic Processing, Inc.
XX*/
XX#define SPOOLDIR "/usr/spool/news"
XX#include <stdio.h>
XX#define LINLEN 512
XXchar *strcpy();

XXint stripuser = 1;

XX/* This function assumes the current directory is SPOOLDIR, and that
XX   the paths are of the form net/news/group/2345.
XX*/
XXvoid perror(), exit();

XXwp (name, ino)
XXchar *name;
XXint ino;
XX{
XX    int status = 0, lng;
XX    char line[LINLEN], savpath[LINLEN];
XX    FILE *fd;
XX    char *p, *q, *rindex();
XX    char curngr[40];	/* current newsgroup */

XX/* Figure out the current newsgroup, by deleting whatever is
XX   after the last /, and changing '/' to '.'. Skip initial "./"
XX   if present
XX*/
XX    if (name[0] == '.' && name[1] == '/') name += 2;
XX    (void) strcpy (curngr, name);
XX    p = rindex (curngr, '/');
XX    if (p == NULL) return;
XX    *p = 0;
XX    while (p >= curngr) {
XX	if (*p == '/') *p = '.';
XX	p--;
XX    }
XX    lng = strlen (curngr);
XX/* Open the file */
XX    if ((fd = fopen (name, "r")) == NULL) {
XX	perror (name);
XX	return;
XX    }
XX    while (status != 3) {
XX	if (fgets (line, LINLEN, fd) == NULL) break;
XX	line[strlen(line)-1] = 0; /* delete \n */
XX	if (strncmp (line, "Newsgroups:", 11) == 0) {
XX	    p = line + 11;
XX	    while (*p == ' ' || *p == '\t') p++;
XX/* Detect cases like curngr == "net.micro", NG == "net.micro.pc,net.micro" */
XX	    if (p[lng] && p[lng] != ',' && p[lng] != ' ' && p[lng] != '\t')
XX		break;
XX	    else if (strncmp (p, curngr, lng) != 0) break;
XX	    status |= 2;
XX	}
XX	else if (strncmp (line, "Path:", 5) == 0) {
XX	    p = line + 5;
XX	    while (*p == ' ' || *p == '\t') p++;
XX/* Strip the username from the end of the path. Don't write anything
XX   if there's no ! in the path (a locally posted article)
XX*/
XX	    if (stripuser) {
XX		q = rindex (line, '!');
XX		if (q == NULL) break;
XX		*q = '\0'; /* Delete the tail */
XX	    }
XX	    (void) strcpy (savpath, p);
XX	    status |= 1;
XX	}
XX    }
XX    if (status == 3) (void) printf ("%s\n", savpath);
XX    (void) fclose (fd);
XX    return;
XX}

XXmain (argc, argv)
XXchar **argv;
XX{
XX    if (chdir (SPOOLDIR) != 0) {
XX	(void) fprintf (stderr, "Can't chdir to ");
XX	perror ("/usr/spool/news");
XX	exit (1);
XX    }
XX    if (argc > 1 && strcmp (argv[1], "-u") == 0) {
XX	stripuser = 0;
XX	argv++;
XX	argc--;
XX    }
XX    if (argc !=2) argv[1] = ".";
XX    scan_tree (argv[1], wp);
XX    return 0;
XX}

XX#include <sys/types.h>
XX#include <sys/stat.h>

XXis_dir (name)
XXchar *name;
XX{
XX    struct stat sbuf;
XX    return (stat (name, &sbuf) == 0 && (sbuf.st_mode & S_IFMT) == S_IFDIR);
XX}
@//E*O*F getpaths.c//
chmod u=rw,g=rw,o=rw getpaths.c
 
echo x - histog
sed 's/^XX//' > "histog" <<'@//E*O*F histog//'
XX#! /bin/sh
XXawk -F! '{ count[NF] += 1; if (NF > mx) mx = NF; sum += NF; }
XXEND { printf ("Path length statistics\nAverage length: %f\nHistogram:\n",
XX		sum / NR);
XX      for (i = 1; i <= mx; i++) printf ("%d %d\n",i,count[i]); }'
@//E*O*F histog//
chmod u=rwx,g=rwx,o=rwx histog
 
echo x - histog.mod
sed 's/^XX//' > "histog.mod" <<'@//E*O*F histog.mod//'
XX#! /bin/sh
XXawk -F! '{ count[NF] += 1; if (NF > mx) mx = NF; sum += NF; }
XXEND {
XX    printf "Path length statistics\nAverage length: %f\nHistogram:\n", sum / NR;
XX    for (i = 1; i <= mx; i++) printf ("%d %d\n",i,count[i]);
XX}'
@//E*O*F histog.mod//
chmod u=rwx,g=rwx,o=rwx histog.mod
 
echo x - hostcount
sed 's/^XX//' > "hostcount" <<'@//E*O*F hostcount//'
XX#! /bin/sh
XX# This script takes a list of paths as input.
XX# It prints a list of the sites encountered in the paths, sorted
XX# according to their frequency.
XXawk -F! '{ for (i = 1; i <= NF; i++) print $i }' $* |\
XXsort | uniq -c | sort -nr



@//E*O*F hostcount//
chmod u=rwx,g=rwx,o=rwx hostcount
 
echo x - makefile
sed 's/^XX//' > "makefile" <<'@//E*O*F makefile//'
XX# This makefile is appropriate for 4.2bsd.  For SysV or V7,
XX# replace "scantree" by "oscantree" everywhere.
XXCFLAGS = -O
XX# Use the following for Sys V.
XX# CFLAGS = -O -Dindex=strchr -Drindex=strrchr

XXgetpaths: getpaths.o scantree.o 
XX	cc $(CFLAGS)  getpaths.o scantree.o  -o getpaths
@//E*O*F makefile//
chmod u=rw,g=rw,o=rw makefile
 
echo x - makefile.42
sed 's/^XX//' > "makefile.42" <<'@//E*O*F makefile.42//'
XX# This makefile is appropriate for 4.2bsd or 4.3bsd.
XXCFLAGS = -O

XXgetpaths: getpaths.o scantree.o getopt.o
XX	cc $(CFLAGS)  getpaths.o scantree.o getopt.o  -o getpaths
@//E*O*F makefile.42//
chmod u=rw,g=rw,o=rw makefile.42
 
echo x - makefile.sysv
sed 's/^XX//' > "makefile.sysv" <<'@//E*O*F makefile.sysv//'
XX# This makefile is appropriate for System III and System V.
XXCFLAGS = -O -Dindex=strchr -Drindex=strrchr

XXgetpaths: getpaths.o oscantree.o 
XX	cc $(CFLAGS)  getpaths.o oscantree.o  -o getpaths
@//E*O*F makefile.sysv//
chmod u=rw,g=rw,o=rw makefile.sysv
 
echo x - makefile.v7
sed 's/^XX//' > "makefile.v7" <<'@//E*O*F makefile.v7//'
XX# This makefile is appropriate for Version 7, or any system that
XX# has neither getopt nor the 4.2 directory library.

XXCFLAGS = -O

XXgetpaths: getpaths.o oscantree.o 
XX	cc $(CFLAGS)  getpaths.o oscantree.o  -o getpaths
@//E*O*F makefile.v7//
chmod u=rw,g=rw,o=rw makefile.v7
 
echo x - oscantree.c
sed 's/^XX//' > "oscantree.c" <<'@//E*O*F oscantree.c//'
XX/*
XX   scan_tree: Joseph T. Buck, Entropic Processing, Inc.
XX	      Old-style version

XX   scan_tree applies a function to each file in a directory tree
XX   (excluding directories), and returns the number of files processed.
XX   This version reads directories directly and will not work with 4.2bsd.
XX*/
XX#include <sys/types.h>
XX#include <sys/dir.h>
XX#define MAXL 256


XXscan_tree (path, func)
XXchar *path;
XXint (*func)();
XX{
XX    char name[MAXL], *strcpy(), *strncpy(), *pos;
XX    struct direct dirent;
XX    int fd = open (path), nf = 0, len;

XX    if (fd < 0) return 0;
XX    
XX    (void) strcpy (name, path);
XX    len = strlen (path);
XX    name[len] = '/';
XX    pos = name + len + 1;
XX    pos[DIRSIZ] = '\0'; /* req'd in case dirent has length DIRSIZ */
XX    while (read (fd, (char *) &dirent, sizeof dirent) == sizeof dirent) {
XX	if (dirent.d_ino == 0) continue;
XX	(void) strncpy (pos, dirent.d_name, DIRSIZ);
XX	if (strcmp (pos, ".") == 0 || strcmp (pos, "..") == 0) continue;
XX	if (is_dir (name)) nf += scan_tree (name, func);
XX	else {
XX	    (func) (name, dirent.d_ino);
XX	    nf += 1;
XX	}
XX    }
XX    (void) close (fd);
XX    return nf;
XX}
@//E*O*F oscantree.c//
chmod u=rw,g=rw,o=rw oscantree.c
 
echo x - scantree.c
sed 's/^XX//' > "scantree.c" <<'@//E*O*F scantree.c//'
XX/*
XX   scan_tree: Joseph T. Buck, Entropic Processing, Inc.
XX	      New-style version

XX   scan_tree applies a function to each file in a directory tree
XX   (excluding directories), and returns the number of files processed.
XX   This version uses the 4.2bsd directory access library; another version
XX   that reads directories directly is available.
XX*/
XX#include <sys/types.h>
XX#include <sys/dir.h>
XX#define MAXL 256
XXscan_tree (path, func)
XXchar *path;
XXint (*func)();
XX{
XX    char name[MAXL], *strcpy(), *pos;
XX    DIR *dirp = opendir (path);
XX    struct direct *de;
XX    int nf = 0, len;

XX    if (dirp == NULL) return 0;
XX    (void) strcpy (name, path);
XX    len = strlen (path);
XX    name[len] = '/';
XX    pos = name + len + 1;
XX    while ((de = readdir (dirp)) != NULL) {
XX	if (strcmp (de->d_name, ".") == 0 || strcmp (de->d_name, "..") == 0)
XX	    continue;
XX	(void) strcpy (pos, de->d_name); 
XX	if (is_dir (name)) nf += scan_tree (name, func);
XX	else {
XX	    (func) (name,de->d_ino);
XX	    nf += 1;
XX	}
XX    }
XX    (void) closedir (dirp);
XX    return nf;
XX}
@//E*O*F scantree.c//
chmod u=rw,g=rw,o=rw scantree.c
 
echo x - threelink
sed 's/^XX//' > "threelink" <<'@//E*O*F threelink//'
XX#! /bin/sh
XX# This script takes a list of paths as input.
XXcat $* |\
XXawk -F! '{ for (i=1; i<NF-1; i++) printf ("%s!%s!%s\n", $i,$(i+1),$(i+2)); }' |\
XXsort | uniq -c | sort -nr
@//E*O*F threelink//
chmod u=rwx,g=rwx,o=rwx threelink
 
echo x - twolink
sed 's/^XX//' > "twolink" <<'@//E*O*F twolink//'
XX#! /bin/sh
XX# This script takes a list of paths as input.
XXcat $* |\
XXawk -F! '{ for (i = 1; i < NF; i++) printf ("%s!%s\n", $i, $(i+1)); }' |\
XXsort | uniq -c | sort -nr



@//E*O*F twolink//
chmod u=rwx,g=rwx,o=rwx twolink
 
exit 0