[news.software.b] Canadian region analysis software

mason@tmsoft.UUCP (07/08/87)
Here's the stuff I promised.  I would like to hear of non-portabilities in the
code, and any sites that you KNOW are in Canada that aren't in the list.
I will be on vacation until early August.  At that time I will re-post this
with fixes to the whole net (in case other people want to analyse their
gateways) and ask uucp/news admins around the country to send me their results.
So, play with these for a month, look at the list of your benefactors (you
could even send them a thank-you note :-) & I'll talk to you in August.
	Have a good July
	../Dave Mason,	TM Software Associates	(Compilers & System Consulting)
	..!{utzoo seismo!mnetor utcsri utgpu lsuc}!tmsoft!mason

#! /bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #! /bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	README.stats
#	Makefile
#	can.sites
#	stats.c
# This archive created: Wed Jul  8 09:11:53 1987
export PATH; PATH=/bin:$PATH
if test -f 'README.stats'
then
	echo shar: will not over-write existing file "'README.stats'"
else
cat << \SHAR_EOF > 'README.stats'
This program is written to analyse news and mail paths in order to
determine the mail and news gateways into a region.  It takes
a list of file names on standard-in, and will scan each of the files
looking for headers of the form:
	Path:
	Date:
	Posted:
	Received:
	From_
	>From_
and extracts dates and site-names from these headers.  It does a fairly
simple analysis on the dates & splits the transit period into various
categories (note that it only understands 2 date formats, but fortunately
almost all the dates we have here fit one of the formats).

A new option (thanks to Rayan's hassling) is the -p option.  This says
that standard in is a sequence of paths, one per line.  If the 'p' is
immediately followed by a character, everything up to and including
the first occurence of that character on each input line will be ignored.
This means that if you have a list of mail paths that you have collected
somewhere you can see an analysis of them.  A neat use of this is to analyse
your pathalias database (if you have one) to see who your benefactors are
for outgoing mail.  A command of the form:
	./stats -p'	' can.sites </usr/lib/uucp/paths
will do this.

It analyses the paths to determine how much mail/news is local, how much
is entirely within the region, how much was brought into the region
directly, and how much came via each of the up to 28 gateways.

The point of all this is to see if an alternate organization of the network
in the region would be in some sense "better".

The data file can.sites must have the list of sites in any order, one per
line.  Blanks lines, and lines starting with '#' are ignored.  Domains that
are entirely within the region can be listed.
The program uses a binary search (unfortunately included because BSD Unix
doesn't come with the search library), so it first sorts the list.

This is part of an effort by /usr/group/cdn to determine if it should support
a Canadian Hub node.  (Note that mail/news comes from many sites that are
not listed in the maps, therefore you may want to edit the can.sites file
to make it reflect your local reality, though I'd like to know paths to
them so that any future comments can go out to all).

If you know of errors in the can.sites file (either sites I have listed
as Canadian that aren't, or sites that I have omitted that are) please
send me these ammendments.  I will send this out again, with an ammended
list, and any bug fixes, to comp.sources.unix (or something similar)
in August, so please try this.
SHAR_EOF
fi # end of overwriting check
if test -f 'Makefile'
then
	echo shar: will not over-write existing file "'Makefile'"
else
cat << \SHAR_EOF > 'Makefile'
#debugging on SysV: 
CFLAGS=-g -O
#for BSD: CFLAGS=-O -DBSD
#for SysV: CFLAGS=-O

test:	stats
#	find /u/*/Mail -type f -print | ./stats can.sites
#	find /usr/spool/news -type f -print | ./stats can.sites
	./stats -p'	' can.sites </usr/lib/uucp/paths

stats:	stats.c
	${CC} ${CFLAGS} -o stats stats.c
SHAR_EOF
fi # end of overwriting check
if test -f 'can.sites'
then
	echo shar: will not over-write existing file "'can.sites'"
else
cat << \SHAR_EOF > 'can.sites'
# assembled by Dave Mason <mason@tmsoft> from map data 87.07.07
# don't hold me responsible, but I think these are most of
# the Canadian sites that are directly reachable

# the following sites are 'DIRECT' calls to listed sites
# so presumably they are Canadian sites
# In any case they're not well connected, so shouldn't mess up the stats

# local to alberta sites
acsedm
astotin
cadomin
ggc0

# local to british columbia sites
attvcr
bby-bc
fornax
#all these are linked to mprg, I guess
dssmv0
handel
joplin
liszt
mprc
mprd
mprott
waters

# local to ontario sites
actel
alias
bml
bnr-ai
bnr-mtl
bnr-rsc
bvax
# there's a problem here. zorac talks to daq, but there's a daq in texas
daq
cmpscr

# local to quebec sites
cambs

# miscellany
mkv020
utihs
redvax
utmars
sunedm
cavell
uwocc1
uvicar
winston
uogvax2
shoshin
vending
bruno
yup
wildcan
attcan
jasper
spyvan
manitou
orchid

# domains - all I know about
.cdn
.sq.com
.toronto.edu
.unicus.com
.waterloo.edu
# this isn't official, but it is referenced somewhere
.can

sq.com
unicus.com
aesat
alberta
alpha0
amigpx
aquila
arcsun
arlene
array
attila
aucs
auvax
biomel
blues.db
bnr-di
bnr-vpa
braegen
cae
calgary
cdl
chp
clan
cle
clouso
clunk
cmq1
cnrail
cognos
cott01
crcmar
csb
dalcs
dalcsug
dalstat
danger
darwin
dataspan
daver
dciem
deepthot
dmnhack
dvlmarv
eclectic
edm
electro
elora
ers
et
force10
forgen
gandalf
garfield
gass
geac
gen1
genat
hcr
hcrvax
hcrvx1
hcrvx2
idacom
image.me
iros1
jazz.db
julian
kimnovax
lathe.me
lethe
lightning
looking
loyalist
lsuc
maccs
marcel
mars.math
math
matrix
mcgill-vision
melody.db
methods
mgvax
micomvax
mill.me
mks
mmainc
mnetor
molihp
moore
mosart
mosca
mprg
mprvaxa
mprvaxb
ms
msitxt
munucs
musocs
ncc
nermal
nrcaer
nrcctis
nte-scg
nvanbc
odie
odyssee
ois.db
onfcanim
ontmoh
orcisi
oscvax
othervax
parkridge
parkwood
pcchui.db
pcfred.db
pcs
pcssun
per
pmbrc
psddevl
quality
qucis
radha
ragno
regina
rhodnius
rhythm.db
rom
ryenat
ryesone
sask
sfucmpt
sickkids
skalar
skatter
skeng
skerth
skorpio
skul
skvlsi
skyblu
spectrix
spycal
sq
sqrt
squad
squat
squawk
squeak
squish
stars
stjoes
strat
syntron
tango.db
tap.me
tcc3b1
teknica
teletron
telly
thunder
tmsoft
trigraph
tsgfred
tslanpar
tunscs
unicus
uottawa
uqv-mts
uvicctr
van-bc
vivarium
xicom
xios
yetti
yugauss
yumath
zap
zaphod
zorac

# U of British Columbia
ubc-andrew
ubc-bdcvax
ubc-cryos
ubc-cs
ubc-cs4
ubc-csfs1
ubc-csgrads
ubc-dsrg
ubc-ean
ubc-mts
ubc-rtec
ubc-ug1
ubc-ug2
ubc-ug3
ubc-ug4
ubc-ug5
ubc-ugserver
ubc-vision

# U of Toronto
# lots of these weren't listed, but they're pretty obvious
# though 'me', 'ecf' shouldn't be on news path names
# and I'm not sure about these *.ai names
ephemeral.ai
graeme.ai
maria.ai
ray.ai
thera.ai
utai
utas1
utas2
utas3
utcdfa
utcdfb
utcdfc
utcga
utcgb
utcseri
utcsri
utcssca
utcsscb
utdgp
ecf
utecf16k
utecfb
utecfe
utecfmv01
utecfmv02
utecfmv03
utecfmv04
utegc
uteuler
utfang
utflis
utfyzx
utglg
utgpu
# and by its old name
utcs
uthub
utjaws
utmanitou
me
utme
utmolar
utradio
utrim
utscar
utstat
utteeth
utterly.ai
uturing
utubrutus
utworm
utzoo

# University of Waterloo
watcgl
watdcsu
watdragon
wateng
water
watmath
wataco
watacs
watale
watarts
watbank
watbun
watcal
watcsg
watdaisy
watdcs
watimp
watlager
watlion
watmad
watmsg
watmum
watnot
watrose
watsos1
watsos2
watstat
watsup1
watvlach
watvlsi
watwml
SHAR_EOF
fi # end of overwriting check
if test -f 'stats.c'
then
	echo shar: will not over-write existing file "'stats.c'"
else
cat << \SHAR_EOF > 'stats.c'
/* Copyright 1987 TM Software Associates Inc, Toronto, Canada */
/* Cannot be sold for profit without permission from the copyright holder */

/* Scan mail and news to gather traffic statistics */
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <ctype.h>
#include <strings.h>
extern char *strchr(),*strrchr();	/* Rayan says BSD needs this */
#define NNODES 400
#define NTYPES 32
#define PREST 35
#ifdef BSD
#define strchr index
#define strrchr rindex
#endif
char buff[1024];
#define SITESIZE 20
struct node{char name[SITESIZE];};
struct node can_sites[NNODES+5];
int datestats[NTYPES*2];
int pathstats[NTYPES*2];
char *sitenames[NTYPES]={"local path","Canadian path","Direct from Abroad"};
char pathrest[NTYPES][PREST+1];
int warning,nsites=3;
unsigned nnodes;
struct stat filestat;
main(argc,argv) int argc; char **argv; {
	char filename[512];
	register char *cp;
	register int ismail,i,total,pathflag=0;
	static char *time_periods[NTYPES] = {
		"time warp",
		"<1 hour",
		"1-3 hours",
		"4-7 hours",
		"8-11 hours",
		"12-24 hours",
		"2 days",
		"3 days",
		"4-5 days",
		"6-7 days",
		"a week or more",
		"?????11???",
		"?????12???",
		"?????13???",
		"?????14???"};
	time_periods[NTYPES-1]="invalid times";
	if (argc==3 && argv[1][0]=='-' && argv[1][1]=='p') {
		--argc;
		++argv;
		pathflag=argv[0][2];
		if (!pathflag)
			pathflag= -1;
	  }
	if (argc!=2) {
		fprintf(stderr,"Usage: stats [-p[c]] can.sites\n");
		exit(1);
	  }
	get_sites(argv[1]);
	while (gets(filename)!=NULL) {
		if (pathflag) {
			if (pathflag>0 && (cp=strchr(filename,pathflag)))
				++cp;
			else
				cp=filename;
			do_pathstats(1,cp);
			++datestats[NTYPES*2-1];
		  }
		else {
			if (stat(filename,&filestat)==0 &&
				(filestat.st_mode&0170000)!=0040000)
				scan(filename);
		  }
	  }
	for (ismail=0;ismail<=NTYPES;ismail+=NTYPES) {
		total=0;
		for (i=0;i<NTYPES;++i)
			total += datestats[ismail+i];
		if (total) {
			if (pathflag)
				printf("\nPath Analysis\n");
			else {
			    if (ismail)
				printf("\nMail\n");
			    else
				printf("\nNews\n");
			    for (i=0;i<NTYPES;++i) {
				if (datestats[ismail+i])
					printf("	%20.20s %4d %3d%%\n",
						time_periods[i],
						datestats[ismail+i],
						datestats[ismail+i]*100/total);
			      }
			  }
			printf("   Gateways from outside Canada\n");
			for (i=0;i<NTYPES;++i) {
				if (pathrest[i][0] && (cp=strchr(pathrest[i],' ')))
					*cp='\0';
				if (pathstats[ismail+i])
					printf("        %20.20s %4d %3d%% %s\n",
						sitenames[i],
						pathstats[ismail+i],
						pathstats[ismail+i]*100/total,
						pathrest[i]);
			  }
		  }
	  }
	return(0);
}
scan(filename) char *filename; {
    register FILE *fp;
    char pathbuf[512],datebuf[512],postbuf[512],recvbuf[512],
	 frombuf[512],lastbuf[512];
    register char *cp;
    register int ismail,realmail=0;
    if ((fp=fopen(filename,"r"))!=NULL) {
	frombuf[0]='\0';
	for(;;) {
		pathbuf[0]='\0';
		datebuf[0]='\0';
		postbuf[0]='\0';
		recvbuf[0]='\0';
		while (inline(fp)) {
			if (strncmp(buff,"Path:",5)==0) {
				if (cp=strchr(&buff[6],'!'))
					++cp;
				else
					cp = &buff[6];
				strcpy(pathbuf,cp);
			  }
			if (strncmp(buff,"Date:",5)==0)
				strcpy(datebuf,&buff[6]);
			if (strncmp(buff,"Posted:",7)==0)
				strcpy(postbuf,&buff[8]);
			if (!recvbuf[0] && strncmp(buff,"Received:",9)==0)
				strcpy(recvbuf,&buff[10]);
			if (strncmp(buff,"Received:",9)==0)
				strcpy(lastbuf,&buff[10]);
			if (strncmp(buff,">From ",5)==0)
				strcpy(lastbuf,&buff[6]);
			if (strncmp(buff,"From",4)==0 && (buff[4]==' ' || buff[4]=='\t')) {
				strcpy(frombuf,&buff[5]);
				++realmail;
			  }
			if (!frombuf[0] && strncmp(buff,"Unix-From:",10)==0)
				strcpy(frombuf,&buff[11]);
		  }
		if (!datebuf[0])
			strcpy(datebuf,postbuf);
		ismail=0;
		if (frombuf[0]) {
			++ismail;
			strcpy(pathbuf,frombuf);
		  }
		if (pathbuf[0]) {
			do_datestats(ismail,frombuf,recvbuf,datebuf,lastbuf);
			do_pathstats(ismail,pathbuf);
		  }
		if (!realmail)
			break;
		while (!feof(fp)) {
			(void) inline(fp);
			if (strncmp(buff,"From",4)==0 && (buff[4]==' ' || buff[4]=='\t')) {
				strcpy(frombuf,&buff[5]);
				break;
			  }
		  }
		if (feof(fp))
			break;
	  }
	fclose(fp);
      }
}
get_sites(fname) char *fname; {
	register char *cp;
	register FILE *fp;
	nnodes=0;
	if (fp=fopen(fname,"r")) {
		while(fgets(can_sites[nnodes].name,SITESIZE,fp)) {
			if (!can_sites[nnodes].name[0] ||
					can_sites[nnodes].name[0]=='#')
				continue;
			if (nnodes>=NNODES) {
				fprintf(stderr,"stats: too many nodes in %s\n",fname);
				exit(1);
			  }
			if (cp=strchr(can_sites[nnodes].name,'\n'))
				*cp='\0';
			if (strlen(can_sites[nnodes].name)>=SITESIZE) {
				fprintf(stderr,"stats: node name too long: %s\n",can_sites[nnodes].name);
				exit(1);
			  }
			++nnodes;
		  }
		close(fp);
		qsort((char *)can_sites,nnodes,SITESIZE,strcmp);
	  }
}		
char *bsearch(cp) char *cp; {
	register struct node *hp,*lp,*mp;
	register int i;
	hp = &can_sites[nnodes];
	lp = can_sites;
	while (hp>lp) {
		mp = ((hp-lp)>>1) + lp;
		if ((i=strcmp(mp->name,cp))==0)
			return (mp->name);
		else if (i<0)
			lp=mp+1;
		else
			hp=mp;
	  }
	return (NULL);
}
do_pathstats(ismail,pathbuf) char *pathbuf; {
	register int patht;
	register char *cp,*tp,*sp,*pp,*np;
	if (cp=strchr(pathbuf,'!')) {
#if DEBUG
printf("%s",pathbuf);
#endif
		cp=pathbuf;
		sp=pp=NULL;
		while (np=strchr(cp,'!')) {
			*np='\0';
			for (tp=cp;tp<np;++tp)
				if (isupper(*tp))
					*tp=tolower(*tp);
			if ((tp=strrchr(cp,'.')) && strcmp(tp,".uucp")==0)
				*tp='\0';
			pp=sp;
			sp=bsearch(cp);
			tp=cp;
			while (!sp && (tp=strchr(tp+1,'.')))
				sp=bsearch(tp);
			if (!sp)
				break;
			cp=np+1;
		  }
		if (!pp) {
			if (np) {
				patht=2;
				if (!pathrest[patht][0])
					strncpy(pathrest[patht],cp,PREST);
			  }
			else
				patht=1;
		  }
		else if (np) {
			for (patht=3;patht<nsites;++patht)
				if (sitenames[patht]==pp)
					break;
			if (patht==nsites)
			    if (nsites<NTYPES) {
				sitenames[patht]=pp;
				*np='!';
				strncpy(pathrest[patht],cp,PREST);
				++nsites;
			      }
			    else
				sitenames[--patht]="other gateways";
#ifdef DEBUG
printf("%d %d %s\n",patht,nsites,pp);
#endif
		  }
		else
			patht=1;
	  }
	else
		patht=0;
	++pathstats[ismail*NTYPES+patht];
}
do_datestats(ismail,frombuf,recvbuf,datebuf,lastbuf)
    char *frombuf,*recvbuf,*datebuf,*lastbuf; {
	register int i;
	time_t sent,recvd;
	recvd=filestat.st_mtime;
	if (!scandate(frombuf,&recvd,"From"))
		(void) scandate(recvbuf,&recvd,"Recv");
	if (!datebuf[0] || (	!scandate(datebuf,&sent,"Date") &&
				!scandate(lastbuf,&sent,">From")))
		i=NTYPES-1;
	else if (recvd<sent)
		i=0;
	else {
		i=(recvd-sent)/60;
		if (i<60)
			i=1;
		else if (i<4*60)
			i=2;
		else if (i<8*60)
			i=3;
		else if (i<12*60)
			i=4;
		else if (i<24*60)
			i=5;
		else if (i<48*60)
			i=6;
		else if (i<72*60)
			i=7;
		else if (i<120*60)
			i=8;
		else if (i<7*24*60)
			i=9;
		else
			i=10;
	  }
#ifdef DEBUG
	printf("%d %d %d\n",i,sent,recvd);
#endif
	++datestats[ismail*NTYPES+i];
}
scandate(datebuf,when) char *datebuf; time_t *when; {
	register char *ap,*cp=datebuf;
	register long i,j;
	char *mp,*dp,*yp,*tp;
	struct namenum{char name[3];int num};
	static struct namenum timezones[] = {
		{"GMT",0},	{"NST",-4*60+30}, {"NDT",-3*60+30},
				{"AST",-4*60},	{"ADT",-3*60},
				{"EST",-5*60},	{"EDT",-4*60},
				{"CST",-6*60},	{"CDT",-5*60},
				{"MST",-7*60},	{"MDT",-6*60},
				{"PST",-8*60},	{"PDT",-7*60},
				{"YST",-9*60},	{"YDT",-8*60}};
	static struct namenum months[12] = {
		{"Jan",0},	{"Feb",31},	{"Mar",59},	{"Apr",90},
		{"May",120},	{"Jun",151},	{"Jul",181},	{"Aug",212},
		{"Sep",243},	{"Oct",273},	{"Nov",304},	{"Dec",334}};
	if (!*cp)
		return(0);
	while (*cp) {
		mp=NULL;
		if (!isdigit(*cp)) {
			if (isdigit(cp[5]) && cp[9]==':' && isupper(*cp) &&
				isalpha(cp[1]) && isalpha(cp[2]) &&
				isdigit(cp[17]) && isdigit(cp[18])) {
				mp=cp;
				dp=cp+4;
				if (*dp==' ')
					++dp;
				yp=cp+17;
				tp=cp+7;
			  }
		  }
		else {
			ap=cp+1;
			if (*ap==' ' && isdigit(*cp))
				ap=cp;
			if (isdigit(*ap) && ap[1]==' ' && isupper(ap[2]) &&
				isalpha(ap[3]) && isalpha(ap[4]) && ap[5]==' ' &&
				isdigit(ap[6]) && isdigit(ap[7])) {
					mp=ap+2;
					dp=cp;
					yp=ap+6;
					tp=ap+9;
				  }
		  }
		if (mp==NULL) {
			++cp;
			continue;
		  }
		j=atoi(yp);
		if (!islower(mp[1])) mp[1]=tolower(mp[1]);
		if (!islower(mp[2])) mp[2]=tolower(mp[2]);
		for (i=0;i<12;++i)
			if (strncmp(months[i].name,mp,3)==0) {
				i=months[i].num;
				if (((j&3)==0) && i>50)
					++i;
				break;
			  }
		if (i==12) {
			++cp;
			continue;
		  }
		if (j<100) j+=1900;
		i += (j-1969)/4;
		i += (j-1970)*365 + atoi(dp) - 1;
		i *= 24*60*60;
#ifdef DEBUG
printf("%s:%s ",what,cp);
#endif
		cp=ap+8;
		if (isdigit(tp[0]))
			i+=atoi(tp)*60*60;
		else
			i+=atoi(&tp[1])*60*60;
		i+=atoi(&tp[3])*60;
		ap = tp+5;
		if (*ap==':')
			ap += 3;
		while (*ap==' ')
			++ap;
		if (islower(*ap)) *ap=toupper(*ap);
		if (islower(ap[1])) ap[1]=toupper(ap[1]);
		if (islower(ap[2])) ap[2]=toupper(ap[2]);
		for (j=0;j<(sizeof timezones)/(sizeof(struct namenum));++j)
			if (strncmp(timezones[j].name,ap,3)==0) {
				*when = i + timezones[j].num*60;
#ifdef DEBUG
printf("%d %s",*when,asctime(gmtime(when)));
#endif
				return(1);
			  }
		++cp;
	  }
	if (warning)
		printf("No recognized date:%s\n",datebuf);
	return(0);
}
int inline(fp) register FILE *fp;{
	register char *cp=buff;
	register int c,flag=0;
	while ((c=getc(fp))!=EOF) {
		if (flag==0) {
			if (c=='\n')
				return(0);
			else
				++flag;
		  }
		else if (flag<0) {
			if (c!=' ' && c!='\t') {
				ungetc(c,fp);
				*cp='\0';
				return(1);
			  }
			else
				flag= 1-flag;
		  }
		if (c=='\n')
			flag= -flag;
		else
			*cp++ = c;
	  }
	*cp='\0';
	return(flag!=0);
}
SHAR_EOF
fi # end of overwriting check
#	End of shell archive
exit 0