[net.sources] New usite prog. w/ reg. exprs.

davy@ecn-ee.UUCP (06/04/84)

#N:ecn-ee:15100005:000:17206
ecn-ee!davy    Jun  3 21:02:00 1984



This program is used for reading the USENET map and finding specific
entries.  It was inspired by Russ Herman's (aesat!rwh) program, but
is somewhat more robust.

A site may be searched for by searching the entire entry, looking only
at a specific field of an entry, or by and'ing together searches through
multiple fields.  All searches support full regular expression matching.

To use this program, you'll have to:

	1. Run 4.2 BSD, have the directory(3) compatibilty
	   routines, or write the scandir() routine yourself
	   (see the comments in the program).
	
	2. Have the PWB regular expression routines, get them,
	   or hack the program to use the "standard" ones.
	
	3. Have all the USENET map files extracted and contained
	   in a directory by themselves.

--Dave Curry
{decvax, ucbvax, ihnp4}!pur-ee!davy

/*---------------------- CUT HERE ------------------------------*/
echo x - usite.1
cat > usite.1 << !Funky!Stuff!
.TH USITE 1
.SH NAME
usite \- print information about \s-2USENET\s+2 sites
.SH SYNOPSIS
usite [- pat] [-n name] [-o organization] [-c contact]
.ti +5
[-P phone] [-p postal address] [-e elect. address]
.ti +5
[-n news] [-m mail]
.SH DESCRIPTION
.PP
.B Usite
reads the files composing the \s-2USENET\s+2 map,
in order to find specific sites as specified by its arguments.
The \s-2USENET\s+2 map is posted as a series of \`\`shell archives''
to the newsgroup
.I net.news.map
on the first of each month.
.PP
A map entry consists of several fields,
they are:
.nf
.sp
.ta 5m
	\fBName:\fP the official site name, with optional nicknames
	\fBOrganization:\fP the name of the company, university, or group
	\fBContact:\fP the name of the Usenet contact person for the site
	\fBPhone:\fP the telephone number of the contact person
	\fBPostal-Address:\fP the paper mailing address of the contact person
	\fBElectronic-Address:\fP for the contact person
	\fBNews:\fP neighbors this site sends net.general to
	\fBMail:\fP sites for which there is a UUCP link or other ! syntax link.
.fi
.LP
Each of the flags above tells
.B usite
to attempt to match the following pattern against one of these
fields.
The \`\`\-'' argument means to try the match against the entire
map entry,
irregardless of the separate fields.
The patterns are standard
.I "regular expressions" ,
as described in
.B ed(1) .
With the exception of the \`\`\-'' flag,
the flags are and'ed together to form a single expression.
.SH NOTE
.PP
The \s-2USENET\s+2 map lists only sites which receive the
.I net.announce
newsgroup.
It is
.I not
a map of the entire \s-2UUCP\s+2 network.
.SH EXAMPLES
.IP "usite -n pur-ee"
Print the information about any site with \`\`pur-ee'' as part
of its name.
.IP "usite -o 'A *T *& *T'"
Print the information about all sites owned by AT & T.
The space-asterisk combinations indicate that any number
of spaces (including zero) may separate the letters.
Note the use of single quotes,
which are necessary in order to prevent the shell from
interpreting the asterisks.
.IP "usite -o University -p Lafayette"
Print information about any site with the word \`\`University'' in
its organization name
.I and
the word \`\`Lafayette'' in its postal address.
.SH SEE ALSO
ed(1), grep(1), news(5)
.SH LIMITATIONS
.PP
A single map entry may be no longer than 32,768 characters.
.SH AUTHOR
David A. Curry ({decvax, ucbvax, ihnp4}!pur-ee!davy)
!Funky!Stuff!
echo x - usite.c
cat > usite.c << !Funky!Stuff!
/*
 * usite - print information about USENET sites
 *
 * This program assumes that a directory exists (USENETDIR) which
 * contains the various files extracted from the "shar"-format
 * USENET maps posted to net.news.map by Karen Summers-Horton.
 *
 * It reads the files in alphabetical order, searching for matching
 * entries.  An entry is requested by using the flags, followed
 * by regular expressions, one regular expression per flag.
 *
 * The flags and expressions, with the exception of the "-" flag,
 * are and'ed together, thus, by specifying two flags it is possible
 * to "narrow down" the information printed.  The "-" flag stands
 * for "match the whole entry", and thus cannot be narrowed down.
 *
 * The flags are:
 *	-n	Match the Name: field
 *	-o	Match the Organization: field
 *	-c	Match the Contact: field
 *	-P	Match the Phone: field
 *	-p	Match the Postal-Address: field
 *	-e	Match the Electronic-Address: field
 *	-N	Match the News: field
 *	-m	Match the Mail: field
 *	-	Match the entire entry
 *
 * This program is written for 4.2 BSD; it uses the scandir(3) routine.
 * This routine is passed the name of a directory, the address of an
 * argv-like array of pointers to type struct direct, the address of a
 * routine which returns non-zero if the entry is wanted, and the
 * address of a routine for qsort.  It builds an array of structures
 * containing the directry entries, using malloc(), and returns the number
 * of entries in the structure.  If you don't have 4.2 or the 4.1 compat-
 * ibility routines, you'll have to write this one.
 *
 * Also, I chose to use the PWB regular expression routines.  On Bell
 * UNIX systems, these routines should be in the -lPW library.  If you
 * run Berkeley, you'll have to either hack the program to use the
 * Berkeley regular expression routines (see regex(3)) or else get a
 * copy of the PWB routines.  
 *
 * (I have the routines, but I don't know about posting them, because of
 * liscensing restrictions.  If someone can set me straight on what
 * liscense is required to see them, I might consider posting them.)
 *
 * David A. Curry, 6/3/84
 * {decvax, ucbvax, ihnp4}!pur-ee!davy
 */
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/dir.h>
#include <stdio.h>

#define NINFO		3			/* no. of info strucs	*/
#define HEADERS		0
#define REGEXPR		1
#define POINTER		2
#define BUFSIZE		32 * 1024		/* max. entry length	*/
#define USENETDIR	"/e/davy/progs/usenet"	/* directory of files	*/

#define SRCH_NAME	0001
#define SRCH_ORGA	0002
#define SRCH_CONT	0004
#define SRCH_PHON	0010
#define SRCH_POST	0020
#define SRCH_ELEC	0040
#define SRCH_NEWS	0100
#define SRCH_MAIL	0200
#define SRCH_EVERY	0377

struct info {
	char *name;
	char *orga;
	char *cont;
	char *phon;
	char *post;
	char *elec;
	char *news;
	char *mail;
	char *every;
} ibuf[NINFO];

char *pname;
int srch_mode;
char linebuf[BUFSIZ];
char buffer[BUFSIZE];

main(argc, argv)
int argc;
char **argv;
{
	char *regcmp();
	
	/*
	 * Set up the header names.
	 */
	ibuf[HEADERS].name = "Name:";
	ibuf[HEADERS].orga = "Organization:";
	ibuf[HEADERS].cont = "Contact:";
	ibuf[HEADERS].phon = "Phone";
	ibuf[HEADERS].post = "Postal-Address:";
	ibuf[HEADERS].elec = "Electronic-Address:";
	ibuf[HEADERS].news = "News:";
	ibuf[HEADERS].mail = "Mail:";
	
	pname = *argv;
	
	if (argc < 3)
		usage();

	/*
	 * Process the arguments.  For each flag/re pair, we
	 * or the appropriate flag into srch_mode and compile
	 * the regular expression.
	 */
	while (--argc) {
		++argv;

		if ((*argv)[0] != '-')
			usage();
		
		switch ((*argv)[1]) {
		case NULL:
			++argv; --argc;
			srch_mode |= SRCH_EVERY;
			if ((ibuf[REGEXPR].every = regcmp(*argv)) == NULL)
				regerror("- r.e.");
			break;
		case 'n':
			++argv; --argc;
			srch_mode |= SRCH_NAME;
			if ((ibuf[REGEXPR].name = regcmp(*argv)) == NULL)
				regerror("-n r.e.");
			break;
		case 'o':
			++argv; --argc;
			srch_mode |= SRCH_ORGA;
			if ((ibuf[REGEXPR].orga = regcmp(*argv)) == NULL)
				regerror("-o r.e.");
			break;
		case 'c':
			++argv; --argc;
			srch_mode |= SRCH_CONT;
			if ((ibuf[REGEXPR].cont = regcmp(*argv)) == NULL)
				regerror("-c r.e.");
			break;
		case 'P':
			++argv; --argc;
			srch_mode |= SRCH_PHON;
			if ((ibuf[REGEXPR].phon = regcmp(*argv)) == NULL)
				regerror("-P r.e.");
			break;
		case 'p':
			++argv; --argc;
			srch_mode |= SRCH_POST;
			if ((ibuf[REGEXPR].post = regcmp(*argv)) == NULL)
				regerror("-p r.e.");
			break;
		case 'e':
			++argv; --argc;
			srch_mode |= SRCH_ELEC;
			if ((ibuf[REGEXPR].elec = regcmp(*argv)) == NULL)
				regerror("-e r.e.");
			break;
		case 'N':
			++argv; --argc;
			srch_mode |= SRCH_NEWS;
			if ((ibuf[REGEXPR].news = regcmp(*argv)) == NULL)
				regerror("-N r.e.");
			break;
		case 'm':
			++argv; --argc;
			srch_mode |= SRCH_MAIL;
			if ((ibuf[REGEXPR].mail = regcmp(*argv)) == NULL)
				regerror("-m r.e.");
			break;
		default:
			usage();
			break;
		}
	}
	
	doit();
}

/*
 * doit - builds the listing of files, and runs through each one.
 */
doit()
{
	register int nent;
	register FILE *fp;
	struct direct **dbase;
	extern int select(), alphasort();
	
	/*
	 * Change into the directory so we don't have
	 * to construct pathnames.
	 */
	if (chdir(USENETDIR) < 0) {
		fprintf(stderr, "%s: cannot chdir into %s\n", USENETDIR);
		exit(1);
	}
	
	/*
	 * Construct an array of struct direct's, pointed to
	 * by dbase.
	 */
	if ((nent = scandir(USENETDIR, &dbase, select, alphasort)) < 0) {
		fprintf(stderr, "%s: cannot open %s\n", pname, USENETDIR);
		exit(1);
	}
	
	/*
	 * For each file....
	 */
	for (; nent; nent--, dbase++) {
		/*
		 * Open the file.
		 */
		if ((fp = fopen((*dbase)->d_name, "r")) == NULL) {
			fprintf(stderr, "%s: cannot open %s\n", pname, (*dbase)->d_name);
			continue;
		}
		
		/*
		 * Position ourselves at the first Name: line.
		 * This is to skip over the initial Comments:
		 * line, etc.
		 */
		if (position(fp) == EOF) {
			fclose(fp);
			continue;
		}
		
		/*
		 * Read an entry at a time, try to get
		 * a match, and if we do, print it.
		 */
		while (readent(fp) != EOF) {
			if (matchent())
				printent();
		}
		
		fclose(fp);
	}
}

/*
 * select - returns non-zero if we want this directory entry to
 *	    be stored, 0 otherwise.  We only want regular files.
 */
int select(d)
register struct direct *d;
{
	struct stat sbuf;
	
	if (stat(d->d_name, &sbuf) < 0)
		return(0);
	
	if ((sbuf.st_mode & S_IFMT) != S_IFREG)
		return(0);
	
	return(1);
}

/*
 * position - reads lines from fp until we get a line that starts
 *	      with "Name:".
 */
position(fp)
register FILE *fp;
{
	register int len;
	
	*linebuf = NULL;
	len = strlen(ibuf[HEADERS].name);
	
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].name, len))
			return(NULL);
	}
	
	return(EOF);
}

/*
 * readent - reads an entry into buffer.  Each field will be null
 *	     terminated; newlines may be embedded in the field.
 */
readent(fp)
register FILE *fp;
{
	char *concat();
	register int len;
	
	/*
	 * Name is the first field, it starts the
	 * buffer.  We first copy linebuf, which
	 * is left over from the position() call.
	 *
	 * The concat() routine is just like strcat(),
	 * except for two things:
	 *	- the first argument is a pointer to the
	 *	  END of the array to be added on to
	 *	- it returns a pointer to the NEW END of
	 *	  the array concatenated onto
	 */
	ibuf[POINTER].name = buffer;
	ibuf[POINTER].orga = concat(buffer, linebuf);
	
	/*
	 * Now we run through the file each time until
	 * we find the next field.  Each one of these
	 * while's is the same, except for the pointers
	 * and fields used.
	 */
	len = strlen(ibuf[HEADERS].orga);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].orga, len)) {
			ibuf[POINTER].orga++;
			ibuf[POINTER].cont = concat(ibuf[POINTER].orga, linebuf);
			break;
		}
		else {
			ibuf[POINTER].orga = concat(ibuf[POINTER].orga, linebuf);
		}
	}

	len = strlen(ibuf[HEADERS].cont);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].cont, len)) {
			ibuf[POINTER].cont++;
			ibuf[POINTER].phon = concat(ibuf[POINTER].cont, linebuf);
			break;
		}
		else {
			ibuf[POINTER].cont = concat(ibuf[POINTER].cont, linebuf);
		}
	}

	len = strlen(ibuf[HEADERS].phon);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].phon, len)) {
			ibuf[POINTER].phon++;
			ibuf[POINTER].post = concat(ibuf[POINTER].phon, linebuf);
			break;
		}
		else {
			ibuf[POINTER].phon = concat(ibuf[POINTER].phon, linebuf);
		}
	}

	len = strlen(ibuf[HEADERS].post);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].post, len)) {
			ibuf[POINTER].post++;
			ibuf[POINTER].elec = concat(ibuf[POINTER].post, linebuf);
			break;
		}
		else {
			ibuf[POINTER].post = concat(ibuf[POINTER].post, linebuf);
		}
	}

	len = strlen(ibuf[HEADERS].elec);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].elec, len)) {
			ibuf[POINTER].elec++;
			ibuf[POINTER].news = concat(ibuf[POINTER].elec, linebuf);
			break;
		}
		else {
			ibuf[POINTER].elec = concat(ibuf[POINTER].elec, linebuf);
		}
	}

	len = strlen(ibuf[HEADERS].news);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].news, len)) {
			ibuf[POINTER].news++;
			ibuf[POINTER].mail = concat(ibuf[POINTER].news, linebuf);
			break;
		}
		else {
			ibuf[POINTER].news = concat(ibuf[POINTER].news, linebuf);
		}
	}

	len = strlen(ibuf[HEADERS].mail);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].mail, len)) {
			ibuf[POINTER].mail++;
			ibuf[POINTER].every = concat(ibuf[POINTER].mail, linebuf);
			break;
		}
		else {
			ibuf[POINTER].mail = concat(ibuf[POINTER].mail, linebuf);
		}
	}

	/*
	 * Now we have a complete buffer, except for extra
	 * lines in the Mail: field, and possibly a Comments:
	 * line or two.  When we find Name: again, we null
	 * terminate the buffer.  The Name: line will be left
	 * in linebuffer.
	 */
	len = strlen(ibuf[HEADERS].name);
	while (getline(linebuf, BUFSIZ, fp) != NULL) {
		if (!strncmp(linebuf, ibuf[HEADERS].name, len)) {
			*(ibuf[POINTER].every) = NULL;
			break;
		}
		else if (!strncmp(linebuf, "Comments:", 9)) {
			ibuf[POINTER].every = concat(ibuf[POINTER].every, "\n");
			ibuf[POINTER].every = concat(ibuf[POINTER].every, linebuf);
		}
		else {
			ibuf[POINTER].every = concat(ibuf[POINTER].every, linebuf);
		}
	}

	if (feof(fp))
		return(EOF);
	
	return(NULL);
}

/*
 * concat - like strcat, except that s points to the end of the buffer,
 *	    and the end of the buffer is returned.
 */
char *concat(s, t)
register char *s, *t;
{
	/*
	 * If the line starts with whitespace, skip it
	 * and replace it with a newline and a tab.
	 */
	if ((*t == ' ') || (*t == '\t')) {
		while ((*t) && ((*t == ' ') || (*t == '\t')))
			t++;
		*s++ = '\n';
		*s++ = '\t';
	}
	
	while (*s++ = *t++)
		;
	
	return(--s);
}

/*
 * matchent - matches the regular expressions and returns non-zero
 *	      if we succeed.
 */
matchent()
{
	int ret = 1;
	char *regex();
	
	/*
	 * If we're searching the entire entry, we have to
	 * change nulls to newlines, and then we get one
	 * shot at a match.
	 */
	if (srch_mode == SRCH_EVERY) {
		convert(buffer, ibuf[POINTER].every);
		
		if (regex(ibuf[REGEXPR].every, buffer) == NULL)
			return(0);
		
		return(1);
	}

	/*
	 * For each match asked for, try it.  If any one fails,
	 * we return no match.  This produces an "and"-like
	 * effect.
	 */
	if (srch_mode & SRCH_NAME)
		if (regex(ibuf[REGEXPR].name, ibuf[POINTER].name) == NULL)
			return(0);

	if (srch_mode & SRCH_ORGA)
		if (regex(ibuf[REGEXPR].orga, ibuf[POINTER].orga) == NULL)
			return(0);

	if (srch_mode & SRCH_CONT)
		if (regex(ibuf[REGEXPR].cont, ibuf[POINTER].cont) == NULL)
			return(0);

	if (srch_mode & SRCH_PHON)
		if (regex(ibuf[REGEXPR].phon, ibuf[POINTER].phon) == NULL)
			return(0);

	if (srch_mode & SRCH_POST)
		if (regex(ibuf[REGEXPR].post, ibuf[POINTER].post) == NULL)
			return(0);

	if (srch_mode & SRCH_ELEC)
		if (regex(ibuf[REGEXPR].elec, ibuf[POINTER].elec) == NULL)
			return(0);

	if (srch_mode & SRCH_NEWS)
		if (regex(ibuf[REGEXPR].news, ibuf[POINTER].news) == NULL)
			return(0);

	if (srch_mode & SRCH_MAIL)
		if (regex(ibuf[REGEXPR].mail, ibuf[POINTER].mail) == NULL)
			return(0);

	return(1);
}

/*
 * convert - runs through a buffer from s to t and changes nulls
 *	     to newlines.
 */
convert(s, t)
register char *s, *t;
{
	while (s != t) {
		if (*s == NULL)
			*s = '\n';
		s++;
	}
}

/*
 * printent - print an entry.
 */
printent()
{
	/*
	 * If we were searching the whole buffer, the nulls
	 * are now newlines, so we dump the whole buffer.
	 */
	if (srch_mode == SRCH_EVERY) {
		printf("%s\n\n", buffer);
		return;
	}
	
	/*
	 * Otherwise, we have to dump each field
	 * separately.
	 */
	printf("%s\n", ibuf[POINTER].name);
	printf("%s\n", ibuf[POINTER].orga);
	printf("%s\n", ibuf[POINTER].cont);
	printf("%s\n", ibuf[POINTER].phon);
	printf("%s\n", ibuf[POINTER].post);
	printf("%s\n", ibuf[POINTER].elec);
	printf("%s\n", ibuf[POINTER].news);
	printf("%s\n\n", ibuf[POINTER].mail);
}

/*
 * getline - like an fgsets(), but strips the trailing newline.
 */
getline(buf, len, fp)
char *buf;
int len;
FILE *fp;
{
	if (fgets(buf, len, fp) == NULL)
		return(NULL);
	
	buf[strlen(buf)-1] = NULL;
}
	
/*
 * regerror - print a regular expression error and exit.
 */
regerror(s)
char *s;
{
	fprintf(stderr, "%s: regular expression error on %s\n", s);
	exit(1);
}

usage()
{
	fprintf(stderr, "Usage: %s [- pat] [-n name] [-o organization] [-c contact] [-P phone]\n", pname);
	fprintf(stderr, "\t\t[-p postaddr] [-e elecaddr] [-N news] [-m mail]\n");
	exit(1);
}
!Funky!Stuff!