[net.news] RFC usenet article archive program with keyword lookup

chuqui@nsc.UUCP (Chuqui Q. Koala) (01/08/85)
My thanks to all of the people out there that responded to my request for 
ways of dealing with directory structures and large numbers of files. Most
responses were either variations of the /f/o/o/foo format or suggested 
machine dependent things like hashing and playing with inode numbers. The
RFC below incorporates a way of dealing with a directory structure that 
I think is rather elegant. I've got parts of the code for the following
program definition written (mostly the keyword generation stuff) so I'm
interested in hearing about potential problems, features I might have
missed and general comments and technical discussions before I find them
out the hard way. Please respond directly to me, I'll post updates to 
the document or other comments as neccessary.

thanks, as they say, in advance.

	chuq

-----

	Archiving and accessing usenet articles with keyword lookup
			    chuq von rospach
			national semiconductor
			    (nsc!chuqui)
			last revision: 1/7/85

This is a preliminary description of a program that can generate and 
maintain an archive of usenet articles and allow looking up articles
based on the article-id, subject lines, or keywords pulled out of the
article itself. 

The basic calling sequence of this program is:

	arc [-q | -v ] <command_string>

<command string is the actualy command to execute. Available commands
are the following (minimum typing needed to be unique can be used):

add <filename> [<filename> [...]]
	adds the given unix filenames to the database. Must be done by
	the archive superuser. All files must be in usenet article format.
	If the filename is a directory all files in it (including more
	directories) are handled appropriately. This command is only 
	available to the archive superuser.

totape <id> [<id> [...]]
	the given id numbers are taken out of the database area and 
	placed in a directory so that they can be backed to secondary
	storage. All pointers to the articles are updated to show that
	they are not accessible from the system. This command is only 
	available to the archive superuser.
	
fromtape <id> [<id> [...]]
	takes the given id numbers out of the secondary storage directory
	and puts them back in the database, updating the pointers as 
	appropriate to make them accessible. This command is only available
	to the archive superuser.

ignore <keyword> 
	removes any references from the system for the given keyword and
	adds the keyword to the IGNORE file so that it will no longer be 
	used. Removing a keyword from IGNORE allows it to be used for 
	lookups, but does not regenerate keywords from the existing 
	database. There ought to be a way of doing this, but I'm lazy.

lookup [-i] <keyword>
	returns the filenames (article id numbers for the -i flag) of
	all articles which are referenced by the given keyword. This is
	useful for pulling references out of the database, using commands
	like "print `arc lookup f77`". Case is not significant.
	
	Future enhancement (maybe): allowing booleans for the keywords,
	as in "arc lookup (net.unix-wizards | net.unix) & (f77 | fortran)"
	or some such syntax. This can be done without a lot of problem 
	with sort, uniq, and other tools, I think, using lists of article
	id's in various combinations.

subject [-i] <subject_string>
	does for subject lines what keywords do. <subject_string> can
	be a substring, i.e. a string such as "vms" will match all
	subjects with "vms" in it. Case is not significant.

article [-i] <article_id>
	does for the article-id of an article what keywords do. <article_id>
	can be a substring, so "orca" will return filenames to all articles
	posted by site "orca" (also "orcan", etc...).
	
idfile <id> [<id> [...]]
	returns the filename needed to access an article with a given
	id number. 

keywords
	Prints a full list of keywords known by the system. This list is 
	in 'dbm' format, which means that the order they are printed out
	will look random to all logical beings.


DEFINITIONS

<id> - an article id, guaranteed to be unique in the system. This is given 
	to the article when entered into the system, and is simply an
	incremented counter (there is, unfortunately, no data in the 
	usenet header guaranteed to be unique without a lot of
	pain, such as an article-id and a posting time or some such
	garbage). Basic definition of <id> is "typedef id long;"

<keyword> - a keyword is any string of characters allowed by the function
	isalnum() (see ctype(3)) bounded by any non-allowed characters.
	the underbar (_) and period(.) characters are also allowed.
	All uppercase characters are mapped to lower case. Newsgroups
	the article was posted to and any words in the Keywords header
	line will also be stored as keywords.

<subject_string> - any legal Unix string. All upper case characters are
	mapped to lower case.

<article_id> - a string of the format <article_number>@<site>.<domain>
	as defined in RFC822, or any substring. All upper case characters
	are mapped to lower case.

FILENAME DEFINITIONS

The following defines are used for filenames in the system. Hardcoded
filenames will NOT be used. Period. (This is a hint to those that might
try to 'fix' this software later).

Define		default value		use

ARC		/usr/spool/newsarc	home directory for this thing.
IGNORE		ignore			Keywords to not build lookup tables
					for in the database.
ARTICLES	articles		stores <id> <article_id> one per
					line for each article in database.
SUBJECTS	subjects		stores <id> <subject> one per line
					for each article in the database.
KEYWORDS	keywords		Directory used to store keyword
					lookup table. All files accessed
					only through the FILEMAP pointers.
TAPE		Tape			Directory used to store articles
					that will be moved to magtape or
					other secondary storage.
DATA		data			Directory used to store the article
					database. All files are accessed 
					only through the FILEMAP pointers.
FILEMAP		filemap			dbm pointer file to access keywords
					and article database. See below.
					

The Filemap pointer file

Filemap is a dbm format file that is used to map keywords and id numbers
to the filenames needed to access them. The basic format is as follows:

	key:  I<id>
	data: <filename>
	
	key:  K<keyword>
	data: <filename>
	
<id> filenames are relative to DATA, <keyword> filenames are relative to
KEYWORD so that the system can be moved around easily by simply changing
a few defines and recompiling. The reason I'm doing this pointer lookup
at all is because I expect to have to change the internal structures of
both KEYWORD and DATA as they grow. Unix directory lookups are quadratic
in nature so if the directories get too large things slow down 
significantly. By using the lookup file I can change one pointer in a 
single place-- otherwise I'd have to track down references all over
the place when I want to change it. As I'm setting things up all I need
to do is change one function in the program and write a quick and dirty
program to relink the files appropriately. 

DATA internal structure

For now, I'm going to use the following algorithm to store files in the DATA
directory. Because I only need to generate a filename once (all other
references are made through FILEMAP) generation can be expensive. We need
to be careful to keep directories at a reasonable size, so a fairly bushy
result is needed. The current algorithm should also grow the number of
directories based on the number of files so the inodes lost to directories
will stay proportional to the number of files in the system.

char *idfile(id_num)
id id_num;
{
	char map[] = "abcdefghij";
	static char filename[MAXLEN];
	char id_string[MAXLEN];
	int i = 0, j = 0;
	
	filename[0]="\0";
	sprintf(id_string,"%d",id_num);
	p = id_string[0];
	
	while (id_string[i])
	{
		filename[j++] = map[id_string[i++] - '0'];
		filename[j++] = '/';
	}
	filename[j] = '\0';
	strcpy(filename,id_string);
	return(filename);
}

KEYWORD filename generation

Because I expect to see many fewer keywords than articles (I HOPE!!!)
the generation of filenames is much simpler. Basically it is generated
by taking keyword[0] as a subdirectory, as in f/foo.

-- 
From the ministry of silly talks:		Chuq Von Rospach
{allegra,cbosgd,decwrl,hplabs,ihnp4,seismo}!nsc!chuqui  nsc!chuqui@decwrl.ARPA

Now look here Mister "I'm not just a word processor"...