chuqui@nsc.UUCP (Chuqui Q. Koala) (01/08/85)
My thanks to all of the people out there that responded to my request for ways of dealing with directory structures and large numbers of files. Most responses were either variations of the /f/o/o/foo format or suggested machine dependent things like hashing and playing with inode numbers. The RFC below incorporates a way of dealing with a directory structure that I think is rather elegant. I've got parts of the code for the following program definition written (mostly the keyword generation stuff) so I'm interested in hearing about potential problems, features I might have missed and general comments and technical discussions before I find them out the hard way. Please respond directly to me, I'll post updates to the document or other comments as neccessary. thanks, as they say, in advance. chuq ----- Archiving and accessing usenet articles with keyword lookup chuq von rospach national semiconductor (nsc!chuqui) last revision: 1/7/85 This is a preliminary description of a program that can generate and maintain an archive of usenet articles and allow looking up articles based on the article-id, subject lines, or keywords pulled out of the article itself. The basic calling sequence of this program is: arc [-q | -v ] <command_string> <command string is the actualy command to execute. Available commands are the following (minimum typing needed to be unique can be used): add <filename> [<filename> [...]] adds the given unix filenames to the database. Must be done by the archive superuser. All files must be in usenet article format. If the filename is a directory all files in it (including more directories) are handled appropriately. This command is only available to the archive superuser. totape <id> [<id> [...]] the given id numbers are taken out of the database area and placed in a directory so that they can be backed to secondary storage. All pointers to the articles are updated to show that they are not accessible from the system. This command is only available to the archive superuser. fromtape <id> [<id> [...]] takes the given id numbers out of the secondary storage directory and puts them back in the database, updating the pointers as appropriate to make them accessible. This command is only available to the archive superuser. ignore <keyword> removes any references from the system for the given keyword and adds the keyword to the IGNORE file so that it will no longer be used. Removing a keyword from IGNORE allows it to be used for lookups, but does not regenerate keywords from the existing database. There ought to be a way of doing this, but I'm lazy. lookup [-i] <keyword> returns the filenames (article id numbers for the -i flag) of all articles which are referenced by the given keyword. This is useful for pulling references out of the database, using commands like "print `arc lookup f77`". Case is not significant. Future enhancement (maybe): allowing booleans for the keywords, as in "arc lookup (net.unix-wizards | net.unix) & (f77 | fortran)" or some such syntax. This can be done without a lot of problem with sort, uniq, and other tools, I think, using lists of article id's in various combinations. subject [-i] <subject_string> does for subject lines what keywords do. <subject_string> can be a substring, i.e. a string such as "vms" will match all subjects with "vms" in it. Case is not significant. article [-i] <article_id> does for the article-id of an article what keywords do. <article_id> can be a substring, so "orca" will return filenames to all articles posted by site "orca" (also "orcan", etc...). idfile <id> [<id> [...]] returns the filename needed to access an article with a given id number. keywords Prints a full list of keywords known by the system. This list is in 'dbm' format, which means that the order they are printed out will look random to all logical beings. DEFINITIONS <id> - an article id, guaranteed to be unique in the system. This is given to the article when entered into the system, and is simply an incremented counter (there is, unfortunately, no data in the usenet header guaranteed to be unique without a lot of pain, such as an article-id and a posting time or some such garbage). Basic definition of <id> is "typedef id long;" <keyword> - a keyword is any string of characters allowed by the function isalnum() (see ctype(3)) bounded by any non-allowed characters. the underbar (_) and period(.) characters are also allowed. All uppercase characters are mapped to lower case. Newsgroups the article was posted to and any words in the Keywords header line will also be stored as keywords. <subject_string> - any legal Unix string. All upper case characters are mapped to lower case. <article_id> - a string of the format <article_number>@<site>.<domain> as defined in RFC822, or any substring. All upper case characters are mapped to lower case. FILENAME DEFINITIONS The following defines are used for filenames in the system. Hardcoded filenames will NOT be used. Period. (This is a hint to those that might try to 'fix' this software later). Define default value use ARC /usr/spool/newsarc home directory for this thing. IGNORE ignore Keywords to not build lookup tables for in the database. ARTICLES articles stores <id> <article_id> one per line for each article in database. SUBJECTS subjects stores <id> <subject> one per line for each article in the database. KEYWORDS keywords Directory used to store keyword lookup table. All files accessed only through the FILEMAP pointers. TAPE Tape Directory used to store articles that will be moved to magtape or other secondary storage. DATA data Directory used to store the article database. All files are accessed only through the FILEMAP pointers. FILEMAP filemap dbm pointer file to access keywords and article database. See below. The Filemap pointer file Filemap is a dbm format file that is used to map keywords and id numbers to the filenames needed to access them. The basic format is as follows: key: I<id> data: <filename> key: K<keyword> data: <filename> <id> filenames are relative to DATA, <keyword> filenames are relative to KEYWORD so that the system can be moved around easily by simply changing a few defines and recompiling. The reason I'm doing this pointer lookup at all is because I expect to have to change the internal structures of both KEYWORD and DATA as they grow. Unix directory lookups are quadratic in nature so if the directories get too large things slow down significantly. By using the lookup file I can change one pointer in a single place-- otherwise I'd have to track down references all over the place when I want to change it. As I'm setting things up all I need to do is change one function in the program and write a quick and dirty program to relink the files appropriately. DATA internal structure For now, I'm going to use the following algorithm to store files in the DATA directory. Because I only need to generate a filename once (all other references are made through FILEMAP) generation can be expensive. We need to be careful to keep directories at a reasonable size, so a fairly bushy result is needed. The current algorithm should also grow the number of directories based on the number of files so the inodes lost to directories will stay proportional to the number of files in the system. char *idfile(id_num) id id_num; { char map[] = "abcdefghij"; static char filename[MAXLEN]; char id_string[MAXLEN]; int i = 0, j = 0; filename[0]="\0"; sprintf(id_string,"%d",id_num); p = id_string[0]; while (id_string[i]) { filename[j++] = map[id_string[i++] - '0']; filename[j++] = '/'; } filename[j] = '\0'; strcpy(filename,id_string); return(filename); } KEYWORD filename generation Because I expect to see many fewer keywords than articles (I HOPE!!!) the generation of filenames is much simpler. Basically it is generated by taking keyword[0] as a subdirectory, as in f/foo. -- From the ministry of silly talks: Chuq Von Rospach {allegra,cbosgd,decwrl,hplabs,ihnp4,seismo}!nsc!chuqui nsc!chuqui@decwrl.ARPA Now look here Mister "I'm not just a word processor"...