science@nems.ARPA (Mark Zimmermann) (02/03/88)
Comp.sources.misc: Volume 2, Issue 41 Submitted-By: "Mark Zimmerman" <science@nems.ARPA> Archive-Name: qndxr Comp.sources.misc: Volume 2, Issue 41 Submitted-By: "Mark Zimmerman" <science@nems.ARPA> Archive-Name: qndxr [Yes, indeed, another un-shared program! Sigh. ++bsa] Following C code builds inverted index files for big text files ... seems to work on Sun, VAX, and Macintosh in my tests. Heavily commented in a helter-skelter style.... ^z /* revised main indexer program 'qndxr.c' by ^z -- 870820-... * * revised 870930-1008 to give the user more options and control of how * delimiters are defined and handled. Now can choose: * - the default: any non-alphanumeric characters are delimiters (my * original approach, and perhaps the simplest to understand and use); * - option "-e": keep punctuation characters ONLY when they are embedded * in between non-delimiter characters, otherwise turn them into * delimiters (approach invented by Bill Hole of MicroDynamics Ltd.; * very nice for displaying 'words' such as don't, non-ASCII, 3.14159, * 87/10/02, etc. without splitting them up and requiring proximity * searching); * - option "-a": keep ALL punctuation characters, whether embedded in * words or not (best for indexing FORTH programs and such, but does * make spurious 'distinct' words at ends of sentences, etc.); * - option "-s": keep all special (non-ASCII) characters (with values * in the range 128-255; in the Apple character set, these are the * symbols that carry umlauts, tilde, accents, etc., making this * option the best for foreign-language and symbol-heavy files); * default is to remove the special characters. * * Note that option "-s" can be combined with any of the other options; * options "-e" and "-a" are mutually exclusive. (Actually, "-a" in my * implementation overrides "-e".) The "-e" option does require * peeking ahead one character before deciding on the fate of * a punctuation symbol, but that isn't too hard to arrange.... * * --------------------------------------------------------------- * * Synopsis of the qndxr.c program's approach to sorting an index: * - load a chunk of the input file into memory; during the loading, * filter the characters appropriately, turning lower-case * into capitals, changing delimiters into \0's, and noting down * the locations of the beginnings of keys (words) in a ptr array; * - do a quicksort on the ptr array, using the keys in the buffer * to sort on; * - write the resulting sorted list of pointers along with their keys * to disk as files, named x0k0 and x0p0, in the *.k and *.p formats * used in ndxr.15; * - repeat the above steps with another chunk of the input file, and * name the resulting files x0k1 and x0p1; repeat, etc., until the * input file is all sorted into chunks; * - begin to merge the sorted files in an N-way merge ... for the * specific case of N=2, for example, merge x0k0/x0p0 and * x0k1/x0p1 into x1k0/x1p0; merge x0k2/x0p2 and x0k3/x0p3 into * x1k1/x1p1; etc., deleting the 0th-generation files as they are * merged; * - merge x1k0/x1p0 and x1k1/x1p1 into x2k0/x2p0, etc., and continue * merging subsequent generations until everything has become a * single pair of index files, xnk0/xnp0, which is then renamed * to be the original document name with '.k' and '.p' endings. * * --------------------------------------------------------------- * * Comparison with the older radix sort approach: * The new quicksort/mergesort method gains a bit in speed (since so * much of the work is done in memory rather than streaming from disk * back and forth) and also saves on disk space requirements (since * the k and p files are significantly compressed relative to the raw * index files formerly used during index sorting). The old radix sort * tended to only achieve about 2 MB/hour indexing on my Mac Plus setup, * and it required 5-6 times the size of the original file in free * disk space during indexing. This new approach achieves >4 MB/hour * in tests on the same Mac Plus, and only requires about 1.9 times * the original file size in free space! * * The new approach also allows for greater key length -- try recompiling * with KEY_LENGTH of 28, for instance -- without such severe disk space * penalties as the old radix sort would have incurred, and without severe * time penalties. (Duplicate words in the chunks are only stored once * in the key file, though each must have an entry in the pointer file.) * * The only obvious disadvantage of the quicksort/mergesort approach is * that it is an O(NlogN) procedure rather than O(N), and thus gets slower * when files get very large. (Radix sorting is strictly linear in N, * in theory anyway.) * * --------------------------------------------------------------- * * For further details, contact the author: * Mark Zimmermann science (at) NEMS.ARPA * 9511 Gwyndale Dr. [75066,2044] on CompuServe * Silver Spring, MD 20910 * 301-565-2166 * --------------------------------------------------------------- */ #include <stdio.h> #include <strings.h> #include <ctype.h> /* --------------header file--------------- */ /* header file for project qndxr -- by ^z -- 870820-0913-... */ /* tell what compiler we're using ... Lightspeed C by Think, if the * following line is #defined ... essentially, when LIGHTSPEED is here, * assume that we have only 16-bit int variables and that Macintosh * toolbox traps are available ... when LIGHTSPEED is not #defined, * assume that ints can hold more like 32 bits without error, so more * can be done using standard I/O routines from <stdio> */ /* #define LIGHTSPEED */ /* preprocessor 'function' to turn on/off debug printing of detailed info * during program runs ... when debugging, use a statement: * #define DEBUG(fmt, arg) printf (fmt, arg) * ... and when not debugging, just use: #define DEBUG(fmt, arg) * to effectively remove those print commands.... */ /* #define DEBUG(fmt, arg) printf (fmt, arg) */ #define DEBUG(fmt, arg) /* sort on KEY_LENGTH characters ... make it 28 rather arbitrarily as an * experiment ... want it long enough to avoid truncation of legitimate * words, but short enough to save some space in the *.k files ... an * alternative value is 12, for compatibility with the older ndxr.c program * and brwsr.c, if they haven't been revised to allow for longer keys.... */ #define KEY_LENGTH 28 /* define a structure for the index_keys file */ typedef struct { char kkey[KEY_LENGTH]; long ccount; } KEY_REC; /* define a structure for my simpleminded I/O buffers ... */ typedef struct { char *zbufbase; char *zbufp; long zbufcounter; long zbufsize; FILE *zbuffilep; int zbufitemsize; } ZBUF; /* choose this size to be the default memory size used for total buffer * space ... for convenience on the Mac Plus, I have picked 512 kB, which * leaves me a bit of space to spare .... */ #define DEFAULT_MEMSIZ 524288 /* merge this many files at each stage of the merging operation for index * building ... 2 means a binary merge, etc. ... one needs to have at least * 5 + 2 * NMERGE I/O buffers around: for each of NMERGE files, there is * a *.k keys file and a *.p pointers file; plus there must be a single * output *.k and a single output *.p file; plus there is the need for stdin, * stdout, and stderr to be open as well. Thus, I have found that a 4-way * merge (NMERGE = 4) works pretty nicely.... */ #define NMERGE 4 #ifndef TRUE #define TRUE 1 #endif #ifndef FALSE #define FALSE 0 #endif /* CMASK makes sure that a returned character isn't sign-extended */ #ifndef CMASK #define CMASK 0xFF #endif /* put the prototypes for my functions here... assume that if LIGHTSPEED * is #define'd, then we want to use prototypes.... */ #ifdef LIGHTSPEED /* in qndxr unix main.c */ void punt(void); void openfile(FILE *file, char *mode); void main(void); /* in qndxr.c */ void _main(int argc, char *argv[]); /* in bufio.c */ void create_zbuffer (int n, long bufsize, FILE *buffile, int bufitemsize); void free_zbuffer (int n); char *next_input_item (int n); void load_zbuffer (int n); char *next_output_item (int n); void flush_zbuffer (int n); /* in build_indices.c */ int build_indices (void); /* in doc_buf.c */ char *make_buf (long bufsiz); long load_doc_buffer (char *doc, long doc_bufsiz, char **ptr); int filtered_getc (void); /* in merge_files.c */ void nway_merge_kpfiles (FILE *ink[], FILE *inp[], FILE *outk, FILE *outp, int n); void copy_ptr_recs (int inzbuf, long count, int outzbuf); void copy_key_rec (char *kkey, long ccount, int outzbuf); int merge_not_finished (KEY_REC *kr[], int n); int smallest_key (KEY_REC *kr[], int n); /* in merge_indices.c */ int merge_indices (char *doc_filename); /* in file_utils.c */ void write_sorted_files (char *doc, char **ptr, long nwords, int pass_number, long offset); int is_new_word (char *w0, char *w1); void write_new_key (char *p, char *kp); /* in merge_utils.c */ FILE *open_inkfile (int generation_number, int file_number); FILE *open_inpfile (int generation_number, int file_number); void fix_oddball_file_name (int generation_number, int file_number); void fix_final_file_names (int generation_number, char *doc_filename); FILE *open_outkfile (int generation_number, int file_number); FILE *open_outpfile (int generation_number, int file_number); void remove_used_infiles (int generation_number, int file_number, int n); void close_infiles (FILE *ink[], FILE *inp[], int n); /* in misc_utils.c */ long set_params (int argc, char *argv[]); long set_zbufsiz (long zb); void set_parser_options (char *str); void check_interrupt (void); long file_size (FILE *fp); /* in open_files.c */ FILE *open_docfile (int argc, char *argv[]); FILE *open_kfile (int pass_number); FILE *open_pfile (int pass_number); /* in zqsort.c */ void zqsort (char **first, char **last); int compare_ptrs (char **p1, char **p2); int zstrcmp (char *s1, char *s2); #endif /* ----------------------main program file---------------- */ /* define a global variable to hold the chosen size of a fundamental * quantum of buffer space ... experiments with dynamically choosing * it at runtime seem to cause occasional problems on the Macintosh, * and have other risks with virtual memory systems, so default to * DEFAULT_MEMSIZ total buffer space but let the user override with * a command-line choice to suit ... see function set_zbufsiz() for * further discussion.... */ long zbufsiz; /* define a global variable to tell whether or not all punctuation chars * are kept... */ int keep_all_punct = FALSE; /* define a global variable to tell whether or not only embedded punctuation * characters are preserved... */ int keep_embedded_punct = FALSE; /* define a global variable to tell whether or not to keep special, * non-ASCII characters... */ int keep_special_chars = FALSE; /* define a global variable to hold the (FILE *) for the document file */ FILE *doc_file; /* main program to do the work ... */ void main(argc, argv) int argc; char *argv[]; { unsigned long start_time, end_time, time(); long set_params(), file_size(); char *ctime(); FILE *open_docfile(); time (&start_time); printf ("Starting work: %s\n", ctime (&start_time)); printf ("\nOpening document file \"%s\"...\n", argv[1]); doc_file = open_docfile (argc, argv); zbufsiz = set_params (argc, argv); printf ("Using buffers of size %ld each...\n", zbufsiz); printf ("Beginning to build index...\n"); while (build_indices ()); printf ("Beginning to merge index subfiles...\n"); while (merge_indices (argv[1])); time (&end_time); printf ("\nEnding work: %s\n", ctime (&end_time)); printf ("Elapsed time = %ld seconds.\n", end_time - start_time); printf ("Indexing rate = %.2f megabytes/hour.\n", file_size (doc_file) * 0.003600 / ( end_time - start_time )); printf ("Input database file \"%s\" was %ld bytes long.\n", argv[1], file_size (doc_file)); fclose (doc_file); } /* -------------------bufio.c file------------------- */ /* This is the file 'bufio.c' ... by ^z, 870830-0913- ... it includes * definitions of some buffer words to accumulate information on its * way to/from the disks ... use these to speed up operations and reduce * disk head movements, in place of calls to fputc(), fread(), fwrite(), * etc. ... try to make them very general so that they will simplify * other tasks.... * */ /* set up some buffers here to save on disk head movement and speed up * operations ... use my simple ZBUF structure for both input and * output operations ... keep everything static, local here to this file * for safety's sake ... allocate enough items here for all the buffers * that we may need for an NMERGE-way merge operation, taking into account * the need for input buffers for all the NMERGE *.k and *.p files plus * the output files *.k and *.p as well.... */ static ZBUF zbuffer[2 + 2 * NMERGE]; /* function to create a zbuffer and set its parameters appropriately */ void create_zbuffer (n, bufsize, buffile, bufitemsize) int n, bufitemsize; long bufsize; FILE *buffile; { char *make_buf(); zbuffer[n].zbufbase = make_buf (bufsize); zbuffer[n].zbufp = zbuffer[n].zbufbase; zbuffer[n].zbufcounter = 0; zbuffer[n].zbufsize = bufsize; zbuffer[n].zbuffilep = buffile; zbuffer[n].zbufitemsize = bufitemsize; } /* function to free a zbuffer ... */ void free_zbuffer (n) int n; { free (zbuffer[n].zbufbase); } /* function to return a pointer to the next item in a chosen input * buffer 'n'; it reloads the buffer if necessary ... returns NULL * pointer when there's nothing left in the file ... */ char *next_input_item (n) int n; { char *result; void load_zbuffer(); if (zbuffer[n].zbufcounter == 0) load_zbuffer (n); zbuffer[n].zbufcounter -= zbuffer[n].zbufitemsize; if (zbuffer[n].zbufcounter >= 0) { result = zbuffer[n].zbufp; zbuffer[n].zbufp += zbuffer[n].zbufitemsize; return (result); } else return (NULL); } /* function to load the nth zbuffer appropriately ... it resets the buffer * pointers, etc. ... might as well give the user a chance to interrupt (in * the Macintosh version) here, as long as we have to go to the disk anyway.... */ void load_zbuffer (n) int n; { long nread; void exit(), check_interrupt(); #ifdef LIGHTSPEED nread = zbuffer[n].zbufsize; FSRead (zbuffer[n].zbuffilep->refnum, &nread, zbuffer[n].zbufbase); #else nread = fread (zbuffer[n].zbufbase, sizeof(char), (int)zbuffer[n].zbufsize, zbuffer[n].zbuffilep); #endif zbuffer[n].zbufp = zbuffer[n].zbufbase; zbuffer[n].zbufcounter = nread; if (ferror (zbuffer[n].zbuffilep)) { printf ("\nFatal error in load_zbuffer while reading in data!\n"); printf ("(n=%d, nread=%ld)\n", n, nread); exit (1); } #ifdef LIGHTSPEED check_interrupt (); #endif } /* function to return a pointer to the right place to store the next * output item for the nth zbuffer ... when the buffer becomes full, * it flushes it to disk, resets pointers, etc. */ char *next_output_item (n) int n; { char *result; void flush_zbuffer(); if (zbuffer[n].zbufcounter == zbuffer[n].zbufsize) flush_zbuffer (n); result = zbuffer[n].zbufp; zbuffer[n].zbufp += zbuffer[n].zbufitemsize; zbuffer[n].zbufcounter += zbuffer[n].zbufitemsize; return (result); } /* function to flush out a zbuffer to the disk ... might as well give user * a chance to interrupt here (in the Macintosh version), as long as we * have to go to the disk anyway.... */ void flush_zbuffer (n) int n; { long chars_written; void exit(); if (zbuffer[n].zbufcounter == 0) return; #ifdef LIGHTSPEED chars_written = zbuffer[n].zbufcounter; FSWrite (zbuffer[n].zbuffilep->refnum, &chars_written, zbuffer[n].zbufbase); #else chars_written = fwrite (zbuffer[n].zbufbase, sizeof(char), (int)zbuffer[n].zbufcounter, zbuffer[n].zbuffilep); #endif if (ferror(zbuffer[n].zbuffilep) || chars_written == 0) { printf ("\nFatal error in flush_zbuffer while writing out data!\n"); printf ("(n=%d, zbufcounter=%ld, chars_written=%ld)\n", n, zbuffer[n].zbufcounter, chars_written); exit(1); } zbuffer[n].zbufp = zbuffer[n].zbufbase; zbuffer[n].zbufcounter = 0; #ifdef LIGHTSPEED check_interrupt (); #endif } /* ---------------build_indices.c file---------------- */ /* file build_indices.c ... by ^z, 870820-0913-... * * revised 870930-871007 to allow more user options on keeping/discarding * punctuation, etc. -- ideas based on Bill Hole's suggestions * * contains subroutine to build indices for each chunk of input document * (database) text file for program qndxr ... * * Strategy is as outlined in qndxr: take in a big chunk of the doc_file, * generate the pointers to each word in the buffer as the buffer contents * are converted into appropriate (all caps, delimiters filtered into * spaces) form for sorting; sort the pointers in memory; and then write * out to disk the pointers and keys in the ndxr.c format for *.k and *.p * files. * * Allocate space for the doc and ptr buffers here so as to maximize use * of available memory ... note that we need to have room for the doc * buffer, for a ptr buffer that might (in the worst case of a file full * of 1-letter words) be twice as long as the doc buffer, and also space * for two standard zbuffers to accumulate the output *.k and *.p file * info in before sending it to disk. * * Note that for speed, while they are being sorted the pointers just point * directly to the key strings in the input buffer; they must be converted * into true offset pointers relative to the 0th byte of the document file * as they are written to disk in the *.p file! Make sure that all of * the delimiters in the document/database buffer are converted into * '\0' characters so that string comparison functions will work right. * * Also note that to avoid edge effects at the end of the buffer, an extra * amount of space is required at the end of the buffer, of length * KEY_LENGTH, to accomodate the end of the last word in the buffer. * * Use static local variables in the function here to keep track of where * we are in the document file from one chunk to the next, what chunk * number we are on now, etc. * * Give the user a chance to interrupt operations (in the Macintosh * version of this program) at intervals here, as long as * there are time-consuming I/O or sorting or scanning operations * to be done ... */ int build_indices () { static int pass_number = 0; long doc_bufsiz, offset, load_doc_buffer(), nwords, ftell(); extern long zbufsiz; extern FILE *doc_file; char *doc, **ptr, *malloc(), *mlalloc(), *calloc(), *clalloc(); void zqsort(), write_sorted_files(); doc_bufsiz = 2 * NMERGE * zbufsiz / 3; DEBUG ("--allocating doc buffer of size %ld\n", doc_bufsiz + KEY_LENGTH); doc = make_buf (doc_bufsiz + KEY_LENGTH); DEBUG ("--allocating ptr buffer of size %ld\n", doc_bufsiz * 2); ptr = (char **)make_buf (doc_bufsiz * 2); #ifdef LIGHTSPEED check_interrupt (); #endif offset = ftell (doc_file); DEBUG ("--loading doc buffer beginning at offset %ld\n", offset); nwords = load_doc_buffer (doc, doc_bufsiz, ptr); if (nwords == 0) { DEBUG ("--Building done ... now freeing doc & ptr buffers\n", NULL); free (doc); free ((char *)ptr); return (FALSE); } printf ("Index subfile #%d contains %ld words...\n", pass_number, nwords); #ifdef LIGHTSPEED check_interrupt (); #endif DEBUG ("--sorting ptr array\n", NULL); zqsort (ptr, ptr + nwords); #ifdef LIGHTSPEED check_interrupt (); #endif DEBUG ("--writing sorted keys and ptrs to disk\n", NULL); write_sorted_files (doc, ptr, nwords, pass_number, offset); #ifdef LIGHTSPEED check_interrupt (); #endif DEBUG ("--freeing doc & ptr buffers\n", NULL); free (doc); free ((char *)ptr); ++pass_number; return (TRUE); } /* ---------------doc_buf.c file------------------ */ /* file doc_buf.c ... by ^z 870830-0919-... * functions to load in text from the document file and then * save out keys and pointers to the *.k and *.p files ... * modified 871007-... to unify the doc-buffer loading with the * character-filtering and the pointer array building.... */ /* function to create a buffer for me to use... */ char *make_buf (bufsiz) long bufsiz; { char *buf, *malloc(), *mlalloc(); void exit(); DEBUG ("--allocating a buffer, size = %ld\n", bufsiz); #ifdef LIGHTSPEED buf = mlalloc (bufsiz); #else buf = malloc ((unsigned int)bufsiz); #endif if (buf == NULL) { printf ("\nFatal error in attempt to allocate a buffer!\n"); printf ("(bufsiz=%ld)\n", bufsiz); exit(1); } return (buf); } /* function to load the document buffer ... bring in doc_bufsiz * characters, and then enough more to finish out the final word, * followed by a terminal delimiter .... as the characters are read * in, filter them appropriately (depending on user choices) and * build the pointer array in memory to the first character of each * word ... return the total number of words that were * read in to the buffer (zero if we're finished with the file) * * ... note that one must be sure to pull in and throw away * any excess characters beyond KEY_LENGTH in the final word, so that * the remaining fragment doesn't show up as the first "word" in the * next chunk of the file.... * * Routine modified 871007-... in order to unify the buffer-loading and * character-filtering and pointer-array-building operations, and to go * back to using getc() from <stdio> rather than Macintosh-specific * operations for loading the buffer.... */ long load_doc_buffer (doc, doc_bufsiz, ptr) char *doc, **ptr; long doc_bufsiz; { int c, i, in_a_word = FALSE; char **ptr0, *end_doc_buf; extern FILE *doc_file; DEBUG ("--Loading document buffer...\n", NULL); ptr0 = ptr; end_doc_buf = doc + doc_bufsiz; while (doc < end_doc_buf) { c = filtered_getc (); DEBUG ("--filtered character = \"%c\"\n", c); if (c == EOF) { *doc++ = '\0'; in_a_word = FALSE; break; } if (! c) in_a_word = FALSE; else if (! in_a_word) { *ptr++ = doc; in_a_word = TRUE; DEBUG ("--adding new ptr = %ld\n", doc); } *doc++ = c; } if (doc == end_doc_buf && in_a_word) { DEBUG ("--finishing off a final buffer word...\n", NULL); for (i = 0; i < KEY_LENGTH; ++i) { c = filtered_getc (); if (c == EOF) { *doc++ = '\0'; break; } if (! c) { *doc++ = '\0'; break; } *doc++ = c; } if (i == KEY_LENGTH) while (filtered_getc ()) ; } return (ptr - ptr0); } /* function to get the next character from the document file and filter it * as the user desires ... return: * EOF if end of file encountered; * '/0' if the character is a delimiter; * otherwise, the character itself (filtered into upper-case, * if it was lower-case) */ int filtered_getc () { static int prevc, c = '\0'; int nextc; extern int keep_all_punct, keep_embedded_punct, keep_special_chars; extern FILE *doc_file; prevc = c; c = getc (doc_file); if (c == EOF) return (EOF); if (islower (c)) return (c = toupper (c)); if (isupper (c) || isdigit (c)) return (c); if (isspace (c)) return (c = '\0'); if (keep_special_chars && ! isascii (c)) return (c); if (keep_all_punct && ispunct (c)) return (c); if (keep_embedded_punct && ispunct (c)) { if (prevc == '\0') return (c = '\0'); nextc = getc (doc_file); ungetc (nextc, doc_file); if (nextc == EOF) return (c = '\0'); if (isalnum (nextc) || (keep_special_chars && ! isascii (nextc))) return (c); else return (c = '\0'); } return (c = '\0'); } /* ------------------------file_utils.c------------- */ /* file file_utils.c ... by ^z -- 870820-0913-... * some utility routines for qndxr project, associated with files... */ /* function to write out sorted k & p files based on the doc and ptr * arrays in memory.... * * The kfile format is as described in detail elsewhere: * the key word, turned into all capital letters and with spaces * afterward, of fixed length KEY_LENGTH; and * the cumulative count of how many words have passed before, including * the current word, a long integer. * * Function revised 870907-... by ^z to use zbuffer method.... */ void write_sorted_files (doc, ptr, nwords, pass_number, offset) char *doc, **ptr; long nwords, offset; int pass_number; { extern long zbufsiz; FILE *kfile, *pfile, *open_kfile(), *open_pfile(); char *prev_word, *next_output_item(); KEY_REC *outk; long *outp, i, file_size (); void create_zbuffer(), write_new_key(); DEBUG ("--Entering write_sorted_files with nwords %ld\n", nwords); if (nwords == 0) return; DEBUG ("--Opening kfile & pfile for pass_number = %d\n", pass_number); kfile = open_kfile (pass_number); pfile = open_pfile (pass_number); DEBUG ("--Creating buffers for keys & ptrs, size = %ld\n", zbufsiz); create_zbuffer (0, zbufsiz, kfile, sizeof(KEY_REC)); create_zbuffer (1, zbufsiz, pfile, sizeof(long)); DEBUG ("--Beginning to write keys and ptrs; first key=%.28s\n", ptr[0]); prev_word = ptr[0]; outk = (KEY_REC *)next_output_item (0); write_new_key (ptr[0], outk->kkey); for (i = 0; i < nwords; ++i) { if (is_new_word (prev_word, ptr[i])) { outk->ccount = i; outk = (KEY_REC *)next_output_item (0); write_new_key (ptr[i], outk->kkey); prev_word = ptr[i]; } outp = (long *)next_output_item (1); *outp = (ptr[i] - doc) + offset; } outk->ccount = i; flush_zbuffer (0); flush_zbuffer (1); DEBUG ("--Getting rid of key and ptr buffers...\n", NULL); free_zbuffer (0); free_zbuffer (1); printf (" ...%ld distinct words\n", file_size (kfile) / sizeof(KEY_REC)); fclose (kfile); fclose (pfile); } /* function to determine if the current word is the same as or different * from the previous word -- if it is different, we'll need to write an * entry out to the key file kfile -- compare the words up to the first * '\0', or for a maximum distance of KEY_LENGTH, and return TRUE * if they differ, FALSE if they are identical that far. Thus, a simple * call to zstrcmp() does the job.... but keep ours as a function instead * of a macro call for the moment, for safety and readability.... */ int is_new_word (w0, w1) char *w0, *w1; { return (zstrcmp (w0, w1)); } /* function to write out a new key entry in the key_file: * KEY_LENGTH letters consisting of the key word (which will be found * delimited by a '\0'), followed by enough blanks to fill out the * record to total length KEY_LENGTH ... */ void write_new_key (p, kp) register char *p, *kp; { register int i, c; for (i = 0; i < KEY_LENGTH; ++i) { c = *p++; if (c == '\0') break; *kp++ = c; } for ( ; i < KEY_LENGTH; ++i) *kp++ = ' '; } /* ---------------------merge_files.c--------------- */ /* file merge_files.c ... 870902-... ^z * more functions to do the work of merging subindices together ... * with buffering of inputs and outputs ... */ /* function to do the real work of merging sorted k & p * files into a single sorted file... * * Procedure is as follows: * * Compare the current key records from each of the N files to be merged. * Take the smallest of the keys (or, if there is a tie, take the one * from the earlier file number) and write its pointer records out to * the *.p file for the next generation. If the key is a new one, that * is, if it differs from the previous key, write out the previous key * word and the value for cumulative_counts to the *.k file, and reset * the previous key's value to that of the current key. Then repeat * this whole procedure, until all the input files are exhausted. * * The above works provided that we are careful in choosing the smallest * key and never let a file that has been exhausted (reached EOF) win a * comparison. The function smallest_key does that properly below, since * a file that is at EOF has returned a NULL pointer for its key_rec... * * For each file, the variable prev_cc[i] holds the previous value of * cumulative_counts for that file, so that we can determine how * many ptr_recs to write out by doing a simple subtraction. * * Buffer numbering scheme: the Kth input file has zbuffer #K * for its keys and zbuffer #(K+n) for its pointers; the output * buffers are zbuffers #(2*n) for keys and #(2*n+1) for pointers. */ void nway_merge_kpfiles (ink, inp, outk, outp, n) FILE *ink[], *inp[], *outk, *outp; register int n; { register int i; KEY_REC *kr[NMERGE], prev_key; long prev_cc[NMERGE], out_cc; extern long zbufsiz; char *strncpy(); void copy_ptr_recs(), copy_key_rec(); DEBUG ("--entering nway_merge_kpfiles with n=%d\n", n); for (i = 0; i < n; ++i) prev_cc[i] = 0; out_cc = 0; for (i = 0; i < n; ++i) { create_zbuffer (i, zbufsiz, ink[i], sizeof(KEY_REC)); create_zbuffer (i + n, zbufsiz, inp[i], sizeof(long)); } create_zbuffer (2 * n, zbufsiz, outk, sizeof(KEY_REC)); create_zbuffer (2 * n + 1, zbufsiz, outp, sizeof(long)); for (i = 0; i < n; ++i) kr[i] = (KEY_REC *)next_input_item (i); i = smallest_key (kr, n); strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH); DEBUG ("--first key is %.28s\n", prev_key.kkey); while (merge_not_finished (kr, n)) { i = smallest_key (kr, n); if (zstrcmp (prev_key.kkey, kr[i]->kkey)) { copy_key_rec (prev_key.kkey, out_cc, 2 * n); strncpy (prev_key.kkey, kr[i]->kkey, KEY_LENGTH); } copy_ptr_recs (i + n, kr[i]->ccount - prev_cc[i], 2 * n + 1); out_cc += kr[i]->ccount - prev_cc[i]; prev_cc[i] = kr[i]->ccount; kr[i] = (KEY_REC *)next_input_item (i); } copy_key_rec (prev_key.kkey, out_cc, 2 * n); DEBUG ("--finished nway merge ... final key=%.28s\n", prev_key.kkey); flush_zbuffer (2 * n); flush_zbuffer (2 * n + 1); for (i = 0; i < 2 * n + 2; ++i) free_zbuffer (i); } /* function to copy a chosen number of ptr_recs (longs) from one file * to another ... used in merging two files by merge_kpfiles() above.... * revised and simplified to use zbuffers....870902 ... ^z */ void copy_ptr_recs (inzbuf, count, outzbuf) register int inzbuf, outzbuf; register long count; { register long *inp, *outp; char *next_input_item(), *next_output_item(); for ( ; count > 0; --count) { inp = (long *)next_input_item (inzbuf); outp = (long *)next_output_item (outzbuf); *outp = *inp; } } /* function to write a key record including kkey (key word) and ccount * (cumulative_count) to an output file... */ void copy_key_rec (kkey, cc, outzbuf) char *kkey; long cc; int outzbuf; { KEY_REC *outp; char *strncpy(), *next_output_item(); outp = (KEY_REC *)next_output_item (outzbuf); strncpy (outp->kkey, kkey, KEY_LENGTH); outp->ccount = cc; } /* simple function to see if we are not yet finished with all of the input * files coming in to the merge operation ... signified by at least one of * the key pointers being non-NULL.... */ int merge_not_finished (kr, n) KEY_REC *kr[]; register int n; { register int i; for (i = 0; i < n; ++i) if (kr[i] != NULL) return (TRUE); return (FALSE); } /* function to determine the smallest of the n keys pointed to by the * kr[] vector of pointers to KEY_RECs ... note that a NULL ptr ranks * after any other...also note that in case of a tie, the smaller * value of i is the one to return (for a stable merging sort) */ smallest_key (kr, n) KEY_REC *kr[]; register int n; { register int i, smallest; for (i = 0; i < n; ++i) if (kr[i] != NULL) { smallest = i; break; } for (i = smallest + 1; i < n; ++i) if (kr[i] != NULL) if (zstrcmp (kr[smallest]->kkey, kr[i]->kkey) > 0) smallest = i; return (smallest); } /* -------------------------merge_indices.c-------------- */ /* file "merge_indices.c" ... by ^z -- 870823-... * function to merge sorted indices together repeatedly until finished * with them all in a single set of *.k/*.p files ... * * The merging strategy is straightforward enough: * Let "g" denote the generation_number and "f" denote the file_number. * Temporary file names begin with the letter x, which is then followed * by a generation number (decimal), the letter k or p (standing for * key or ptr, respectively), and then the file number (decimal). Thus, * the file "x0k0" is the keys file #0 for generation #0 (the starting, * pre-merging, generation), file "x2p3" is the ptr file #3 for generation * #2, etc. * * (The following discussion is specifically for a 2-way merge ... but * the generalization for N-way merging is straightforward.) * * On a given call to merge_indices, the following may happen: * - files xgkf/xgpf and xgk(f+1)/xgp(f+1) are merged into files * x(g+1)k(f/2)/x(g+1)p(f/2), and then the parent files are * deleted; * - file xgkf isn't found, which means we are done with this * generation and must go on to the next; * - file xgk(f+1) isn't found, which means that either we are * completely done with the merging work (if f=0) and just * have to rename the files xgkf/xgpf into the correct final * names (that is, doc_filename.k/doc_filename.p), or else * (if f>0) we have an odd number of files for this level * of merging, and therefore just have to rename xgkf/xgpf * to x(g+1)k(f/2)/x(g+1)p(f/2). */ int merge_indices (doc_filename) char *doc_filename; { static int generation_number = 0, file_number = 0; FILE *ink[NMERGE], *inp[NMERGE], *outk, *outp, *open_inkfile(), *open_inpfile(), *open_outkfile(), *open_outpfile(); void fix_oddball_file_name(), fix_final_file_names(), merge_kpfiles(), remove_used_infiles(), close_infiles(); long inwords, indistinctwords, outdistinctwords, file_size(); int i, n; for (n = 0; n < NMERGE; ++n) { ink[n] = open_inkfile (generation_number, file_number + n); if (ink[n] == NULL) break; inp[n] = open_inpfile (generation_number, file_number + n); } if (file_number + n == 1) { DEBUG ("--All finished with merging files!\n", NULL); printf ("\nFinished with index sorting! Final file sizes:\n"); printf ("%9ld bytes of distinct key words\n", file_size (ink[0])); printf ("%9ld bytes of pointers to words\n", file_size (inp[0])); close_infiles (ink, inp, n); fix_final_file_names (generation_number, doc_filename); return (FALSE); } if (n < 2) { DEBUG ("--finished generation_number=%d ", generation_number); DEBUG ("-- file_number=%d\n", file_number); if (n == 1) { close_infiles (ink, inp, n); fix_oddball_file_name (generation_number, file_number); } ++generation_number; file_number = 0; return (TRUE); } outk = open_outkfile (generation_number, file_number); outp = open_outpfile (generation_number, file_number); inwords = 0; indistinctwords = 0; for (i = 0; i < n; ++i) { inwords += file_size (inp[i]) / sizeof(long); indistinctwords += file_size (ink[i]) / sizeof(KEY_REC); } printf ("Merge pass #%d.%d\n", generation_number, file_number / NMERGE); printf ("Input files contain %ld words -- %ld distinct words...\n", inwords, indistinctwords); nway_merge_kpfiles (ink, inp, outk, outp, n); outdistinctwords = file_size (outk) / sizeof(KEY_REC); printf (" ... merged result has %ld distinct words.\n", outdistinctwords); close_infiles (ink, inp, n); fclose (outk); fclose (outp); remove_used_infiles (generation_number, file_number, n); file_number += NMERGE; return (TRUE); } /* --------------------merge_utils.c---------------- */ /* file "merge_utils.c" ... 870902-... ^z * misc. utilities for the merge_indices functions... */ /* function to open an input key file for this generation and file number */ FILE *open_inkfile (generation_number, file_number) int generation_number, file_number; { FILE *fopen(); char fname[32]; sprintf (fname, "x%dk%d", generation_number, file_number); return (fopen (fname, "rb")); } /* function to open an input ptr file for this generation and file number */ FILE *open_inpfile (generation_number, file_number) int generation_number, file_number; { FILE *fopen(); char fname[32]; sprintf (fname, "x%dp%d", generation_number, file_number); return (fopen (fname, "rb")); } /* function to open an output key file for this generation and file number */ FILE *open_outkfile (generation_number, file_number) int generation_number, file_number; { FILE *fopen(); char fname[32]; sprintf (fname, "x%dk%d", generation_number + 1, file_number / NMERGE); return (fopen (fname, "wb")); } /* function to open an output ptr file for this generation and file number */ FILE *open_outpfile (generation_number, file_number) int generation_number, file_number; { FILE *fopen(); char fname[32]; sprintf (fname, "x%dp%d", generation_number + 1, file_number / NMERGE); return (fopen (fname, "wb")); } /* function to rename the remaining last unpaired key file & ptr file * from one generation to the next... */ void fix_oddball_file_name (generation_number, file_number) int generation_number, file_number; { char oldname[32], newname[32]; sprintf (oldname, "x%dk%d", generation_number, file_number); sprintf (newname, "x%dk%d", generation_number + 1, file_number / NMERGE); if (rename (oldname, newname)) printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname, newname); sprintf (oldname, "x%dp%d", generation_number, file_number); sprintf (newname, "x%dp%d", generation_number + 1, file_number / NMERGE); if (rename (oldname, newname)) printf ("Sorry -- error in attempt to rename %s to %s!\n", oldname, newname); } /* function to give the final key and ptr files their proper ultimate * names ... */ void fix_final_file_names (generation_number, doc_filename) int generation_number; char *doc_filename; { char oldname[32], finalname[65]; sprintf (oldname, "x%dk0", generation_number); sprintf (finalname, "%s.k", doc_filename); if (rename (oldname, finalname)) printf ("Sorry -- error in renaming file %s to %s!\n", oldname, finalname); sprintf (oldname, "x%dp0", generation_number); sprintf (finalname, "%s.p", doc_filename); if (rename (oldname, finalname)) printf ("Sorry -- error in renaming file %s to %s!\n", oldname, finalname); } /* function to get rid of the superfluous k & p files now that they * have been merged into the next generation.... */ void remove_used_infiles (generation_number, file_number, n) int generation_number, file_number, n; { char fname[32]; register int i; for (i = 0; i < n; ++i) { sprintf (fname, "x%dk%d", generation_number, file_number + i); DEBUG ("--removing %s\n", fname); if (unlink (fname)) printf ("Sorry -- unable to delete file %s!\n", fname); sprintf (fname, "x%dp%d", generation_number, file_number + i); DEBUG ("--removing %s\n", fname); if (unlink (fname)) printf ("Sorry -- unable to delete file %s!\n", fname); } } /* function to close out the ink and inp files that have been opened... */ void close_infiles (ink, inp, n) FILE *ink[], *inp[]; register int n; { register int i; for (i = 0; i < n; ++i) { fclose (ink[i]); fclose (inp[i]); } } /* ---------------------misc_utils.c----------------- */ /* file "misc_utils.c" -- miscellaneous utilities for the qndxr project... * by ^z -- 870821-... */ /* this function returns a value for the size of the internal buffers, * zbufsiz, and it also takes care of setting the other global parameters, * keep_all_punct, keep_embedded_punct, and keep_special_chars, * which govern how punctuation and special characters are handled. * They are set based on flags such as -e, -a, and -s in the input line. * The input parameters are taken one after another, and any that do * not convert to a nonzero number are scanned for the letters "e", "a", and * "s". If a parameter is not reset, the default is to leave keep_all_punct, * keep_embedded_punct, and keep_special_chars as FALSE. * The default memory size is DEFAULT_MEMSIZ, set in the header file. */ long set_params (argc, argv) int argc; char *argv[]; { int i; void set_parser_options(); long zb = 0, n, atol(), set_zbufsiz(); for (i = 2; i < argc; ++i) { n = atol (argv[i]); if (n != 0) zb = set_zbufsiz (n); else set_parser_options (argv[i]); } if (zb == 0) zb = set_zbufsiz (DEFAULT_MEMSIZ); return (zb); } /* determine how big we should make the big buffers -- want to be sure * to have room for at least 2*NMERGE+2 of them at one time in memory, * so that the N-way merge function can hold all the input files * (2N) plus the output files (2). * * NOTE that the chosen buffer size must be a multiple of both sizeof(long) * and sizeof(KEY_REC); I ensure that very simply in what follows by * a little modular arithmetic. I also take care of the sign and of the * requirement that the buffer size be non-zero -- thus, UNIX aficionados * can precede the parameter with a "-" with no ill effects. I have put in * various safeguards against picking illegal buffer sizes, along with an * ultimate safety net small minimum value for zbufsiz. */ long set_zbufsiz (zb) long zb; { char *testb; if (zb < 0) zb = -zb; zb /= (2 * NMERGE + 2); zb = zb - zb % (sizeof(KEY_REC) * sizeof(long)); if (zb < sizeof(KEY_REC) * sizeof(long)) zb = sizeof(KEY_REC) * sizeof(long); DEBUG ("--Checking for memory to support buffer size zb=%ld\n", zb); testb = make_buf ((2 * NMERGE + 2) * zb); free (testb); return (zb); } /* function to set the parser options based on a character string ... * 'a' turns on keep_all_punct, 'e' turns on keep_embedded_punct, * and 's' turns on keep_special_chars */ void set_parser_options (str) char *str; { extern int keep_all_punct, keep_embedded_punct, keep_special_chars; while (*str) { if (*str == 'a') { keep_all_punct = TRUE; printf ("All punctuation characters will be kept.\n"); } else if (*str == 'e') { keep_embedded_punct = TRUE; printf ("Embedded punctuation characters will be kept.\n"); } else if (*str == 's') { keep_special_chars = TRUE; printf ("Special characters will be kept.\n"); } ++str; } } /* function to check for user interruption of operations (for use in the * Macintosh version of this program only) ... call SystemTask() to give * desk accessories a bit of time, and then check for non-null events * with GetNextEvent() ... if something comes along (a mouse click, or key * press, or whatever) let the user choose to exit the program or to * carry on .... */ #ifdef LIGHTSPEED void check_interrupt () { EventRecord myEvent; char cmd[256], *gets(); void exit(); SystemTask (); if (GetNextEvent (everyEvent, &myEvent)) { fprintf (stderr, "\Quit indexing now?\n"); gets (cmd); if (cmd[0] == 'y' || cmd[0] == 'Y') exit (0); } } #endif /* a very simple function to return the size of a file ... leave the file * pointer positioned at wherever it was when the routine was entered.... * should work fine with stdio methods, but not guaranteed compatible when * mixed in with Mac-specific FSRead() function! */ long file_size (fp) FILE *fp; { long fpos, result, ftell(); DEBUG ("--finding the size of file fp=%ld\n", fp); fpos = ftell (fp); fseek (fp, 0L, 2); result = ftell (fp); fseek (fp, fpos, 0); return (result); } /* ----------------------open_files.c----------------- */ /* functions to open files as needed in qndxr work... */ FILE *open_docfile (argc, argv) int argc; char *argv[]; { FILE *fp, *fopen(); void exit(); if (argc < 2) { printf ("\nToo few command line arguments!\n"); printf ("\nEnter a text file name (no embedded spaces allowed)\n"); printf ("and the program will build and sort a complete inverted\n"); printf ("index to that file. Use \">\" in front of a file name to\n"); printf ("redirect console output (UNIX-fashion) if desired.\n"); printf ("Give the program a value for available memory (in bytes)\n"); printf ("if the default value of 512kB is unsatisfactory ... larger\n"); printf ("memory allows for faster index building and sorting.\n"); printf ("Other command-line arguments:\n"); printf (" \"-e\" preserves embedded punctuation\n"); printf (" \"-a\" preserves all punctuation characters\n"); printf (" \"-s\" preserves special (non-ASCII) characters\n"); printf ("\nContact Mark Zimmermann, 9511 Gwyndale Drive, Silver Spring\n"); printf ("Maryland 20910 USA; 301-565-2166; science (at) nems.arpa;\n"); printf ("[75066,2044] on CompuServe for further information. - ^z\n"); exit (1); } if ((fp = fopen (argv[1], "r")) == NULL) { printf ("\nFatal error opening document file\"%s\"!\n", argv[1]); exit (1); } return (fp); } /* open the key file with proper name for this pass ... file will be * named "x0kN" where N represents the pass number .... */ FILE *open_kfile (pass_number) int pass_number; { FILE *fopen(); char fname[32]; sprintf (fname, "x0k%d", pass_number); return (fopen (fname, "wb")); } /* open the ptr file with proper name for this pass ... file will be * named "x0pN" where N represents the pass number .... */ FILE *open_pfile (pass_number) int pass_number; { FILE *fopen(); char fname[32]; sprintf (fname, "x0p%d", pass_number); return (fopen (fname, "wb")); } /* ----------------------zqsort.c-------------------- */ /* file zqsort.c -- by ^z, 870823-... * my quicksort to sort out the ptr array ... based, at least initially, * on the Lightspeed C library qsort routine, specialized to the task * at hand here ... */ /* sort elements "first" through "last" */ void zqsort (first, last) register char **first, **last; { register char **i, **j, *tmp; while (last - first > 1) { i = first; j = last; for (;;) { while (++i < last && compare_ptrs (i, first) < 0) ; while (--j > first && compare_ptrs (j, first) > 0) ; if (i >= j) break; tmp = *i; *i = *j; *j = tmp; } tmp = *first; *first = *j; *j = tmp; if (j - first < last - (j + 1)) { zqsort (first, j); first = j + 1; } else { zqsort (j + 1, last); last = j; } } } /* function to compare two index ptrs and give a result appropriate * for quicksort to use in alphabetizing them.... * * Since the words pointed to have already been turned into all capital * letters and delimiters have been filtered out, simply doing zstrcmp() * for KEY_LENGTH letters works fine! * * Slight modification to make the quicksort stable: if two words tie, * then we want to compare their pointers to make the lesser one come * out first in the sort ... */ int compare_ptrs (p1, p2) register char **p1, **p2; { register int diff; diff = zstrcmp (*p1, *p2); if (diff == 0) diff = ((*p1 - *p2) > 0) ? 1 : -1; return (diff); } /* my function to compare two strings and give a result as to who is * alphabetically earlier. Note that this is almost the same as strncmp() * with the fixed value of KEY_LENGTH as the maximum comparison distance, * except that I must be sure to mask the characters to make them positive * (since we want to be able to handle the non-ASCII funny letters in * the Apple character set properly/consistently). If the masking isn't * done, then inconsistent results can occur with those non-ASCII chars! */ int zstrcmp (s1, s2) register char *s1, *s2; { register int n = KEY_LENGTH; for (; --n && ((*s1 & CMASK) == (*s2 & CMASK)); s1++, s2++) if (!*s1) break; return ((*s1 & CMASK) - (*s2 & CMASK)); } -------