ked@garnet.berkeley.edu (Earl H. Kinmonth) (07/20/89)
STOP! Before you read the rest of this posting. If vague, difficult-to-answer questions offend you, hit the 'j' key now and go on to less complex problems! ============================================================ Dennis Allen, "Text Retrieval with a Twist", Byte (1989: July): 201-204, reviews a program called "Folio Views" for indexing every word in large text data bases. I have several questions pertaining to this program and indexing schemes in general. Specifically, I am trying to improve the indexing scheme for Bibliofile, my free form (unlimited number of fields, unlimited field size) data base. As currently structured, Bibliofile has one program that uses an index. The index is made by taking every n-records (granularity, a variable) and creating a hash table for each word (or with a user-designated set of words excluded) for those n-records. To be more specific, a simplified version of the index looks like: head [info about files, size, when modified, etc.] pointer to first of granule-n records hash table pointer to second of granule-n records hash table pointer to third of granule-n records hash table A search is made by hashing each word in the search string, then reading through the hash tables for granule-n records. If the hash value does NOT occur, move to the next hash table, otherwise search for the word. (Currently words are mapped to lower case with all punctuation removed). With a granularity of 128, a modulo of 4091 for the crc hash, and an input file of about 900k, the index is about 50K. (Each hash table is roughly modulo / 8 in size.) Returning any one record varies (on a 10 mHz AT clone running Xenix) from 4 to 40 seconds. Generally, the MORE words (word1 && word2 && word3 && wordNN), the FASTER the search. Although performance is adequate for the intended purpose and even so-so by commercial standards (Bibliofile in **IX and MSDOS versions is still only $00.00 to all who ask for it), I feel I must be missing something. I was pleased to find that I had anticipated one "feature" of "Folio Views," something they call "sparse" indexing. (What I call granularity.) Nevertheless, I'm an historian (specializing on Japan), and not a programmer, and I feel there must be something obvious that I'm missing. Perhaps those familiar with "Folio Views" or with indexing schemes in general would be GRACIOUS enough to make a few suggestions on strategies I might explore. (Note: I've looked at various b-tree and avl packages. They don't seem to fit. I've also experimented with dbm. Again it doesn't fit the design paramters. If you think RTFM is the appropriate response, please indicate WHICH FM you have in mind. I've 3 meters of shelf space given over to FM of various flavors.) Here are some of the constraints the current system works with: (a) 64K or less for data space; Although I have (free) access to a medium horse-power SUN and a several high horse-power VAXen, both for intellectual and historical reasons I feel obliged to live with the original PDP 11/70 limits. (b) at least the potential to index all "words" in all fields; (c) acceptance of existing files for indexing; Folio Views" requires that files be rewritten with markers. kwik (the Bibliofile index) program does not. You can index to the line, paragraph (nroff/troff macro), or Bibliofile "card" level. If you want to write your own input routine, you can index to whatever unit you choose. (d) the number of files, the number of records, and the size of records should be constrained only by the hardware; (e) the index should not exceed 10-15 percent of the data file(s) being indexed; Although the BYTE review mentioned that "typical" indexes range from 30-100 percent of the size of the data bases they index, it gave no numbers for Folio Views. Can any net readers supply this? Also, I'm interested in knowing the overhead/raw data ratios for HYPErtext systems. Basically, I'm following the rule for ordinary books. Once the indexing goes above 10-15 percent, you might as well skim the whole book! (f) Generating the index should be "reasonably" fast (which I would define as faster than my current system or about five minutes per meg of data on a 10 mHz clone); (g) etc. Make your own suggestions here. (h) Any suggestions should be made with the same conditions Bibliofile is distributed: charge = $00.00, support = charge. If your suggestions prove useful, or even if they are not, but provoke me into something that does prove useful, you will get (a) recognition and (b) up to 100 percent of the the sale price of one copy of Bibliofile! (Save your strength. Any response made using the r/R commands of the news reader WILL BOUNCE. Use the address(es) below, preferably the uucp form!) Earl H. Kinmonth History Department University of California, Davis 916-752-1636 (voice, fax [2300-0800 PDT]) 916-752-0776 secretary (bitnet) ehkinmonth@ucdavis.edu (uucp) ucbvax!ucdavis!ucdked!cck (telnet or 916-752-7920) cc-dnet.ucdavis.edu [128.120.2.251] request ucdked, login as guest, no password