[comp.databases] "Folio Views" and Indexing Strategies

ked@garnet.berkeley.edu (Earl H. Kinmonth) (07/20/89)
STOP! Before you read the rest of this posting.

If vague, difficult-to-answer questions offend you, hit the 'j' key
now and go on to less complex problems!

============================================================

Dennis Allen, "Text Retrieval with a Twist", Byte (1989: July):
201-204, reviews a program called "Folio Views" for indexing every
word in large text data bases. I have several questions pertaining to
this program and indexing schemes in general. Specifically, I am trying
to improve the indexing scheme for Bibliofile, my free form (unlimited
number of fields, unlimited field size) data base.

As currently structured, Bibliofile has one program that uses an index.
The index is made by taking every n-records (granularity, a variable)
and creating a hash table for each word (or with a user-designated set
of words excluded) for those n-records. To be more specific, a
simplified version of the index looks like:

head [info about files, size, when modified, etc.]
pointer to first of granule-n records
hash table
pointer to second of granule-n records
hash table
pointer to third of granule-n records
hash table

A search is made by hashing each word in the search string, then
reading through the hash tables for granule-n records. If the hash
value does NOT occur, move to the next hash table, otherwise search for
the word. (Currently words are mapped to lower case with all
punctuation removed).

With a granularity of 128, a modulo of 4091 for the crc hash, and an
input file of about 900k, the index is about 50K. (Each hash table is
roughly modulo / 8 in size.) Returning any one record varies (on a 10
mHz AT clone running Xenix) from 4 to 40 seconds. Generally, the MORE
words (word1 && word2 && word3 && wordNN), the FASTER the search.

Although performance is adequate for the intended purpose and even
so-so by commercial standards (Bibliofile in **IX and MSDOS versions
is still only $00.00 to all who ask for it), I feel I must be missing
something.  I was pleased to find that I had anticipated one "feature"
of "Folio Views," something they call "sparse" indexing.  (What I call
granularity.)

Nevertheless, I'm an historian (specializing on Japan), and not a
programmer, and I feel there must be something obvious that I'm
missing.  Perhaps those familiar with "Folio Views" or with indexing
schemes in general would be GRACIOUS enough to make a few
suggestions on strategies I might explore.  (Note: I've looked at
various b-tree and avl packages.  They don't seem to fit.  I've also
experimented with dbm.  Again it doesn't fit the design paramters.  If
you think RTFM is the appropriate response, please indicate WHICH FM
you have in mind.  I've 3 meters of shelf space given over to FM of
various flavors.)

Here are some of the constraints the current system works with:

(a)	64K or less for data space;

	Although I have (free) access to a medium horse-power SUN
	and a several high horse-power VAXen, both for intellectual
	and historical reasons I feel obliged to live with the
	original PDP 11/70 limits.

(b)	at least the potential to index all "words" in all fields;

(c)	acceptance of existing files for indexing; 

	Folio Views" requires that files be rewritten with markers.
	kwik (the Bibliofile index) program does not.  You can index
	to the line, paragraph (nroff/troff macro), or Bibliofile "card"
	level.  If you want to write your own input routine, you can
	index to whatever unit you choose.

(d)	the number of files, the number of records, and the size of
	records should be constrained only by the hardware;

(e)	the index should not exceed 10-15 percent of the data file(s)
	being indexed;

	Although the BYTE review mentioned that "typical" indexes range from
	30-100 percent of the size of the data bases they index, it gave no
	numbers for Folio Views. Can any net readers supply this? Also, I'm
	interested in knowing the overhead/raw data ratios for HYPErtext
	systems.

	Basically, I'm following the rule for ordinary books.  Once the
	indexing goes above 10-15 percent, you might as well skim the
	whole book!

(f)	Generating the index should be "reasonably" fast (which I would
	define as faster than my current system or about five minutes per
	meg of data on a 10 mHz clone);

(g)	etc.

	Make your own suggestions here.

(h)	Any suggestions should be made with the same conditions Bibliofile
	is distributed:  charge = $00.00, support = charge.

If your suggestions prove useful, or even if they are not, but provoke
me into something that does prove useful, you will get (a) recognition
and (b) up to 100 percent of the the sale price of one copy of
Bibliofile!

(Save your strength.  Any response made using the r/R commands of the
news reader WILL BOUNCE.  Use the address(es) below, preferably the
uucp form!)

Earl H. Kinmonth
History Department
University of California, Davis
916-752-1636 (voice, fax [2300-0800 PDT])
916-752-0776 secretary

(bitnet) ehkinmonth@ucdavis.edu
(uucp) ucbvax!ucdavis!ucdked!cck
(telnet or 916-752-7920) cc-dnet.ucdavis.edu [128.120.2.251]
	request ucdked, login as guest,
	no password