[comp.lang.icon] Icon text-database

goer@ellis.uchicago.edu (Richard L. Goerwitz) (06/24/91)

This is the README file from a package I've been using now, on and off,
for about a year.  Some parts are better tested than others.  I'll be
happy to mail a shell archive of the whole package to anyone who asks.
This is not a finished product, but rather a collection of tools that
I'd enjoy having a few people play with, if they are so inclined.

-Richard

________________________________________________________________________


Name: retrieve
Language: Icon
Contents: tools for word-based, indexed access to text files
Requires: up-to-date Icon Program Library, up-to-date iconc/icont, UNIX


--------


Overview:

	Scholars have traditionally split so-called "Classics," the
Quran, the Bible, and generally any closely studied literary or
religious text, into hierarchically arranged divisions (in the case of
the Bible, these are books, chapters, and verses).  Such divisions
drastically simplify the process of citation and reference.
Fortunately for those of us who need electronic access to these files,
this hierarchical system of divisions permits easy representation
using bit-fields, i.e. fixed-width series' of binary digits.  Such
representations are compact, and allow the programmer to implement
high-level boolean operations and range-based searches using simple
shifts, additions, and subtractions.
	The package in which this README file comes - "retrieve" -
offers a naive, but generalized and fairly high-level tool for
indexing texts which are divided up in the manner just described, and
for performing word-based searches on them.  These word-based searches
offer wildcard-based access to word patterns (e.g. "give me every
passage containing a word with the letters 'NIX'").  The search
facilities also permit boolean and range-based specifications (e.g.
"give me every instance of word X occurring within eleven sections of
the word Y").  One can also access passages by both absolute (e.g.
"give me book 1, chapter 3, verse 4"), and relative, location (e.g.
"give me the passage occurring before/after the one I just looked
at").
	Retrieve does no compression of any kind, and is written
entirely in Icon.  As a result it is something of a disk hog, and
takes a long time to index files.  Surprisingly, though, once set up,
files incorporated into the retrieve package can be accessed quite
rapidly.  After a brief initialization process (takes 2-4 seconds on a
Sun4), absolute locations can be retrieved with no perceptible delay.
The same is true of relative locations (again, after a lag on first
invocation).  Regular expression-based searches appear instantaneous
on a fast machine (there is a just perceptible delay on a Sun4 for a
four megabyte indexed file, five to ten seconds on a Xenix/386 box
with a relatively slow disk).  Boolean and range-based searches take
the longest, varying widely according to their complexity and the
number of "hits."


--------


Installation:

	Retrieve is really not a program as such.  It is a set of
routines for indexing, and accessing indexed, files.  Installation
consists of four basic steps:

	1) creating an indexable file
	2) indexing that file
	3) writing a program using the retrieve interface
	4) compiling and running what you wrote in (3)

These steps are discussed in detail in the following sections.


--------


Step 1:  Creating an Indexable File

	The format for indexable files must conform to a simple, but
strict, set of guidelines.  Basically, it must interleave a series of
location designators (internally represented by so-called "bitmaps")
with actual text:

	::001:001:001
	This is text.
	::001:001:002
	This is more text.

The initial :: (double colon) delimits lines containing the location
designators.  These designators translate into integers dividable
internally into (in this case) three bit-fields of length 10 (enough
to handle 999:999:999), which serve as a location markers for the text
that goes with them.  Note that the translation process is invisible.
All you need to do is make sure,

	a) that the location designators are correctly paired with
	   blocks of text, and
	b) that the fields are numbered consistently, beginning with
	   the same low value (usually 1 or 0), and continuing in
	   ascending order until they roll over again to their low
	   value

	Rather than speak merely in the abstract about the format, let
me offer a simple illustration taken from the King James Bible.  The
first verse in the Bible is Genesis chapter 1 verse 1.  This passage
might be designated 1:1:1.  Verses in Genesis chapter 1 would continue
in ascending order to verse 31 (1:1:31), after which chapter 2 would
begin (i.e. 1:2:1).  The resulting text would look like:

	::1:1:1
	In the beginning God created the heaven and the earth.
	::1:1:2
	And the earth was without form, and void; and darkness was
	upon the face of the deep. And the Spirit of God moved upon
	the face of the waters.
	::1:1:3
	And God said, Let there be light: and there was light.
	::1:1:4
	And God saw the light, that it was good: and God divided the
	light from the darkness.
	::1:1:5
	And God called the light Day, and the darkness he called
	Night. And the evening and the morning were the first day.
	...
	::1:2:1
	Thus the heavens and the earth were finished, and all the host
	of them.

Although you can use any number of fields you want or need, and can
use any nonnumeric separator (e.g. 01-01-01-05-03), lines containing
location designators *must* begin with "::," and must be ordered
sequentially throughout the input file, paired with the correct text
block in each instance.
		

--------


Step 2:  Indexing the File

	Indexing the file created in step (1) entails compiling and
invoking a program called "makeind."  The compilation end of this
process would typically be achieved by typing:

	icont -o makeind makeind.icn gettokens.icn indexutl.icn

One of the files listed just above, gettokens.icn is of particular
interest.  It contains the tokenizing routine to be used in creating
the main word index.  Should this routine prove unsatisfactory for one
reason or another, you are free to replace it with something more to
your liking.  Just comment out the old gettokens() routine, and insert
the new one in its place.  Then recompile.
	Once you have compiled makeind, you must invoke it for the
text file you created in step (1).  Invoking makeind involves
specifying a file to be indexed, the number of fields in location
markers for that file, and the maximum value for fields.  If you plan
on invoking passages by relative location, you must also use the -l
option, which tells makeind to build a .LIM file, which records the
high values for a specific field throughout the file being indexed.
Let us say you have examined Genesis 1:31 in the Bible, and want to
look at the next verse.  The only easy way the the procedure which
handles this particular chore can know the maximum verse value for
Genesis chapter 1 (31) is to store this maximum value in a file.  By
supplying makeind with an -l argument, you are telling it to create
such a file.
	Just for illustration's sake, let us suppose you want to index
the King James Bible.  How might you invoke makeind to accomplish
this?  First you would need to determine the maximum field value for
your text.  In the case of the Christian English Bible, this is 176.
The English Bible (including Apocrypha) contains 73 books.  The
Protestant KJV contains 66.  The maximum number of chapters in any
book is 150 (Psalms).  The maximum number of verses in any one chapter
in any one book is 176 (Psalm 119).  176 would therefore be the
maximum value any field would have to contain.  You would pass this
information to makeind via the -m option.  The total number of fields
is three, naturally (book, chapter, and verse).  This value would be
passed using the -n option.  As noted above, in order to use relative
locations you would need to tell makeind what field to record max
values for.  In our hypothesized scenario, you would want makeind to
store the max value for the verse field for every chapter of every
book in the input file.  The verse field (field #3), in other words,
is your "rollover" field, and would be passed to makeind using the -l
option.  Assuming "kjv" to be the name of your input file, this set of
circumstances would imply the following invocation for makeind:

	makeind -f kjv -m 176 -n 3 -l 3

If you were to want a case-sensitive index (not a good idea), you
would add "-s" to the argument list above.
	Actual English Bible texts usually take up 4-5 megabytes.
Indexing one would require at least twice that much core memory, and
would take at least an hour on a fast machine.  The end result would
be a set of data files occupying about 2 megabytes plus the 4-5
megabytes of the original file.  Once these data files were created,
they could be moved, along with the original source file, to any
platform you desired.
	Having indexed, and having moved the files to wherever you
wanted them, you would then be ready for step 3.


--------


Step 3:  Writing a Program to Access Indexed Files

	When accessing text files such as the Bible, the most useful
unit for searches is normally the word.  Let us suppose you are a
zealous lay-speaker preparing a talk on fire imagery and divine wrath
in the Bible.  You would probably want to look for every passage in
the text that contained words like

	fire, firy
	burn
	furnace
	etc.

To refine the search, let us say that you want every instance of one
of these fire words that occurs within one verse of a biblical title
for God:

	God
	LORD
	etc.

The searches for fire, firy, burn, etc. would be accomplished by
calling a routine called retrieve().  Retrieve takes three arguments:

	retrieve(pattern, filename, invert_search)

The first argument should be a string containing a regular expression
based pattern, such as

	fir(y|e|iness)|flam(e|ing)|burn.*?

Note that the pattern must match words IN THEIR ENTIRETY.  So, for
instance, "fir[ie]" would not catch "firiness," but rather only
"fire."  Likewise, if you want every string beginning with the
sequence "burn," the string "burn" will not work.  Use "burn.*"
instead.  The filename argument supplies retrieve() with the name of
the original text file.  The last argument, if nonnull, inverts the
sense of the search (a la egrep -v).  In the case of the fire words
mentioned above, one would invoke retrieve() as follows:

	hits1 := retrieve("fir(y|e|iness)|flam(e|ing)|burn.*?", "kjv")

For the divine names, one would do something along these lines:

	hits2 := retrieve("god|lord", "kjv")

	Having finished the basic word searches, one would then
perform a set intersection on them.  If we are looking for fire words
which occur at most one verse away from a divine name, then we would
specify 1 as our range (as opposed to, say, zero), and the verse as
our unit.  The utility you would use to carry out the search is
r_and().  R_and() would be invoked as follows:

	hits3 := r_and(hits1, hits2, "kjv", 3, 1)

The last two arguments, 3 and 1, specify field three (the "verse"
field) and field 1 (the range).
	To display the text for your "hit list" (hits3 above), you
would call bitmap_2_text():

	every write(!bitmap_2_text(hits3, "kjv"))

Bitmap_2_text converts the location designators contained in hits3
into actual text.
	The three basic functions mentioned above - retrieve(),
r_and(), and bitmap_2_text() - are contained in the three files
retrieve.icn, retrops.icn, and bmp2text.icn, respectively.  Other
useful routines are included in these files, and also in whatnext.icn.
If you are planning on writing a retrieval engine for serious work of
some kind, you would probably want to construct a mini interpreter,
which would convert strings typed in by the user at run-time into
internal search and retrieval operations.
	Note that I have included no routine to parse or expand
human-readable input (the nature of which will naturally vary from
text to text).  For instance, it would be very useful to be able to
ask for every passage in, say, Genesis chapters 2 through 4 in a
biblical text, and to be able to print these to the screen.  Doing
this would require a parsing routine to break down the references, and
map them to retrieve-internal format.  The routine would then have to
generate all valid locations from the minimum value in chapter 2 above
to the max in chapter 4.  See the file whatnext.icn for an
illustration of how to generate location designators in a suitably
step-like fashion.


--------


Step 4:  Compiling and Running Your Program

	Assuming you have written a search/retrieval program using the
routines contained in retrieve.icn, retrops.icn, bmp2text.icn, and
whatnext.icn, you would now be ready to compile it.  In order to
function properly, these routines would need to be linked with
initfile.icn and indexutl.icn.  Specific dependencies are noted in the
individual files in case there is any confusion.
	If you have made significant use of this package, you probably
should not worry about the exact dependencies, though.  Just link
everything in together, and worry about what isn't needed after you
have fully tested your program:

	icont -o yourprog yourprog.icn initfile.icn indexutl.icn \
		retrieve.icn retrops.icn bmp2text.icn binsrch.icn

--------


Problems, bugs:

	This is really an alpha release of the retrieve package.  I
use it for various things.  For instance, I recently retrieved a text
file containing informal reviews of a number of Science Fiction works.
My father likes SciFi, and it was close to Fathers' Day, so I indexed
the file, and performed cross-referenced searches for words like "very
good," "brilliant," and "excellent," omitting authors my father has
certainly read (e.g. Herbert, Azimov, etc.).  I also had occasion to
write a retrieval engine for the King James Bible (hence the many
examples from this text), and to construct a retrieval package for the
Hebrew Bible, which I am now using to gather data for various chapters
of my dissertation.
	If anyone else finds these routines useful, then great.
Obviously, they could be written/maintained in C or something that
might offer much better performance.  They would, however, lose a lot
of flexibility, and would have taken much, much longer to write.
Right now, they occupy about 60k of basic source files, probably most
of which consists of comments.  When compiled together with a
moderate-size user interface, the total package typically comes to
about 120k.  In core size typically runs about 350k on my home machine
here (a Xenix/386 box), with the basic run-time interpreter taking up
a good chunk of that space all on its own.  It's not a small package,
but I've found it a nice base for rapid prototyping and development of
small search and retrieval engines.

-- 

   -Richard L. Goerwitz              goer%sophist@uchicago.bitnet
   goer@sophist.uchicago.edu         rutgers!oddjob!gide!sophist!goer