goer@ellis.uchicago.edu (Richard L. Goerwitz) (06/24/91)
This is the README file from a package I've been using now, on and off, for about a year. Some parts are better tested than others. I'll be happy to mail a shell archive of the whole package to anyone who asks. This is not a finished product, but rather a collection of tools that I'd enjoy having a few people play with, if they are so inclined. -Richard ________________________________________________________________________ Name: retrieve Language: Icon Contents: tools for word-based, indexed access to text files Requires: up-to-date Icon Program Library, up-to-date iconc/icont, UNIX -------- Overview: Scholars have traditionally split so-called "Classics," the Quran, the Bible, and generally any closely studied literary or religious text, into hierarchically arranged divisions (in the case of the Bible, these are books, chapters, and verses). Such divisions drastically simplify the process of citation and reference. Fortunately for those of us who need electronic access to these files, this hierarchical system of divisions permits easy representation using bit-fields, i.e. fixed-width series' of binary digits. Such representations are compact, and allow the programmer to implement high-level boolean operations and range-based searches using simple shifts, additions, and subtractions. The package in which this README file comes - "retrieve" - offers a naive, but generalized and fairly high-level tool for indexing texts which are divided up in the manner just described, and for performing word-based searches on them. These word-based searches offer wildcard-based access to word patterns (e.g. "give me every passage containing a word with the letters 'NIX'"). The search facilities also permit boolean and range-based specifications (e.g. "give me every instance of word X occurring within eleven sections of the word Y"). One can also access passages by both absolute (e.g. "give me book 1, chapter 3, verse 4"), and relative, location (e.g. "give me the passage occurring before/after the one I just looked at"). Retrieve does no compression of any kind, and is written entirely in Icon. As a result it is something of a disk hog, and takes a long time to index files. Surprisingly, though, once set up, files incorporated into the retrieve package can be accessed quite rapidly. After a brief initialization process (takes 2-4 seconds on a Sun4), absolute locations can be retrieved with no perceptible delay. The same is true of relative locations (again, after a lag on first invocation). Regular expression-based searches appear instantaneous on a fast machine (there is a just perceptible delay on a Sun4 for a four megabyte indexed file, five to ten seconds on a Xenix/386 box with a relatively slow disk). Boolean and range-based searches take the longest, varying widely according to their complexity and the number of "hits." -------- Installation: Retrieve is really not a program as such. It is a set of routines for indexing, and accessing indexed, files. Installation consists of four basic steps: 1) creating an indexable file 2) indexing that file 3) writing a program using the retrieve interface 4) compiling and running what you wrote in (3) These steps are discussed in detail in the following sections. -------- Step 1: Creating an Indexable File The format for indexable files must conform to a simple, but strict, set of guidelines. Basically, it must interleave a series of location designators (internally represented by so-called "bitmaps") with actual text: ::001:001:001 This is text. ::001:001:002 This is more text. The initial :: (double colon) delimits lines containing the location designators. These designators translate into integers dividable internally into (in this case) three bit-fields of length 10 (enough to handle 999:999:999), which serve as a location markers for the text that goes with them. Note that the translation process is invisible. All you need to do is make sure, a) that the location designators are correctly paired with blocks of text, and b) that the fields are numbered consistently, beginning with the same low value (usually 1 or 0), and continuing in ascending order until they roll over again to their low value Rather than speak merely in the abstract about the format, let me offer a simple illustration taken from the King James Bible. The first verse in the Bible is Genesis chapter 1 verse 1. This passage might be designated 1:1:1. Verses in Genesis chapter 1 would continue in ascending order to verse 31 (1:1:31), after which chapter 2 would begin (i.e. 1:2:1). The resulting text would look like: ::1:1:1 In the beginning God created the heaven and the earth. ::1:1:2 And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters. ::1:1:3 And God said, Let there be light: and there was light. ::1:1:4 And God saw the light, that it was good: and God divided the light from the darkness. ::1:1:5 And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day. ... ::1:2:1 Thus the heavens and the earth were finished, and all the host of them. Although you can use any number of fields you want or need, and can use any nonnumeric separator (e.g. 01-01-01-05-03), lines containing location designators *must* begin with "::," and must be ordered sequentially throughout the input file, paired with the correct text block in each instance. -------- Step 2: Indexing the File Indexing the file created in step (1) entails compiling and invoking a program called "makeind." The compilation end of this process would typically be achieved by typing: icont -o makeind makeind.icn gettokens.icn indexutl.icn One of the files listed just above, gettokens.icn is of particular interest. It contains the tokenizing routine to be used in creating the main word index. Should this routine prove unsatisfactory for one reason or another, you are free to replace it with something more to your liking. Just comment out the old gettokens() routine, and insert the new one in its place. Then recompile. Once you have compiled makeind, you must invoke it for the text file you created in step (1). Invoking makeind involves specifying a file to be indexed, the number of fields in location markers for that file, and the maximum value for fields. If you plan on invoking passages by relative location, you must also use the -l option, which tells makeind to build a .LIM file, which records the high values for a specific field throughout the file being indexed. Let us say you have examined Genesis 1:31 in the Bible, and want to look at the next verse. The only easy way the the procedure which handles this particular chore can know the maximum verse value for Genesis chapter 1 (31) is to store this maximum value in a file. By supplying makeind with an -l argument, you are telling it to create such a file. Just for illustration's sake, let us suppose you want to index the King James Bible. How might you invoke makeind to accomplish this? First you would need to determine the maximum field value for your text. In the case of the Christian English Bible, this is 176. The English Bible (including Apocrypha) contains 73 books. The Protestant KJV contains 66. The maximum number of chapters in any book is 150 (Psalms). The maximum number of verses in any one chapter in any one book is 176 (Psalm 119). 176 would therefore be the maximum value any field would have to contain. You would pass this information to makeind via the -m option. The total number of fields is three, naturally (book, chapter, and verse). This value would be passed using the -n option. As noted above, in order to use relative locations you would need to tell makeind what field to record max values for. In our hypothesized scenario, you would want makeind to store the max value for the verse field for every chapter of every book in the input file. The verse field (field #3), in other words, is your "rollover" field, and would be passed to makeind using the -l option. Assuming "kjv" to be the name of your input file, this set of circumstances would imply the following invocation for makeind: makeind -f kjv -m 176 -n 3 -l 3 If you were to want a case-sensitive index (not a good idea), you would add "-s" to the argument list above. Actual English Bible texts usually take up 4-5 megabytes. Indexing one would require at least twice that much core memory, and would take at least an hour on a fast machine. The end result would be a set of data files occupying about 2 megabytes plus the 4-5 megabytes of the original file. Once these data files were created, they could be moved, along with the original source file, to any platform you desired. Having indexed, and having moved the files to wherever you wanted them, you would then be ready for step 3. -------- Step 3: Writing a Program to Access Indexed Files When accessing text files such as the Bible, the most useful unit for searches is normally the word. Let us suppose you are a zealous lay-speaker preparing a talk on fire imagery and divine wrath in the Bible. You would probably want to look for every passage in the text that contained words like fire, firy burn furnace etc. To refine the search, let us say that you want every instance of one of these fire words that occurs within one verse of a biblical title for God: God LORD etc. The searches for fire, firy, burn, etc. would be accomplished by calling a routine called retrieve(). Retrieve takes three arguments: retrieve(pattern, filename, invert_search) The first argument should be a string containing a regular expression based pattern, such as fir(y|e|iness)|flam(e|ing)|burn.*? Note that the pattern must match words IN THEIR ENTIRETY. So, for instance, "fir[ie]" would not catch "firiness," but rather only "fire." Likewise, if you want every string beginning with the sequence "burn," the string "burn" will not work. Use "burn.*" instead. The filename argument supplies retrieve() with the name of the original text file. The last argument, if nonnull, inverts the sense of the search (a la egrep -v). In the case of the fire words mentioned above, one would invoke retrieve() as follows: hits1 := retrieve("fir(y|e|iness)|flam(e|ing)|burn.*?", "kjv") For the divine names, one would do something along these lines: hits2 := retrieve("god|lord", "kjv") Having finished the basic word searches, one would then perform a set intersection on them. If we are looking for fire words which occur at most one verse away from a divine name, then we would specify 1 as our range (as opposed to, say, zero), and the verse as our unit. The utility you would use to carry out the search is r_and(). R_and() would be invoked as follows: hits3 := r_and(hits1, hits2, "kjv", 3, 1) The last two arguments, 3 and 1, specify field three (the "verse" field) and field 1 (the range). To display the text for your "hit list" (hits3 above), you would call bitmap_2_text(): every write(!bitmap_2_text(hits3, "kjv")) Bitmap_2_text converts the location designators contained in hits3 into actual text. The three basic functions mentioned above - retrieve(), r_and(), and bitmap_2_text() - are contained in the three files retrieve.icn, retrops.icn, and bmp2text.icn, respectively. Other useful routines are included in these files, and also in whatnext.icn. If you are planning on writing a retrieval engine for serious work of some kind, you would probably want to construct a mini interpreter, which would convert strings typed in by the user at run-time into internal search and retrieval operations. Note that I have included no routine to parse or expand human-readable input (the nature of which will naturally vary from text to text). For instance, it would be very useful to be able to ask for every passage in, say, Genesis chapters 2 through 4 in a biblical text, and to be able to print these to the screen. Doing this would require a parsing routine to break down the references, and map them to retrieve-internal format. The routine would then have to generate all valid locations from the minimum value in chapter 2 above to the max in chapter 4. See the file whatnext.icn for an illustration of how to generate location designators in a suitably step-like fashion. -------- Step 4: Compiling and Running Your Program Assuming you have written a search/retrieval program using the routines contained in retrieve.icn, retrops.icn, bmp2text.icn, and whatnext.icn, you would now be ready to compile it. In order to function properly, these routines would need to be linked with initfile.icn and indexutl.icn. Specific dependencies are noted in the individual files in case there is any confusion. If you have made significant use of this package, you probably should not worry about the exact dependencies, though. Just link everything in together, and worry about what isn't needed after you have fully tested your program: icont -o yourprog yourprog.icn initfile.icn indexutl.icn \ retrieve.icn retrops.icn bmp2text.icn binsrch.icn -------- Problems, bugs: This is really an alpha release of the retrieve package. I use it for various things. For instance, I recently retrieved a text file containing informal reviews of a number of Science Fiction works. My father likes SciFi, and it was close to Fathers' Day, so I indexed the file, and performed cross-referenced searches for words like "very good," "brilliant," and "excellent," omitting authors my father has certainly read (e.g. Herbert, Azimov, etc.). I also had occasion to write a retrieval engine for the King James Bible (hence the many examples from this text), and to construct a retrieval package for the Hebrew Bible, which I am now using to gather data for various chapters of my dissertation. If anyone else finds these routines useful, then great. Obviously, they could be written/maintained in C or something that might offer much better performance. They would, however, lose a lot of flexibility, and would have taken much, much longer to write. Right now, they occupy about 60k of basic source files, probably most of which consists of comments. When compiled together with a moderate-size user interface, the total package typically comes to about 120k. In core size typically runs about 350k on my home machine here (a Xenix/386 box), with the basic run-time interpreter taking up a good chunk of that space all on its own. It's not a small package, but I've found it a nice base for rapid prototyping and development of small search and retrieval engines. -- -Richard L. Goerwitz goer%sophist@uchicago.bitnet goer@sophist.uchicago.edu rutgers!oddjob!gide!sophist!goer