jaw@ames.UUCP (James A. Woods) (03/12/85)
# "If you don't like the news, go out and make some of your own." -- Scoop Nisker, local (S.F.) radio commentator Lest everyone has forgotten Mike Lesk's utilities for fast text searching, you are hereby referred to the neglected paper "Some Applications of Inverted Indexes on the UNIX System" in the old V7 docs. Full text searching using inverted files is a fairly painless way of automating indexing and retrieval; the legal community has known this for quite some time, using similar systems for case law databases. In Lesk's paper, a mention of the never-distributed "lookall" command to build a hashed index of 35 MB of all English files on the !research machine is thought-provoking. Space overhead is only a few percent, assuming that the first 100 tokens (not counting common words) of each file go into the database. (We ignore, for the moment, construing netnews as "English".) Retrieval time is constant (a few disk seeks), and indexing is done, possibly incrementally, by cron during the wee hours. Regarding the John Bass comment about 'namei' overhead, Kirk McKusick has done an admirable job optimizing the function by a factor of six (25 msec -> 4 msec) for BSD 4.3. 'Namei' now takes up only a small fraction of system time. I doubt that news article access (creation, retrieval, indexing) will figure much into load for systems fortunate enough to be 4.3-based. The Salt Lake USENIX proceedings will provide details. -- James A. Woods {ihnp4,hplabs}!ames!jaw (or, jaw@riacs)