[net.news] Two comments on keyword-based news

jaw@ames.UUCP (James A. Woods) (03/12/85)

#  "If you don't like the news, go out and make some of your own."
	-- Scoop Nisker, local (S.F.) radio commentator

     Lest everyone has forgotten Mike Lesk's utilities for fast text
searching, you are hereby referred to the neglected paper "Some
Applications of Inverted Indexes on the UNIX System" in the old V7 docs.
Full text searching using inverted files is a fairly painless way of
automating indexing and retrieval; the legal community has known
this for quite some time, using similar systems for case law databases.

     In Lesk's paper, a mention of the never-distributed "lookall"
command to build a hashed index of 35 MB of all English files on the
!research machine is thought-provoking.  Space overhead is only a few percent,
assuming that the first 100 tokens (not counting common words) of each file
go into the database.  (We ignore, for the moment, construing netnews
as "English".)  Retrieval time is constant (a few disk seeks), and
indexing is done, possibly incrementally, by cron during the wee hours.

     Regarding the John Bass comment about 'namei' overhead, Kirk
McKusick has done an admirable job optimizing the function by a factor
of six (25 msec -> 4 msec) for BSD 4.3.  'Namei' now takes up only
a small fraction of system time.  I doubt that news article
access (creation, retrieval, indexing) will figure much into load
for systems fortunate enough to be 4.3-based.  The Salt Lake USENIX
proceedings will provide details.

     -- James A. Woods   {ihnp4,hplabs}!ames!jaw   (or, jaw@riacs)