[news.software.b] Searching news.group archives

hybl@mbph.UUCP (Albert Hybl Dept of Biophysics SM) (11/27/89)

In message <2179@prune.bbn.com> from rsalz@bbn.com (Rich Salz)
Re: Modifying news storage for fast searches dated 22 Nov 89 
writes:
>In <51195@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>>...Another idea is to store articles in a special compressed form
>>that lists the dictionary first (ie. the list of words) followed
>>by the text expressed and indices into the word list.
>
>Free-text retrieval is basically a solved problem.  ...[See]
>"Some Examples of Inverted Indices on the Unix System" by
>Mike Lesk (USD:30 in the BSD docs, ...for Version 7, ...).
>
>There will be a [relevant posting] in c.s.unix in a couple of weeks.
>	/r$

I have been using the M. Lesk inverted indexes for maintaining
bibliographic citation on several general subjects.  For
example, I search the MaryMed database located at UMAB
Health Science Library for all citations on "cholera"; down
load the complete list containing Title, Author, Subject, Keywords,
Abstract, and a few other categories for each reference; filter each
citation into the Lesk format and then produce the inverted reference
files.  The HSL does not provide means to search for words in
the Abstracts, I can search for "X-ray crystal structure
toxin" and find citations that the MaryMed software would
ignore.  In addition, of course, refer can be used for
insertion of citations into a document being produced
with the aid of (n/t)roff.

I would like to see something like this available on the USENET
not to complement or supplant my daily reading technique but for
search of news.group archives for specific information.  I would
like to use control commands that would:
    seekinfo -n news.all -k "history expire failed write"
    seekinfo -y 198[7-9] -n news.admin -k "expire date unparsable"
because I have been annoyed by:

     >Thu Nov 16 21:19:00 EST 1989
     >expire: history write failed
     >expire: history write failed
     >expire: history write failed
     >
     >Fri Nov 17 21:19:00 EST 1989
     >expire: Unparsable date "31 Dec 69 23:59:59 GMT"
     >expire: history write failed
     >expire: history write failed
     >expire: history write failed
     >expire: history write failed

I think that I remember these question have been posted before and
rather than reposting them I want to located what already
exists in some archive somewhere.  The AT&T refer package
contains programs:  mkey, inv, hunt, refer, deliv and others;
I don't think they are in the public domain.  The refer program
is not needed and the other programs in the package are not
entirely ideal.  However, they work well enough to persuade me
that the technique could be applied to large data bases like
MedLine citations and an archive of USENET postings.

----------------------------------------------------------------------
Albert Hybl, PhD.              Office UUCP: uunet!mimsy!mbph!hybl
Department of Biophysics       Home   UUCP: uunet!mimsy!mbph!hybl!ah
University of Maryland                CoSy: ahybl
School of Medicine             Office Phone: (301) 328-7940
Baltimore, MD  21201           Home   Phone: (301) 243-1710
----------------------------------------------------------------------
Responders--DO NOT USE:  hybl@cs.umd.edu  or  ah@cs.umd.edu