[news.software.b] Modifying news storage for fast searches

brad@looking.on.ca (Brad Templeton) (11/22/89)

Well, if news.software.b is any indication, a fast search system for
News might not be too bad.

The average article in news.software.b this week had only 140 different
words in it.  In such cases, usually at least half those words are
'common' words that would not be used in a search.  To be conservative,
we could say that a high estimate for the average number of important
words in an article would be under 100.   Better software about common
words could drop this even further.

Using a hashing algorithm, if those 100 words were hashed into an array of
650 bits (can be expressed in 100 printable chars) we get an error rate of
15% or less, meaning that if we ask if word X is in the document, and it
says yes, there is a less than 15% chance it is wrong.  (If it says no, we
can be sure X is not in the document)

15% ain't so hot, but if you are doing a 2 word search (ie. "Henry Spencer")
then the probability drops even more.  If you got 2 yes answers on those
words, the chance of having neither is around 2.25%, although the chance
of having only one is just as bad.

Somehow I feel there are better algorithms, though, but my request for info
hasn't drawn anything.  I will have to search further.

Another idea is to store articles in a special compressed form that lists
the dictionary first (ie. the list of words) followed by the text expressed
and indices into the word list.

Turns out for this group that the average word list would be 870 bytes long.
Not too bad.  This doesn't cost any disk space though, as this method actually
reduces the file size -- although I don't know how much, if any, it hurts
compress.   You want to store the 870 byte index in an index file outside
the article, too.   For typical news flow, that's a cost of 2.6 meg/day,
which is a bit high, although you could compress it a lot, as well.

But the advantage would be a complete accurate search facility.  You could
ask it to show you all the articles that mention Henry Spencer and get an
answer almost right away.

Any further ideas are welcome.

A decoder for this is simple.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

rsalz@bbn.com (Rich Salz) (11/23/89)

In <51195@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>Well, if news.software.b is any indication, a fast search system for
>News might not be too bad.

>Another idea is to store articles in a special compressed form that lists
>the dictionary first (ie. the list of words) followed by the text expressed
>and indices into the word list.

Free-text retrieval is basically a solved problem.  Go buy books by
(Gerald?) Salton.

Check out your Unix documentation for "Some Examples of Inverted Indices
on the Unix System" by Mike Lesk (USD:30 in the BSD docs, I don't know
where for other systems -- 2B for Version 7, I think).

There was a mini text-retrieval system that appeared in comp.sources.misc
qndxr I think the name was.  There will be a bigger system in c.s.unix in
a couple of weeks.

Associative retrieval -- "give me more articles like THIS one" was first
proposed in the 1950's.  Thinking Machines has one hell of a sexy demo on
it.

To follow the Usenet trend of "I said it first," I guess I should say
that I proposed this on the news-interfaces list nearly a year ago.
	/r$
-- 
Please send comp.sources.unix-related mail to rsalz@uunet.uu.net.
Use a domain-based address or give alternate paths, or you may lose out.

lee@sq.sq.com (Liam R. E. Quin) (11/26/89)

In article <2179@prune.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>In <51195@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
>>Well, if news.software.b is any indication, a fast search system for
>>News might not be too bad.
Do you mean fast for users of for programs, I wonder...

>>Another idea is to store articles in a special compressed form that lists
>>the dictionary first (ie. the list of words) followed by the text expressed
>>and indices into the word list.
This turns out to have problems.
The average word-length in English is about 4.5 characters, plus 1 or more
for the space between words.  So you have to use less than 4.5 bytes to
store each word, and still retain the punctuation (which is very important
in fragments of C code, of course!).
Now, there are an estimated million users, each with a distinct host and
username.  So the vocabulary would be a little large.  And if peeple make
as many speling mistakes and typos as I doo [:-)], the vocabulary gets even
larger.  Most "words" are used less than ten times!
So, you would probably need 4 bytes-worth of word-number, and the saving turns
out to be less than you would like.
See below for some other approaches, though.

>Free-text retrieval is basically a solved problem.  Go buy books by
>(Gerald?) Salton.
I should point out that there is still a lot of research in this area.

>Check out your Unix documentation for "Some Examples of Inverted Indices
>on the Unix System" by Mike Lesk (USD:30 in the BSD docs, I don't know
>where for other systems -- 2B for Version 7, I think).
Refer is not distribnuted with System V.

>There was a mini text-retrieval system that appeared in comp.sources.misc
>qndxr I think the name was.  There will be a bigger system in c.s.unix in
>a couple of weeks.
Well, as soon as I finish the Sun port.  It is ready now on 386/ix, except
that I need to take the sun version and check that it still works on the 386.
MIPS and Cray patches to follow :-) :-) :-)

I have not been able to try it on news because of disk space, however.
If anyone cares to donate an eagle or three, I will make a news-reader
interface.  [0.5 :-)]

>Associative retrieval -- "give me more articles like THIS one" was first
>proposed in the 1950's.  Thinking Machines has one hell of a sexy demo [...]
This is rather tricky.  TOPIC is a commercial package which does this.  It can
cost of the order of UK#250,000 for a typical 64-user Sequent Symmetry.

The difficulty is in finding unusual words and phrases.
If you ask for articles containing "unix" anywhere in the body, for instance,
you will get an awful lot of responses!
There are also disk space issues.  Most articles expire after less than a day
or so here as it is.

>To follow the Usenet trend of "I said it first," I guess I should say
>that I proposed this on the news-interfaces list nearly a year ago.
>	/r$
Well, what happened to "knews" back in 1985 or so?

* Footnote: Some other approaches to retrieval

(1) signatures -- this is basically hashing, and can dramatically reduce
   search time.  There is generally a 10% space overhead.
   Most current schemes (PAT et. al.) do not cope with deleting and adding
   files at 10 minute intervals.  PAT takes a weekend on a Sun 4 to index
   one ~500 Meg file, which would be a little irritating for six month's
   news.

(2) refer uses something like superimposed signatures, see Mike Lesk's paper.
   But it turns out that grep tehnology has caught up with refer now.

(3) Fulcrum use a byte index, which allows *much* more sophisticated
    searching.  Unfortunately, this algorithm is not (as I understand it)
    publicly available.

If the original poster wants to discuss the algorihms I chose in implementing
a fairly simple prototype text retrieval package, I would be happy to do so
in mail.

Lee
-- 
Liam R. Quin, Unixsys (UK) Ltd [note: not an employee of "sq" - a visitor!]
lee@sq.com (Whilst visiting Canada from England, until Christmas)
utai!anduk.uucp!lee (after Christmas)
 ...striving to promote the interproduction of epimorphistic conformability

lee@sq.sq.com (Liam R. E. Quin) (11/27/89)

Key-word news might not be as useful as it sounds.
The searching is fun, but it turns out that we often want to look for
unusual permutations of frequent words.
The savings in space are probably minimal compared to the extra CPU,
program complexity and programmer effort.

>In article <2179@prune.bbn.com> rsalz@bbn.com (Rich Salz) writes:
>>There was a mini text-retrieval system that appeared in comp.sources.misc
>>qndxr I think the name was.  There will be a bigger system in c.s.unix in
>>a couple of weeks.

Canada-dwellers who can't wait can now ftp the sources of my lq-text
package from
	radio.astro.toronto.edu
which is (according to telnet)
	128.100.75.4

The file is in utils/lq-text.shar.Z

This is an awfully early release, but I have less than a month left on
the net, so I thought it better to post it while I could still receive
and send patches and fixes, etc...
It was only really effective on news when I ran a filter to remove most
of the headers.  The same is true for mail -- consider all of the words
in the Path: header line...

For transmission, consider that the words-file must never be separated
from the article, or one ends up with what might in effect be an insoluble
encryption problem!

Lee

Lee
-- 
Liam R. Quin, Unixsys (UK) Ltd [note: not an employee of "sq" - a visitor!]
lee@sq.com (Whilst visiting Canada from England, until Christmas)
utai!anduk.uucp!lee (after Christmas)
 ...striving to promote the interproduction of epimorphistic conformability