[news.software.b] Not Checksums, how about string search hash codes?

brad@looking.on.ca (Brad Templeton) (11/19/89)

As I understand it (it's been a while) there are some nifty algorithms
that can be used to generate hash numbers than can make string searches
in regions of text very fast.

In particular, as I understand it, using such hash numbers, you can tell
right away 95% of the time if a given string is NOT in a body of text.
ie. if the function returns false, you can be sure the string is not found.
If it's true, it might be found, and you have to search further -- different
hash functions, and eventually a full text search to verify things.

Anybody know more about this and care to post it?

Anyway, it seems that it would be great to calculate the hash numbers for
news articles as they are generated and put them in the header.  Then
article-kill (or find) programs would be many times more efficient, as well
as programs that ask, 'where was that article that talked about hashing?'

In particular, things would be far more efficient for NNTP over slower links,
since you could usually operate on an article just from the header, even if
you are searching for strings in the body.

This would be far more useful than a checksum.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

bill@twwells.com (T. William Wells) (11/20/89)

In article <49602@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
: As I understand it (it's been a while) there are some nifty algorithms
: that can be used to generate hash numbers than can make string searches
: in regions of text very fast.
:
: In particular, as I understand it, using such hash numbers, you can tell
: right away 95% of the time if a given string is NOT in a body of text.
: ie. if the function returns false, you can be sure the string is not found.
: If it's true, it might be found, and you have to search further -- different
: hash functions, and eventually a full text search to verify things.
:
: Anybody know more about this and care to post it?

This is easy enough. Just take any hashing function and use it to
set bits in a bit table for each word in the article. If a hashed
word creates a bit that is not in the table, the word is not in
the article.

As an extension, you can use multiple hash codes and OR their
respective bit maps. You then test each hash function in turn.

Unfortunately, I don't recall references or any more details,
though I could find out more at work.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com

brad@looking.on.ca (Brad Templeton) (11/21/89)

In article <1989Nov20.073448.18953@twwells.com> bill@twwells.com (T. William Wells) writes:
>This is easy enough. Just take any hashing function and use it to
>set bits in a bit table for each word in the article. If a hashed
>word creates a bit that is not in the table, the word is not in
>the article.

Is that all it is?  Seems that would require a lot of bits.  There were
151 unique words in your article, so you would need a 1500 bit table
(188 bytes, or 226 bytes of printable characters) to get a 10%
hit ratio using this algorithm.

Of course you could have the hit ratio vary according to the article,
but 226 seems like a lot.  I thought there were more compact algorithms.
-- 
Brad Templeton, ClariNet Communications Corp. -- Waterloo, Ontario 519/884-7473

bill@twwells.com (T. William Wells) (11/22/89)

In article <50572@looking.on.ca> brad@looking.on.ca (Brad Templeton) writes:
: In article <1989Nov20.073448.18953@twwells.com> bill@twwells.com (T. William Wells) writes:
: >This is easy enough. Just take any hashing function and use it to
: >set bits in a bit table for each word in the article. If a hashed
: >word creates a bit that is not in the table, the word is not in
: >the article.
:
: Is that all it is?  Seems that would require a lot of bits.  There were
: 151 unique words in your article, so you would need a 1500 bit table
: (188 bytes, or 226 bytes of printable characters) to get a 10%
: hit ratio using this algorithm.
:
: Of course you could have the hit ratio vary according to the article,
: but 226 seems like a lot.  I thought there were more compact algorithms.

Well, that is what having several hash codes ORed into the same
table gets you. If you have k independent (and perfectly
randomizing) hash codes, n bits in the table, and m words, the
probability of having all hash codes find a 1 in the table is (if
I did the math right)

	(1 - (1 - 1 / n) ** (m * k)) ** k

So, for one hash code, a random word has a 9.5% chance of being in
the table when it isn't in the article. But with two hash codes,
this becomes 3.3%. Alternately, one could shrink the table to
obtain the same failure rate as with one hash code.

This formula breaks down when the table has high occupancy,
because it ignores quantization. And hash codes are never
perfect. So experimentation is in order.

---
Bill                    { uunet | novavax | ankh | sunvice } !twwells!bill
bill@twwells.com