[comp.text] fast string searching?

henry@utzoo.UUCP (Henry Spencer) (04/26/87)
I am currently interested in fast search algorithms suitable for use in
things like text editors.  I'm not thinking so much of things like Boyer-
Moore string matching, as of ways of avoiding string matching altogether
when you expect that the string being searched for will be rare.

I haven't looked very hard, but the one thing that my files did turn up was
Malcolm Harrison's paper in the Dec 1971 CACM.  He describes an algorithm
in which you hash character pairs in each line into a short bit vector, keep
the bit vectors on hand, compute a bit vector for the string you are after,
and then check its bit vector against the one for a line before trying the
string match on that line.  Just what I'd like, except that it doesn't work
well enough to be worth bothering about.  In my application, the costs of
the string match are dominated by fetching the line from disk.  To be worth
having, the algorithm must reduce the number of disk fetches significantly.
This means a low rate of false alarms, because there are quite a few lines
in a disk block, and a false-alarm rate higher than one per block doesn't
save much over a straight string-match-on-each-line method.  The Harrison
algorithm just isn't good enough:  for reasonable-sized bit vectors (like
32 bits) and normal text files (averaging 30-50 chars/line), hashing
character pairs (or triples, or any reasonable number) turns on most of
the bits and the false-alarm rate climbs.  A longer bit vector ultimately
gets this under control, but the bit vectors start to get bulky enough
that intelligent caching of the lines themselves would probably be better.
Hashing on larger units, like words, also gets the bit density in the line
bit vectors down, but it doesn't turn *on* enough bits in the search-string
bit vector to avoid high false-alarm rates for typical (short) search strings.
Assorted variations haven't helped enough.

Any thoughts?  Whatever it is needs to have low maintenance overhead for
frequently-changing text, and must be relatively cheap and simple to
apply -- I'd *like* fast searches, but they're not worth a lot of complexity
in the application.
-- 
"If you want PL/I, you know       Henry Spencer @ U of Toronto Zoology
where to find it." -- DMR         {allegra,ihnp4,decvax,pyramid}!utzoo!henry