henry@utzoo.UUCP (Henry Spencer) (04/26/87)
I am currently interested in fast search algorithms suitable for use in things like text editors. I'm not thinking so much of things like Boyer- Moore string matching, as of ways of avoiding string matching altogether when you expect that the string being searched for will be rare. I haven't looked very hard, but the one thing that my files did turn up was Malcolm Harrison's paper in the Dec 1971 CACM. He describes an algorithm in which you hash character pairs in each line into a short bit vector, keep the bit vectors on hand, compute a bit vector for the string you are after, and then check its bit vector against the one for a line before trying the string match on that line. Just what I'd like, except that it doesn't work well enough to be worth bothering about. In my application, the costs of the string match are dominated by fetching the line from disk. To be worth having, the algorithm must reduce the number of disk fetches significantly. This means a low rate of false alarms, because there are quite a few lines in a disk block, and a false-alarm rate higher than one per block doesn't save much over a straight string-match-on-each-line method. The Harrison algorithm just isn't good enough: for reasonable-sized bit vectors (like 32 bits) and normal text files (averaging 30-50 chars/line), hashing character pairs (or triples, or any reasonable number) turns on most of the bits and the false-alarm rate climbs. A longer bit vector ultimately gets this under control, but the bit vectors start to get bulky enough that intelligent caching of the lines themselves would probably be better. Hashing on larger units, like words, also gets the bit density in the line bit vectors down, but it doesn't turn *on* enough bits in the search-string bit vector to avoid high false-alarm rates for typical (short) search strings. Assorted variations haven't helped enough. Any thoughts? Whatever it is needs to have low maintenance overhead for frequently-changing text, and must be relatively cheap and simple to apply -- I'd *like* fast searches, but they're not worth a lot of complexity in the application. -- "If you want PL/I, you know Henry Spencer @ U of Toronto Zoology where to find it." -- DMR {allegra,ihnp4,decvax,pyramid}!utzoo!henry