amead@s.psych.uiuc.edu (alan mead) (02/07/91)
Here is a file from TUG002.ZIP which can be had at chyde. The other files
mentioned in this document (the source code) are there as well as many
other useful files. It describes what I find to be a clever search
algorithm (it would only be worth it to use this for searching many
characters):
TUG PDS CERT 1.01 (Documentation)
==========================================================================
TUG PUBLIC DOMAIN SOFTWARE CERTIFICATION
The Turbo User Group (TUG) is recognized by Borland International as the
official support organization for Turbo languages. This file has been
verified by the TUG library staff. We are reasonably certain that the
information contained in this file is public domain material, but
it is also subject to any restrictions applied by its author.
This diskette contains information determined to be in the PUBLIC
DOMAIN, provided as a service of TUG for the use of its members. The
Turbo User Group will not be liable for any damages, including any lost
profits, lost savings or other incidental or consequential damages arising
out of the use of or inability to use the contents, even if TUG has been
advised of the possibility of such damages, or for any claim by any
other party.
To the best of our knowledge, the information in this file is accurate.
If you discover an error in this file, we would appreciate it if you would
report it to us. To report bugs, or to request information on membership
in TUG, please contact us at:
Turbo User Group
PO Box 1510
Poulsbo, Washington USA 98370
--------------------------------------------------------------------------
F i l e I n f o r m a t i o n
* DESCRIPTION
Documentation file explaining about Boyer-MOORE.
* ASSOCIATED FILES
SEARCHES.PAS
BOYER.DOC
SEARCHES.DOC
* CHECKED BY
DRM - 08/14/88
* KEYWORDS
TURBO PASCAL V4.0 DOCUMENTATION SEARCH INLINE
==========================================================================
}
W. Vann Hall
Pala Designs
P.O. Box 10804
Arlington, VA 22210
CIS 70346,1144
MCI WVHALL
ABOUT Boyer-MOORE:
Boyer-Moore is an attempt to speed up the usually serial text search.
Traditionally, text searches proceed from the first character onward.
(In these examples, "string" refers to string we're trying to locate,
"text" the array of characters we're trying to find the string in.
These all ignore case sensitivity, matching around line boundaries,
punctuation, etc.)
String to find: STRING
Text to find it in: HOW YOU SEARCH FOR A STRING
Our StringPointer and TextPointer both start at 1. We compare
String[1] and Text[1]. Since "S" and "H" don't match, we increment the
text pointer and compare again. At TextPointer = 9, the "S" of
"STRING" and the "S" of "SEARCH" match, so we increment TextPointer
*and* StringPointer and compare String[2] and Text[10]. "T" and "E"
don't match, so we reset StringPointer and increment TextPointer again.
And so on. You can see that it takes 22 comparisons before we locate
the correct "S" and a full 27 before we know that we have located
"STRING."
Boyer-Moore, on the other hand, works from the end of the string (but
still the beginning of the text). First, it creates a look-up table of
values corresponding to the distance of the first occurence of each
character from the end of the string. For instance, the table for
"STRING" would read:
A=6 B=6 C=6 D=6 E=6 F=6 G=6 H=6 I=2 J=6 K=6 L=6 M=6
N=1 O=6 P=6 Q=6 R=3 S=5 T=4 U=6 V=6 W=6 X=6 Y=6 Z=6
Note that characters not located in the string are set to
Length(String), as it the final character.
Since a 6-character string can't be located entirely within the first 5
characters, we start by comparing the String[6] -- "G" -- with Text[6]:
"O". They don't match, so we increment TextPointer by the value
associated with "O" in the table; that is, by 6. Our next compare is
"G" with the "R" in "SEARCH". We now increment TextPointer by "R"'s
value, 3, and compare "S" to " ". And so on. Here's a tracing of the
progression; the letter to be compared is highlighted by lower casing:
STRINg
HOW YoU SEARCH FOR A STRING
STRINg
HOW YOU SEArCH FOR A STRING
STRINg
HOW YOU SEARCH^FOR A STRING
STRINg
HOW YOU SEARCH FOR A STRING
STRINg
HOW YOU SEARCH FOR A STRINg
STRIng
HOW YOU SEARCH FOR A STRIng
STRing
HOW YOU SEARCH FOR A STRing
STring
HOW YOU SEARCH FOR A STring
String
HOW YOU SEARCH FOR A String
string
HOW YOU SEARCH FOR A string
There; 10 comparisons total. You can see how this would speed
searching a long text -- enough to make up for the additional overhead
of compiling the look-up table.
amead@s.psych.uiuc.edu (alan mead) (02/07/91)
I also have some shareware compiled code that works quite quickly. I undoubtedly ftp'd it from SIMTEL or chyde, but if somone really wants it, I can probably make it available. -alan