[comp.lang.pascal] efficient string searches

amead@s.psych.uiuc.edu (alan mead) (02/07/91)

Here is a file from TUG002.ZIP which can be had at chyde.  The other files 
mentioned in this document (the source code) are there as well as many 
other useful files.  It describes what I find to be a clever search
algorithm (it would only be worth it to use this for searching many 
characters):

TUG PDS CERT 1.01 (Documentation)

==========================================================================

                  TUG PUBLIC DOMAIN SOFTWARE CERTIFICATION

The Turbo User Group (TUG) is recognized by Borland International as the
official support organization for Turbo languages.  This file has been
verified by the TUG library staff.  We are reasonably certain that the
information contained in this file is public domain material, but
it is also subject to any restrictions applied by its author.

This diskette contains information determined to be in the PUBLIC
DOMAIN, provided as a service of TUG for the use of its members.  The
Turbo User Group will not be liable for any damages, including any lost
profits, lost savings or other incidental or consequential damages arising
out of the use of or inability to use the contents, even if TUG has been
advised of the possibility of such damages, or for any claim by any
other party.

To the best of our knowledge, the information in this file is accurate.

If you discover an error in this file, we would appreciate it if you would
report it to us.  To report bugs, or to request information on membership
in TUG, please contact us at:

             Turbo User Group
             PO Box 1510
             Poulsbo, Washington USA  98370

--------------------------------------------------------------------------
                       F i l e    I n f o r m a t i o n

* DESCRIPTION
Documentation file explaining about Boyer-MOORE.

* ASSOCIATED FILES
SEARCHES.PAS
BOYER.DOC
SEARCHES.DOC

* CHECKED BY
DRM - 08/14/88

* KEYWORDS
TURBO PASCAL V4.0 DOCUMENTATION SEARCH INLINE

==========================================================================
}
    W. Vann Hall
    Pala Designs
    P.O. Box 10804
    Arlington, VA 22210
    CIS 70346,1144
    MCI WVHALL

    ABOUT Boyer-MOORE:

    Boyer-Moore is an attempt to speed up the usually serial text search.
    Traditionally, text searches proceed from the first character onward.
    (In these examples, "string" refers to string we're trying to locate,
    "text" the array of characters we're trying to find the string in.  
    These all ignore case sensitivity, matching around line boundaries,
    punctuation, etc.)

    String to find:        STRING
    Text to find it in:    HOW YOU SEARCH FOR A STRING

    Our StringPointer and TextPointer both start at 1.  We compare 
    String[1] and Text[1].  Since "S" and "H" don't match, we increment the 
    text pointer and compare again.  At TextPointer = 9, the "S" of
    "STRING" and the "S" of "SEARCH" match, so we increment TextPointer 
    *and* StringPointer and compare String[2] and Text[10].  "T" and "E" 
    don't match, so we reset StringPointer and increment TextPointer again.  
    And so on.  You can see that it takes 22 comparisons before we locate 
    the correct "S" and a full 27 before we know that we have located 
    "STRING." 

    Boyer-Moore, on the other hand, works from the end of the string (but 
    still the beginning of the text).  First, it creates a look-up table of 
    values corresponding to the distance of the first occurence of each 
    character from the end of the string.  For instance, the table for
    "STRING" would read:

    A=6  B=6  C=6  D=6  E=6  F=6  G=6  H=6  I=2  J=6  K=6  L=6  M=6
    N=1  O=6  P=6  Q=6  R=3  S=5  T=4  U=6  V=6  W=6  X=6  Y=6  Z=6

    Note that characters not located in the string are set to 
    Length(String), as it the final character.  

    Since a 6-character string can't be located entirely within the first 5 
    characters, we start by comparing the String[6] -- "G" -- with Text[6]:
    "O".  They don't match, so we increment TextPointer by the value 
    associated with "O" in the table; that is, by 6.  Our next compare is 
    "G" with the "R" in "SEARCH".  We now increment TextPointer by "R"'s 
    value, 3, and compare "S" to " ".  And so on.  Here's a tracing of the 
    progression; the letter to be compared is highlighted by lower casing: 

    STRINg
    HOW YoU SEARCH FOR A STRING

          STRINg
    HOW YOU SEArCH FOR A STRING

             STRINg
    HOW YOU SEARCH^FOR A STRING

                   STRINg
    HOW YOU SEARCH FOR A STRING

                         STRINg
    HOW YOU SEARCH FOR A STRINg

                         STRIng
    HOW YOU SEARCH FOR A STRIng

                         STRing
    HOW YOU SEARCH FOR A STRing

                         STring
    HOW YOU SEARCH FOR A STring

                         String
    HOW YOU SEARCH FOR A String

                         string
    HOW YOU SEARCH FOR A string
    
    There; 10 comparisons total.  You can see how this would speed
    searching a long text -- enough to make up for the additional overhead
    of compiling the look-up table.

amead@s.psych.uiuc.edu (alan mead) (02/07/91)

I also have some shareware compiled code that works quite quickly.  I
undoubtedly ftp'd it from SIMTEL or chyde, but if somone really wants it, I 
can probably make it available.

-alan