[comp.sys.amiga.programmer] The Boyer-Moore string searching algorithm

bojsen@moria.uucp (Per Bojsen) (06/09/91)

In article <492@regina.uregina.ca>, Dave Plummer writes:

> Would anyone happen to have source code for a (very) fast search routine?
> Basically, what I need is something like strstr(), but I need to be able
> to modify it to make it non-case sensitive.  I'm basically looking through
> a small (512-2048 bytes) buffer for an ASCII string.
>
The fastest-known single-keyword string searching algorithm both in theory
and practice is the Boyer-Moore string matching algorithm described in
Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm.
ACM 20(10) (1977) pp. 62-72.

I'll try to give a short description of the algorithm.  Consider the problem
of finding a pattern p consisting of a single keyword, i.e., not a regular
expression, and an input string s.  That is if s = xpy for some x and y p
occurs in s, else not.  Let p = p p  ... p  and s = s s  ... s , where p
                                 1 2      m          1 2      n         i
(there's no subscripts in ASCII) is the ith character of the pattern, and s
                                                                           j
the jth character of the input string.

The basic idea of the Boyer-Moore algorithm is to look for a match by comparing
characters in the keyword from *right*to*left* against the input string.  We
start by comparing p  with s .  Now, if s  does not occur in p there can be
                    m       m            m
*no* match for the pattern beginning at any of the first m characters of s.
So we can immediately proceed by comparing p  with s  , thereby avoiding m-1
                                            m       2m
character comparisons.

Now for the general case.  We have just shifted the keyword to the right and
are going to compare p  with s :
                      m       k

                          |
                          V
               p      ... p
                1          m
        s  ... s      ... s  ... s
         1      k-m+1      k      n
                          ^
                          |
                          j

1) We find that p  and s  do not match.  If the rightmost occurence of s
                 m      k                                               k
   in the pattern is p   , we can shift the pattern g positions to the right.
                      m-g
   This aligns p    and s  (which by definition match).  We then continue
                m-g      k
   with comparing p  with s   :
                   m       k+g
                                   |
                                   V
               p      ... p    ... p
                1          m-g      m
        s  ... s      ... s    ... s    ... s
         1      k-m+1      k        k+g      n
                                   ^
                                   |
                                   j

   As the special case, if sk did not occur in the pattern, the pattern is
   shifted m positions to the right.

2) Now assume the last m-i characters of the keyword matches the last m-i
   characters of the input string ending at position k.  I.e.:

        p   p    ... p  = s       s        ... s
         i+1 i+2      m    k-m+i+1 k-m+i+2      k

   If i == 0 we are finished!  We found a match.  If i>0 and p  != s
                                                              i     k-m+i
   we are left with two cases.

   (a) If the rightmost occurence of the character s      in the pattern
                                                    k-m+1
       is p   , then as in case 1 we shift the keyword g positions to the
           i-g
       right so that p    and s      are aligned and continue by comparing
                      i-g      k-m+i
       p  with s   :
        m       k+g
                                       |
                                       V
               p        ... p      ... p
                1            i-g        m
        s  ... s        ... s      ... s    ... s
         1      k-m+g+1      k-m+i      k+g      n
                                       ^
                                       |
                                       j

       If p    is to the right of p , that is g < 0, then we should just
           i-g                     i
       shift the pattern one position to the right and continue matching by
       comparing p  with s   .
                  m       k+1

   (b) A longer shift than that obtained in the previous case may sometimes
       be possible.  If the suffix p   p    ... p  reoccurs as the sub-
                                    i+1 i+2      m
       string p     p     ...p    in the keyword and p  != p    (if there're
               i+1-g i+2-g    m-g                     i     i-g
       more than one occurence then take the rightmost one).  Then we may get
       a bigger shift than in the previous case by aligning
       p     p     ...p    above s       s       ...s  and continuing the
        i+1-g i+2-g    m-g        k-m+i+1 k-m+i+2    k
       search by comparing p  with s   :
                            m       k+g
                                                  |
                                                  V
               p        ... p        ... p    ... p
                1            i+1-g        m-g      m
        s  ... s        ... s        ... s    ... s    ... s
         1      k-m+g+1      k-m+i+1      k        k+g      n
                                                  ^
                                                  |
                                                  j

The details of this algorithm may be formulated as (in pseudocode):


        begin
          j := m
          while j <= n do
          begin
            i := m
            while i > 0 cand p == s  do
                              i    j
              begin
                i := i - 1; j := j - 1
              end
            if i := 0 then return "yes"
            else j := j + max(d [s ], d [i])
                               1  j    2
          end
          return "no"
        end


The `cand' is a conditional and, aka a shortcut and, or the && operator in C.

The algorithm uses two tables d  and d  in determining how far the keyword
                               1      2
should be shifted to the right when p  != s .  The first table is indexed by
                                     i     j
characters.  For every character c occuring in p, d [p] is m-i, where i is
                                                   1
the largest i such that p  == c; for all c *not* occuring in p, d [p] is m.
                         i                                       1
This table takes care of cases 1 and 2a.

The second table takes care of case 2b.  This table is indexed by positions
in the keyword.  For each 1 <= i <= m, d [i] gives the minimum shift g such
                                        2
that when we position p  above s   , the substring p     p     ...p    of
                       m        k+g                 i+1-g i+2-g    m-g
the keyword matches the substring s       s       ...s  of the input string,
                                   k-m+i+1 k-m+i+2    k
under the assumption that p  did not match s     .
                           i                k-m+i

Formally:

        d [i] = min{g+m-i | g >= 1 and (g >= i or p    != p )
         2                                         i-g     i
                            and ((g >= k or p    == p ) for i < k <= m)}
                                             k-g     k

To compute this table use the following algorithm due to Donald Knuth:


        begin
          for i := 1 to n do d [i] := 2m - i
                              2
          j := m; k := m + 1
          while j > 0 do
          begin
            f[j] := k
            while k <= m cand p  != p  do
                               j     k
            begin
              d [k] := min(d [k], m - j)
               2            2
              k := f[k]
            end
            j := j - 1; k := k - 1
          end
          for i := 1 to k do d [i] := min(d [i], m + k - i)
                              2            2
          j := f[k]
          while k <= m do
          begin
            while k <= j do
            begin
              d [k] := min(d [k], j - k + m)
               2            2
              k := k + 1
            end
            j := f[j]
          end
        end


The intermediate table f computed by the algorithm has the property that
f[m] == m + 1 and for 1 <= j < m,

        f[j] == min{i | j < i <= m and p   p   ...p  == p   p   ...p     }
                                        i+1 i+2    m     j+1 j+2    m+j-i


So much for the description of the algorithm.  There are at least one
caveat when you try to implement this algorithm in C.  All arrays in the
description has a first element index of 1.  I have not had the time to
convert the algorithm to the C standard, yet.

As for the performance of the algorithm it is a theorem that it solves
our problem in O(m+n) time and O(m) space.  This is the worst case behavior,
though, and the average behavior is much better.  In average the algorithm
need only compare about n/m input string characters.  Compare this to
the obvious brute-force algorithm, where you would compare *all* characters
of the input string.

As a nice example of how finding a better algorithm is much more important
than trying to handoptimize code it has been experimentally observed how
the Boyer-Moore algorithm outperformed simpler algorithms that were implemen-
ted to take advantage of special-purpose machine instructions.

The description above is adapted from A.V. Aho ``Algorithms for Finding
Patterns in Strings'', pp. 267--271 in ``Algorithms and Complexity'' Vol. A
of ``Handbook of Theoretical Computer Science''.

As for C source code, I have just finished a small program today that
uses the Boyer-Moore algorithm.  If I get the time I'll extract the Boyer-
Moore parts and post the source, if there's interest.  Otherwise you might
want to look at the GNU regex.c module, which includes code for the Boyer-
Moore algorithm.

Hope this helps :-)

--
.------------------------------------------------------------------------------.
|  Greetings from Per Bojsen.                                                  |
+------------------------------+-----------------------------------------------+
|  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
|     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
`------------------------------+-----------------------------------------------'

S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt) (06/10/91)

In <19453c80.ARN360b@moria.uucp> bojsen@moria.uucp writes:

> In article <492@regina.uregina.ca>, Dave Plummer writes:
> 
> > Would anyone happen to have source code for a (very) fast search routine?
> > Basically, what I need is something like strstr(), but I need to be able
> > to modify it to make it non-case sensitive.  I'm basically looking through
> > a small (512-2048 bytes) buffer for an ASCII string.
> >
> The fastest-known single-keyword string searching algorithm both in theory
> and practice is the Boyer-Moore string matching algorithm described in
> Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm.
> ACM 20(10) (1977) pp. 62-72.
> 
Not totally correct. Boyer-Moore works in O(n*m), another algorithm, KMP only
needs O(n+m), where n is the length of the text you are searching in and m is
the length of the text you are searching for. So Boyer&Moore is not the fastest
in theory, only in practice:

I've already implemented both algorithms. I've also implemented the most
trivialest one which needs about O(n*m) and looks like this:
   for (i=0; i<n-m; i++) {
     j=0;
     while (j<m && a[i+j]==b[j]) {
       ...
       j++;
     }
   }
I stopped the time in all three cases. I searched ASCII-Strings on a harddisk,
using the device. So there were a lot of differnent characters. KMP needed
about twice as much(!!!) time as the normal algorithm which needs about
O(n*m) - like Boyer&Moore. But when I used Boyer&Moore, it run much faster -
depending of the length of the searching string. So Boyer&Moore is really the
fastest - in practice.

> [...]

Nice, your description, bit too long to quote it once more ;-)
>
> --
> .------------------------------------------------------------------------------.
> |  Greetings from Per Bojsen.                                                  |
> +------------------------------+-----------------------------------------------+
> |  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
> |     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
> `------------------------------+-----------------------------------------------'

bojsen@moria.uucp (Per Bojsen) (06/11/91)

In article <1991Jun10.140028.9664@ira.uka.de>, |S| Angela Schmidt writes:

> In <19453c80.ARN360b@moria.uucp> bojsen@moria.uucp writes:
>
> > The fastest-known single-keyword string searching algorithm both in theory
> > and practice is the Boyer-Moore string matching algorithm described in
> > Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm.
> > ACM 20(10) (1977) pp. 62-72.
> >
> Not totally correct. Boyer-Moore works in O(n*m), another algorithm, KMP only
> needs O(n+m), where n is the length of the text you are searching in and m is
> the length of the text you are searching for. So Boyer&Moore is not the
> fastest in theory, only in practice:
>
I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP.  But Boyer-Moore has
a much better average behavior than KMP.  For sufficiently large alphabets
Boyer-Moore only needs to examin about n/m characters of the input string
on the average.

> I've already implemented both algorithms. I've also implemented the most
> trivialest one which needs about O(n*m) and looks like this:
>    for (i=0; i<n-m; i++) {
>      j=0;
>      while (j<m && a[i+j]==b[j]) {
>        ...
>        j++;
>      }
>    }
>
This must be what is known as the brute-force algorithm.

> I stopped the time in all three cases. I searched ASCII-Strings on a harddisk,
> using the device. So there were a lot of differnent characters. KMP needed
> about twice as much(!!!) time as the normal algorithm which needs about
> O(n*m) - like Boyer&Moore. But when I used Boyer&Moore, it run much faster -
> depending of the length of the searching string. So Boyer&Moore is really the
> fastest - in practice.
>
Yes, and in theory, since it's O(m+n) and *not* O(m*n).  You may want to
check the references in my original posting.

--
.------------------------------------------------------------------------------.
|  Greetings from Per Bojsen.                                                  |
+------------------------------+-----------------------------------------------+
|  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
|     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
`------------------------------+-----------------------------------------------'

S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt) (06/12/91)

In <1947e763.ARN36eb@moria.uucp> bojsen@moria.uucp writes:

> In article <1991Jun10.140028.9664@ira.uka.de>, |S| Angela Schmidt writes:
> 
> > In <19453c80.ARN360b@moria.uucp> bojsen@moria.uucp writes:
> >
> > > The fastest-known single-keyword string searching algorithm both in theory
> > > and practice is the Boyer-Moore string matching algorithm described in
> > > Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm.
> > > ACM 20(10) (1977) pp. 62-72.
> > >
> > Not totally correct. Boyer-Moore works in O(n*m), another algorithm, KMP only
> > needs O(n+m), where n is the length of the text you are searching in and m is
> > the length of the text you are searching for. So Boyer&Moore is not the
> > fastest in theory, only in practice:
> >
> I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP.  But Boyer-Moore has
> a much better average behavior than KMP.  For sufficiently large alphabets
> Boyer-Moore only needs to examin about n/m characters of the input string
> on the average.
> 
Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it always
tries to match the whole string and only recognizes at the last character, that
there is a mismatch. Doing this n times causes O(m*n). Of course, normally it
only needs a little bit more than n/m - that's why in practice it's so quick.

aaaaaaa String you search in called A
baaa    String you search for called B
^ means that the character matches
! means that the character does not match

Strings    Characters that will be checked     Number of checks
---------------------------------------------------------------
aaaaaaa
baaa
!^^^       A[3] A[2] A[1] - mismatch at A[0]   m checks in 1st round

aaaaaaa
 baaa      A[4] A[3] A[2] - mismatch at A[1]   m checks in 2nd round
 !^^^

...

(n-m+1 times)

This will be done n-m+1 times, so you need O(m*(n-m+1)) and since m<<n
(otherwise nobody would implementat Boyer&Moore) I said that it needs O(n*m).
That does not mean that I dislike Boyer&Moore to the contrary - I like it very
much. But O(..) discusses the worst case and there Boyer&Moore is not very
useful.

> [...]
> --
> .------------------------------------------------------------------------------.
> |  Greetings from Per Bojsen.                                                  |
> +------------------------------+-----------------------------------------------+
> |  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
> |     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
> `------------------------------+-----------------------------------------------'

---------------------------------------------------------+---------------------
Angela Schmidt   Internet: S_ASchmidt@iravcl.ira.uka.de  |        //
(Nessy in IRC)   BITNET:   UK8B@DKAUNI2.BITNET           | Amiga //
                 Phone:    +49 731 712316 & 721 6904-263 |   \\ // only 1000
                           MEZ: (10am-9pm)  (12pm-10pm)  |    \X/ the real one!
---------------------------------------------------------+---------------------

jbickers@templar.actrix.gen.nz (John Bickers) (06/13/91)

Quoted from <1991Jun12.143556.4127@ira.uka.de> by S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt):

> > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP.  But Boyer-Moore has

> Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it always

    Me too. BM's worst case IS O(n*m). KMP's worst case is optimal.

    BM's average case is much better, which makes it worth taking the
    risk of hitting a worst case (the user has only themself to blame
    :).

> Angela Schmidt   Internet: S_ASchmidt@iravcl.ira.uka.de  |        //
--
*** John Bickers, TAP, NZAmigaUG.        jbickers@templar.actrix.gen.nz ***
***         "Endless variations, make it all seem new" - Devo.          ***

S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt) (06/13/91)

In <4418.tnews@templar.actrix.gen.nz> jbickers@templar.actrix.gen.nz writes:

> Quoted from <1991Jun12.143556.4127@ira.uka.de> by S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt):
> 
> > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP.  But Boyer-Moore has
> 
> > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it always
> 
>     Me too. BM's worst case IS O(n*m). KMP's worst case is optimal.
> 
>     BM's average case is much better, which makes it worth taking the
>     risk of hitting a worst case (the user has only themself to blame
>     :).
> 
> > Angela Schmidt   Internet: S_ASchmidt@iravcl.ira.uka.de  |        //
> --
> *** John Bickers, TAP, NZAmigaUG.        jbickers@templar.actrix.gen.nz ***
> ***         "Endless variations, make it all seem new" - Devo.          ***

Sorry, John, we were both wrong... :-(

I've looked into another book and I have to excuse me. Boyer&Moore only
needs linear time in the worst case.
Some books seem to simplify the algorythm. Let me descripe the difference
to the real algorythm.

If the character you test does not appear in the string, both algorythms
do the same thing: They skip m characters.
And if it does appear in the string, both begin to test, if it's the
string we search.
If it's not the searched string, the real Boyer&Moore does something,
that causes the linear time:
It does not skip one character to the right as the simple algorythm does,
it skips several characters - depending on the position where the
mismatch appeared. This behavior is like that of the KMP-algorythm. So
the real Boyer&Moore needs the same time as KMP in worst case. But in
the normal case it is - of course - much better.

A propos: In normal case the simple algorythm is not slower than the real one.
but it is much easyer to implementate... ;-)

Bye

Angela

---------------------------------------------------------+---------------------
Angela Schmidt   Internet: S_ASchmidt@iravcl.ira.uka.de  |        //
(Nessy in IRC)   BITNET:   UK8B@DKAUNI2.BITNET           | Amiga //
                 Phone:    +49 731 712316 & 721 6904-263 |   \\ // only 1000
                           MEZ: (10am-9pm)  (12pm-10pm)  |    \X/ the real one!
---------------------------------------------------------+---------------------

---------------------------------------------------------+---------------------
Angela Schmidt   Internet: S_ASchmidt@iravcl.ira.uka.de  |        //
(Nessy in IRC)   BITNET:   UK8B@DKAUNI2.BITNET           | Amiga //
                 Phone:    +49 731 712316 & 721 6904-263 |   \\ // only 1000
                           MEZ: (10am-9pm)  (12pm-10pm)  |    \X/ the real one!
---------------------------------------------------------+---------------------

bojsen@moria.uucp (Per Bojsen) (06/16/91)

In article <1991Jun12.143556.4127@ira.uka.de>, |S| Angela Schmidt writes:

> In <1947e763.ARN36eb@moria.uucp> bojsen@moria.uucp writes:
>
> > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP.  But Boyer-Moore
> > has a much better average behavior than KMP.  For sufficiently large
> > alphabets Boyer-Moore only needs to examin about n/m characters of the
> > input string on the average.
> >
> Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it
> always tries to match the whole string and only recognizes at the last
> character, that there is a mismatch. Doing this n times causes O(m*n).
                                       ~~~~~~~~~~~~~~~~~~
This is the fallacy in your argument!  We don't need to do this n times,
but rather just about n/m times giving a behavior of approximately O(n)
in this case.  See below:

> aaaaaaaa String you search in called A
> baaa     String you search for called B
> ^ means  that the character matches
> ! means  that the character does not match
>
> Strings    Characters that will be checked     Number of checks
> ---------------------------------------------------------------
> aaaaaaaa
> baaa
> !^^^       A[3] A[2] A[1] - mismatch at A[0]   m checks in 1st round
>
> aaaaaaaa
>  baaa      A[4] A[3] A[2] - mismatch at A[1]   m checks in 2nd round
>  !^^^
>
Nope!  The Boyer-Moore algorithm will skip the first m characters of the
input string entirely at this point:

  aaaaaaaa
      baaa      A[8] A[7] A[6] - mismatch at A[4]   m checks in 2nd round
      !^^^

If you read my article you would know why it is possible to do this.  The
reason is that there cannot possibly be a match starting from any of the
first four characters of the input string since the `b' does not occur
anywhere among the last three characters of the search string.

--
.------------------------------------------------------------------------------.
|  Greetings from Per Bojsen.                                                  |
+------------------------------+-----------------------------------------------+
|  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
|     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
`------------------------------+-----------------------------------------------'

bojsen@moria.uucp (Per Bojsen) (06/16/91)

In article <4418.tnews@templar.actrix.gen.nz>, John Bickers writes:

> Quoted from <1991Jun12.143556.4127@ira.uka.de> by S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt):
>
> > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP [Knuth-Morris-
      Pratt]
>
> > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case -
>
>     Me too. BM's worst case IS O(n*m). KMP's worst case is optimal.
>
Nope!  BM *is* O(m+n) in the worst case.  Are you sure we are talking about
the same algorithm?  I'm talking about the Boyer-Moore algorithm that is
described in the paper by A.V. Aho in ``Algorithms and Complexity'', Volume A
of ``Handbook of Theoretical Computer Science''.  The algorithm was first
published by R.S. Boyer and J.S. Moore in ``A Fast String Searching Algorithm'',
Comm. ACM 20(10) (1977) pp 62--72.  Citing from Aho:

    ``3.8 Theorem.  The Boyer-Moore algorithm solves [the string searching
      problem] in O(m+n) time and O(m) space.''

I think you're talking about Horspool's simplification of the Boyer-Moore
algorithm.  Could that be it?

--
.------------------------------------------------------------------------------.
|  Greetings from Per Bojsen.                                                  |
+------------------------------+-----------------------------------------------+
|  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
|     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
`------------------------------+-----------------------------------------------'

bojsen@moria.uucp (Per Bojsen) (06/16/91)

In article <1991Jun13.104318.24450@ira.uka.de>, |S| Angela Schmidt writes:

> In <4418.tnews@templar.actrix.gen.nz> jbickers@templar.actrix.gen.nz writes:
>
> >     Me too. BM's worst case IS O(n*m). KMP's worst case is optimal.
> >
> >     BM's average case is much better, which makes it worth taking the
> >     risk of hitting a worst case (the user has only themself to blame
> >     :).
> >
>
> Sorry, John, we were both wrong... :-(
>
Yes.  Please disregard my rebuttal to your previous post!

> I've looked into another book and I have to excuse me. Boyer&Moore only
> needs linear time in the worst case.
> Some books seem to simplify the algorythm. Let me descripe the difference
> to the real algorythm.
>
The simple algorithm seems to be Horspools simplified BM.

> A propos: In normal case the simple algorythm is not slower than the real one.
> but it is much easyer to implementate... ;-)
>
Well, the real BM wasn't that difficult to implement either!

--
.------------------------------------------------------------------------------.
|  Greetings from Per Bojsen.                                                  |
+------------------------------+-----------------------------------------------+
|  EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of      |
|     Or: bojsen@dc.dth.dk     |  conjuring images of places we have not seen" |
`------------------------------+-----------------------------------------------'