bojsen@moria.uucp (Per Bojsen) (06/09/91)
In article <492@regina.uregina.ca>, Dave Plummer writes: > Would anyone happen to have source code for a (very) fast search routine? > Basically, what I need is something like strstr(), but I need to be able > to modify it to make it non-case sensitive. I'm basically looking through > a small (512-2048 bytes) buffer for an ASCII string. > The fastest-known single-keyword string searching algorithm both in theory and practice is the Boyer-Moore string matching algorithm described in Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm. ACM 20(10) (1977) pp. 62-72. I'll try to give a short description of the algorithm. Consider the problem of finding a pattern p consisting of a single keyword, i.e., not a regular expression, and an input string s. That is if s = xpy for some x and y p occurs in s, else not. Let p = p p ... p and s = s s ... s , where p 1 2 m 1 2 n i (there's no subscripts in ASCII) is the ith character of the pattern, and s j the jth character of the input string. The basic idea of the Boyer-Moore algorithm is to look for a match by comparing characters in the keyword from *right*to*left* against the input string. We start by comparing p with s . Now, if s does not occur in p there can be m m m *no* match for the pattern beginning at any of the first m characters of s. So we can immediately proceed by comparing p with s , thereby avoiding m-1 m 2m character comparisons. Now for the general case. We have just shifted the keyword to the right and are going to compare p with s : m k | V p ... p 1 m s ... s ... s ... s 1 k-m+1 k n ^ | j 1) We find that p and s do not match. If the rightmost occurence of s m k k in the pattern is p , we can shift the pattern g positions to the right. m-g This aligns p and s (which by definition match). We then continue m-g k with comparing p with s : m k+g | V p ... p ... p 1 m-g m s ... s ... s ... s ... s 1 k-m+1 k k+g n ^ | j As the special case, if sk did not occur in the pattern, the pattern is shifted m positions to the right. 2) Now assume the last m-i characters of the keyword matches the last m-i characters of the input string ending at position k. I.e.: p p ... p = s s ... s i+1 i+2 m k-m+i+1 k-m+i+2 k If i == 0 we are finished! We found a match. If i>0 and p != s i k-m+i we are left with two cases. (a) If the rightmost occurence of the character s in the pattern k-m+1 is p , then as in case 1 we shift the keyword g positions to the i-g right so that p and s are aligned and continue by comparing i-g k-m+i p with s : m k+g | V p ... p ... p 1 i-g m s ... s ... s ... s ... s 1 k-m+g+1 k-m+i k+g n ^ | j If p is to the right of p , that is g < 0, then we should just i-g i shift the pattern one position to the right and continue matching by comparing p with s . m k+1 (b) A longer shift than that obtained in the previous case may sometimes be possible. If the suffix p p ... p reoccurs as the sub- i+1 i+2 m string p p ...p in the keyword and p != p (if there're i+1-g i+2-g m-g i i-g more than one occurence then take the rightmost one). Then we may get a bigger shift than in the previous case by aligning p p ...p above s s ...s and continuing the i+1-g i+2-g m-g k-m+i+1 k-m+i+2 k search by comparing p with s : m k+g | V p ... p ... p ... p 1 i+1-g m-g m s ... s ... s ... s ... s ... s 1 k-m+g+1 k-m+i+1 k k+g n ^ | j The details of this algorithm may be formulated as (in pseudocode): begin j := m while j <= n do begin i := m while i > 0 cand p == s do i j begin i := i - 1; j := j - 1 end if i := 0 then return "yes" else j := j + max(d [s ], d [i]) 1 j 2 end return "no" end The `cand' is a conditional and, aka a shortcut and, or the && operator in C. The algorithm uses two tables d and d in determining how far the keyword 1 2 should be shifted to the right when p != s . The first table is indexed by i j characters. For every character c occuring in p, d [p] is m-i, where i is 1 the largest i such that p == c; for all c *not* occuring in p, d [p] is m. i 1 This table takes care of cases 1 and 2a. The second table takes care of case 2b. This table is indexed by positions in the keyword. For each 1 <= i <= m, d [i] gives the minimum shift g such 2 that when we position p above s , the substring p p ...p of m k+g i+1-g i+2-g m-g the keyword matches the substring s s ...s of the input string, k-m+i+1 k-m+i+2 k under the assumption that p did not match s . i k-m+i Formally: d [i] = min{g+m-i | g >= 1 and (g >= i or p != p ) 2 i-g i and ((g >= k or p == p ) for i < k <= m)} k-g k To compute this table use the following algorithm due to Donald Knuth: begin for i := 1 to n do d [i] := 2m - i 2 j := m; k := m + 1 while j > 0 do begin f[j] := k while k <= m cand p != p do j k begin d [k] := min(d [k], m - j) 2 2 k := f[k] end j := j - 1; k := k - 1 end for i := 1 to k do d [i] := min(d [i], m + k - i) 2 2 j := f[k] while k <= m do begin while k <= j do begin d [k] := min(d [k], j - k + m) 2 2 k := k + 1 end j := f[j] end end The intermediate table f computed by the algorithm has the property that f[m] == m + 1 and for 1 <= j < m, f[j] == min{i | j < i <= m and p p ...p == p p ...p } i+1 i+2 m j+1 j+2 m+j-i So much for the description of the algorithm. There are at least one caveat when you try to implement this algorithm in C. All arrays in the description has a first element index of 1. I have not had the time to convert the algorithm to the C standard, yet. As for the performance of the algorithm it is a theorem that it solves our problem in O(m+n) time and O(m) space. This is the worst case behavior, though, and the average behavior is much better. In average the algorithm need only compare about n/m input string characters. Compare this to the obvious brute-force algorithm, where you would compare *all* characters of the input string. As a nice example of how finding a better algorithm is much more important than trying to handoptimize code it has been experimentally observed how the Boyer-Moore algorithm outperformed simpler algorithms that were implemen- ted to take advantage of special-purpose machine instructions. The description above is adapted from A.V. Aho ``Algorithms for Finding Patterns in Strings'', pp. 267--271 in ``Algorithms and Complexity'' Vol. A of ``Handbook of Theoretical Computer Science''. As for C source code, I have just finished a small program today that uses the Boyer-Moore algorithm. If I get the time I'll extract the Boyer- Moore parts and post the source, if there's interest. Otherwise you might want to look at the GNU regex.c module, which includes code for the Boyer- Moore algorithm. Hope this helps :-) -- .------------------------------------------------------------------------------. | Greetings from Per Bojsen. | +------------------------------+-----------------------------------------------+ | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | `------------------------------+-----------------------------------------------'
S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt) (06/10/91)
In <19453c80.ARN360b@moria.uucp> bojsen@moria.uucp writes: > In article <492@regina.uregina.ca>, Dave Plummer writes: > > > Would anyone happen to have source code for a (very) fast search routine? > > Basically, what I need is something like strstr(), but I need to be able > > to modify it to make it non-case sensitive. I'm basically looking through > > a small (512-2048 bytes) buffer for an ASCII string. > > > The fastest-known single-keyword string searching algorithm both in theory > and practice is the Boyer-Moore string matching algorithm described in > Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm. > ACM 20(10) (1977) pp. 62-72. > Not totally correct. Boyer-Moore works in O(n*m), another algorithm, KMP only needs O(n+m), where n is the length of the text you are searching in and m is the length of the text you are searching for. So Boyer&Moore is not the fastest in theory, only in practice: I've already implemented both algorithms. I've also implemented the most trivialest one which needs about O(n*m) and looks like this: for (i=0; i<n-m; i++) { j=0; while (j<m && a[i+j]==b[j]) { ... j++; } } I stopped the time in all three cases. I searched ASCII-Strings on a harddisk, using the device. So there were a lot of differnent characters. KMP needed about twice as much(!!!) time as the normal algorithm which needs about O(n*m) - like Boyer&Moore. But when I used Boyer&Moore, it run much faster - depending of the length of the searching string. So Boyer&Moore is really the fastest - in practice. > [...] Nice, your description, bit too long to quote it once more ;-) > > -- > .------------------------------------------------------------------------------. > | Greetings from Per Bojsen. | > +------------------------------+-----------------------------------------------+ > | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | > | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | > `------------------------------+-----------------------------------------------'
bojsen@moria.uucp (Per Bojsen) (06/11/91)
In article <1991Jun10.140028.9664@ira.uka.de>, |S| Angela Schmidt writes: > In <19453c80.ARN360b@moria.uucp> bojsen@moria.uucp writes: > > > The fastest-known single-keyword string searching algorithm both in theory > > and practice is the Boyer-Moore string matching algorithm described in > > Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm. > > ACM 20(10) (1977) pp. 62-72. > > > Not totally correct. Boyer-Moore works in O(n*m), another algorithm, KMP only > needs O(n+m), where n is the length of the text you are searching in and m is > the length of the text you are searching for. So Boyer&Moore is not the > fastest in theory, only in practice: > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP. But Boyer-Moore has a much better average behavior than KMP. For sufficiently large alphabets Boyer-Moore only needs to examin about n/m characters of the input string on the average. > I've already implemented both algorithms. I've also implemented the most > trivialest one which needs about O(n*m) and looks like this: > for (i=0; i<n-m; i++) { > j=0; > while (j<m && a[i+j]==b[j]) { > ... > j++; > } > } > This must be what is known as the brute-force algorithm. > I stopped the time in all three cases. I searched ASCII-Strings on a harddisk, > using the device. So there were a lot of differnent characters. KMP needed > about twice as much(!!!) time as the normal algorithm which needs about > O(n*m) - like Boyer&Moore. But when I used Boyer&Moore, it run much faster - > depending of the length of the searching string. So Boyer&Moore is really the > fastest - in practice. > Yes, and in theory, since it's O(m+n) and *not* O(m*n). You may want to check the references in my original posting. -- .------------------------------------------------------------------------------. | Greetings from Per Bojsen. | +------------------------------+-----------------------------------------------+ | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | `------------------------------+-----------------------------------------------'
S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt) (06/12/91)
In <1947e763.ARN36eb@moria.uucp> bojsen@moria.uucp writes: > In article <1991Jun10.140028.9664@ira.uka.de>, |S| Angela Schmidt writes: > > > In <19453c80.ARN360b@moria.uucp> bojsen@moria.uucp writes: > > > > > The fastest-known single-keyword string searching algorithm both in theory > > > and practice is the Boyer-Moore string matching algorithm described in > > > Boyer, R.S. and J.S. Moore, ``A Fast String Searching Algorithm'', Comm. > > > ACM 20(10) (1977) pp. 62-72. > > > > > Not totally correct. Boyer-Moore works in O(n*m), another algorithm, KMP only > > needs O(n+m), where n is the length of the text you are searching in and m is > > the length of the text you are searching for. So Boyer&Moore is not the > > fastest in theory, only in practice: > > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP. But Boyer-Moore has > a much better average behavior than KMP. For sufficiently large alphabets > Boyer-Moore only needs to examin about n/m characters of the input string > on the average. > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it always tries to match the whole string and only recognizes at the last character, that there is a mismatch. Doing this n times causes O(m*n). Of course, normally it only needs a little bit more than n/m - that's why in practice it's so quick. aaaaaaa String you search in called A baaa String you search for called B ^ means that the character matches ! means that the character does not match Strings Characters that will be checked Number of checks --------------------------------------------------------------- aaaaaaa baaa !^^^ A[3] A[2] A[1] - mismatch at A[0] m checks in 1st round aaaaaaa baaa A[4] A[3] A[2] - mismatch at A[1] m checks in 2nd round !^^^ ... (n-m+1 times) This will be done n-m+1 times, so you need O(m*(n-m+1)) and since m<<n (otherwise nobody would implementat Boyer&Moore) I said that it needs O(n*m). That does not mean that I dislike Boyer&Moore to the contrary - I like it very much. But O(..) discusses the worst case and there Boyer&Moore is not very useful. > [...] > -- > .------------------------------------------------------------------------------. > | Greetings from Per Bojsen. | > +------------------------------+-----------------------------------------------+ > | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | > | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | > `------------------------------+-----------------------------------------------' ---------------------------------------------------------+--------------------- Angela Schmidt Internet: S_ASchmidt@iravcl.ira.uka.de | // (Nessy in IRC) BITNET: UK8B@DKAUNI2.BITNET | Amiga // Phone: +49 731 712316 & 721 6904-263 | \\ // only 1000 MEZ: (10am-9pm) (12pm-10pm) | \X/ the real one! ---------------------------------------------------------+---------------------
jbickers@templar.actrix.gen.nz (John Bickers) (06/13/91)
Quoted from <1991Jun12.143556.4127@ira.uka.de> by S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt): > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP. But Boyer-Moore has > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it always Me too. BM's worst case IS O(n*m). KMP's worst case is optimal. BM's average case is much better, which makes it worth taking the risk of hitting a worst case (the user has only themself to blame :). > Angela Schmidt Internet: S_ASchmidt@iravcl.ira.uka.de | // -- *** John Bickers, TAP, NZAmigaUG. jbickers@templar.actrix.gen.nz *** *** "Endless variations, make it all seem new" - Devo. ***
S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt) (06/13/91)
In <4418.tnews@templar.actrix.gen.nz> jbickers@templar.actrix.gen.nz writes: > Quoted from <1991Jun12.143556.4127@ira.uka.de> by S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt): > > > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP. But Boyer-Moore has > > > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it always > > Me too. BM's worst case IS O(n*m). KMP's worst case is optimal. > > BM's average case is much better, which makes it worth taking the > risk of hitting a worst case (the user has only themself to blame > :). > > > Angela Schmidt Internet: S_ASchmidt@iravcl.ira.uka.de | // > -- > *** John Bickers, TAP, NZAmigaUG. jbickers@templar.actrix.gen.nz *** > *** "Endless variations, make it all seem new" - Devo. *** Sorry, John, we were both wrong... :-( I've looked into another book and I have to excuse me. Boyer&Moore only needs linear time in the worst case. Some books seem to simplify the algorythm. Let me descripe the difference to the real algorythm. If the character you test does not appear in the string, both algorythms do the same thing: They skip m characters. And if it does appear in the string, both begin to test, if it's the string we search. If it's not the searched string, the real Boyer&Moore does something, that causes the linear time: It does not skip one character to the right as the simple algorythm does, it skips several characters - depending on the position where the mismatch appeared. This behavior is like that of the KMP-algorythm. So the real Boyer&Moore needs the same time as KMP in worst case. But in the normal case it is - of course - much better. A propos: In normal case the simple algorythm is not slower than the real one. but it is much easyer to implementate... ;-) Bye Angela ---------------------------------------------------------+--------------------- Angela Schmidt Internet: S_ASchmidt@iravcl.ira.uka.de | // (Nessy in IRC) BITNET: UK8B@DKAUNI2.BITNET | Amiga // Phone: +49 731 712316 & 721 6904-263 | \\ // only 1000 MEZ: (10am-9pm) (12pm-10pm) | \X/ the real one! ---------------------------------------------------------+--------------------- ---------------------------------------------------------+--------------------- Angela Schmidt Internet: S_ASchmidt@iravcl.ira.uka.de | // (Nessy in IRC) BITNET: UK8B@DKAUNI2.BITNET | Amiga // Phone: +49 731 712316 & 721 6904-263 | \\ // only 1000 MEZ: (10am-9pm) (12pm-10pm) | \X/ the real one! ---------------------------------------------------------+---------------------
bojsen@moria.uucp (Per Bojsen) (06/16/91)
In article <1991Jun12.143556.4127@ira.uka.de>, |S| Angela Schmidt writes: > In <1947e763.ARN36eb@moria.uucp> bojsen@moria.uucp writes: > > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP. But Boyer-Moore > > has a much better average behavior than KMP. For sufficiently large > > alphabets Boyer-Moore only needs to examin about n/m characters of the > > input string on the average. > > > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - it > always tries to match the whole string and only recognizes at the last > character, that there is a mismatch. Doing this n times causes O(m*n). ~~~~~~~~~~~~~~~~~~ This is the fallacy in your argument! We don't need to do this n times, but rather just about n/m times giving a behavior of approximately O(n) in this case. See below: > aaaaaaaa String you search in called A > baaa String you search for called B > ^ means that the character matches > ! means that the character does not match > > Strings Characters that will be checked Number of checks > --------------------------------------------------------------- > aaaaaaaa > baaa > !^^^ A[3] A[2] A[1] - mismatch at A[0] m checks in 1st round > > aaaaaaaa > baaa A[4] A[3] A[2] - mismatch at A[1] m checks in 2nd round > !^^^ > Nope! The Boyer-Moore algorithm will skip the first m characters of the input string entirely at this point: aaaaaaaa baaa A[8] A[7] A[6] - mismatch at A[4] m checks in 2nd round !^^^ If you read my article you would know why it is possible to do this. The reason is that there cannot possibly be a match starting from any of the first four characters of the input string since the `b' does not occur anywhere among the last three characters of the search string. -- .------------------------------------------------------------------------------. | Greetings from Per Bojsen. | +------------------------------+-----------------------------------------------+ | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | `------------------------------+-----------------------------------------------'
bojsen@moria.uucp (Per Bojsen) (06/16/91)
In article <4418.tnews@templar.actrix.gen.nz>, John Bickers writes: > Quoted from <1991Jun12.143556.4127@ira.uka.de> by S_ASCHMIDT@iravcl.ira.uka.de (|S| Angela Schmidt): > > > > I'm sorry, but I believe Boyer-Moore is O(m+n) like KMP [Knuth-Morris- Pratt] > > > Sorry, but I'm *very* shure it has O(n*m) since - in the worst case - > > Me too. BM's worst case IS O(n*m). KMP's worst case is optimal. > Nope! BM *is* O(m+n) in the worst case. Are you sure we are talking about the same algorithm? I'm talking about the Boyer-Moore algorithm that is described in the paper by A.V. Aho in ``Algorithms and Complexity'', Volume A of ``Handbook of Theoretical Computer Science''. The algorithm was first published by R.S. Boyer and J.S. Moore in ``A Fast String Searching Algorithm'', Comm. ACM 20(10) (1977) pp 62--72. Citing from Aho: ``3.8 Theorem. The Boyer-Moore algorithm solves [the string searching problem] in O(m+n) time and O(m) space.'' I think you're talking about Horspool's simplification of the Boyer-Moore algorithm. Could that be it? -- .------------------------------------------------------------------------------. | Greetings from Per Bojsen. | +------------------------------+-----------------------------------------------+ | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | `------------------------------+-----------------------------------------------'
bojsen@moria.uucp (Per Bojsen) (06/16/91)
In article <1991Jun13.104318.24450@ira.uka.de>, |S| Angela Schmidt writes: > In <4418.tnews@templar.actrix.gen.nz> jbickers@templar.actrix.gen.nz writes: > > > Me too. BM's worst case IS O(n*m). KMP's worst case is optimal. > > > > BM's average case is much better, which makes it worth taking the > > risk of hitting a worst case (the user has only themself to blame > > :). > > > > Sorry, John, we were both wrong... :-( > Yes. Please disregard my rebuttal to your previous post! > I've looked into another book and I have to excuse me. Boyer&Moore only > needs linear time in the worst case. > Some books seem to simplify the algorythm. Let me descripe the difference > to the real algorythm. > The simple algorithm seems to be Horspools simplified BM. > A propos: In normal case the simple algorythm is not slower than the real one. > but it is much easyer to implementate... ;-) > Well, the real BM wasn't that difficult to implement either! -- .------------------------------------------------------------------------------. | Greetings from Per Bojsen. | +------------------------------+-----------------------------------------------+ | EMail: cbmehq!lenler!bojsen | "Names do have power, after all, that of | | Or: bojsen@dc.dth.dk | conjuring images of places we have not seen" | `------------------------------+-----------------------------------------------'