herbie@watdcsu.UUCP (Herb Chong [DCS]) (07/20/85)
In article <293@zaphod.UUCP> bobd@zaphod.UUCP (Bob Dalgleish) writes: >The Boyer-Moore pattern matching algorithm is slower than a naive >pattern matching algorithm in many cases (including all of the cases in >the "benchmark"). ....deleted material.... this statement contradicts the results given in the following paper, which i have a copy of lying around somewhere. as an implementor of this algorithm on VM/CMS and a consultant to Peter Bain's implementation bm, i would like to make some comments here about why bgrep is so slow when it should be much faster. G. de V. Smit,"A Comparison of Three String Matching Algorithms", Software - Practice and Experience, vol. 12, 1982, 57-66, if you look carefully at the bgrep code, you will notice that each character in the input line is scanned twice before the matching algorithm even gets to see the line. the fgets() reads a character at a time from the input buffer until it finds a newline. then a strlen() is done to find the length of the string. if a monocase search is done, a further pass is made to make everything lower case. after all that, the matching algorithm is applied. each character has been scanned at least two times. in the bm implementation of the BM algorithm, an entire block of 8K is read in using the read() system call. the pattern is matched against the block until a match is found. then the block is searched backwards and forwards for the occurence of newlines to delimit the string. if the pattern doesn't occur too many times in the file, the number of times each character is examined is on average less than once because of the way the BM algorithm skips through the file. bgrep's average is at least twice. the three algoithms presented in de V. Smit's paper are the straight forward, the Knuth-Morris-Pratt, and the Boyer-Strother-Moore. in his examples, the i/o implementations were identical, and the standard for comparison was the number of times that each character in the imput stream had to be examined averaged over an entire file. the rankings were BSM first, straight forward next, and KNP far behind. the only time that BSM was not first was when the pattern was a single character long. straight forward was faster then. in summary, the i/o and string handling in bgrep accounts for all of the differences between bgrep and bm. none of the overhead refered to in the article i'm replying to made a significant contribution to the slow execution time. special case patterns such as xxxxxxyyyyyxxxxx would have done so. in the pursuit of speed, after the correct algorithm is chosen, the underlying implementation must be considered. i should point out that Peter has been working on a replacement for bm which at this point benchmarks about 1/4 the user time of bm, but has about the same system time. it uses the MATCHC assembler instruction but the same i/o structure as bm. this measures about 1/20 the user time of egrep at it's fastest on literal strings. i helped Peter run some of the benchmarks with various test patterns. there is only one significant case where bm is slower than the other greps, and that is when searching for multiple patterns in fgrep where there are more than about four patterns. Herb Chong... I'm user-friendly -- I don't byte, I nybble.... UUCP: {decvax|utzoo|ihnp4|allegra|clyde}!watmath!water!watdcsu!herbie CSNET: herbie%watdcsu@waterloo.csnet ARPA: herbie%watdcsu%waterloo.csnet@csnet-relay.arpa NETNORTH, BITNET, EARN: herbie@watdcs, herbie@watdcsu