[net.unix] bgrep, bm, *grep, and benchmarks

herbie@watdcsu.UUCP (Herb Chong [DCS]) (07/20/85)

In article <293@zaphod.UUCP> bobd@zaphod.UUCP (Bob Dalgleish) writes:
>The Boyer-Moore pattern matching algorithm is slower than a naive
>pattern matching algorithm in many cases (including all of the cases in
>the "benchmark").

....deleted material....

this statement contradicts the results given in the following paper,
which i have a copy of lying around somewhere.  as an implementor of
this algorithm on VM/CMS and a consultant to Peter Bain's
implementation bm, i would like to make some comments here about why
bgrep is so slow when it should be much faster.

G. de V. Smit,"A Comparison of Three String Matching Algorithms", 
Software - Practice and Experience, vol. 12,  1982, 57-66, 

if you look carefully at the bgrep code, you will notice that each
character in the input line is scanned twice before the matching
algorithm even gets to see the line.  the fgets() reads a character at
a time from the input buffer until it finds a newline.  then a strlen()
is done to find the length of the string.  if a monocase search is
done, a further pass is made to make everything lower case.  after all
that, the matching algorithm is applied.  each character has been
scanned at least two times.  in the bm implementation of the BM
algorithm, an entire block of 8K is read in using the read() system
call.  the pattern is matched against the block until a match is
found.  then the block is searched backwards and forwards for the
occurence of newlines to delimit the string.  if the pattern doesn't
occur too many times in the file, the number of times each character is
examined is on average less than once because of the way the BM
algorithm skips through the file.  bgrep's average is at least twice.

the three algoithms presented in de V. Smit's paper are the straight
forward, the Knuth-Morris-Pratt, and the Boyer-Strother-Moore.  in his
examples, the i/o implementations were identical, and the standard for
comparison was the number of times that each character in the imput
stream had to be examined averaged over an entire file.  the rankings
were BSM first, straight forward next, and KNP far behind.  the only
time that BSM was not first was when the pattern was a single character
long.  straight forward was faster then.

in summary, the i/o and string handling in bgrep accounts for all of
the differences between bgrep and bm.  none of the overhead refered to
in the article i'm replying to made a significant contribution to the
slow execution time.  special case patterns such as xxxxxxyyyyyxxxxx
would have done so.  in the pursuit of speed, after the correct
algorithm is chosen, the underlying implementation must be considered.

i should point out that Peter has been working on a replacement for bm
which at this point benchmarks about 1/4 the user time of bm, but has
about the same system time.  it uses the MATCHC assembler instruction
but the same i/o structure as bm.  this measures about 1/20 the user
time of egrep at it's fastest on literal strings.  i helped Peter
run some of the benchmarks with various test patterns.  there is only
one significant case where bm is slower than the other greps, and that
is when searching for multiple patterns in fgrep where there are
more than about four patterns.

Herb Chong...

I'm user-friendly -- I don't byte, I nybble....

UUCP:  {decvax|utzoo|ihnp4|allegra|clyde}!watmath!water!watdcsu!herbie
CSNET: herbie%watdcsu@waterloo.csnet
ARPA:  herbie%watdcsu%waterloo.csnet@csnet-relay.arpa
NETNORTH, BITNET, EARN: herbie@watdcs, herbie@watdcsu