geoffw@elecvax.SUN (Geoff Whale) (05/23/84)
The following bug can be found in many releases, notably Level 7 and system V (fgrep version 1.2). Try matching the patterns "roar" and "arrow" against the subject "arroar". It will fail to match due to a problem in setting fail links in cfail() - this was probably never noticed because of the highly obscure and unstructured nature of the implementation, which looks more like Fortran than C. The expedient fix (and the poor quality of the original code) is evident from the diff listing below: 317c317 < *rear++ = s->nst; --- > *rear++ = s; 339,346c339,351 < floop: if (state == 0) state = w; < if (state->inp == c) { < qloop: q->fail = state->nst; < if ((state->nst)->out == 1) q->out = 1; < if ((q = q->link) != 0) goto qloop; < } < else if ((state = state->link) != 0) < goto floop; --- > while (state != NULL && state->inp != c) > if (state->link != NULL) > state = state->link; > else /* no more alternatives at this level */ > state = state->fail; > if (state == NULL) > state = w; /* first level node */ > else > state = state->nst; /* success link */ > do { > q->fail = state; > q->out |= state->out; > } while ((q = q->link) != NULL); The program has been completely rewritten at UNSW to use malloc() to obtain tree nodes when needed (the original version declares an incredible 96076 bytes of data!), to avoid useless leaf nodes in the tree, and to accept the other grep options -s and -y. -- Geoff Whale (geoffw:elecvax)
fnf@unisoft.UUCP (07/09/85)
*** REPLACE THIS LINE WITH YOUR LIFE HISTORY *** After grabbing the bgrep distribution off of mod.sources recently I decided to try a quick test of the various grep's on one our system V release 2 ports: Trial 1: *grep FOOBARBLETCH /etc/termcap Trial 2: *grep BARF /etc/termcap Trial 3: *grep MI /etc/termcap /etc/termcap approx 50kb trial 1 trial 2 trial 3 real user sys real user sys real user sys ---- ---- --- ---- ---- --- ---- ---- --- grep 1.9 1.1 0.6 1.8 1.1 0.6 1.8 1.1 0.6 bgrep 2.7 1.9 0.7 2.3 1.4 0.7 4.9 3.9 0.7 egrep 3.5 2.6 0.7 3.6 2.8 0.6 3.5 2.7 0.6 fgrep 9.6 8.8 0.7 9.7 9.0 0.6 9.8 8.9 0.7 Notice that plain old grep is the fastest of all, and fgrep is the slowest! =========================================================================== Fred Fish UniSoft Systems Inc, 739 Allston Way, Berkeley, CA 94710 USA {ucbvax,decvax}!unisoft!fnf (415) 644 1230 TWX 11 910 366-2145 ===========================================================================
bobd@zaphod.UUCP (Bob Dalgleish) (07/11/85)
In article <495@unisoft.UUCP> fnf@unisoft.UUCP writes: >After grabbing the bgrep distribution off of mod.sources recently >I decided to try a quick test of the various grep's on one our system >V release 2 ports: > > Trial 1: *grep FOOBARBLETCH /etc/termcap > Trial 2: *grep BARF /etc/termcap > Trial 3: *grep MI /etc/termcap > > [Table depicting /bin/time stats for "benchmarks"] >Notice that plain old grep is the fastest of all, and fgrep is the slowest! > >Fred Fish UniSoft Systems Inc, 739 Allston Way, Berkeley, CA 94710 USA The Boyer-Moore pattern matching algorithm is slower than a naive pattern matching algorithm in many cases (including all of the cases in the "benchmark"). It uses a lookahead set to decide how much to advance the pattern against the subject string. Using a one or two character pattern causes the pattern matching overhead (both the setup and runtime) to greatly exceed the matching operation. The BM algorithm works best on longer strings, *especially* strings that resemble the subject string, i.e., the front part of the pattern is in the subject string, with variations in the rest of the pattern. Since only one of the benchmark patterns actually occurs in the subject string file (at least I presume it does - it doesn't in mine), the partial match recovery that the BM algorithm uses will not come into effect more than a few times overall. A much better test would use (for instance) a misspelled word in a large document. The BM algorithm would then show some improvements. As mentioned in the documentation for the grep family, ideally there need only be one program that determines the best algorithm to use for the pattern. Since the best algorithm actually depends on both the pattern and the subject spaces, "best" is not possible in practice. REMEMBER, in benchmarking as in everything else: CHOOSE HORSES FOR COURSES. -- [Forgive me, Father, for I have signed ...] Bob Dalgleish ...!alberta!sask!zaphod!bobd ihnp4! (My company has disclaimed any knowledge of me and whatever I might say)
bruce@stride.UUCP (Bruce Robertson) (07/11/85)
In article <495@unisoft.UUCP> fnf@unisoft.UUCP writes: > >After grabbing the bgrep distribution off of mod.sources recently >I decided to try a quick test of the various grep's on one our system >V release 2 ports: > > trial 1 trial 2 trial 3 > real user sys real user sys real user sys > ---- ---- --- ---- ---- --- ---- ---- --- > >grep 1.9 1.1 0.6 1.8 1.1 0.6 1.8 1.1 0.6 >bgrep 2.7 1.9 0.7 2.3 1.4 0.7 4.9 3.9 0.7 >egrep 3.5 2.6 0.7 3.6 2.8 0.6 3.5 2.7 0.6 >fgrep 9.6 8.8 0.7 9.7 9.0 0.6 9.8 8.9 0.7 Here are my results of the same benchmarks: trial 1 trial 2 trial 3 real user sys real user sys real user sys ---- ---- --- ---- ---- --- ---- ---- --- grep 1.7 1.3 0.3 1.7 1.3 0.3 1.6 1.3 0.3 bm 0.5 0.1 0.3 0.8 0.4 0.7 1.3 0.9 0.3 egrep 1.7 1.2 0.3 1.7 1.2 0.3 1.7 1.3 0.4 fgrep 1.5 1.0 0.4 1.4 1.0 0.3 1.4 1.0 0.4 My results are more what I would expect to see. I used bm, because I ported that one, and not bgrep. As you can see, bm is MUCH faster than all of the others, and 'fgrep' is marginally faster than grep. These were run on a Stride Micro 460 computer, which is 68000 based at 10 MHz with no wait states, running the Motorola System-V/68 version of Unix, which is System V release 1. The file used for the test was the first 51000 bytes of /etc/termcap. I'm not sure why your times are so large, especially for egrep, but I'd take a close look at your C compiler and see what code it's generating. -- Bruce Robertson UUCP: {ucbvax!menlo70,seismo}!unr70!unrvax!stride!bruce
henry@utzoo.UUCP (Henry Spencer) (07/12/85)
> Notice that plain old grep is the fastest of all, and fgrep is the slowest!
Sigh, it is well-known that fgrep is slow when searching for a single string.
You forgot to test egrep, which is well-known to be the fastest for the
trivial case. Where fgrep makes a difference is when you want to search
for a number of strings simultaneously. Grep can't do that at all, and
egrep hits pattern-complexity limits quickly. Fgrep really screams along
in this case. I haven't tested bgrep et al yet.
--
Henry Spencer @ U of Toronto Zoology
{allegra,ihnp4,linus,decvax}!utzoo!henry
henry@utzoo.UUCP (Henry Spencer) (07/14/85)
Oops, my apologies, my "you didn't test egrep" comment was incorrect; I wasn't paying attention properly. "It is wise to engage brain before putting mouth in motion." -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry