[net.unix] fgrep

fnf@unisoft.UUCP (07/09/85)

*** REPLACE THIS LINE WITH YOUR LIFE HISTORY ***

After grabbing the bgrep distribution off of mod.sources recently
I decided to try a quick test of the various grep's on one our system
V release 2 ports:

	Trial 1:	*grep FOOBARBLETCH /etc/termcap
	Trial 2:	*grep BARF /etc/termcap
	Trial 3:	*grep MI /etc/termcap
	
	/etc/termcap approx 50kb
	
	
		trial 1			trial 2			trial 3
	real	user	sys	real	user	sys	real	user	sys
	----	----	---	----	----	---	----	----	---

grep	1.9	1.1	0.6	1.8	1.1	0.6	1.8	1.1	0.6
bgrep	2.7	1.9	0.7	2.3	1.4	0.7	4.9	3.9	0.7
egrep	3.5	2.6	0.7	3.6	2.8	0.6	3.5	2.7	0.6
fgrep	9.6	8.8	0.7	9.7	9.0	0.6	9.8	8.9	0.7


Notice that plain old grep is the fastest of all, and fgrep is the slowest!


===========================================================================
Fred Fish    UniSoft Systems Inc, 739 Allston Way, Berkeley, CA  94710  USA
{ucbvax,decvax}!unisoft!fnf	(415) 644 1230 		TWX 11 910 366-2145
===========================================================================

charliem@shark.UUCP (Charlie Mills) (07/10/85)

Under 4.2 on a vax, egrep is always faster than grep, running in about
two thirds the time.  But notice that egrep uses more memory, and there
are one or two things grep can do that egrep can't (-v option).  Fgrep
is intended to be used with a string-list instead of a pattern.
Emphasis on *list*.  The string-list is usually taken from a file.

I wonder why grep is faster than egrep on the Unisoft SysVr2 ports.
Did AT&T or Unisoft speed up grep?  Did Berkeley speed up egrep?  Is
memory size the issue?

	-- Charlie Mills
UUCP: ..{ucbvax,decvax,uw-beaver,hplabs,ihnp4,allegra}!tektronix!shark!charliem
CSNET:	shark!charliem@tektronix
ARPA:	shark!charliem.tektronix@rand-relay
USMail: M/S 61-277
	Tektronix, Inc.
	P.O. Box 1000
	Wilsonville, OR 97070

bobd@zaphod.UUCP (Bob Dalgleish) (07/11/85)

In article <495@unisoft.UUCP> fnf@unisoft.UUCP writes:
>After grabbing the bgrep distribution off of mod.sources recently
>I decided to try a quick test of the various grep's on one our system
>V release 2 ports:
>
>	Trial 1:	*grep FOOBARBLETCH /etc/termcap
>	Trial 2:	*grep BARF /etc/termcap
>	Trial 3:	*grep MI /etc/termcap
>	
> [Table depicting /bin/time stats for "benchmarks"]
>Notice that plain old grep is the fastest of all, and fgrep is the slowest!
>
>Fred Fish    UniSoft Systems Inc, 739 Allston Way, Berkeley, CA  94710  USA

The Boyer-Moore pattern matching algorithm is slower than a naive
pattern matching algorithm in many cases (including all of the cases in
the "benchmark").  It uses a lookahead set to decide how much to advance
the pattern against the subject string.  Using a one or two character
pattern causes the pattern matching overhead (both the setup and
runtime) to greatly exceed the matching operation.  The BM algorithm
works best on longer strings, *especially* strings that resemble the
subject string, i.e., the front part of the pattern is in the subject
string, with variations in the rest of the pattern.  Since only one of
the benchmark patterns actually occurs in the subject string file (at
least I presume it does - it doesn't in mine), the partial match
recovery that the BM algorithm uses will not come into effect more than
a few times overall.  A much better test would use (for instance) a
misspelled word in a large document.  The BM algorithm would then show
some improvements.

As mentioned in the documentation for the grep family, ideally there
need only be one program that determines the best algorithm to use for
the pattern.  Since the best algorithm actually depends on both the
pattern and the subject spaces, "best" is not possible in practice.

REMEMBER, in benchmarking as in everything else: CHOOSE HORSES FOR COURSES.

-- 
[Forgive me, Father, for I have signed ...]

Bob Dalgleish		...!alberta!sask!zaphod!bobd
			      ihnp4!
(My company has disclaimed any knowledge of me and whatever I might say)

bruce@stride.UUCP (Bruce Robertson) (07/11/85)

In article <495@unisoft.UUCP> fnf@unisoft.UUCP writes:
>
>After grabbing the bgrep distribution off of mod.sources recently
>I decided to try a quick test of the various grep's on one our system
>V release 2 ports:
>
>		trial 1			trial 2			trial 3
>	real	user	sys	real	user	sys	real	user	sys
>	----	----	---	----	----	---	----	----	---
>
>grep	1.9	1.1	0.6	1.8	1.1	0.6	1.8	1.1	0.6
>bgrep	2.7	1.9	0.7	2.3	1.4	0.7	4.9	3.9	0.7
>egrep	3.5	2.6	0.7	3.6	2.8	0.6	3.5	2.7	0.6
>fgrep	9.6	8.8	0.7	9.7	9.0	0.6	9.8	8.9	0.7


Here are my results of the same benchmarks:

		trial 1			trial 2			trial 3
	real	user	sys	real	user	sys	real	user	sys
	----	----	---	----	----	---	----	----	---

grep	1.7	1.3	0.3	1.7	1.3	0.3	1.6	1.3	0.3
bm	0.5	0.1	0.3	0.8	0.4	0.7	1.3	0.9	0.3
egrep	1.7	1.2	0.3	1.7	1.2	0.3	1.7	1.3	0.4
fgrep	1.5	1.0	0.4	1.4	1.0	0.3	1.4	1.0	0.4

My results are more what I would expect to see.  I used bm, because I
ported that one, and not bgrep.  As you can see, bm is MUCH faster than
all of the others, and 'fgrep' is marginally faster than grep.

These were run on a Stride Micro 460 computer, which is 68000 based at 10
MHz with no wait states, running the Motorola System-V/68 version of Unix,
which is System V release 1.  The file used for the test was the first 51000
bytes of /etc/termcap.

I'm not sure why your times are so large, especially for egrep, but I'd take
a close look at your C compiler and see what code it's generating.
-- 

	Bruce Robertson
	UUCP: {ucbvax!menlo70,seismo}!unr70!unrvax!stride!bruce

henry@utzoo.UUCP (Henry Spencer) (07/12/85)

> Notice that plain old grep is the fastest of all, and fgrep is the slowest!

Sigh, it is well-known that fgrep is slow when searching for a single string.
You forgot to test egrep, which is well-known to be the fastest for the
trivial case.  Where fgrep makes a difference is when you want to search
for a number of strings simultaneously.  Grep can't do that at all, and 
egrep hits pattern-complexity limits quickly.  Fgrep really screams along
in this case.  I haven't tested bgrep et al yet.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

henry@utzoo.UUCP (Henry Spencer) (07/14/85)

Oops, my apologies, my "you didn't test egrep" comment was incorrect;
I wasn't paying attention properly.

"It is wise to engage brain before putting mouth in motion."
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

fnf@unisoft.UUCP (07/23/85)

In article <5785@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
>> Notice that plain old grep is the fastest of all, and fgrep is the slowest!
>
>You forgot to test egrep, which is well-known to be the fastest for the
     ^^^^^^^^^^^^^^^^^^^^	    ^^^^^^^^^^
>trivial case.  Where fgrep makes a difference is when you want to search
>for a number of strings simultaneously.  Grep can't do that at all, and 
>egrep hits pattern-complexity limits quickly.  Fgrep really screams along
>in this case...

Thanks for the tidbit of Unix folklore Henry.  But no, I didn't forget
to test egrep, here is an extract from my original posting:

		real	user	sys
		----	----	---

	grep	1.9	1.1	0.6
	bgrep	2.7	1.9	0.7
	egrep	3.5	2.6	0.7
	fgrep	9.6	8.8	0.7


It would be nice if the documentation wasn't so misleading.  From the
From the system V User's Manual grep(1):

	"Grep patterns are limited regular *expressions* in the style
	of ed(1)..."

	"Egrep patterns are full regular *expressions*..."

	"Fgrep patterns are fixed *strings*; it is fast and compact."

This implies that fgrep should be the fastest for plain old strings
even if there is only one (or none? :-).

-Fred

===========================================================================
Fred Fish    UniSoft Systems Inc, 739 Allston Way, Berkeley, CA  94710  USA
{ucbvax,decvax}!unisoft!fnf	(415) 644 1230 		TWX 11 910 366-2145
===========================================================================

thomas@utah-gr.UUCP (Spencer W. Thomas) (07/31/85)

Not to continue flogging a dead horse, but I think (mixing my
metaphors), I may be able to shed a little more light here.  (Then
again, maybe not.)

In article <507@unisoft.UUCP> fnf@unisoft.UUCP (Fred Fish) writes:
>From the system V User's Manual grep(1):
>...
>	"Egrep patterns are full regular *expressions*..."

You didn't quote the relevant part about egrep (unless this has been
deleted from the SYS V manual).  The '...' expands to
	"it uses a fast deterministic algorithm..."
When read carefully, this indicates that egrep has the potential of
being faster.  The problem with egrep is
	"that [it] sometimes needs exponential space."

Timings of '*grep foo *' on a large directory yield the information

egrep	5.8u 1.6s 0:17 41% 19+40k 217+13io 2pf+0w
grep	7.0u 1.7s 0:34 25% 7+15k 225+22io 1pf+0w
fgrep	10.0u 1.8s 0:47 24% 7+15k 211+7io 3pf+0w

showing `clearly' the superiority of egrep in this case.

I think that in Fred's example, the setup time for egrep may have
overwhelmed the search time (that's the only theory I can think of,
anyway), so that perhaps grep is better for single file searches, but it
has been our experience that egrep is better when searching over a large
number of files.  (To forstall the obvious followups:)  It seems clear,
from previous postings, that 'bm' or 'bgrep' are better for the simple
case of searching for strings.

-- 
=Spencer   ({ihnp4,decvax}!utah-cs!thomas, thomas@utah-cs.ARPA)
	"You don't get to choose how you're going to die.  Or when.
	 You can only decide how you're going to live." Joan Baez