[comp.lang.perl] fast grep?

vixie@volition.pa.dec.com (Paul Vixie) (10/16/90)

I've a need to grep for a simple pattern in a large file from within a perl
script.  Simple means no metacharacters, large means many megabytes.  So
far it looks like "netgrep" -- the B M thing that later turned into GNU grep,
is the clear winner.

	net/bm/gnu grep...
		0.3u 1.5s 0:09 19% 20+20k 284+0io 0pf+0w
		0.3u 1.3s 0:11 15% 20+20k 285+1io 0pf+0w
		0.3u 0.8s 0:01 92% 20+20k 0+0io 0pf+0w

	perl  "while (<>) { print if (/$pat/o); }"...
		5.3u 1.3s 0:12 53% 151+63k 208+0io 3pf+0w
		5.3u 1.3s 0:09 68% 152+63k 63+0io 3pf+0w
		5.3u 2.1s 0:16 45% 152+63k 280+0io 3pf+0w

	perl  "print grep(/$pat/o, <>);"...
		7.6u 5.4s 0:18 69% 136+3894k 284+1io 3pf+0w
		7.8u 5.1s 0:19 67% 149+3878k 282+1io 3pf+0w
		7.7u 4.4s 0:17 67% 137+3920k 282+1io 3pf+0w

Have I found something that perl can't do as well as C?  With this
kind of variance, it's cheaper to fork/exec a bmgrep.
--
Paul Vixie
DEC Western Research Lab	<vixie@wrl.dec.com>
Palo Alto, California		...!decwrl!vixie

tchrist@convex.COM (Tom Christiansen) (10/16/90)

In article <1990Oct15.211219.8543@wrl.dec.com> vixie@volition.pa.dec.com (Paul Vixie) writes:
>I've a need to grep for a simple pattern in a large file from within a perl
>script.  Simple means no metacharacters, large means many megabytes.  So
>far it looks like "netgrep" -- the B M thing that later turned into GNU grep,
>is the clear winner.

>Have I found something that perl can't do as well as C?  With this
>kind of variance, it's cheaper to fork/exec a bmgrep.

Well, the perl code you used was slightly sub-optimal.  If you slap
an eval around the the whole thing, then $pat becomes a constant string, 
which is even better than the /o.  Somewhere I've an article in which 
Larry explains that this makes it use BM-style search routines.  


I've taken a 4-megabyte file, which is just 40 concatenated termcap files,
and I'm looking for the string /termcap/.  Here are some timings.  The two
perl codes are:

    $pat = shift; while  (<>) { print if /$pat/o; };

and 

    $pat = shift; eval "while  (<>) { print if /$pat/; }";

Now, in descending real-time order:  

(It's too bad that novices think fgrep stands for fast.)

fgrep
       14.649289 real       12.367839 user        1.743313 sys 
       14.582311 real       12.360291 user        1.745801 sys 
       14.707298 real       12.364471 user        1.760904 sys 
egrep
       12.672381 real        7.283316 user        3.333458 sys 
       12.693866 real        7.293646 user        3.369639 sys 
       11.212437 real        7.246659 user        3.312308 sys 
perl with pat/o
       11.755116 real       10.349145 user        0.414848 sys 
       11.619828 real       10.345177 user        0.428762 sys 
       11.650266 real       10.342087 user        0.420670 sys 
grep
       10.311009 real        9.389851 user        0.493640 sys 
       10.238396 real        9.389047 user        0.439769 sys 
       10.372305 real        9.392753 user        0.447918 sys 
perl with eval
        7.068890 real        5.988363 user        0.377409 sys 
        8.725602 real        6.019300 user        0.439920 sys 
        7.004554 real        5.983231 user        0.401990 sys 
gungrep
        2.050560 real        1.442311 user        0.494117 sys 
        2.164280 real        1.445398 user        0.494487 sys 
        2.082105 real        1.443624 user        0.488781 sys 

I still can't get it sufficently close to gnugrep to justify not running a
system on gnugrep, but even so, there's nothing truly wrong with that.  I
don't think Larry would have made popens and `foo` and system() so easy to
get at if he didn't want you do use them.  Note that perl DOES beat
everything except for gnugrep.  One of these days, I've got to see whether
I can't vectorize any of the regexp stuff in perl...

--tom

merlyn@iwarp.intel.com (Randal Schwartz) (10/17/90)

In article <107211@convex.convex.com>, tchrist@convex (Tom Christiansen) writes:
| I still can't get it sufficently close to gnugrep to justify not running a
| system on gnugrep, but even so, there's nothing truly wrong with that.  I
| don't think Larry would have made popens and `foo` and system() so easy to
| get at if he didn't want you do use them.  Note that perl DOES beat
| everything except for gnugrep.  One of these days, I've got to see whether
| I can't vectorize any of the regexp stuff in perl...

Tom, go back and rerun your test with some slightly-more-optimal (I hope)
Perl code:

##################################################

#!/usr/bin/perl
# @ARGV = ('/usr/dict/words'); # for testing

undef $/; # enable maximal slurp mode
$* = 1; # ^$ work in searches with newlines
while (<>) {
	s/string/&cheat($`,$&,$')/eg;
}

sub cheat {
	local($left,$center,$right) = @_;
	print	substr($left,rindex($left,"\n")+1,99999),
		$center,
		substr($right,0,index($right,"\n")+1);
	$center;
}
		
##################################################

I think you were getting beat up on how often you went around the
while loop.  This code more closely matches what bmgrep is doing
inside.

(Yes, I actually tested this code for a change. :-)

print "Just another Perl hacker," # with theoretically enough time to write a good JAPH now, but still not doing it.... :-)
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Welcome to Portland, Oregon, home of the California Raisins!"=/

tchrist@convex.COM (Tom Christiansen) (10/18/90)

In article <1990Oct17.044839.21709@iwarp.intel.com> merlyn@iwarp.intel.com 
(Randal Schwartz) writes, replying to me:

|Tom, go back and rerun your test with some slightly-more-optimal (I hope)
|Perl code:

No Randal, it's not so, unless I broke something whne I clipped your code.
It's 10 times slower, and that's on the 100k file.  On the 4Meg one
it gobbled up 30 meg (most resident) and spun for 40 minutes on my 
quarter-gig convex 220 before I killed it.

Larry, how come we only get BM speed on constant strings?  Can't it know?
Having to know to do the eval is too bad.

--tom

merlyn@iwarp.intel.com (Randal Schwartz) (10/19/90)

In article <107296@convex.convex.com>, tchrist@convex (Tom Christiansen) writes:
| No Randal, it's not so, unless I broke something whne I clipped your code.
| It's 10 times slower, and that's on the 100k file.  On the 4Meg one
| it gobbled up 30 meg (most resident) and spun for 40 minutes on my 
| quarter-gig convex 220 before I killed it.

Well, at least it ran.  That's better than *some* of the improvements
I make. :-)
-- 
/=Randal L. Schwartz, Stonehenge Consulting Services (503)777-0095 ==========\
| on contract to Intel's iWarp project, Beaverton, Oregon, USA, Sol III      |
| merlyn@iwarp.intel.com ...!any-MX-mailer-like-uunet!iwarp.intel.com!merlyn |
\=Cute Quote: "Intel put the 'backward' in 'backward compatible'..."=========/