[comp.lang.perl] How can I scan one file with a list of RE's from another efficently?.

greyham@hades.OZ (Greyham Stoney) (05/24/90)

A question for all ye perl hackers:

	Our site recieves a subset of a full news feed. I'd like to generate
a report based on the checkgroups message and/or the newsgroups file to say
which groups we actually recieve, and which groups we don't.

	Groups that we don't recieve are listed in a file named "dropped",
which is actually a list of regular expressions (bionet.*, vmsnet.* for
example) which match all the groups we don't want.

	I want the report to spit out the checkgroups message in two groups;
newgroups we get, and newsgroups we don't get. So it basically scans the
checkgroups message with a list of Regular Expressions from another file.

	I've put together a perl script which does the job, but BOY is it
slow!; I imagine becuase it has to compile that RE thousands of times.
Can anyone of severe perl guru wizard status suggest a better way of doing
it? [ doesn't have to use perl, I'm easy ]. I could just use fgrep -f, but the
list of groups dropped is too long for it to handle, and they're RE's anyway.

				thanks,
					Greyham.
--------------------------------- CUT HERE ------------------------------
#!/usr/local/bin/perl
# provide a report (from checkgroups) as to what newsgroups we still get,
# and what ones we don't get.

# slurp in the 'dropped' file.
open(DROPPED, 'dropped');
@dropped = <DROPPED>;
close(DROPPED);
chop (@dropped);		# nuke the \n off the end of each line.

# print it, just for checking.
#print @dropped;
#print $#dropped;


# slurp in the 'checkgroups' file (it's a news article).
open(CHECKGROUPS,'checkgrps.msg');

# skip the header business.
while (<CHECKGROUPS>)
{
	if (/^$/)
	{
		last;
	}
}

@checkgroups = <CHECKGROUPS>;
close (CHECKGROUPS);

# print it, just for checking.
#print @checkgroups;
#print $#checkgroups;

# go down each message in the checkgroups, and find whether we get it or not.
for ($group = 0; $group <= $#checkgroups; $group++)
{
	#print $checkgroups[$group], "\n";

	# see if this group is matched by anything in dropped.
	for ($drop = 0; $drop <= $#dropped; $drop++)
	{
		#print $checkgroups[$group], $dropped[$drop],"\n";

		if ($checkgroups[$group] =~ /^$dropped[$drop]/)
		{
			$nogo[$group] = 1;
			last;
		}
	}
}

print "***** The Following is a list of groups that we DO get:\n";

# spin down saying what we DO get:
for ($group = 0; $group <= $#checkgroups; $group++)
{
	if (!$nogo[$group])
	{
		print $checkgroups[$group];
	}
}

print "\n\n\n***** The Following is a list of groups that we DO NOT get:\n";

# spin down saying what we DONT get:
for ($group = 0; $group <= $#checkgroups; $group++)
{
	if ($nogo[$group])
	{
		print $checkgroups[$group];
	}
}
--------------------------------- CUT HERE -------------------------------

-- 
/*  Greyham Stoney:                            Australia: (02) 428 6476  *
 *    greyham@hades.oz  - Ausonics Pty Ltd, Lane Cove, Sydney, Oz.
 */

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (05/30/90)

In article <694@hades.OZ> greyham@hades.OZ (Greyham Stoney) writes:
: 	I've put together a perl script which does the job, but BOY is it
: slow!; I imagine becuase it has to compile that RE thousands of times.

That's the primary problem.  A secondary problem is the use of subscripts
to index into arrays.  Whenever you see subscripts in a Perl script, it's
a pretty strong indication that things aren't being done the Perl Way.
Iteration over an array should almost always be done with foreach.

: Can anyone of severe perl guru wizard status suggest a better way of doing
: it? [ doesn't have to use perl, I'm easy ]. I could just use fgrep -f, but the
: list of groups dropped is too long for it to handle, and they're RE's anyway.

RE's aside, a properly written Perl script will beat fgrep at its own game.
The trick is to use Perl's strengths rather than its weaknesses.  In the
following, we write a little bit of code that gets eval'ed.  This lets us
compile each pattern just once--a major savings.  Additionally, since we'll
be matching against multiple patterns, we do a study on each line, which
provides additional savings.

The script below is identical to yours, down to the #CHANGES line.

#!/usr/local/bin/perl
# provide a report (from checkgroups) as to what newsgroups we still get,
# and what ones we don't get.

# slurp in the 'dropped' file.
open(DROPPED, 'dropped');
@dropped = <DROPPED>;
close(DROPPED);
chop (@dropped);		# nuke the \n off the end of each line.

# print it, just for checking.
#print @dropped;
#print $#dropped;


# slurp in the 'checkgroups' file (it's a news article).
open(CHECKGROUPS,'checkgrps.msg');

# skip the header business.
while (<CHECKGROUPS>)
{
	if (/^$/)
	{
		last;
	}
}

@checkgroups = <CHECKGROUPS>;
close (CHECKGROUPS);

# print it, just for checking.
#print @checkgroups;
#print $#checkgroups;

# go down each message in the checkgroups, and find whether we get it or not.
#CHANGES BEGIN HERE

$prog = <<'EOF';
			    foreach $_ (@checkgroups) {
				study;
				$go = 1;	# Assume we get it.
EOF;

foreach $pat (@dropped) {
    $prog .= <<"EOF";
				next if /$pat/;
EOF
}

$prog .= <<'EOF';
				$go = 0;
			    }
			    continue {
				if ($go) {
				    push(@yes, $_);
				}
				else {
				    push(@no, $_);
				}
			    }
EOF

eval $prog; die $@ if $@;

print <<EOF;
***** The Following is a list of groups that we DO get:
@yes


***** The Following is a list of groups that we DO NOT get:
@no
EOF

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov