greyham@hades.OZ (Greyham Stoney) (05/24/90)
A question for all ye perl hackers:
Our site recieves a subset of a full news feed. I'd like to generate
a report based on the checkgroups message and/or the newsgroups file to say
which groups we actually recieve, and which groups we don't.
Groups that we don't recieve are listed in a file named "dropped",
which is actually a list of regular expressions (bionet.*, vmsnet.* for
example) which match all the groups we don't want.
I want the report to spit out the checkgroups message in two groups;
newgroups we get, and newsgroups we don't get. So it basically scans the
checkgroups message with a list of Regular Expressions from another file.
I've put together a perl script which does the job, but BOY is it
slow!; I imagine becuase it has to compile that RE thousands of times.
Can anyone of severe perl guru wizard status suggest a better way of doing
it? [ doesn't have to use perl, I'm easy ]. I could just use fgrep -f, but the
list of groups dropped is too long for it to handle, and they're RE's anyway.
thanks,
Greyham.
--------------------------------- CUT HERE ------------------------------
#!/usr/local/bin/perl
# provide a report (from checkgroups) as to what newsgroups we still get,
# and what ones we don't get.
# slurp in the 'dropped' file.
open(DROPPED, 'dropped');
@dropped = <DROPPED>;
close(DROPPED);
chop (@dropped); # nuke the \n off the end of each line.
# print it, just for checking.
#print @dropped;
#print $#dropped;
# slurp in the 'checkgroups' file (it's a news article).
open(CHECKGROUPS,'checkgrps.msg');
# skip the header business.
while (<CHECKGROUPS>)
{
if (/^$/)
{
last;
}
}
@checkgroups = <CHECKGROUPS>;
close (CHECKGROUPS);
# print it, just for checking.
#print @checkgroups;
#print $#checkgroups;
# go down each message in the checkgroups, and find whether we get it or not.
for ($group = 0; $group <= $#checkgroups; $group++)
{
#print $checkgroups[$group], "\n";
# see if this group is matched by anything in dropped.
for ($drop = 0; $drop <= $#dropped; $drop++)
{
#print $checkgroups[$group], $dropped[$drop],"\n";
if ($checkgroups[$group] =~ /^$dropped[$drop]/)
{
$nogo[$group] = 1;
last;
}
}
}
print "***** The Following is a list of groups that we DO get:\n";
# spin down saying what we DO get:
for ($group = 0; $group <= $#checkgroups; $group++)
{
if (!$nogo[$group])
{
print $checkgroups[$group];
}
}
print "\n\n\n***** The Following is a list of groups that we DO NOT get:\n";
# spin down saying what we DONT get:
for ($group = 0; $group <= $#checkgroups; $group++)
{
if ($nogo[$group])
{
print $checkgroups[$group];
}
}
--------------------------------- CUT HERE -------------------------------
--
/* Greyham Stoney: Australia: (02) 428 6476 *
* greyham@hades.oz - Ausonics Pty Ltd, Lane Cove, Sydney, Oz.
*/lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (05/30/90)
In article <694@hades.OZ> greyham@hades.OZ (Greyham Stoney) writes: : I've put together a perl script which does the job, but BOY is it : slow!; I imagine becuase it has to compile that RE thousands of times. That's the primary problem. A secondary problem is the use of subscripts to index into arrays. Whenever you see subscripts in a Perl script, it's a pretty strong indication that things aren't being done the Perl Way. Iteration over an array should almost always be done with foreach. : Can anyone of severe perl guru wizard status suggest a better way of doing : it? [ doesn't have to use perl, I'm easy ]. I could just use fgrep -f, but the : list of groups dropped is too long for it to handle, and they're RE's anyway. RE's aside, a properly written Perl script will beat fgrep at its own game. The trick is to use Perl's strengths rather than its weaknesses. In the following, we write a little bit of code that gets eval'ed. This lets us compile each pattern just once--a major savings. Additionally, since we'll be matching against multiple patterns, we do a study on each line, which provides additional savings. The script below is identical to yours, down to the #CHANGES line. #!/usr/local/bin/perl # provide a report (from checkgroups) as to what newsgroups we still get, # and what ones we don't get. # slurp in the 'dropped' file. open(DROPPED, 'dropped'); @dropped = <DROPPED>; close(DROPPED); chop (@dropped); # nuke the \n off the end of each line. # print it, just for checking. #print @dropped; #print $#dropped; # slurp in the 'checkgroups' file (it's a news article). open(CHECKGROUPS,'checkgrps.msg'); # skip the header business. while (<CHECKGROUPS>) { if (/^$/) { last; } } @checkgroups = <CHECKGROUPS>; close (CHECKGROUPS); # print it, just for checking. #print @checkgroups; #print $#checkgroups; # go down each message in the checkgroups, and find whether we get it or not. #CHANGES BEGIN HERE $prog = <<'EOF'; foreach $_ (@checkgroups) { study; $go = 1; # Assume we get it. EOF; foreach $pat (@dropped) { $prog .= <<"EOF"; next if /$pat/; EOF } $prog .= <<'EOF'; $go = 0; } continue { if ($go) { push(@yes, $_); } else { push(@no, $_); } } EOF eval $prog; die $@ if $@; print <<EOF; ***** The Following is a list of groups that we DO get: @yes ***** The Following is a list of groups that we DO NOT get: @no EOF Larry Wall lwall@jpl-devvax.jpl.nasa.gov