greyham@hades.OZ (Greyham Stoney) (05/24/90)
A question for all ye perl hackers: Our site recieves a subset of a full news feed. I'd like to generate a report based on the checkgroups message and/or the newsgroups file to say which groups we actually recieve, and which groups we don't. Groups that we don't recieve are listed in a file named "dropped", which is actually a list of regular expressions (bionet.*, vmsnet.* for example) which match all the groups we don't want. I want the report to spit out the checkgroups message in two groups; newgroups we get, and newsgroups we don't get. So it basically scans the checkgroups message with a list of Regular Expressions from another file. I've put together a perl script which does the job, but BOY is it slow!; I imagine becuase it has to compile that RE thousands of times. Can anyone of severe perl guru wizard status suggest a better way of doing it? [ doesn't have to use perl, I'm easy ]. I could just use fgrep -f, but the list of groups dropped is too long for it to handle, and they're RE's anyway. thanks, Greyham. --------------------------------- CUT HERE ------------------------------ #!/usr/local/bin/perl # provide a report (from checkgroups) as to what newsgroups we still get, # and what ones we don't get. # slurp in the 'dropped' file. open(DROPPED, 'dropped'); @dropped = <DROPPED>; close(DROPPED); chop (@dropped); # nuke the \n off the end of each line. # print it, just for checking. #print @dropped; #print $#dropped; # slurp in the 'checkgroups' file (it's a news article). open(CHECKGROUPS,'checkgrps.msg'); # skip the header business. while (<CHECKGROUPS>) { if (/^$/) { last; } } @checkgroups = <CHECKGROUPS>; close (CHECKGROUPS); # print it, just for checking. #print @checkgroups; #print $#checkgroups; # go down each message in the checkgroups, and find whether we get it or not. for ($group = 0; $group <= $#checkgroups; $group++) { #print $checkgroups[$group], "\n"; # see if this group is matched by anything in dropped. for ($drop = 0; $drop <= $#dropped; $drop++) { #print $checkgroups[$group], $dropped[$drop],"\n"; if ($checkgroups[$group] =~ /^$dropped[$drop]/) { $nogo[$group] = 1; last; } } } print "***** The Following is a list of groups that we DO get:\n"; # spin down saying what we DO get: for ($group = 0; $group <= $#checkgroups; $group++) { if (!$nogo[$group]) { print $checkgroups[$group]; } } print "\n\n\n***** The Following is a list of groups that we DO NOT get:\n"; # spin down saying what we DONT get: for ($group = 0; $group <= $#checkgroups; $group++) { if ($nogo[$group]) { print $checkgroups[$group]; } } --------------------------------- CUT HERE ------------------------------- -- /* Greyham Stoney: Australia: (02) 428 6476 * * greyham@hades.oz - Ausonics Pty Ltd, Lane Cove, Sydney, Oz. */
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (05/30/90)
In article <694@hades.OZ> greyham@hades.OZ (Greyham Stoney) writes: : I've put together a perl script which does the job, but BOY is it : slow!; I imagine becuase it has to compile that RE thousands of times. That's the primary problem. A secondary problem is the use of subscripts to index into arrays. Whenever you see subscripts in a Perl script, it's a pretty strong indication that things aren't being done the Perl Way. Iteration over an array should almost always be done with foreach. : Can anyone of severe perl guru wizard status suggest a better way of doing : it? [ doesn't have to use perl, I'm easy ]. I could just use fgrep -f, but the : list of groups dropped is too long for it to handle, and they're RE's anyway. RE's aside, a properly written Perl script will beat fgrep at its own game. The trick is to use Perl's strengths rather than its weaknesses. In the following, we write a little bit of code that gets eval'ed. This lets us compile each pattern just once--a major savings. Additionally, since we'll be matching against multiple patterns, we do a study on each line, which provides additional savings. The script below is identical to yours, down to the #CHANGES line. #!/usr/local/bin/perl # provide a report (from checkgroups) as to what newsgroups we still get, # and what ones we don't get. # slurp in the 'dropped' file. open(DROPPED, 'dropped'); @dropped = <DROPPED>; close(DROPPED); chop (@dropped); # nuke the \n off the end of each line. # print it, just for checking. #print @dropped; #print $#dropped; # slurp in the 'checkgroups' file (it's a news article). open(CHECKGROUPS,'checkgrps.msg'); # skip the header business. while (<CHECKGROUPS>) { if (/^$/) { last; } } @checkgroups = <CHECKGROUPS>; close (CHECKGROUPS); # print it, just for checking. #print @checkgroups; #print $#checkgroups; # go down each message in the checkgroups, and find whether we get it or not. #CHANGES BEGIN HERE $prog = <<'EOF'; foreach $_ (@checkgroups) { study; $go = 1; # Assume we get it. EOF; foreach $pat (@dropped) { $prog .= <<"EOF"; next if /$pat/; EOF } $prog .= <<'EOF'; $go = 0; } continue { if ($go) { push(@yes, $_); } else { push(@no, $_); } } EOF eval $prog; die $@ if $@; print <<EOF; ***** The Following is a list of groups that we DO get: @yes ***** The Following is a list of groups that we DO NOT get: @no EOF Larry Wall lwall@jpl-devvax.jpl.nasa.gov