gnb@bby.oz.au (Gregory N. Bond) (04/24/91)
This is a little script I hacked up to rebuild our news system after
some nasty failures. It is basically a perl transcription if the
ngmatch.c file from Cnews.
Strange thing is, it is as slow as hell, about 2 match/sec on 20Mips
sparc. Doesn't worry me too much (as its a bit of a oncer), but
surprising.
Another thing that I found difficult was the equivalent of "fgrep -v
-f <file>", where file is large (say, 1,000 lines). I tried a perl
script that built a program then eval'd it, but with a 30k pattern
file, the perl program grew to >5Mb and ran like a dog. It was much
much faster to split the pat file and use a chain of fgreps. Again,
surprising. Any suggestions?
Greg.
#! /usr/local/bin/perl
# take a history file on stdin, and match them to the following
# newsgroups to see if we would collect it (ignore distributions here)
# Each line that matches is passed to the stdout
$sysfile_pattern = "
gnu,comp,news,sci,rec,misc,soc,talk,alt,to.melba,aus,melb,world,general,\
!talk.all,\
!soc.all,soc.culture.australian,soc.culture.jewish,\
soc.culture.arabic,\
soc.culture.esperanto,soc.religion.islam,\
soc.history,soc.men,soc.motss,\
soc.women,soc.feminism,\
!rec.all,rec.humor,rec.games.hack,rec.sport.cricket,rec.pets,\
rec.games.trivia,rec.arts.movies.reviews,\
!alt.all,alt.humor.oracle,alt.sources,alt.security,alt.sys.sun,\
alt.folklore,alt.religion,\
!sci.all,sci.space,sci.military,sci.med.aids,sci.econ,\
sci.environment,\
sci.energy,sci.skeptic,\
comp.all,!comp.binaries,!comp.all.mac,!comp.all.amiga,\
!comp.all.atari,\
!comp.mail.maps,!comp.os,!comp.sys,comp.sys.sun,comp.sys.ibm,\
!talk.all,all.misc:F:
";
#$sysfile_pattern = "comp,!comp.binaries.all";
$sysfile_pattern =~ s/\\.//g;
$sysfile_pattern =~ s/\s//g;
$sysfile_pattern =~ s/:.*//;
$sysfile_pattern =~ tr/A-Z/a-z/;
@patterns = split(/,/, $sysfile_pattern);
# ngmatch(grouplist)
# returns true is one of the groups in grouplist matched one of the
# patterns in patterns.
sub ngmatch {
local ($grouplist) = @_;
return scalar(grep(&onegroupmatch($_), split(/,/, $grouplist)));
}
#
# match the single group name in grp to the patterns. Keep track of
# depth! (see cnews file libcnews/ngmatch.c for comments)
#
sub onegroupmatch {
local ($grp) = @_;
local ($pat, $neg, $depth, $hitdeepest, $faildeepest);
foreach $_ (@patterns) {
$neg = (($pat = $_) =~ s/^!//);
if (($depth = &onepatmatch($grp, $pat)) > 0) {
$faildeepest = $depth if $depth > $faildeepest && $neg;
$hitdeepest = $depth if $depth > $hitdeepest && !$neg;
}
}
$hitdeepest > $faildeepest;
}
#
# match the single newsgroup against the single pattern
#
sub onepatmatch {
local($grp, $pat) = @_;
local (@gs, @ps);
local($p, $g, $incr, $depth);
@gs = split(/\./, $grp);
@ps = split(/\./, $pat);
$depth = 0;
for ($p = shift(@ps), $g = shift(@gs);
$p && $g; $p = shift(@ps), $g = shift(@gs)) {
$incr = 20;
return 0 if ($p ne $g && $p ne "all");
if ($p eq $g) {
$depth += $incr;
} else { # Is a wildcard match
$depth += --$incr;
}
}
($p && !$g) ? 0 : $depth;
}
if (0) {
$v = "comp.graphics";
$m = &ngmatch($v);
print "match $v: $m\n";
$v = "comp.graphics";
$m = &ngmatch($v);
print "match $v: $m\n";
}
#
#
# Now the main loop
#
#
if (1) {
open (GOOD, ">good.out") || die "can't open good.out: $!\n";
open (BAD, ">bad.out") || die "can't open bad.out: $!\n";
# open (EXPIRED, ">expired.out") || die "can't open expired.out: $!\n";
while (<>) {
@L = split(/\t/);
next unless ( ($ng = @L[2]) =~ s=/\d+==g );
$ng =~ s/\s+/,/g;
$ng =~ s/,+$//;
if (&ngmatch($ng)) {
print GOOD;
} else {
print BAD;
}
## last if $num++ > 100;
}
}
--
Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia
Internet: gnb@melba.bby.oz.au non-MX: gnb%melba.bby.oz@uunet.uu.net
Uucp: {uunet,pyramid,ubc-cs,ukc,mcvax,prlb2,nttlab...}!munnari!melba.bby.oz!gnbflee@cs.psu.edu (Felix Lee) (04/26/91)
>Strange thing is, it is as slow as hell, about 2 match/sec on 20Mips sparc.
This is because Perl is a terribly inefficient at string processing
(half smiley).
Below, the version of ngmatch in Perl that I use. It's at least an
order of magnitude faster. The code is quite ugly.
--
Felix Lee flee@cs.psu.edu
#! /bin/sh
# This is a shell archive. Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file". To overwrite existing
# files, type "sh file -c". You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g.. If this archive is complete, you
# will see the following message at the end:
# "End of shell archive."
# Contents: z.pl ngmatch.pl
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'z.pl' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'z.pl'\"
else
echo shar: Extracting \"'z.pl'\" \(874 characters\)
sed "s/^X//" >'z.pl' <<'END_OF_FILE'
Xshift, $ngmatch'debug = 1 if $ARGV[0] eq '-d';
Xdo 'ngmatch.pl';
X
X$pattern = "gnu,comp,news,sci,rec,misc,soc,talk,alt,to.melba,aus,melb,world,general,!talk.all,!soc.all,soc.culture.australian,soc.culture.jewish,soc.culture.arabic,soc.culture.esperanto,soc.religion.islam,soc.history,soc.men,soc.motss,soc.women,soc.feminism,!rec.all,rec.humor,rec.games.hack,rec.sport.cricket,rec.pets,rec.games.trivia,rec.arts.movies.reviews,!alt.all,alt.humor.oracle,alt.sources,alt.security,alt.sys.sun,alt.folklore,alt.religion,!sci.all,sci.space,sci.military,sci.med.aids,sci.econ,sci.env
ironment,sci.energy,sci.skeptic,comp.all,!comp.binaries,!comp.all.mac,!comp.all.amiga,!comp.all.atari,!comp.mail.maps,!comp.os,!comp.sys,comp.sys.sun,comp.sys.ibm,!talk.all,all.misc";
X
X&ngmatch'compile('match', $pattern);
X
X$| = 1;
Xwhile (<>) {
X if (&match($_)) {
X print '+';
X } else {
X print '-';
X }
X}
END_OF_FILE
if test 874 -ne `wc -c <'z.pl'`; then
echo shar: \"'z.pl'\" unpacked with wrong size!
fi
# end of 'z.pl'
fi
if test -f 'ngmatch.pl' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'ngmatch.pl'\"
else
echo shar: Extracting \"'ngmatch.pl'\" \(1453 characters\)
sed "s/^X//" >'ngmatch.pl' <<'END_OF_FILE'
Xpackage ngmatch;
X
X# &ngmatch_compile($name, $pattern)
X#
X# Create a new subroutine called $name:
X# &$name($group)
X# that returns true if $group matches $pattern.
X# $pattern is something like "comp,!comp.sys,comp.sys.mac".
X# "all" can be used to wildcard any single component, like "comp.all.mac".
X# Any pattern, like "comp", will also match "comp.all", "comp.all.all", etc.
X
Xsub ngmatch'compile {
X local($name, $pattern) = @_;
X local(@n, $_, $match, $flag, $package);
X
X @n = split(/,/, $pattern);
X
X grep((
X s/\./,/g, # change periods to commas
X s/([^!,\w])/\\$1/g, # quote special characters
X s/(,all)$/(,[^,]*|)/, # trailing ".all"
X s/$/(,.*|)$/, # implicit tail
X s/^(!?)all,/\1([^,]*),/,# leading "all."
X s/,all,/,([^,]*),/g, # embedded ".all."
X s/,/\\./g, # change commas to quoted periods.
X ), @n);
X
X # Now we build a series of tests.
X $match = '';
X for $_ (@n) {
X $flag = !s/^!//;
X s/^/^/; # anchor the pattern.
X $match .= '@x = /'.$_.'/;';
X $match .= "\nprint '".$_." '".',@x+0," @x\n";' if $debug;
X $match .= 'if (($x = @x) && ($x *= 19) < $best) {
X $x += ($x[$#x] =~ tr/././) * 20;
X $result = '.$flag.', $best = $x if $x < $best;
X };';
X $match .= 'print " $x $best $result\n";' if $debug;
X }
X
X local($package) = (caller)[0];
X
X $_ = '
X package '.$package.';
X sub '.$name.' {
X local($_) = @_;
X local($result, $best, $x, @x) = (0, 1e20);
X '.$match.'
X $result;
X }';
X eval "$_; 1" || die "ngmatch'compile: $@";
X}
END_OF_FILE
if test 1453 -ne `wc -c <'ngmatch.pl'`; then
echo shar: \"'ngmatch.pl'\" unpacked with wrong size!
fi
# end of 'ngmatch.pl'
fi
echo shar: End of shell archive.
exit 0