gnb@bby.oz.au (Gregory N. Bond) (04/24/91)
This is a little script I hacked up to rebuild our news system after some nasty failures. It is basically a perl transcription if the ngmatch.c file from Cnews. Strange thing is, it is as slow as hell, about 2 match/sec on 20Mips sparc. Doesn't worry me too much (as its a bit of a oncer), but surprising. Another thing that I found difficult was the equivalent of "fgrep -v -f <file>", where file is large (say, 1,000 lines). I tried a perl script that built a program then eval'd it, but with a 30k pattern file, the perl program grew to >5Mb and ran like a dog. It was much much faster to split the pat file and use a chain of fgreps. Again, surprising. Any suggestions? Greg. #! /usr/local/bin/perl # take a history file on stdin, and match them to the following # newsgroups to see if we would collect it (ignore distributions here) # Each line that matches is passed to the stdout $sysfile_pattern = " gnu,comp,news,sci,rec,misc,soc,talk,alt,to.melba,aus,melb,world,general,\ !talk.all,\ !soc.all,soc.culture.australian,soc.culture.jewish,\ soc.culture.arabic,\ soc.culture.esperanto,soc.religion.islam,\ soc.history,soc.men,soc.motss,\ soc.women,soc.feminism,\ !rec.all,rec.humor,rec.games.hack,rec.sport.cricket,rec.pets,\ rec.games.trivia,rec.arts.movies.reviews,\ !alt.all,alt.humor.oracle,alt.sources,alt.security,alt.sys.sun,\ alt.folklore,alt.religion,\ !sci.all,sci.space,sci.military,sci.med.aids,sci.econ,\ sci.environment,\ sci.energy,sci.skeptic,\ comp.all,!comp.binaries,!comp.all.mac,!comp.all.amiga,\ !comp.all.atari,\ !comp.mail.maps,!comp.os,!comp.sys,comp.sys.sun,comp.sys.ibm,\ !talk.all,all.misc:F: "; #$sysfile_pattern = "comp,!comp.binaries.all"; $sysfile_pattern =~ s/\\.//g; $sysfile_pattern =~ s/\s//g; $sysfile_pattern =~ s/:.*//; $sysfile_pattern =~ tr/A-Z/a-z/; @patterns = split(/,/, $sysfile_pattern); # ngmatch(grouplist) # returns true is one of the groups in grouplist matched one of the # patterns in patterns. sub ngmatch { local ($grouplist) = @_; return scalar(grep(&onegroupmatch($_), split(/,/, $grouplist))); } # # match the single group name in grp to the patterns. Keep track of # depth! (see cnews file libcnews/ngmatch.c for comments) # sub onegroupmatch { local ($grp) = @_; local ($pat, $neg, $depth, $hitdeepest, $faildeepest); foreach $_ (@patterns) { $neg = (($pat = $_) =~ s/^!//); if (($depth = &onepatmatch($grp, $pat)) > 0) { $faildeepest = $depth if $depth > $faildeepest && $neg; $hitdeepest = $depth if $depth > $hitdeepest && !$neg; } } $hitdeepest > $faildeepest; } # # match the single newsgroup against the single pattern # sub onepatmatch { local($grp, $pat) = @_; local (@gs, @ps); local($p, $g, $incr, $depth); @gs = split(/\./, $grp); @ps = split(/\./, $pat); $depth = 0; for ($p = shift(@ps), $g = shift(@gs); $p && $g; $p = shift(@ps), $g = shift(@gs)) { $incr = 20; return 0 if ($p ne $g && $p ne "all"); if ($p eq $g) { $depth += $incr; } else { # Is a wildcard match $depth += --$incr; } } ($p && !$g) ? 0 : $depth; } if (0) { $v = "comp.graphics"; $m = &ngmatch($v); print "match $v: $m\n"; $v = "comp.graphics"; $m = &ngmatch($v); print "match $v: $m\n"; } # # # Now the main loop # # if (1) { open (GOOD, ">good.out") || die "can't open good.out: $!\n"; open (BAD, ">bad.out") || die "can't open bad.out: $!\n"; # open (EXPIRED, ">expired.out") || die "can't open expired.out: $!\n"; while (<>) { @L = split(/\t/); next unless ( ($ng = @L[2]) =~ s=/\d+==g ); $ng =~ s/\s+/,/g; $ng =~ s/,+$//; if (&ngmatch($ng)) { print GOOD; } else { print BAD; } ## last if $num++ > 100; } } -- Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia Internet: gnb@melba.bby.oz.au non-MX: gnb%melba.bby.oz@uunet.uu.net Uucp: {uunet,pyramid,ubc-cs,ukc,mcvax,prlb2,nttlab...}!munnari!melba.bby.oz!gnb
flee@cs.psu.edu (Felix Lee) (04/26/91)
>Strange thing is, it is as slow as hell, about 2 match/sec on 20Mips sparc.
This is because Perl is a terribly inefficient at string processing
(half smiley).
Below, the version of ngmatch in Perl that I use. It's at least an
order of magnitude faster. The code is quite ugly.
--
Felix Lee flee@cs.psu.edu
#! /bin/sh
# This is a shell archive. Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file". To overwrite existing
# files, type "sh file -c". You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g.. If this archive is complete, you
# will see the following message at the end:
# "End of shell archive."
# Contents: z.pl ngmatch.pl
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'z.pl' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'z.pl'\"
else
echo shar: Extracting \"'z.pl'\" \(874 characters\)
sed "s/^X//" >'z.pl' <<'END_OF_FILE'
Xshift, $ngmatch'debug = 1 if $ARGV[0] eq '-d';
Xdo 'ngmatch.pl';
X
X$pattern = "gnu,comp,news,sci,rec,misc,soc,talk,alt,to.melba,aus,melb,world,general,!talk.all,!soc.all,soc.culture.australian,soc.culture.jewish,soc.culture.arabic,soc.culture.esperanto,soc.religion.islam,soc.history,soc.men,soc.motss,soc.women,soc.feminism,!rec.all,rec.humor,rec.games.hack,rec.sport.cricket,rec.pets,rec.games.trivia,rec.arts.movies.reviews,!alt.all,alt.humor.oracle,alt.sources,alt.security,alt.sys.sun,alt.folklore,alt.religion,!sci.all,sci.space,sci.military,sci.med.aids,sci.econ,sci.env
ironment,sci.energy,sci.skeptic,comp.all,!comp.binaries,!comp.all.mac,!comp.all.amiga,!comp.all.atari,!comp.mail.maps,!comp.os,!comp.sys,comp.sys.sun,comp.sys.ibm,!talk.all,all.misc";
X
X&ngmatch'compile('match', $pattern);
X
X$| = 1;
Xwhile (<>) {
X if (&match($_)) {
X print '+';
X } else {
X print '-';
X }
X}
END_OF_FILE
if test 874 -ne `wc -c <'z.pl'`; then
echo shar: \"'z.pl'\" unpacked with wrong size!
fi
# end of 'z.pl'
fi
if test -f 'ngmatch.pl' -a "${1}" != "-c" ; then
echo shar: Will not clobber existing file \"'ngmatch.pl'\"
else
echo shar: Extracting \"'ngmatch.pl'\" \(1453 characters\)
sed "s/^X//" >'ngmatch.pl' <<'END_OF_FILE'
Xpackage ngmatch;
X
X# &ngmatch_compile($name, $pattern)
X#
X# Create a new subroutine called $name:
X# &$name($group)
X# that returns true if $group matches $pattern.
X# $pattern is something like "comp,!comp.sys,comp.sys.mac".
X# "all" can be used to wildcard any single component, like "comp.all.mac".
X# Any pattern, like "comp", will also match "comp.all", "comp.all.all", etc.
X
Xsub ngmatch'compile {
X local($name, $pattern) = @_;
X local(@n, $_, $match, $flag, $package);
X
X @n = split(/,/, $pattern);
X
X grep((
X s/\./,/g, # change periods to commas
X s/([^!,\w])/\\$1/g, # quote special characters
X s/(,all)$/(,[^,]*|)/, # trailing ".all"
X s/$/(,.*|)$/, # implicit tail
X s/^(!?)all,/\1([^,]*),/,# leading "all."
X s/,all,/,([^,]*),/g, # embedded ".all."
X s/,/\\./g, # change commas to quoted periods.
X ), @n);
X
X # Now we build a series of tests.
X $match = '';
X for $_ (@n) {
X $flag = !s/^!//;
X s/^/^/; # anchor the pattern.
X $match .= '@x = /'.$_.'/;';
X $match .= "\nprint '".$_." '".',@x+0," @x\n";' if $debug;
X $match .= 'if (($x = @x) && ($x *= 19) < $best) {
X $x += ($x[$#x] =~ tr/././) * 20;
X $result = '.$flag.', $best = $x if $x < $best;
X };';
X $match .= 'print " $x $best $result\n";' if $debug;
X }
X
X local($package) = (caller)[0];
X
X $_ = '
X package '.$package.';
X sub '.$name.' {
X local($_) = @_;
X local($result, $best, $x, @x) = (0, 1e20);
X '.$match.'
X $result;
X }';
X eval "$_; 1" || die "ngmatch'compile: $@";
X}
END_OF_FILE
if test 1453 -ne `wc -c <'ngmatch.pl'`; then
echo shar: \"'ngmatch.pl'\" unpacked with wrong size!
fi
# end of 'ngmatch.pl'
fi
echo shar: End of shell archive.
exit 0