[comp.lang.perl] ngmatch in perl

gnb@bby.oz.au (Gregory N. Bond) (04/24/91)

This is a little script I hacked up to rebuild our news system after
some nasty failures.  It is basically a perl transcription if the
ngmatch.c file from Cnews. 

Strange thing is, it is as slow as hell, about 2 match/sec on 20Mips
sparc.  Doesn't worry me too much (as its a bit of a oncer), but
surprising.

Another thing that I found difficult was the equivalent of "fgrep -v
-f <file>", where file is large (say, 1,000 lines).  I tried a perl
script that built a program then eval'd it, but with a 30k pattern
file, the perl program grew to >5Mb and ran like a dog.  It was much
much faster to split the pat file and use a chain of fgreps.  Again,
surprising.  Any suggestions?

Greg.

#! /usr/local/bin/perl

# take a history file on stdin, and match them to the following
# newsgroups to see if we would collect it (ignore distributions here)
# Each line that matches is passed to the stdout


$sysfile_pattern = "
	gnu,comp,news,sci,rec,misc,soc,talk,alt,to.melba,aus,melb,world,general,\
	!talk.all,\
	!soc.all,soc.culture.australian,soc.culture.jewish,\
	soc.culture.arabic,\
	soc.culture.esperanto,soc.religion.islam,\
	soc.history,soc.men,soc.motss,\
	soc.women,soc.feminism,\
	!rec.all,rec.humor,rec.games.hack,rec.sport.cricket,rec.pets,\
	rec.games.trivia,rec.arts.movies.reviews,\
	!alt.all,alt.humor.oracle,alt.sources,alt.security,alt.sys.sun,\
	alt.folklore,alt.religion,\
	!sci.all,sci.space,sci.military,sci.med.aids,sci.econ,\
	sci.environment,\
	sci.energy,sci.skeptic,\
	comp.all,!comp.binaries,!comp.all.mac,!comp.all.amiga,\
	!comp.all.atari,\
	!comp.mail.maps,!comp.os,!comp.sys,comp.sys.sun,comp.sys.ibm,\
	!talk.all,all.misc:F:
";

#$sysfile_pattern = "comp,!comp.binaries.all";

$sysfile_pattern =~ s/\\.//g;
$sysfile_pattern =~ s/\s//g;
$sysfile_pattern =~ s/:.*//;
$sysfile_pattern =~ tr/A-Z/a-z/;

@patterns = split(/,/, $sysfile_pattern);


# ngmatch(grouplist)
# returns true is one of the groups in grouplist matched one of the
# patterns in patterns.

sub ngmatch {
    local ($grouplist) = @_;

    return scalar(grep(&onegroupmatch($_), split(/,/, $grouplist)));
}

#
# match the single group name in grp to the patterns.  Keep track of
# depth! (see cnews file libcnews/ngmatch.c for comments)
#
sub onegroupmatch {
    local ($grp) = @_;
    local ($pat, $neg, $depth, $hitdeepest, $faildeepest);

    foreach $_ (@patterns) {
	$neg = (($pat = $_) =~ s/^!//);
	if (($depth = &onepatmatch($grp, $pat)) > 0) {
	    $faildeepest = $depth if $depth > $faildeepest && $neg;
	    $hitdeepest = $depth if $depth > $hitdeepest && !$neg;
	}
    }
    $hitdeepest > $faildeepest;
}

#
# match the single newsgroup against the single pattern
#
sub onepatmatch {
    local($grp, $pat) = @_;
    local (@gs, @ps);
    local($p, $g, $incr, $depth);

    @gs = split(/\./, $grp);
    @ps = split(/\./, $pat);

    $depth = 0;
    
    for ($p = shift(@ps), $g = shift(@gs);
	 $p && $g; $p = shift(@ps), $g = shift(@gs)) {
	$incr = 20;
	return 0 if ($p ne $g && $p ne "all");
	if ($p eq $g) {
	    $depth += $incr;
	} else { # Is a wildcard match
	    $depth += --$incr;
	}
    }
    ($p && !$g) ? 0 : $depth;
}

if (0) {
    $v = "comp.graphics";
    $m = &ngmatch($v);
    print "match $v: $m\n";
    $v = "comp.graphics";
    $m = &ngmatch($v);
    print "match $v: $m\n";
}
#
#
# Now the main loop
#
#

if (1) {
    open (GOOD, ">good.out") || die "can't open good.out: $!\n";
    open (BAD, ">bad.out") || die "can't open bad.out: $!\n";
#    open (EXPIRED, ">expired.out") || die "can't open expired.out: $!\n";
    
    while (<>) {
	@L = split(/\t/);
	next unless ( ($ng = @L[2]) =~ s=/\d+==g );
	$ng =~ s/\s+/,/g;
	$ng =~ s/,+$//;
	
	if (&ngmatch($ng)) {
	    print GOOD;
	} else {
	    print BAD;
	}
	
##    last if $num++ > 100;
    }
    
}
--
Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia
Internet: gnb@melba.bby.oz.au    non-MX: gnb%melba.bby.oz@uunet.uu.net
Uucp: {uunet,pyramid,ubc-cs,ukc,mcvax,prlb2,nttlab...}!munnari!melba.bby.oz!gnb

flee@cs.psu.edu (Felix Lee) (04/26/91)

>Strange thing is, it is as slow as hell, about 2 match/sec on 20Mips sparc.

This is because Perl is a terribly inefficient at string processing
(half smiley).

Below, the version of ngmatch in Perl that I use.  It's at least an
order of magnitude faster.  The code is quite ugly.
--
Felix Lee	flee@cs.psu.edu

#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of shell archive."
# Contents:  z.pl ngmatch.pl
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'z.pl' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'z.pl'\"
else
echo shar: Extracting \"'z.pl'\" \(874 characters\)
sed "s/^X//" >'z.pl' <<'END_OF_FILE'
Xshift, $ngmatch'debug = 1 if $ARGV[0] eq '-d';
Xdo 'ngmatch.pl';
X
X$pattern = "gnu,comp,news,sci,rec,misc,soc,talk,alt,to.melba,aus,melb,world,general,!talk.all,!soc.all,soc.culture.australian,soc.culture.jewish,soc.culture.arabic,soc.culture.esperanto,soc.religion.islam,soc.history,soc.men,soc.motss,soc.women,soc.feminism,!rec.all,rec.humor,rec.games.hack,rec.sport.cricket,rec.pets,rec.games.trivia,rec.arts.movies.reviews,!alt.all,alt.humor.oracle,alt.sources,alt.security,alt.sys.sun,alt.folklore,alt.religion,!sci.all,sci.space,sci.military,sci.med.aids,sci.econ,sci.env






ironment,sci.energy,sci.skeptic,comp.all,!comp.binaries,!comp.all.mac,!comp.all.amiga,!comp.all.atari,!comp.mail.maps,!comp.os,!comp.sys,comp.sys.sun,comp.sys.ibm,!talk.all,all.misc";
X
X&ngmatch'compile('match', $pattern);
X
X$| = 1;
Xwhile (<>) {
X	if (&match($_)) {
X		print '+';
X	} else {
X		print '-';
X	}
X}
END_OF_FILE
if test 874 -ne `wc -c <'z.pl'`; then
    echo shar: \"'z.pl'\" unpacked with wrong size!
fi
# end of 'z.pl'
fi
if test -f 'ngmatch.pl' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'ngmatch.pl'\"
else
echo shar: Extracting \"'ngmatch.pl'\" \(1453 characters\)
sed "s/^X//" >'ngmatch.pl' <<'END_OF_FILE'
Xpackage ngmatch;
X
X# &ngmatch_compile($name, $pattern)
X#
X# Create a new subroutine called $name:
X#	&$name($group)
X# that returns true if $group matches $pattern.
X# $pattern is something like "comp,!comp.sys,comp.sys.mac".
X# "all" can be used to wildcard any single component, like "comp.all.mac".
X# Any pattern, like "comp", will also match "comp.all", "comp.all.all", etc.
X
Xsub ngmatch'compile {
X	local($name, $pattern) = @_;
X	local(@n, $_, $match, $flag, $package);
X
X	@n = split(/,/, $pattern);
X
X	grep((
X		s/\./,/g,		# change periods to commas
X		s/([^!,\w])/\\$1/g,	# quote special characters
X		s/(,all)$/(,[^,]*|)/,	# trailing ".all"
X		s/$/(,.*|)$/,		# implicit tail
X		s/^(!?)all,/\1([^,]*),/,# leading "all."
X		s/,all,/,([^,]*),/g,	# embedded ".all."
X		s/,/\\./g,		# change commas to quoted periods.
X	), @n);
X
X	# Now we build a series of tests.
X	$match = '';
X	for $_ (@n) {
X		$flag = !s/^!//;
X		s/^/^/;			# anchor the pattern.
X		$match .= '@x = /'.$_.'/;';
X		$match .= "\nprint '".$_." '".',@x+0," @x\n";' if $debug;
X		$match .= 'if (($x = @x) && ($x *= 19) < $best) {
X				$x += ($x[$#x] =~ tr/././) * 20;
X				$result = '.$flag.', $best = $x if $x < $best;
X		};';
X		$match .= 'print " $x $best $result\n";' if $debug;
X	}
X
X	local($package) = (caller)[0];
X
X	$_ = '
X	package '.$package.';
X	sub '.$name.' {
X		local($_) = @_;
X		local($result, $best, $x, @x) = (0, 1e20);
X		'.$match.'
X		$result;
X	}';
X	eval "$_; 1" || die "ngmatch'compile: $@";
X}
END_OF_FILE
if test 1453 -ne `wc -c <'ngmatch.pl'`; then
    echo shar: \"'ngmatch.pl'\" unpacked with wrong size!
fi
# end of 'ngmatch.pl'
fi
echo shar: End of shell archive.
exit 0