[comp.lang.perl] FSM's for large regex's in Perl or Awk

lfk@athena.mit.edu (Lee F Kolakowski) (07/06/90)

Hello out there in the land of perls.

I am thinking of moving an application from awk to perl, but wonder
what kind of speed up is possible and maybe of I am make a
pseudo-finite state machine out of problem.

I have a file with 350 regexs, which is expected to grow to 500 or
more by years end.

I want to search all of these on the input line. Currently, I read the
regexs into an array, and loop through all the regexs doing a match
each time. If it matches (I know where from the RSTART var), I cut off
the input up to that point and try match again.

so my code looks like this:

{
  if (FILENAME == "regex file") {
    accession[NR] = $1 ; regex[NR] = $2; name[NR] = $3; regex_num = NR;
  }
  else { 
    n = length($0)
    for (i = 1; i <= regex_num ; i++ ) {
      if (match($0, regex[i])) {

#	if it matches do stuff, then cut off the 
#	begining of string up to match and try again

	while (match(new,regex[i])) {

#	do some of the same stuff above

	}
      }
    }
  }
}


So the question is can I build a fsm using perl for this kind of
thing??

Wondering....


--

Frank Kolakowski 

======================================================================
|lfk@athena.mit.edu                     ||      Lee F. Kolakowski    |
|lfk@eastman2.mit.edu                   ||	M.I.T.		     |
|kolakowski@wccf.mit.edu                ||	Dept of Chemistry    |
|lfk@mbio.med.upenn.edu		        ||	Room 18-506	     |
|lfk@hx.lcs.mit.edu                     ||	77 Massachusetts Ave.|
|AT&T:  1-617-253-1866                  ||	Cambridge, MA 02139  |
|--------------------------------------------------------------------|
|                         #include <woes.h>         		     |
|		           One-Liner Here!                           |
======================================================================

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (07/06/90)

In article <1990Jul5.225427.12822@athena.mit.edu> lfk@athena.mit.edu (Lee F Kolakowski) writes:
: I am thinking of moving an application from awk to perl, but wonder
: what kind of speed up is possible and maybe of I am make a
: pseudo-finite state machine out of problem.
: 
: I have a file with 350 regexs, which is expected to grow to 500 or
: more by years end.

You might get a considerable speedup if you do it right.  And it might
run slower if you do it wrong.

: I want to search all of these on the input line. Currently, I read the
: regexs into an array, and loop through all the regexs doing a match
: each time. If it matches (I know where from the RSTART var), I cut off
: the input up to that point and try match again.

If I take you literally, we now have an infinite loop, because the same
match will match again.  Do you mean you chop it at RSTART+RLENGTH?

: so my code looks like this:
: 
[code omitted]
: 
: So the question is can I build a fsm using perl for this kind of
: thing??

Certainly.  At a minimum, you could just run your awk script through the
a2p translator and get a perl script.  But it produces rather awkish-looking
perl, and will run slowly since it's recompiling the patterns over and over.

A more idiomatic perl script for that would look like this:

#!/usr/bin/perl

$rfile = shift;

open(RFILE, $rfile) || die "Can't open regex file: $!\n";

# First we'll build a little subroutine to match all the patterns.  The
# embedded code is indented for readability:

$prog = <<EOF;
			sub matches {
				study;\n";
EOF

while (<RFILE>) {
    ($accession, $regex, $name) = split(' ');
    $accession[$.] = $accesssion;
    $name[$.] = $name;
    $prog .= <<EOF;
				&pullup($.) while /$regex/;
EOF
}
$prog .= <<EOF;
			}
EOF
close RFILE;

# now we eval the program we just built up to define the subroutine &matches.

print STDERR $prog if $verbose;
eval $prog;
die $@ if $@;

# this routine is called whenever there's a match on the current line

sub pullup {
    local($id) = @_;

    substr($_, 0, length($`) + length($&)) = '';# cut off front of $_
    study;					# restudy remainder of string

    # remainder of routine does something with $accession[$id] and $name[$id]
    # such as:
    print "Found $accession[$id] $name[$id]\n";
}

# now we actually iterate over the input lines

while (<>) {
    &matches;
}

Or something like that.

If your input text doesn't resemble normal written text or programs in
letter frequency, you might wish to delete the calls to "study;", since
it may not run any faster with them in.  I doubt study would help in
scanning DNA sequences, for instance.  But a list of chemical names
would probably be ok.

Whether perl will beat awk also depends on how complicated you get with
your regular expressions, since awk uses a DFA algorithm and Perl doesn't.
If you make Perl do a lot of backtracking, it'll get slow.  That's the
penalty for access to backreferences.

If you don't use study, it also depends on the length of the longest
constant substrings in your patterns, since Perl uses a Boyer-Moore kind
of prescan.

Larry Wall
lwall@jpl-devvax.jpl.nasa.gov

lfk@athena.mit.edu (Lee F Kolakowski) (07/07/90)

First, let me apologize for my posting appearing twice, my mistake.

Larry replied:

> If I take you literally, we now have an infinite loop, because the same
> match will match again.  Do you mean you chop it at RSTART+RLENGTH?
> 

Actually I chop at RSTART+1, RSTART+RLENGTH may allow patterns
starting between RSTART, and RSTART+RLENGTH to be missed.


Thanks for the posted code. I have to study() the code a while before
I can make any inteligent comments about it (being new to perl and
such).

Thanks.


--

Frank Kolakowski 

======================================================================
|lfk@athena.mit.edu                     ||      Lee F. Kolakowski    |
|lfk@eastman2.mit.edu                   ||	M.I.T.		     |
|kolakowski@wccf.mit.edu                ||	Dept of Chemistry    |
|lfk@mbio.med.upenn.edu		        ||	Room 18-506	     |
|lfk@hx.lcs.mit.edu                     ||	77 Massachusetts Ave.|
|AT&T:  1-617-253-1866                  ||	Cambridge, MA 02139  |
|--------------------------------------------------------------------|
|                         #include <woes.h>         		     |
|		           One-Liner Here!                           |
======================================================================