lfk@athena.mit.edu (Lee F Kolakowski) (07/06/90)
Hello out there in the land of perls. I am thinking of moving an application from awk to perl, but wonder what kind of speed up is possible and maybe of I am make a pseudo-finite state machine out of problem. I have a file with 350 regexs, which is expected to grow to 500 or more by years end. I want to search all of these on the input line. Currently, I read the regexs into an array, and loop through all the regexs doing a match each time. If it matches (I know where from the RSTART var), I cut off the input up to that point and try match again. so my code looks like this: { if (FILENAME == "regex file") { accession[NR] = $1 ; regex[NR] = $2; name[NR] = $3; regex_num = NR; } else { n = length($0) for (i = 1; i <= regex_num ; i++ ) { if (match($0, regex[i])) { # if it matches do stuff, then cut off the # begining of string up to match and try again while (match(new,regex[i])) { # do some of the same stuff above } } } } } So the question is can I build a fsm using perl for this kind of thing?? Wondering.... -- Frank Kolakowski ====================================================================== |lfk@athena.mit.edu || Lee F. Kolakowski | |lfk@eastman2.mit.edu || M.I.T. | |kolakowski@wccf.mit.edu || Dept of Chemistry | |lfk@mbio.med.upenn.edu || Room 18-506 | |lfk@hx.lcs.mit.edu || 77 Massachusetts Ave.| |AT&T: 1-617-253-1866 || Cambridge, MA 02139 | |--------------------------------------------------------------------| | #include <woes.h> | | One-Liner Here! | ======================================================================
lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (07/06/90)
In article <1990Jul5.225427.12822@athena.mit.edu> lfk@athena.mit.edu (Lee F Kolakowski) writes: : I am thinking of moving an application from awk to perl, but wonder : what kind of speed up is possible and maybe of I am make a : pseudo-finite state machine out of problem. : : I have a file with 350 regexs, which is expected to grow to 500 or : more by years end. You might get a considerable speedup if you do it right. And it might run slower if you do it wrong. : I want to search all of these on the input line. Currently, I read the : regexs into an array, and loop through all the regexs doing a match : each time. If it matches (I know where from the RSTART var), I cut off : the input up to that point and try match again. If I take you literally, we now have an infinite loop, because the same match will match again. Do you mean you chop it at RSTART+RLENGTH? : so my code looks like this: : [code omitted] : : So the question is can I build a fsm using perl for this kind of : thing?? Certainly. At a minimum, you could just run your awk script through the a2p translator and get a perl script. But it produces rather awkish-looking perl, and will run slowly since it's recompiling the patterns over and over. A more idiomatic perl script for that would look like this: #!/usr/bin/perl $rfile = shift; open(RFILE, $rfile) || die "Can't open regex file: $!\n"; # First we'll build a little subroutine to match all the patterns. The # embedded code is indented for readability: $prog = <<EOF; sub matches { study;\n"; EOF while (<RFILE>) { ($accession, $regex, $name) = split(' '); $accession[$.] = $accesssion; $name[$.] = $name; $prog .= <<EOF; &pullup($.) while /$regex/; EOF } $prog .= <<EOF; } EOF close RFILE; # now we eval the program we just built up to define the subroutine &matches. print STDERR $prog if $verbose; eval $prog; die $@ if $@; # this routine is called whenever there's a match on the current line sub pullup { local($id) = @_; substr($_, 0, length($`) + length($&)) = '';# cut off front of $_ study; # restudy remainder of string # remainder of routine does something with $accession[$id] and $name[$id] # such as: print "Found $accession[$id] $name[$id]\n"; } # now we actually iterate over the input lines while (<>) { &matches; } Or something like that. If your input text doesn't resemble normal written text or programs in letter frequency, you might wish to delete the calls to "study;", since it may not run any faster with them in. I doubt study would help in scanning DNA sequences, for instance. But a list of chemical names would probably be ok. Whether perl will beat awk also depends on how complicated you get with your regular expressions, since awk uses a DFA algorithm and Perl doesn't. If you make Perl do a lot of backtracking, it'll get slow. That's the penalty for access to backreferences. If you don't use study, it also depends on the length of the longest constant substrings in your patterns, since Perl uses a Boyer-Moore kind of prescan. Larry Wall lwall@jpl-devvax.jpl.nasa.gov
lfk@athena.mit.edu (Lee F Kolakowski) (07/07/90)
First, let me apologize for my posting appearing twice, my mistake. Larry replied: > If I take you literally, we now have an infinite loop, because the same > match will match again. Do you mean you chop it at RSTART+RLENGTH? > Actually I chop at RSTART+1, RSTART+RLENGTH may allow patterns starting between RSTART, and RSTART+RLENGTH to be missed. Thanks for the posted code. I have to study() the code a while before I can make any inteligent comments about it (being new to perl and such). Thanks. -- Frank Kolakowski ====================================================================== |lfk@athena.mit.edu || Lee F. Kolakowski | |lfk@eastman2.mit.edu || M.I.T. | |kolakowski@wccf.mit.edu || Dept of Chemistry | |lfk@mbio.med.upenn.edu || Room 18-506 | |lfk@hx.lcs.mit.edu || 77 Massachusetts Ave.| |AT&T: 1-617-253-1866 || Cambridge, MA 02139 | |--------------------------------------------------------------------| | #include <woes.h> | | One-Liner Here! | ======================================================================