[comp.lang.perl] perl slowness with regexp

jbw@bucsf.bu.edu (Joe Wells) (03/06/90)

I ran into some interesting perl behavior recently.  The following short
program will illustrate it:

    for (1 .. 20)
    {
      $_ = 'Y' . ('x' x $_);
      print "_: $_\n";
      m/(.+)*Y/;
      print "`: $`, 1: $1, &: $&, ': $'\n";
    }

If you run this program, you will notice that each iteration of the loop
takes much longer than the previous iteration.  Am I correct in guessing
that this is perl's way of telling me don't do that?  Or is this kind of
regular expression optimizable?

For the curious, this is the original regexp from which I distilled the
one in the program above:

    ^\s*(\S+(\s*\S+)*)\s*<.*>$

I was trying to exclude the spaces on the end of what I was looking for.
Since then, I've done that afterwards.

-- 
Joe Wells <jbw@bu.edu>
jbw%bucsf.bu.edu@cs.bu.edu
...!harvard!bu-cs!bucsf!jbw

lwall@jpl-devvax.JPL.NASA.GOV (Larry Wall) (03/07/90)

In article <JBW.90Mar5221106@bucsf.bu.edu> jbw@bucsf.bu.edu (Joe Wells) writes:
: I ran into some interesting perl behavior recently.  The following short
: program will illustrate it:
: 
:     for (1 .. 20)
:     {
:       $_ = 'Y' . ('x' x $_);
:       print "_: $_\n";
:       m/(.+)*Y/;
:       print "`: $`, 1: $1, &: $&, ': $'\n";
:     }
: 
: If you run this program, you will notice that each iteration of the loop
: takes much longer than the previous iteration.  Am I correct in guessing
: that this is perl's way of telling me don't do that?  Or is this kind of
: regular expression optimizable?

What you've just discovered is that with a backtracking mechanism it's easy
to fall into combinatorial explosions.  Saying /.+/ gives you one level
of backtracking; saying /(.+)*/ backtracks for every state of the inner
backtrack.  As to whether it's optimizable or not /(.+)*/ is obviously
equivalent to /.*/, but I don't think it's a worthwhile optimization to
put in, because it will almost never arise in real life (your own problem
has other stuff in with the \S+).

There is, of course, the optimization of using a non-bactracking algorithm
like egrep, but then you wouldn't be able to isolate $1, because the way
the states are constructed in the DFA, in a particular state you could be
just starting or just ending or in the middle of or nowhere near any number
of potential $1's.  Well, maybe not any number.  It might just be possible
to construct your DFA such that it kept an array of potential $1's, but I
get a headache when I look at GNU egrep.  And when GNU egrep itself sees
backreferences it punts and goes to a backtracking algorithm.

So yes, don't do that.  There's almost always a much more efficient way,
and probably a more readable one at that.

Larry