[comp.compilers] right context in scanner generators

boehm@flora.rice.edu (Hans Boehm) (11/12/89)

A number of scanner generators, including lex, make the claim that
they can handle regular expressions specifying right context.
However, as was pointed out to me by a student in a compiler class
I was teaching,  the implementation that lex uses (namely that described
in the Aho and Ullman text) is wrong.  (I believe the same bug exists
in Aho, Sethi, and Ullman, but I don't have a copy handy at the moment.)
An example is:

(a|ab)/ba

on input "aba".  Lex will consider "ab" to be the first token instead of "a".
(It first looks for the concatenation of both regular expressions, then back-
tracks to the rightmost occurrence of a final state corresponding to the first
regular expression.)  Are there any scanner generators that actually
do this right?

I believe a correct solution is to build a finite automaton for the
reversal of the second regular expression, and run it during the backtracking
process.  This is not asymptotically worse than the standard (incorrect)
algorithm, but it is more complicated than I would like.

Hans-J. Boehm
boehm@rice.edu
[How about stepping across the matched token, trying to match the second
expression; when it matches all the way to the end you've found the right
context.  It's never been clear to me how useful lex-style right context
is.  Anyone have some real examples?  -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.

boehm@flora.rice.edu (Hans Boehm) (11/13/89)

The moderator concluded my previous message with the following questions:

>[How about stepping across the matched token, trying to match the second
>expression; when it matches all the way to the end you've found the right
>context.  It's never been clear to me how useful lex-style right context
>is.  Anyone have some real examples?  -John]

Here are some partial answers:

  I may be wrong, but it has been my impression that lex is capable of
building Fortran scanners, which is certainly no easy task.  For example,
isn't it the case that DO can be recognized by looking for a `d' followed by
an `o' in a right context in which a `,' precedes the first left parenthesis?
My impression is that nobody has really built a Fortran scanner in this way,
but I assumed that was for performance reasons.

  Trying to match the context expression separately is, at least in some
theoretical sense, not very efficient.  If there is more than one possible
regular expression that may match, it may be necessary to look for several
different right contexts at several different starting positions.  This would
involve backtracking within the recognition of a single token.  The current
lex algorithm has the property that it needs to only simulate a single finite
automaton, independent of the number of regular expressions in the
specification, and independent of whether or not they include right context.
It does need to back up in the input at the end of every token recognition to
find the longest match, but this only happens once per token.  It's a simple
scan to find the last time a the FA was in a final state.  Right context is
implemented by filtering out certain final states. 

  I believe we can keep things down to a single forward and a single backward
scan by building a FA recognizing strings ending in the reversal of one of the
right context expressions, and running it in the backup phase.  Are there
simpler techniques that don't require repeated backtracking?

Hans-J. Boehm
boehm@rice.edu

[You can't build a Fortran scanner in lex without some external hackery.  In
particular, to identify a DO statement you need to look for a comma not inside
of parens to the right of the equal, and you can't do paren matching with
regular expressions.  I don't think you can do H constants such as 3Hfoo in
lex, either.  And in a statement like "DO10E5I=1,10" you need contextual help
to tell that the statement number is 10 rather than 10E5 and the variable is
E5I rather than I.  (Nearly everywhere else that an integer constant is
allowed, a real constant is, too, or at least there's no ambiguity.)

  When I wrote an F77 compiler (INfort, for anyone who may have used it) I did
the same thing as everyone else which is to prescan each statement to take out
the string constants and tell whether it was an assignment statement or not.
Having done that, scanning is easy although it needs a lot of feedback from
the parser.  For people who are interested, I am giving away a Fortran subset
parser that I wrote a while ago.  -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.

vern@cs.cornell.edu (Vern Paxson) (11/14/89)

In article <1989Nov12.041025.12451@esegue.segue.boston.ma.us> boehm@flora.rice.edu (Hans Boehm) writes:
>A number of scanner generators, including lex, make the claim that
>they can handle regular expressions specifying right context.
>However, as was pointed out to me by a student in a compiler class
>I was teaching,  the implementation that lex uses (namely that described
>in the Aho and Ullman text) is wrong.  (I believe the same bug exists
>in Aho, Sethi, and Ullman, but I don't have a copy handy at the moment.)
>An example is:
>
>(a|ab)/ba
>
>on input "aba".  Lex will consider "ab" to be the first token instead of "a".
>...
>Are there any scanner generators that actually
>do this right?

Note that flex kind of does this correctly--it allows x/y whenever it can
get it right, i.e., if either x or y have fixed length or if both are
variable length and the end of x will be unambiguous.  It will correctly
match the above example.  If the example is changed to "(a|ab)/ba*" then
it warns "Dangerous trailing context in rule at line <whatever>".

		Vern

	Vern Paxson			      vern@cs.cornell.edu
	Computer Science Dept.		      decvax!cornell!vern
	Cornell University		      vern@LBL (bitnet)
[Vern is the author of flex.  -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.

peterd@cs.washington.edu (Peter C. Damron) (11/14/89)

In article <1989Nov12.223642.14219@esegue.segue.boston.ma.us> boehm@flora.rice.edu (Hans Boehm) writes:
>  Trying to match the context expression separately is, at least in some
>theoretical sense, not very efficient.  If there is more than one possible
>regular expression that may match, it may be necessary to look for several
>different right contexts at several different starting positions.  This would
>involve backtracking within the recognition of a single token.  The current
>lex algorithm has the property that it needs to only simulate a single finite
>automaton, independent of the number of regular expressions in the
>specification, and independent of whether or not they include right context.
>It does need to back up in the input at the end of every token recognition to
>find the longest match, but this only happens once per token.  It's a simple
>scan to find the last time a the FA was in a final state.  Right context is
>implemented by filtering out certain final states. 

I don't know how lex matches right context, but here is my suggestion.

1.  Take all the regular expressions for a given meta-state, and tack on
    the right context, remembering where the division is.
2.  Build the finite automaton which has two types of final states,
    one for the end of a pattern and one for the end of a right context.
3.  Run the scanner over an input string until the automata hits an error
    state.  During the scan, push all the final states reached onto a stack
    along with the corresponding string position.
4.  Examine the stack to find the first (longest) end-of-pattern final state
    that also has a corresponding end-of-right-context final state.
5.  Restart the scan at the position of the end of pattern final state.

There is no need to build a reverse string matcher.  You do have to re-scan
the right context of the matching pattern.  However, you may be able to
build the automata in such a way that you can figure out the new final state
stack and the new starting state without any re-scan.  I have no idea how
difficult that might be to compute.

Hope this helps (and works),
--
Peter C. Damron, Dept. of Computer Science, FR-35
University of Washington, Seattle, WA  98195
peterd@cs.washington.edu, {ucbvax,decvax,etc.}!uw-beaver!uw-june!peterd
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.