boehm@flora.rice.edu (Hans Boehm) (11/12/89)
A number of scanner generators, including lex, make the claim that they can handle regular expressions specifying right context. However, as was pointed out to me by a student in a compiler class I was teaching, the implementation that lex uses (namely that described in the Aho and Ullman text) is wrong. (I believe the same bug exists in Aho, Sethi, and Ullman, but I don't have a copy handy at the moment.) An example is: (a|ab)/ba on input "aba". Lex will consider "ab" to be the first token instead of "a". (It first looks for the concatenation of both regular expressions, then back- tracks to the rightmost occurrence of a final state corresponding to the first regular expression.) Are there any scanner generators that actually do this right? I believe a correct solution is to build a finite automaton for the reversal of the second regular expression, and run it during the backtracking process. This is not asymptotically worse than the standard (incorrect) algorithm, but it is more complicated than I would like. Hans-J. Boehm boehm@rice.edu [How about stepping across the matched token, trying to match the second expression; when it matches all the way to the end you've found the right context. It's never been clear to me how useful lex-style right context is. Anyone have some real examples? -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.
boehm@flora.rice.edu (Hans Boehm) (11/13/89)
The moderator concluded my previous message with the following questions: >[How about stepping across the matched token, trying to match the second >expression; when it matches all the way to the end you've found the right >context. It's never been clear to me how useful lex-style right context >is. Anyone have some real examples? -John] Here are some partial answers: I may be wrong, but it has been my impression that lex is capable of building Fortran scanners, which is certainly no easy task. For example, isn't it the case that DO can be recognized by looking for a `d' followed by an `o' in a right context in which a `,' precedes the first left parenthesis? My impression is that nobody has really built a Fortran scanner in this way, but I assumed that was for performance reasons. Trying to match the context expression separately is, at least in some theoretical sense, not very efficient. If there is more than one possible regular expression that may match, it may be necessary to look for several different right contexts at several different starting positions. This would involve backtracking within the recognition of a single token. The current lex algorithm has the property that it needs to only simulate a single finite automaton, independent of the number of regular expressions in the specification, and independent of whether or not they include right context. It does need to back up in the input at the end of every token recognition to find the longest match, but this only happens once per token. It's a simple scan to find the last time a the FA was in a final state. Right context is implemented by filtering out certain final states. I believe we can keep things down to a single forward and a single backward scan by building a FA recognizing strings ending in the reversal of one of the right context expressions, and running it in the backup phase. Are there simpler techniques that don't require repeated backtracking? Hans-J. Boehm boehm@rice.edu [You can't build a Fortran scanner in lex without some external hackery. In particular, to identify a DO statement you need to look for a comma not inside of parens to the right of the equal, and you can't do paren matching with regular expressions. I don't think you can do H constants such as 3Hfoo in lex, either. And in a statement like "DO10E5I=1,10" you need contextual help to tell that the statement number is 10 rather than 10E5 and the variable is E5I rather than I. (Nearly everywhere else that an integer constant is allowed, a real constant is, too, or at least there's no ambiguity.) When I wrote an F77 compiler (INfort, for anyone who may have used it) I did the same thing as everyone else which is to prescan each statement to take out the string constants and tell whether it was an assignment statement or not. Having done that, scanning is easy although it needs a lot of feedback from the parser. For people who are interested, I am giving away a Fortran subset parser that I wrote a while ago. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.
vern@cs.cornell.edu (Vern Paxson) (11/14/89)
In article <1989Nov12.041025.12451@esegue.segue.boston.ma.us> boehm@flora.rice.edu (Hans Boehm) writes: >A number of scanner generators, including lex, make the claim that >they can handle regular expressions specifying right context. >However, as was pointed out to me by a student in a compiler class >I was teaching, the implementation that lex uses (namely that described >in the Aho and Ullman text) is wrong. (I believe the same bug exists >in Aho, Sethi, and Ullman, but I don't have a copy handy at the moment.) >An example is: > >(a|ab)/ba > >on input "aba". Lex will consider "ab" to be the first token instead of "a". >... >Are there any scanner generators that actually >do this right? Note that flex kind of does this correctly--it allows x/y whenever it can get it right, i.e., if either x or y have fixed length or if both are variable length and the end of x will be unambiguous. It will correctly match the above example. If the example is changed to "(a|ab)/ba*" then it warns "Dangerous trailing context in rule at line <whatever>". Vern Vern Paxson vern@cs.cornell.edu Computer Science Dept. decvax!cornell!vern Cornell University vern@LBL (bitnet) [Vern is the author of flex. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.
peterd@cs.washington.edu (Peter C. Damron) (11/14/89)
In article <1989Nov12.223642.14219@esegue.segue.boston.ma.us> boehm@flora.rice.edu (Hans Boehm) writes: > Trying to match the context expression separately is, at least in some >theoretical sense, not very efficient. If there is more than one possible >regular expression that may match, it may be necessary to look for several >different right contexts at several different starting positions. This would >involve backtracking within the recognition of a single token. The current >lex algorithm has the property that it needs to only simulate a single finite >automaton, independent of the number of regular expressions in the >specification, and independent of whether or not they include right context. >It does need to back up in the input at the end of every token recognition to >find the longest match, but this only happens once per token. It's a simple >scan to find the last time a the FA was in a final state. Right context is >implemented by filtering out certain final states. I don't know how lex matches right context, but here is my suggestion. 1. Take all the regular expressions for a given meta-state, and tack on the right context, remembering where the division is. 2. Build the finite automaton which has two types of final states, one for the end of a pattern and one for the end of a right context. 3. Run the scanner over an input string until the automata hits an error state. During the scan, push all the final states reached onto a stack along with the corresponding string position. 4. Examine the stack to find the first (longest) end-of-pattern final state that also has a corresponding end-of-right-context final state. 5. Restart the scan at the position of the end of pattern final state. There is no need to build a reverse string matcher. You do have to re-scan the right context of the matching pattern. However, you may be able to build the automata in such a way that you can figure out the new final state stack and the new starting state without any re-scan. I have no idea how difficult that might be to compute. Hope this helps (and works), -- Peter C. Damron, Dept. of Computer Science, FR-35 University of Washington, Seattle, WA 98195 peterd@cs.washington.edu, {ucbvax,decvax,etc.}!uw-beaver!uw-june!peterd -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.