johnl@ima.UUCP (08/14/87)
> [I'd be interested in hearing what the recursive descent lexer feeds to the > parser when the input is something like a = b + ( ; -John] The basic approach in the lexer is (a) give the customer what he wants, and (b) if what he wants doesn't match what's in the input, apply some simple heuristics to try to resynchronize. Say we've just parsed the "b", so the parser knows we're in an expression that is part of a statement; here is the dialogue, slightly simplified: Parser: "Do you have any of <several expression continuations like '+'>?" Lexer: "Yes, I have '+'." P (to itself): "Aha, there's more to this expression." P: "Do you have any of <things that might start an operand, like '('>?" L: "Yes, I have '('." P (to itself): "Aha, a parenthesized expression." P: "Give me one of <things that would start an expression>." L (to itself): "<gurgling noises> That's not in my input! I'm going to have to fake it. Let's see... what I see is a line terminator, and he's not asking for a line terminator, so I will try to fake the rest of the line to resynchronize. What's the most harmless thing I can give him, out of the list he asked for?..." L: "I have '0'." P: "Do you have any of <things that might continue the expression>?" L (to itself): "To avoid infinite loops, I should try to cut this short." L: "No." P (to itself): "Must be the end of the expression." P: "Give me ')'." L (to itself): "Keep on faking it..." L: "Okay, I have ')'." P: "Do you have any of <things that might continue the outer expression>?" L (to itself): "Again, try to cut it short." L: "No." P (to itself): "Must be the end of the expression." P: "Give me ';'." L (to itself): "<sigh of relief> Resynchronized! Now I can stop lying." L: "I have ';'." ... and so forth. The details of resynchronization heuristics can get messy, especially in a language like C where clear-cut line-terminator tokens are hard to find (e.g. ';' also appears inside "for"), but the basic idea is simple and works well. It's not original with me; I think the first place where I saw it was in Dave Barnard's M.Sc. thesis, describing the SP/k parser and scanner. Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (08/15/87)
In article <651@ima.ISC.COM> decvax!utzoo!henry writes: >> [I'd be interested in hearing what the recursive descent lexer feeds to the >> parser when the input is something like a = b + ( ; -John] > >The basic approach in the lexer is (a) give the customer what he wants, and >(b) if what he wants doesn't match what's in the input, apply some simple >heuristics to try to resynchronize. Say we've just parsed the "b", so the >parser knows we're in an expression that is part of a statement; here is >the dialogue, slightly simplified: > [Relatively long dialog deleted...] This seems like a pain. Recusrive descent with error recovery performed by the higher level entity would seem to be simpler, namely because the higher level entity knows more about what's going on. For example, suppose the expression parser has just parsed the paren. The conversation then proceeds as follows: Expression Parser: Give me a token. Lexical Analyzer: Okay, here's a semicolon. EP: A semicolon?! I didn't expect one of these. I expected an expression. Yo! User! I expected to see an expression after this parenthese at line 124. Okay... Now I've told the user it screwed up, let's recover from this sucker. The simplest thing to do, since I'm in an expression, is to toss tokens until I get to a synchronizing token like a semi. Hmmm... I've got a semi, so I'm now synchronized. Okay. Let's tell our caller that we parsed an expression successfully. Meanwhile... A while ago I suggested that one cannot write a reasonable compiler using YACC. I got no response. So let me attempt to generate some flames. I claim it is impossible to write a reasonable compiler using YACC. I base this claim on the fact that while it is important that a compiler produce a correct binary, it is nearly as important that the compiler produce reasonable error messages. I further claim it is impossible to produce reasonable error messages using YACC. So flame me and tell me why I don't know what I'm talking about. I feel the need for some edification... But what about developing the | Chuck Simmons @ Amdahl moon before we terraform Mars? | amdahl!chuck [Just because nobody ever does good error recovery in yacc parsers doesn't mean it's all that hard. What is hard is to figure out the right set of error productions to add to the grammar to catch the sorts of errors that will actually occur. Then there's the issue of what a good error message is. Some people like things like "statement required ID, CPAREN, SEMI, BLURCH, or BLAH but FNUMBER was encountered." Personally, I liked the ones in Fortran H which sounded like pronouncements from a slightly disdainful robot: "IEH247I THE STATEMENT CONTAINS A PROSCRIBED COMMON DECLARATION." -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (08/16/87)
In article <655@ima.ISC.COM> chuck@amdahl.amdahl.com (Charles Simmons) writes: | I claim it is impossible to write a reasonable compiler | using YACC. I base this claim on the fact that ... it is | impossible to produce reasonable error messages using YACC. Well, it's certainly difficult, but that's not because YACC is a bottom-up parser. It's because you're trying to do the error recovery by hand. Bernard Dion's thesis work at Wisconsin (with Charlie Fischer) demonstrated how to perform locally least-cost *automatic* error *correction* in bottom-up parsers, with modest space and negligible time overhead. [Compiler writer gives the parser generator a table of insertion and deletion costs for tokens; error corrector makes the least cost set of insertions and deletions that allows one more real token to be shifted.] Good quality corrections make pretty good error messages. Dion's techniques are incorporated in the ECP (error correcting parser) package distributed by Fischer's group. -- Michael L. Scott University of Rochester (716) 275-7745 scott@cs.rochester.edu scott%rochester@CSNET-RELAY {decvax, allegra, seismo, cmcl2}!rochester!scott -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (08/17/87)
> This seems like a pain. Recusrive descent with error recovery performed > by the higher level entity would seem to be simpler, namely because the > higher level entity knows more about what's going on... On the contrary, doing it at the higher level is a horrendous pain, and the smooth simplicity of the low-level recovery (which has detailed guidance from the higher level, remember) is an enormous win. You have to experience the difference to fully appreciate it -- I've written parsers both ways. The problems with doing it at the higher level boil down to (a) it adds a lot of complexity to the code, and (b) error repair often has to cross the boundaries of syntactic structures, which is painful in recursive descent because those are function boundaries in the parser. By contrast, the low-level approach requires 100 lines or so of code to handle *all* syntactic error repair for the entire compiler, and it's all in one place rather than interspersed throughout the parser. > ... Okay... Now I've told the user it screwed > up, let's recover from this sucker. The simplest thing to do, since > I'm in an expression, is to toss tokens until I get to a synchronizing > token like a semi... Just where does the code that does this reside? Remember that the code for parsing an expression is big and complicated, and may be spread over several functions. (In a straightforward recursive-descent parser, it will be a dozen or more functions.) They all have to cooperate very carefully to make even such a crude algorithm work. This takes a lot more effort and code than you would think. Also, you've missed an (admittedly non-obvious) point in my contribution. The error repair need not be at anywhere near as gross a level as throwing away everything until the next semicolon. That is necessary as a "backstop" algorithm, but the more local resync heuristic can be something like "if the input token is punctuation and the requested one is not, throw away the input, otherwise keep it". This repairs many minor goofs promptly and *correctly*. Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (08/17/87)
OK, he wants flames... chuck@amdahl.amdahl.com (Charles Simmons) writes: > A while ago I suggested that one cannot write a reasonable > compiler using YACC. ... I claim it is impossible to write a reasonable > compiler using YACC. ... ... I further claim it is impossible to produce > reasonable error messages using YACC. I've been working on a production compiler that uses Yacc for parsing, and I've given a bit of thought to this. I claim that it is indeed possible to produce good syntax error messages. The strategy is to print out several things: 1. a message that a syntax error has been found 2. a pointer to the token which caused the parser to barf (This is extremely important -- in practice this pointer is almost all a programmer needs. Compare this to the usual style: "line 37: invalid declaraction" when line 37 contains a nasty mess of structs and unions.) 3. a description of the token that was found (In case it tokenizes differently than the programmer expected.) 4. a listing of the tokens that are valid in this position (This gets long in some cases (expression operand expected, operator expected, and statement verb expected), so some special case code is needed to condense these forms.) This functions as a quick refresher of the grammar for the programmer. This isn't so important for languages like C with a small grammar, but its very useful for languages like ours that have over 50 different statements. (uck!) 5. a summary of where the parse stands now: statement-list ; l-value = expression + ( All of this stuff is immediately accessible from Yacc's tables. I'll admit that all of this (except 5) can be done in recursive descent, but usually people who code recursive descent don't bother. For example, in one popular C compiler, the construction struct { int a, reel /* <- note misspelling */ b } ; produces the message "closing brace expected", because when the parser sees 'reel' (which the lexer returns as 'identifier' rather than 'type name'), it says "Aha! He forgot to close off his structure declaraction." I'd much rather the compiler just tell me that I made an error, and not try to guess what I meant. As far as size and speed, I doubt that a recursive-descent parser for a complex grammar (say, 750 Yacc states) would be anywhere near as small, because the number of read-and-test code segments in an RD parser has to be about as large as the number of Yacc states. There are about 4 int's in Yacc's tables for each state, which is a lot more compact than the code you're going to generate to read-token-and-test. As far as needing an external stack for looping constructs, give me a break! Yacc maintains a stack, use it! /* generating the code for the expressions in the order they are * written requires more goto's than necessary, but otherwise I have * to write code to read and store them and then emit them later */ for_statement : FOR '(' expression ';' $remember_loc expression $goto_on_false $goto ';' $remember_loc expression $goto ')' $remember_loc statement $goto { /* if test is false, exit for */ backpatch($7, current_object_location); /* if test is true, go to body */ backpatch($8, $14); /* after step expression, go to test */ backpatch($12, $5); /* after body, go to step expression */ backpatch($16, $10); } ; $remember_loc : { $$ = current_object_location } ; $goto : { emit a goto statement, $$ = location where it's destination location will be filled in later by backpatch() }; $goto_on_false : { emit a conditional goto, $$ = destination location } ; -- Dale Worley Cullinet Software ARPA: cullvax!drw@eddie.mit.edu UUCP: ...!seismo!harvard!mit-eddie!cullvax!drw -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (08/22/87)
A friend of mine (Jim Cordy), who's better organized than I am, has supplied some properly-detailed references for people who are interested in error recovery in general and error recovery in top-down parsers in particular. The last reference in particular discusses the SP/k error recovery method that I've described in the past. D.T. Barnard and R.C. Holt; Hierarchic Syntax Error Repair for LR Grammars; International Journal of Computer and Information Sciences 11,4 (1982) pp.231-258. Also Technical Report 81-127, Department of Computing and Information Science, Queen's University (October 1981) pp.1-28. D.T. Barnard; <<this one's a survey of the field>> Syntax Error Handling Techniques; Technical Report 81-125, Department of Computing and Information Science, Queen's University (September 1981) pp.1-23. D.T. Barnard; Hierarchic Syntax Error Repair; Ph.D. Thesis, Department of Computer Science, University of Toronto (March 1981) pp.1-253. D.T. Barnard; Automatic Generation of Syntax-Repairing and Paragraphing Parsers; Technical Report CSRG-52, Computer Systems Research Group, University of Toronto (April 1975) pp.1-132. M.Sc. Thesis. Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request