[comp.compilers] recursive-descent error recovery

johnl@ima.UUCP (08/14/87)

> [I'd be interested in hearing what the recursive descent lexer feeds to the
> parser when the input is something like a = b + ( ;    -John]

The basic approach in the lexer is (a) give the customer what he wants, and
(b) if what he wants doesn't match what's in the input, apply some simple
heuristics to try to resynchronize.  Say we've just parsed the "b", so the
parser knows we're in an expression that is part of a statement; here is
the dialogue, slightly simplified:

Parser:  "Do you have any of <several expression continuations like '+'>?"
Lexer:  "Yes, I have '+'."
P (to itself):  "Aha, there's more to this expression."
P:  "Do you have any of <things that might start an operand, like '('>?"
L:  "Yes, I have '('."
P (to itself):  "Aha, a parenthesized expression."
P:  "Give me one of <things that would start an expression>."
L (to itself):  "<gurgling noises> That's not in my input!  I'm going to
	have to fake it.  Let's see... what I see is a line terminator,
	and he's not asking for a line terminator, so I will try to fake
	the rest of the line to resynchronize.  What's the most harmless
	thing I can give him, out of the list he asked for?..."
L:  "I have '0'."
P:  "Do you have any of <things that might continue the expression>?"
L (to itself):  "To avoid infinite loops, I should try to cut this short."
L:  "No."
P (to itself):  "Must be the end of the expression."
P:  "Give me ')'."
L (to itself):  "Keep on faking it..."
L:  "Okay, I have ')'."
P:  "Do you have any of <things that might continue the outer expression>?"
L (to itself):  "Again, try to cut it short."
L:  "No."
P (to itself):  "Must be the end of the expression."
P:  "Give me ';'."
L (to itself):  "<sigh of relief> Resynchronized!  Now I can stop lying."
L:  "I have ';'."

... and so forth.  The details of resynchronization heuristics can get messy,
especially in a language like C where clear-cut line-terminator tokens are
hard to find (e.g. ';' also appears inside "for"), but the basic idea is
simple and works well.  It's not original with me; I think the first place
where I saw it was in Dave Barnard's M.Sc. thesis, describing the SP/k
parser and scanner.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/15/87)

In article <651@ima.ISC.COM> decvax!utzoo!henry writes:
>> [I'd be interested in hearing what the recursive descent lexer feeds to the
>> parser when the input is something like a = b + ( ;    -John]
>
>The basic approach in the lexer is (a) give the customer what he wants, and
>(b) if what he wants doesn't match what's in the input, apply some simple
>heuristics to try to resynchronize.  Say we've just parsed the "b", so the
>parser knows we're in an expression that is part of a statement; here is
>the dialogue, slightly simplified:
>
[Relatively long dialog deleted...]

This seems like a pain.  Recusrive descent with error recovery performed
by the higher level entity would seem to be simpler, namely because the
higher level entity knows more about what's going on.  For example, suppose
the expression parser has just parsed the paren.  The conversation then
proceeds as follows:
 
Expression Parser:  Give me a token.
Lexical Analyzer:  Okay, here's a semicolon.
EP:  A semicolon?!  I didn't expect one of these.  I expected an
expression.  Yo!  User!  I expected to see an expression after this
parenthese at line 124.   Okay...  Now I've told the user it screwed
up, let's recover from this sucker.  The simplest thing to do, since
I'm in an expression, is to toss tokens until I get to a synchronizing
token like a semi.  Hmmm...  I've got a semi, so I'm now synchronized.
Okay.  Let's tell our caller that we parsed an expression successfully.
 
Meanwhile...  A while ago I suggested that one cannot write a reasonable
compiler using YACC.  I got no response.  So let me attempt to generate
some flames.  I claim it is impossible to write a reasonable compiler
using YACC.  I base this claim on the fact that while it is important
that a compiler produce a correct binary, it is nearly as important that
the compiler produce reasonable error messages.  I further claim it is 
impossible to produce reasonable error messages using YACC.

So flame me and tell me why I don't know what I'm talking about.  I feel
the need for some edification...

But what about developing the   |  Chuck Simmons @ Amdahl
moon before we terraform Mars?  |  amdahl!chuck
[Just because nobody ever does good error recovery in yacc parsers doesn't
mean it's all that hard. What is hard is to figure out the right set of error
productions to add to the grammar to catch the sorts of errors that will
actually occur. Then there's the issue of what a good error message is. Some
people like things like "statement required ID, CPAREN, SEMI, BLURCH, or BLAH
but FNUMBER was encountered." Personally, I liked the ones in Fortran H which
sounded like pronouncements from a slightly disdainful robot: "IEH247I THE
STATEMENT CONTAINS A PROSCRIBED COMMON DECLARATION." -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/16/87)

In article <655@ima.ISC.COM> chuck@amdahl.amdahl.com (Charles Simmons) writes:
| I claim it is impossible to write a reasonable compiler
| using YACC.  I base this claim on the fact that ... it is
| impossible to produce reasonable error messages using YACC.

Well, it's certainly difficult, but that's not because YACC is a
bottom-up parser.  It's because you're trying to do the error recovery
by hand.  Bernard Dion's thesis work at Wisconsin (with Charlie
Fischer) demonstrated how to perform locally least-cost *automatic*
error *correction* in bottom-up parsers, with modest space and
negligible time overhead.  [Compiler writer gives the parser generator
a table of insertion and deletion costs for tokens; error corrector
makes the least cost set of insertions and deletions that allows one
more real token to be shifted.]  Good quality corrections make pretty
good error messages.  Dion's techniques are incorporated in the ECP
(error correcting parser) package distributed by Fischer's group.
-- 
Michael L. Scott
University of Rochester    (716) 275-7745
scott@cs.rochester.edu     scott%rochester@CSNET-RELAY
{decvax, allegra, seismo, cmcl2}!rochester!scott

--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/17/87)

> This seems like a pain.  Recusrive descent with error recovery performed
> by the higher level entity would seem to be simpler, namely because the
> higher level entity knows more about what's going on...

On the contrary, doing it at the higher level is a horrendous pain, and the
smooth simplicity of the low-level recovery (which has detailed guidance
from the higher level, remember) is an enormous win.  You have to experience
the difference to fully appreciate it -- I've written parsers both ways.

The problems with doing it at the higher level boil down to (a) it adds a
lot of complexity to the code, and (b) error repair often has to cross
the boundaries of syntactic structures, which is painful in recursive
descent because those are function boundaries in the parser.  By contrast,
the low-level approach requires 100 lines or so of code to handle *all*
syntactic error repair for the entire compiler, and it's all in one place
rather than interspersed throughout the parser.

> ... Okay...  Now I've told the user it screwed
> up, let's recover from this sucker.  The simplest thing to do, since
> I'm in an expression, is to toss tokens until I get to a synchronizing
> token like a semi...

Just where does the code that does this reside?  Remember that the code
for parsing an expression is big and complicated, and may be spread over
several functions.  (In a straightforward recursive-descent parser, it
will be a dozen or more functions.)  They all have to cooperate very
carefully to make even such a crude algorithm work.  This takes a lot
more effort and code than you would think.

Also, you've missed an (admittedly non-obvious) point in my contribution.
The error repair need not be at anywhere near as gross a level as throwing
away everything until the next semicolon.  That is necessary as a "backstop"
algorithm, but the more local resync heuristic can be something like "if the
input token is punctuation and the requested one is not, throw away the input,
otherwise keep it".  This repairs many minor goofs promptly and *correctly*.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/17/87)

OK, he wants flames...

chuck@amdahl.amdahl.com (Charles Simmons) writes:
> A while ago I suggested that one cannot write a reasonable
> compiler using YACC. ... I claim it is impossible to write a reasonable
> compiler using YACC. ... ... I further claim it is impossible to produce
> reasonable error messages using YACC.

I've been working on a production compiler that uses Yacc for parsing,
and I've given a bit of thought to this.  I claim that it is indeed
possible to produce good syntax error messages.  The strategy is to
print out several things:

	1. a message that a syntax error has been found
	2. a pointer to the token which caused the parser to barf
	(This is extremely important -- in practice this pointer is
	almost all a programmer needs.  Compare this to the usual
	style: "line 37: invalid declaraction" when line 37 contains
	a nasty mess of structs and unions.)
	3. a description of the token that was found
	(In case it tokenizes differently than the programmer expected.)
	4. a listing of the tokens that are valid in this position
	(This gets long in some cases (expression operand expected,
	operator expected, and statement verb expected), so some
	special case code is needed to condense these forms.)
	This functions as a quick refresher of the grammar for the
	programmer.  This isn't so important for languages like C with
	a small grammar, but its very useful for languages like ours that
	have over 50 different statements.  (uck!)
	5. a summary of where the parse stands now:
		statement-list ; l-value = expression + (

All of this stuff is immediately accessible from Yacc's tables.

I'll admit that all of this (except 5) can be done in recursive
descent, but usually people who code recursive descent don't bother.
For example, in one popular C compiler, the construction

	struct { int a, reel /* <- note misspelling */ b } ;

produces the message "closing brace expected", because when the parser
sees 'reel' (which the lexer returns as 'identifier' rather than 'type
name'), it says "Aha!  He forgot to close off his structure
declaraction."  I'd much rather the compiler just tell me that I made
an error, and not try to guess what I meant.

As far as size and speed, I doubt that a recursive-descent parser for
a complex grammar (say, 750 Yacc states) would be anywhere near as
small, because the number of read-and-test code segments in an RD
parser has to be about as large as the number of Yacc states.  There
are about 4 int's in Yacc's tables for each state, which is a lot more
compact than the code you're going to generate to read-token-and-test.

As far as needing an external stack for looping constructs, give me a
break!  Yacc maintains a stack, use it!

/* generating the code for the expressions in the order they are
 * written requires more goto's than necessary, but otherwise I have
 * to write code to read and store them and then emit them later */
for_statement : FOR '(' expression ';'
		$remember_loc expression $goto_on_false $goto ';'
		$remember_loc expression $goto ')'
		$remember_loc statement $goto
		{
		/* if test is false, exit for */
		backpatch($7, current_object_location);
		/* if test is true, go to body */
		backpatch($8, $14);
		/* after step expression, go to test */
		backpatch($12, $5);
		/* after body, go to step expression */
		backpatch($16, $10);
		} ;
$remember_loc	: { $$ = current_object_location } ;
$goto		: { emit a goto statement, $$ = location where it's
		destination location will be filled in later by backpatch() };
$goto_on_false	: { emit a conditional goto, $$ = destination location } ;

-- 
Dale Worley	Cullinet Software		ARPA: cullvax!drw@eddie.mit.edu
UUCP: ...!seismo!harvard!mit-eddie!cullvax!drw
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/22/87)

A friend of mine (Jim Cordy), who's better organized than I am, has supplied
some properly-detailed references for people who are interested in error
recovery in general and error recovery in top-down parsers in particular.
The last reference in particular discusses the SP/k error recovery method
that I've described in the past.

    D.T. Barnard and R.C. Holt;
    Hierarchic Syntax Error Repair for LR Grammars;
    International Journal of Computer and Information
    Sciences 11,4 (1982) pp.231-258.
    Also Technical Report 81-127,
    Department of Computing and Information Science,
    Queen's University
    (October 1981)
    pp.1-28.

    D.T. Barnard; <<this one's a survey of the field>>
    Syntax Error Handling Techniques;
    Technical Report 81-125,
    Department of Computing and Information Science,
    Queen's University
    (September 1981)
    pp.1-23.

    D.T. Barnard;
    Hierarchic Syntax Error Repair;
    Ph.D. Thesis, Department of Computer Science,
    University of Toronto (March 1981) pp.1-253.

    D.T. Barnard;
    Automatic Generation of Syntax-Repairing and Paragraphing Parsers;
    Technical Report CSRG-52, Computer Systems Research Group,
    University of Toronto (April 1975) pp.1-132.
    M.Sc. Thesis.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request