[comp.lang.c] error recovery

djones@megatest.UUCP (Dave Jones) (04/21/89)

From article <1279@lzfme.att.com>, by jro@lzfme.att.com (J.OWENS):
> FROM:  James R. Owens, 38-C Elm Street, Summit, New Jersey 07901
> PHONE: 201-273-7360
> 
> I am looking for an algorithm for error-recovery to be used in
> a syntax analyzer.  The input stream is Pascal, and the parser
> is written in C under Unix 3.2.  My professor will not allow the use
> of Lex and Yacc.

Hmmmm...   Will he allow the use of some other parser-generator?
If not, your most practical solution would seem to be to re-implement
a subset of yacc.  But if that's not the solution your professor has in
mind, it's not likely to get you a very good grade.  He probably wants
you to write an ad hoc recursive descent parser. 

In any case, error-recovery is a little tricky. I use a commercial C compiler,
which I happen to know is written using yacc. It sometimes prints pages of
gibberish if you make a mistake.

I am currently writing a commercial Pascal compiler, using a BSD4.2 yacc
supplied by the same computer vender as the C compiler.  I had
to fix a bug in yacc in order to get error-recovery to work the way
it's supposed to.  We have a Unix source licence, and I happened to have
the sources to yacc laying around. But if I had not been able to fix the bug,
I would have used another parser-generator, or written one from scratch,
rather than doing a recursive descent parser.

Why? For one thing, error-recovery in a recursive descent parser is even
trickier than in an LR parser! The C activation stack gets to a point
where procedure activations are stacked up which must be discarded. A stack
of longjmp buffers is probably the best mechanism to use, but beware: there
are people who have strong religious objections to longjmp. Your prof might
be one of them.

Not to second guess your professor, but you seem to be new at this, and
it doesn't seem right to ask a student to implement error-recovery on the
first try without benefit of parser-building tools such as yacc. It could
take quite a bit of inspiration. Why not ask your instructor whether he
intends for you to write a recursive descent parser, and if so, whether he
wants you to attempt error-recovery.

If the answer to both is "yes," cancel all of your social engagements for
a couple of quarters.

> 
> Your comments, suggestions or CODE will be greatly appreciated.
> 

Let me get this right.  He won't let you use yacc, but he'll let you
use other people's suggestions and code?  Very curious.  I strongly 
suggest that you ask your professor for _his_ comments and suggestions.

> Thanks,
> James R. Owens

Hope this helped a little.

djones@megatest.UUCP (Dave Jones) (04/21/89)

P.S.

Try to get your instructor to give you the most complete specification
you can. Be sure that your instructor and you are in agreement on what is
required of your program. Then do the best job you can on that.  Don't
gratuitously add features like error-recovery which are not in the spec.

These are words to live by.

dg@lakart.UUCP (David Goodenough) (04/25/89)

djones@megatest.UUCP (Dave Jones) sez:
> Why? For one thing, error-recovery in a recursive descent parser is even
> trickier than in an LR parser!

The hell it is!!!

I suggest you dig out the source of the PDP 11, V6 UNIX C compiler. Written
in C, without YACC, uses recursive descent, and does error recovery so
gracefully, and correctly, it is a pleasure to use. It requires one longjmp
call to recover, but with the exception of a for loop (which could generate
a spurious error about imbalanced parentheses, due to overloading the ';'
operator) it never made a mistake. Would that pcc, the GreenHills family etc.
could do the same.

I'd suggest that all these guys talk to Leor Zohlman, the writer of
BDS-C. I suspect that he uses RD, whatever the case I rate his error
recovery way above _ANYTHING_ I have come across in UNIX since the V6
days. Are you listening guys????
-- 
	dg@lakart.UUCP - David Goodenough		+---+
						IHS	| +-+-+
	....... !harvard!xait!lakart!dg			+-+-+ |
AKA:	dg%lakart.uucp@xait.xerox.com		  	  +---+

henry@utzoo.uucp (Henry Spencer) (04/25/89)

In article <4389@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>... error-recovery in a recursive descent parser is even
>trickier than in an LR parser!

Nonsense.  If you insist on doing it as part of the parser, it gets messy,
but there's an easy way around that.  Have the parser tell the scanner what
kind of tokens it wants at each point, rather than just asking for "the next
token", and do the error recovery in the scanner.  The parser always sees a
syntactically correct program, and never has to get into the messy business
of popping an activation stack.  With the necessary cooperation from the
parser, this is about a page of code all told.  It works well, too -- often
much better than the messy contortions in yacc.  (Yes, I've done both.)
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

kers@otter.hpl.hp.com (Chris Dollin) (04/25/89)

Henry (Spencer) said:

| In article <4389@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
| >... error-recovery in a recursive descent parser is even
| >trickier than in an LR parser!
|
| Nonsense.  If you insist on doing it as part of the parser, it gets messy,
| but there's an easy way around that.  Have the parser tell the scanner what
| kind of tokens it wants at each point, rather than just asking for "the next
| token", and do the error recovery in the scanner.  The parser always sees a
| syntactically correct program, and never has to get into the messy business
| of popping an activation stack.  With the necessary cooperation from the
| parser, this is about a page of code all told.  It works well, too -- often
| much better than the messy contortions in yacc.  (Yes, I've done both.)

I second Henry's remarks. I've done such error recovery both in-parser and
in-scanner (even the in-parser one started doing it's work by calling the
scanner), and in-scanner is *nice*. In fact this was the way I did error
recovery - sorry, repair - for an automatically generated LL(1) parser
generator [*1] several years ago.

I've not tried yacc yet, but that dreadful day looms ever closer.

Regards, Kers.
"If anything anyone lacks, they'll find it all ready in stacks."

[*1] The front-end of the parser-generator was written in Pascal, which was
also the target language. The back-end was written in Pop11.

bv3456@leah.albany.edu (Victor @ SUNY Albany CS) (04/26/89)

I always thought, in my limited experience, that the VAX/VMS compilers had
pretty good error recovery.  Is this really so, relatively speaking?  Does
anybody know if they are recursive descent or LR?


			- Victor

ath@helios.prosys.se (Anders Thulin) (04/28/89)

In article <510@lakart.UUCP> dg@lakart.UUCP (David Goodenough) writes:
>djones@megatest.UUCP (Dave Jones) sez:
>> Why? For one thing, error-recovery in a recursive descent parser is even
>> trickier than in an LR parser!
>
>The hell it is!!!
>

I most humbly agree. 

Some suggestion to the original poster as to recursive-descent error recovery:

Al Hartmann: An implementation of Concurrent Pascal
Springer Verlag

	Contains a description of a way to produce RD parsers with error
	recovery, based on the methods used in the Zurich P4 Pascal.

Steven Pemberton: Comments on an error-recovery scheme by Hartmann
Software - Practice & Experience, 

	Corrects an error in Hartmann's description. This is probably
	the best place to start.

(I'm typing these references from memory. Titles are probably off by one,
authors and publishers are not. Hope it is of some help.)



-- 
Anders Thulin			INET : ath@prosys.se
Programsystem AB		UUCP : ...!{uunet,mcvax}!sunic!prosys!ath
Teknikringen 2A			PHONE: +46 (0)13 21 40 40
S-583 30 Linkoping, Sweden	FAX  : +46 (0)13 21 36 35

djones@megatest.UUCP (Dave Jones) (04/29/89)

From article <1989Apr24.203827.5969@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> In article <4389@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>>... error-recovery in a recursive descent parser is even
>>trickier than in an LR parser!
> 
> Nonsense.
>

Ahem.

The "sense" of it is that, to me, it's trickier.  Maybe you know
an untrickier way. If so, please do tell.  I can always stand a bit
of educating.

> If you insist on doing it as part of the parser, it gets messy,
> but there's an easy way around that.  Have the parser tell the scanner what
> kind of tokens it wants at each point, rather than just asking for "the 
> next token", and do the error recovery in the scanner.  The parser
> always sees a syntactically correct program, and never has to get into
> the messy business of popping an activation stack.

This corresponds to the action which an LR parser
takes as it gobbles tokens from the input until a token can legally be
shifted. In the scheme you describe, is there anything equivalent to poping
the LR-state stack?  That's done, as you may know, to cooperate
with the token-gobbling in an attempt to "snip out" sections of bad input,
often beginning somewhere before the first illegal token, and extending
somewhat beyond it. The grammar is written so that these snipped out
sections reduce to the nonterminal symbol which the code-writer probably
meant for them to produce.

The parser might undertake, for example, to snip out a defective statement:
When the parser is producing a statement, and a bad token is encountered, it
will first pop states until it has uncovered the state that started the 
statement, and it will produce an "error-statement". Only then 
will it gobble tokens until it finds a token that could legally
follow a statement.

Such recovery can be attempted at arbitrary levels: The parser might
first undertake to snip out some smaller part of the statement, perhaps the
expression on the right-hand-side of an assignment, etc.., and produce
an "error-assignment-statement". See the second example at the bottom
of this posting.

It would seem to me, that to accomplish this same kind of "snipping",
in an LL parser, some kind of longjump, or error induced short circuiting
would be necessary, in order to abort the productions which should not
be completed.  It seems wrong to force the completion of such productions,
willy nilly, with tokens from the input stream, and in doing so, perhaps
"stealing" tokens from other productions which might otherwise be completed
successfully. If I understand the proposed method correctly, I can even
think of situations in which rather simple mistakes would cause productions
to be done using tokens which the author obviously did not intend to go
together.

>  With the necessary cooperation from the
> parser, this is about a page of code all told.  It works well, too -- often
> much better than the messy contortions in yacc.  (Yes, I've done both.)

It would seem to be not only a matter of personal judgement, but also
a matter of Yungian "peak-experience", as to whether the yacc method
is "messy" or "contorted".   Although I have seen the method applied
very very badly, I find it to be extremely intuitive and straight forward,
once one is "in on the joke".  It's the catching on that seems to be
the big hurdle. That and the fact that most yaccs I have come across
will use default productions in states which can shift "error". That's
a bug (a typo) that screws up the whole deal. I've fixed it in my copy
of BSD4.2 yacc.

An example:

    identifier_list	
	        : IDENTIFIER
                     { $$ = List_new(); List_append($$, (Ptr)$1); }
                | identifier_list ',' IDENTIFIER
                      { $$ = $1; List_append($1, (Ptr)$3); }
                | error
                      { $$ = List_new(); }
                | identifier_list ',' error
		      /* $$ = $1; .. default */

;

Here's another...

expression      : IDENTIFIER
			{ $$ = Mpc_identifier_expr($1); }
                | structured_var_access

                | raw_constant	      
			{ $$ = Mpc_const_expr($1->name, $1->const); }
                | rvalue adding_operator rvalue	%prec '+'
			{$$ = Mpc_binary_oper_expr($1,$2,$3); }
		| rvalue multiplying_operator rvalue   	%prec '*'
			{$$ = Mpc_binary_oper_expr($1,$2,$3); }
		| rvalue relational_operator rvalue	%prec '='
			{$$ = Mpc_relational_oper_expr($1,$2,$3); }
		| unary_operator rvalue			%prec UNARY
			{$$ = Mpc_unary_oper_expr($1,$2); }
		| NOT rvalue				%prec NOT
			{$$ = Mpc_unary_oper_expr($1,$2); }
		| IDENTIFIER  actual_parameter_list 
			{ $$ = Mpc_function_expr($1,$2); }
		| set_constructor

		| '(' expression ')' { $$ = $2; }

		| error  
			{$$ = (Expr*)0; }

;

What could be easier? And it works really well.

Perhaps this discussion, if it is to continue, should move to comp.compilers,
or whatever.

djones@megatest.UUCP (Dave Jones) (04/29/89)

From article <510@lakart.UUCP>, by dg@lakart.UUCP (David Goodenough):
> djones@megatest.UUCP (Dave Jones) sez:
>> Why? For one thing, error-recovery in a recursive descent parser is even
>> trickier than in an LR parser!
> 
> The hell it is!!!
> 

Watch your language. (And lower your voice.)  There's programmers in
the audience.  Some of us are trying to sleep.

> I suggest you dig out the source of the PDP 11, V6 UNIX C compiler. Written
> in C, without YACC, uses recursive descent, and does error recovery so
> gracefully, and correctly, it is a pleasure to use.

I'd love to. Where can I dig it out?

> It requires one longjmp call to recover,

Ahah!  See previous posting.

> but with the exception of a for loop (which could generate
> a spurious error about imbalanced parentheses, due to overloading the ';'
> operator) it never made a mistake. Would that pcc, the GreenHills family
> etc. could do the same.

True. I would prefer that these two just stop on the first error.
That's the only message I ever read, anyway.  (The subject of
error-recovery, as applied to compilers, is becoming increasingly moot.
It was important when it took a day or more to get a listing back from
the sysop, but now that you can retry a compilation in a matter of a few
seconds, it's mostly accademic.)
 
> I'd suggest that all these guys talk to Leor Zohlman, the writer of
> BDS-C. I suspect that he uses RD, whatever the case I rate his error
> recovery way above _ANYTHING_ I have come across in UNIX since the V6
> days. Are you listening guys????

Attentively.

I would also be interested in people's opinions on the best parser
generator utilities, both LL and LR.


Should we move this party over to comp.compilers?

>
> 	dg@lakart.UUCP - David Goodenough		+---+
>

Love that name!




		Yours truly,
		Dave "Highly Adequate" Jones.

hankd@ee.ecn.purdue.edu (Hank Dietz) (04/30/89)

An MS student of mine named Terence Parr has been building a really spiffy
lexical analyzer generator and LL(1) parser generator program for his MS
thesis.  It accepts a yacc+lex-like input syntax, but generates BOTH parser
and lexer from a single unified description.  Beyond that, it accepts the
usual EBNF extensions for 0-or-more, alternative, and optional subrules and
supports dynamically-allocated/deallocated attributes (e.g., by default, all
$ variables are variable-length strings).

The lexers generated are DFAs very similar to those built by lex, but with a
more direct implementation (bigger but faster).  The parsers are recursive
descent C code, but use switches and mapping tables where appropriate, rather
than the usual binary comparisons, to find the next alternative to apply and
to give rather decent error messages (based on first sets).  In non-trivial
test cases, the recognizers it builds have run consistently upwards of 25K
lines per second....

We will be putting this stuff in PD within the next few months.  If anyone
is interested in testing "alpha" versions of the package, send me email.

						-hankd@ee.ecn.purdue.edu

PS: Some of you may recall me mentioning a lexer generator a while back...
    well, that's part of this package.
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

henry@utzoo.uucp (Henry Spencer) (05/02/89)

In article <4595@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>> ... Have the parser tell the scanner what
>> kind of tokens it wants at each point, rather than just asking for "the 
>> next token", and do the error recovery in the scanner.  The parser
>> always sees a syntactically correct program, and never has to get into
>> the messy business of popping an activation stack.
>
>This corresponds to the action which an LR parser
>takes as it gobbles tokens from the input until a token can legally be
>shifted. In the scheme you describe, is there anything equivalent to poping
>the LR-state stack?

Nothing direct; the parser always sees syntactically correct input and
never has to pop.  However, it's wise for the scanner to include some
sort of equivalent.  If there is an agreed-on notion of "line terminator"
tokens, e.g. semicolons, the scanner can generate false input or discard
real input to keep line terminators in sync with the parser's requests
for same.  Combined with a simple low-level heuristic or two, this works
surprisingly well.

In principle, doing this at multiple levels (which is trickier) is better,
but in practice it doesn't seem worthwhile.

>It would seem to me, that to accomplish this same kind of "snipping",
>in an LL parser, some kind of longjump, or error induced short circuiting
>would be necessary, in order to abort the productions which should not
>be completed.  It seems wrong to force the completion of such productions,
>willy nilly, with tokens from the input stream, and in doing so, perhaps
>"stealing" tokens from other productions which might otherwise be completed
>successfully...

It may seem "wrong", but on the other hand, it ensures that later phases
see a complete and consistent picture, and avoids a cascade of code
everywhere in the compiler to deal with the consequences of simple syntax
errors.

>If I understand the proposed method correctly, I can even
>think of situations in which rather simple mistakes would cause productions
>to be done using tokens which the author obviously did not intend to go
>together.

It can happen.  In practice it's not a serious problem.  That's actually
a good summation of the method as a whole:  it may sound inelegant but
it works well in reality.  It's simple to implement and it lets the rest
of the compiler ignore the possibility of syntax errors.  The rumors of
difficulty are greatly exaggerated.
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu