[comp.compilers] error recovery in yacc

johnl@ima.UUCP (08/17/87)

We have been succesfully using for more than a year a modified version
of yacc that automatically recovers from errors.  Error recovery was
added by Julia Dain from Univ of Warwick.  Included find an ms
formated paper describing the work.

	Juha Heinanen
	Tampere Univ of Technology
	Finland
-----------------------------------------------------------------------
.tr #$
.RP
.EQ
delim $$
.EN
.TL
Error Recovery for Yacc Parsers
.AU
Julia Dain
.AI
Dept. of Computer Science
University of Warwick
Coventry CV4 7AL
UK
.AB
We aim to improve error recovery in parsers generated by the
LALR parser-generator \fIYacc\fP.
We describe an error recovery scheme which
a new version of
\fIYacc\fP automatically builds
into its parsers.
The scheme uses state information to attempt to repair
input which is syntactically incorrect.
Repair by alteration of a single token is attempted first,
followed by replacement of a phrase of the input.
A parser for the C language is generated from existing specifications
and tested on a collection of student programs.
The quality of error recovery and diagnostic messages is found to be
higher than that of the existing portable C compiler.
The new version of \fIYacc\fP may be used by any current user of
\fIYacc\fP, with minor modifications to their existing specifications,
to produce systems with enhanced syntax error recovery.
.AE
.NH
Introduction
.PP
The portable C compiler \fIpcc\fP [Johnson78b] is widely used in
.UX
environments but its diagnostic messages are poor.
The parser for \fIpcc\fP is built by the LALR parser-generator \fIYacc\fP
[Johnson78a]
which automatically generates error recovery routines.
Many other popular
.UX
utilities
contain syntax analysers built by \fIYacc\fP,
such as the pattern matchers \fIlex\fP, \fIawk\fP and \fIgrep\fP
and the FORTRAN 77 and C++ compilers,
and these utilities would also be easier to use if
they had improved diagnostics.
The aim of the work presented here is to improve the error recovery scheme
which \fIYacc\fP builds into its parsers and thus to improve the error handling
in \fIpcc\fP.
.PP
This paper describes the old method for error recovery in parsers built
by \fIYacc\fP, and a new general-purpose method which is independent
of source language and which may be used with existing
\fIYacc\fP input specifications with minor changes.
We present tests on the resulting C compiler which show an improvement
in error handling.
We assume familiarity with LR parsing as described in [Aho77] for example.
.NH
The portable C compiler and \fIYacc\fP
.PP
In some computing environments, for example a university where many students
are learning to use a new language,
the quality of error diagnostics produced by a compiler is at least as
important as the efficiency of generated code.
Students
using a
.UX
environment who
learn C after Pascal often ask why the portable C compiler is so poor
compared with the Berkeley Pascal system [Joy80].
A reason for their dissatisfaction is that \fIpcc\fP is unable to diagnose
many simple syntax errors and produces misleading error messages, whereas
the authors of the 
Berkeley Pascal compiler paid particular attention to developing a
good error recovery scheme, presented in [Graham79].
.PP
\fIYacc\fP produces LALR(1) parsers from a
set of grammar rules (productions) and actions.
The parsers contain default reductions,
that is any state of the parser which has a unique reduction in
its actions is given that reduction as entry for all symbols which
cannot be shifted.
To make use of the existing automatic error recovery scheme,
described in [Aho74], the productions of the grammar
should contain error productions of the form
$A~->~\(*a~\fBerror\fP~\(*b$,
where $A$ denotes a non-terminal,
$\(*a , \(*b$ denote strings of grammar symbols, and
\fBerror\fP denotes the token reserved by \fIYacc\fP for error handling.
When the parser is presented with an input token which is not a legal
symbol for the current state,
it enters error recovery mode and inserts the \fBerror\fP token
on the input.
The parser
pops states from the stack until the top state is one which can
shift \fBerror\fP.
Parsing then continues as dictated by the parse tables, except that any
token for which there is no parsing action is deleted from the
input.
When three input tokens have been shifted, the parser assumes recovery
is complete and leaves error recovery mode.
In effect the parser assumes that an error has occurred while looking for a
derivation of a non-terminal $A$ and that a series of tokens approximating
to a derivation of $A$ has been consumed from the input.
.PP
\fIYacc\fP allows the user some control over error recovery actions by permitting
error productions to have semantic actions associated with them.
These can be used to specify actions to be taken in particular cases.
\fIYacc\fP also allows the user to force the parser out of error recovery mode
before three tokens have been shifted, and to clear the lookahead token.
.PP
The grammar for \fIpcc\fP contains eight error productions, one for
the external definition construct
(the highest-level block of which C programs are composed,
that is function and data definitions), five for various forms of declarations
and two for the statement construct.
Only three of these productions have semantic actions, and these only
change local variables.
The productions for the statement construct are
.br
        $statement~->~\fBerror\fP~';'~|~\fBerror\fP~'"}"'$
.br
These productions mean that if the parser detects an error in a statement
it will skip all input to the next semi-colon or right curly bracket.
All the other error productions have the form
.br
        $declaration~->~\fBerror\fP$
.br
These cause the parser to skip input to anything which can follow the
declaration.
No use is made of the facilities to force the parser out of error recovery
mode or clear the lookahead token.
.PP
In general, the method for error recovery in \fIYacc\fP has some disadvantages.
The user has to write error productions which will control error recovery
to an extent which the user may not realise.
These productions may introduct ambiguities into the grammar.
During recovery, input and stack states are deleted silently.
No information about the nature of an error is available.
The advantages of the method are that it is simple to implement and
efficient to run.
In the particular case of \fIpcc\fP, the main disadvantage is the poor
quality of diagnostic messages, which is a result of the lack
of information about errors.
.NH
The new method
.PP
The new method for error recovery in parsers generated by \fIYacc\fP uses two
techniques, local correction with a forward move,
and phrase-level recovery as presented in [Leinius70].
When the parser meets an illegal input token, it first tries to make a
local correction to the input string by changes of a single token.
If no local correction is successful,
where success is judged by the number of moves which can then be
made by the parser,
a phrase-level
recovery is made by replacing a part of the input with a non-terminal.
Both already parsed input and input still remaining may be replaced.
.NH 2
Local correction
.PP
The set of tokens which are legal shift symbols for the current
configuration is determined by the current state.
The parser attempts to repair the input by
actions in the following order:
inserting a token from this
set on the input before the next token, deleting the next token,
or replacing it with one from the set of legal tokens.
In order to determine whether a repair is "good" the parser runs a
forward move on the repaired input.
This is achieved by copying some of the parse stack onto an error stack,
buffering the input and turning off the semantic actions.
The parser then restarts from the error state (the state
in which it detected error), with the altered input.
If the parser can continue to make moves
without detecting a further error before five input tokens are shifted,
or before accepting,
the alteration is taken to be a good repair.
The parser is returned to the
error state and the parse stack,
the input is backed up to the chosen alteration,
and semantic actions are turned on again.
If an alteration does not allow the parser to run a forward move which
consumes five tokens from the input, a forward move is run with the
next altered configuration from the set above.
.NH 2
Phrase-level recovery
.PP
If no local correction succeeds, the parser
is restored to the error state
and the input is backed up to the illegal token.
The parser chooses a goal non-terminal
from the set of kernel items for
the current state.
Its item has the form
.br
    $A~->~v sub 1 ... v sub m ~.~v sub m+1 ... v sub n$
.br
where the $v sub i$ are grammar symbols.
The phrase to be replaced by the goal non-terminal $A$ is
$v sub 1 ... v sub n$.
$v sub 1 ... v sub m$ have been parsed, so the parser 
pops $m$ states from the stack and pushes the goto state for the
new top of stack and $A$.
Further reductions may now take place.
To complete the recovery, input is discarded until the next input token
is legal for the current state.
In effect, a reduction by the production
$A~->~v sub 1 ... v sub m v sub m+1 ... v sub n$
has taken place.
.PP
A heuristic rule is used to choose the goal phrase from the kernel items
of the error state, namely the last item to have been added to the kernel
during construction of the item sets, except for the special case of
state 0, where the first item is chosen.
.PP
The scheme 
is guaranteed to terminate, because it always
consumes input tokens during a successful repair.
.NH 2
Changes to \fIYacc\fP input specifications
.PP
Error productions are no longer required for error recovery and so may
be deleted from grammar rules.
The user must supply a routine \fIyyerrlval\fP as part of the \fIYacc\fP
environment.
The purpose of this routine is to supply a default semantic value which
is required for tokens inserted during local correction and for non-terminals
used as goals for phrase-level recovery.
This semantic value will typically be a leaf of the syntax tree, suitably
tagged.
.NH 2
Error messages
.PP
The parser synthesises an error message from the recovery action taken
in each case.
It use the terminal and non-terminal names from the input grammar to \fIYacc\fP.
Examples of messages are
.br
.nf
.ce
SEMICOLON inserted before RIGHTCURLY
.br
.fi
for a successful local repair and
.br
.nf
.ce
e ASSIGN IF NAME replaced by e
.br
.fi
for replacement of a phrase.
.NH 2
Space requirements
.PP
Parsers generated by the new \fIYacc\fP require extra space for information
for phrase-level recovery and diagnostic messages.
No extra space is required for local recovery, as the information
required, the valid shift symbols for each state, is present in the
existing tables.
For phrase-level recovery, two extra words are required for each state,
the goal non-terminal and the number of symbols to pop from the stack.
Tables of strings are needed for synthesizing diagnostic messages;
one string is required for a meaningful name for each grammar symbol,
excluding literal tokens.
.PP
The parser generated by the new \fIYacc\fP
will have fewer states than the equivalent parser generated by the old
\fIYacc\fP, because there are no error productions in its grammar.
The space-saving device of
default reductions for all states with a single reduction is still used.
.NH
The C compiler
.PP
The existing C compiler \fIpcc\fP contains a syntax analyzer which is
generated by \fIYacc\fP.
We took the source of this compiler, removed the error productions from
the \fIYacc\fP specifications and included a new function \fIyyerrlval\fP
which returns a semantic value for inserted tokens and non-terminals.
This value is a new leaf of the syntax tree.
The only other changes made were to the names of some of the terminals,
such as changing SM to SEMICOLON and RC to RIGHTCURLY, to improve the
error messages.
.PP
The relative sizes of the old and new C compilers are shown in
Figure 1.
.sp
.KS
.TS
box center;
l | c | c
l | n | n .
 	pcc	new version
_
Size in bytes of binary (ccom)	86776	98312
_
Parser only:	 	 
  Number of grammar rules	187	179
  Number of states	312	303
  Size in chars of source (y.tab.c)	42980	54617
.TE
.ce
Figure 1. Space required by the C compilers
.sp
.KE
.PP
It is obvious that the compiler performs identically to \fIpcc\fP
on C programs which are syntactically correct.
Error recovery for incorrect programs consists of repairs to the input
and error messages.
For the new compiler, repairs may be simple changes of one token
or replacement of a phrase, and error messages describe the repairs.
For the old compiler, error messages do not describe the action taken
by the parser to recover, but are either uninformative
("Syntax error") or indicate what the parser finds incorrect.
.PP
In order to test the compiler's performance on incorrect programs, we made a
collection of all C programs submitted by undergraduate students
in the Department of Computer Science
at the University of Warwick
to \fIpcc\fP
for compilation over three twenty-four hour periods,
October 9, 10 and 16, 1984.
Duplicate programs were removed and the programs were run through
\fIpcc\fP and the new compiler.
The code generated for syntactically correct programs was identical.
Error recovery was evaluated according
to the criteria used by Sippu [Sippu83], rating a correction as
excellent if it was the same as a competent programmer might make,
good if it introduced no spurious errors and missed no actual errors,
fair if it introduced one spurious error or if it missed one error,
and poor otherwise, or if the error message
generated was meaningless.
Missed errors, that is
syntax errors that were present in the source code but not reported
by the compiler, were counted.
Also counted was the number of extra messages, that is messages
about errors introduced into the source by incorrect recovery action
taken by the compiler.
A comparison of the performances of the two compilers, evaluated
according to these criteria, is shown in Figure 2.
Figure 3 shows a sample C program and its diagnostics.
.sp
.KS
.TS
center box;
c | c s | c s
l | r l | r l .
 	pcc 	new version
_
Quality of recovery action:
  Excellent	1%	(1)	54%	(64)
  Good	3%	(3)	11%	(13)
  Fair	54%	(64)	13%	(15)
  Poor	27%	(32)	19%	(22)
_
Missed errors	15%	(18)	3%	(4)
_
Total number of errors	118	 	118
_
Extra messages	127	 	82
.TE
.ce
Figure 2. Comparison of the performance of the C compilers
.KE
.sp
.sp
.sp
.KS
.TS
n l .
1	/* Kernighan and Ritchie p. 102 - mutilated */
2	
3	strcmp(s, t)
4	char *s, *t
5	{
6	    for ( ; *s == *t; s++, t++)
7	        if *s == '\0'
8	            return(0)!
9	    return(*s - *t);
10	? }
.TE
.TS
l .
Diagnostics from pcc:
    Line 5: Syntax error
Diagnostics from new version:
    Line 6: SEMICOLON inserted before LCURLY
    Line 7: LPAREN inserted before MUL
    Line 8: RPAREN inserted before RETURN
    Line 9: UNOP replaced by SEMICOLON
    Line 11: QUEST deleted
.TE
.ce
Figure 3. A sample C program
.KE
.sp
.NH
Discussion
.PP
The C compiler generated by the new \fIYacc\fP performed better on the
collection of incorrect programs than \fIpcc\fP.
The majority of the errors were simple ones which occurred sparsely,
and were therefore amenable to repair by the local recovery tactic.
This pattern of occurrence of simple errors concurs with Ripley
and Druseikis' analysis of syntax errors in Pascal programs [Ripley78],
which showed that
the majority of these are
single-token errors and occur
infrequently.
Clusters of errors and complicated errors were not handled so well
by the phrase-level recovery, and these were responsible for the large
number of extra messages generated.
.PP
The diagnostic messages produced depend on the names for the terminals
and non-terminals, which should be carefully chosen by the grammar-writer.
Ideally, messages should be at source level rather than lexical token
level, as the user will understand a message of the form
.DS
Line 16:            x \(eq y 
Semi-colon inserted ...... |
.DE
better than one of the form
.DS
Line 16:  SEMICOLON inserted after ID
.DE
More communication between the lexical analyzer and the parser may be
needed for this sort of message.
Line numbers at present are occasionally out by one because of buffering
of the lexical tokens.
.PP
A disadvantage is that the scheme shows bias towards assuming correctness
of the left context.
Local recovery assumes a single error in the current input token,
and secondary recovery makes an arbitrary choice of item from the error
state which takes no account of the right context.
.PP
Several other error recovery schemes for LR(1) parsers have been described.
[Sippu81] and [Dain84]
contain recent reviews of the literature.
The scheme presented here bears resemblance to 
that devised by Graham, Haley and Joy for a Pascal compiler
[Graham79], in that a two-stage
recovery is attempted.
There are several differences to note.
Firstly, our scheme is a general-purpose recovery scheme incorporated
in a new version of
\fIYacc\fP, and is used by any parser generated by the new \fIYacc\fP.
Graham requires special purpose error recovery routines
and cost vectors to be supplied
for use by their parser generator \fIEyacc\fP
which contains no error recovery scheme.
Secondly, \fIEyacc\fP produces parsers with certain states calling
for reductions having their lookahead tokens enumerated, i.e. some
default reductions are not made.
Our \fIYacc\fP has the usual default reductions.
Thirdly, Graham requires the user to supply error productions in the
grammar, to control secondary recovery.
These are not required for our scheme.
.PP
The error recovery scheme
for the compiler-writing system HLP [Raiha83]
incorporates
a local recovery tactic into the phrase-level recovery scheme [Sippu83].
No forward move is made on the input and there is less check on the
"correctness" of a local correction;
the user must supply costs for deletion and insertion of each terminal
in local correction.
Different criteria are used for identifying and replacing the error phrase
in phrase-level recovery.
.PP
A two-stage recovery scheme for LL(1) and LALR(1) parsers which uses
the concept of \fIscope recovery\fP is implemented by Burke and Fisher
[Burke82].
The scheme cannot be used however in LR parsers with default reductions.
The user must supply additional language information such as constructs
which open and close scope in the language, lists of tokens which
cannot be inserted between a given pair of tokens, and lists of tokens
which cannot be substituted for a given token.
Pai and Kieburtz [Pai80] use \fIfiducial\fP (trustworthy) symbols,
typically reserved words, in a scheme for LL(1) parsers which they suggest
as suitable for extending to LR parsers.
.PP
Requiring the user of a parser-generator to supply information to
aid error recovery in addition to the grammar may result in recovery
which is more tailored to the language, but imposes an extra burden on
the user, who may not have a full understanding of the mechanism of
the parser and its error handling.
The scheme which we have implemented in \fIYacc\fP makes few demands of this
nature on its users, yet improves the quality of error recovery in
its parsers.
.NH
References
.LP
.br
[Aho74]  Aho, A. V. and Johnson, S. C.  LR Parsing.
.I
Computing Surveys 6,
.R
2 (June 1974), 99-124.
.br
[Aho77]  Aho, A. V. and Ullman, J. D.
.I
Principles of Compiler Design.
.R
Addison Wesley, 1977.
.br
[Burke82]  Burke, M. and Fisher, G. A.  A practical method for syntactic
error diagnosis and recovery.
.I
ACM SIGPLAN Notices 17, 6
.R
(June 1982), 67-78.
.br
[Dain84]  Dain, J. A.  Error recovery schemes in LR parsers.
Theory of Computation Report No. 71, Dept. of Computer Science,
University of Warwick, Coventry, December 1984.
.br
[Graham79]  Graham, S. L., Haley, C. B. and Joy, W. N.  Practical LR error
recovery.
.I
ACM SIGPLAN Notices 14,
.R
8 (August 1979), 168-175.
.br
[Johnson78a]  Johnson, S. C.  Yacc - Yet Another Compiler-Compiler.
Bell Laboratories, Murray Hill, New Jersey, 1978.
.br
[Johnson78b]  Johnson, S. C.  A portable compiler: theory and practice.
.I
Proc. 5th ACM Symp. on Principles of Programming Languages
.R
(January 1978), 97-104.
.br
[Joy80]  Joy, W. N., Graham, S. L. and Haley, C. B.  Berkeley Pascal User's
Manual.
Computer Science Division, Dept. of Electrical Engineering and Computer Science,
University of California, Berkeley, California, 1980.
.br
[Leinius70]  Leinius, R. P.  Error detection and recovery for syntax
directed compiler systems.
Ph. D. Th., Computer Science Dept., University of Wisconsin, Madison, 1970.
.br
[Pai80]  Pai, A. B. and Kieburtz, R. B.  Global context recovery: a new
strategy for parser recovery from syntax errors.
.I
ACM Trans. on Programming Languages and Systems 2,
.R
1 (January 1980), 18-41.
.br
[Raiha83]  Raiha, K-J., Saarinen, M., Sarjakoski, M., Sippu, S.,
Soisalon-Soininen, E. and Tienari, M.  Revised Report on the Compiler
Writing System HLP78.  Report A-1983-1, Dept. of Computer Science,
University of Helsinki, Finland, January 1983.
.br
[Ripley78]  Ripley, G. D. and Druseikis, F.  A statistical analysis
of syntax errors.
.I
Computer Languages 3,
.R
4 (1978), 227-240.
.br
[Sippu81]  Sippu, S.  Syntax Error Handling in Compilers.
Report A-1981-1, Dept. of Computer Science, University of Helsinki,
Finland, March 1981.
.br
[Sippu83]  Sippu, S. and Soisalon-Soininen, E.  A syntax-error-handling
technique and its experimental analysis.
.I
ACM Trans. on Programming Languages and Systems 5,
.R
4 (October 1983), 656-679.
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/18/87)

About 1-2 years ago someone at UNC-Chapel Hill did a study
of error recovery mechanism in YACC. The paper is titled
"Error Recovery in LR Parsing: A Case Study Using YACC"
and is available as 'SoftLab Document No. 21' from UNC
Chapel hill, Dept. of Computer Science. Requests for the
report should be directed to  :

		Softlab Group,
		Dept. of Computer Science,
		Sitterson Hall,
		University of North Carolina at Chapel Hill,
		NC-27514.

or electronically to 		rts@unc.cs.unc.edu
					or
				 bj@unc.cs.unc.edu
----------------------------------------------------------------
Phone: (919) 962-1827		Gopal Gupta
       (919) 962-1935		UNC-CH Computer Science.
gupta@unc.cs.unc.edu,  ...!decvax!mcnc!unc!gupta
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.istc.sri.com (09/04/87)

The version of yacc with improved error handling that juha mentioned
is still available, to anyone with a UNIX (System V and bsd 4.1 or more)
source license.
A student here at Warwick, Nick Holloway, and I have been working
on a scheme to produce error messages using the compiler source input
rather than grammar names, automatically: i.e. if you use LEX and
YACC (new versions) to build your translator you automatically get
it to produce error messages like this:
ex.c, line 15: syntax error
	while ((ch = getchar()) != EOF {
......................................^
) inserted. 
We've used it to build the portable C compiler and awk.  I'm writing it up
now but if anyone wants to try out these new tools meanwhile, get in touch
with me.
-- 
Julia Dain 			Tel: 0203-523363
Dept of Computer Science 	uucp: ...mcvax!ukc!warwick!julia
University of Warwick 		janet: julia@uk.ac.warwick.uu
Coventry CV4 7AL UK
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.istc.sri.com (09/04/87)

In article <689@ima.ISC.COM>, gupta%cs.unc.edu@RELAY.CS.NET (Gopal Gupta) writes:
> About 1-2 years ago someone at UNC-Chapel Hill did a study
> of error recovery mechanism in YACC. The paper is titled
> "Error Recovery in LR Parsing: A Case Study Using YACC"

For those interrested in coaxing acceptable error handling out
of yacc without being able to hack the source or having to
digest academic papers (not that it would hurt you...) there
is hope.

_Introduction_to_Compiler_Construction_with_Unix_
Schreiner and Friedman, Prentice-Hall, ISBN 0-13-474396-2

This is a usable tutorial introduction to getting results from
lex/yacc.  I bought it to learn the tools, since they had resisted
attack using just the unix manuals.  The book is organized around
the evolution of a C subset compiler (roughly small-C) which I have
since extended and retargeted and I'm using it to generate code
for my architecture simulation.

Anyway, they include simple yacc source construction rules for
error recovery and a simple modification to /usr/lib/yaccpar
that you can make at home.  The result is truly usable error
handling, better than a lot of commercial compilers.
-- 
Steve Nuchia
{soma,academ}!uhnix1!nuchat!steve
(713) 334 6720
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request