johnl@ima.UUCP (08/17/87)
We have been succesfully using for more than a year a modified version of yacc that automatically recovers from errors. Error recovery was added by Julia Dain from Univ of Warwick. Included find an ms formated paper describing the work. Juha Heinanen Tampere Univ of Technology Finland ----------------------------------------------------------------------- .tr #$ .RP .EQ delim $$ .EN .TL Error Recovery for Yacc Parsers .AU Julia Dain .AI Dept. of Computer Science University of Warwick Coventry CV4 7AL UK .AB We aim to improve error recovery in parsers generated by the LALR parser-generator \fIYacc\fP. We describe an error recovery scheme which a new version of \fIYacc\fP automatically builds into its parsers. The scheme uses state information to attempt to repair input which is syntactically incorrect. Repair by alteration of a single token is attempted first, followed by replacement of a phrase of the input. A parser for the C language is generated from existing specifications and tested on a collection of student programs. The quality of error recovery and diagnostic messages is found to be higher than that of the existing portable C compiler. The new version of \fIYacc\fP may be used by any current user of \fIYacc\fP, with minor modifications to their existing specifications, to produce systems with enhanced syntax error recovery. .AE .NH Introduction .PP The portable C compiler \fIpcc\fP [Johnson78b] is widely used in .UX environments but its diagnostic messages are poor. The parser for \fIpcc\fP is built by the LALR parser-generator \fIYacc\fP [Johnson78a] which automatically generates error recovery routines. Many other popular .UX utilities contain syntax analysers built by \fIYacc\fP, such as the pattern matchers \fIlex\fP, \fIawk\fP and \fIgrep\fP and the FORTRAN 77 and C++ compilers, and these utilities would also be easier to use if they had improved diagnostics. The aim of the work presented here is to improve the error recovery scheme which \fIYacc\fP builds into its parsers and thus to improve the error handling in \fIpcc\fP. .PP This paper describes the old method for error recovery in parsers built by \fIYacc\fP, and a new general-purpose method which is independent of source language and which may be used with existing \fIYacc\fP input specifications with minor changes. We present tests on the resulting C compiler which show an improvement in error handling. We assume familiarity with LR parsing as described in [Aho77] for example. .NH The portable C compiler and \fIYacc\fP .PP In some computing environments, for example a university where many students are learning to use a new language, the quality of error diagnostics produced by a compiler is at least as important as the efficiency of generated code. Students using a .UX environment who learn C after Pascal often ask why the portable C compiler is so poor compared with the Berkeley Pascal system [Joy80]. A reason for their dissatisfaction is that \fIpcc\fP is unable to diagnose many simple syntax errors and produces misleading error messages, whereas the authors of the Berkeley Pascal compiler paid particular attention to developing a good error recovery scheme, presented in [Graham79]. .PP \fIYacc\fP produces LALR(1) parsers from a set of grammar rules (productions) and actions. The parsers contain default reductions, that is any state of the parser which has a unique reduction in its actions is given that reduction as entry for all symbols which cannot be shifted. To make use of the existing automatic error recovery scheme, described in [Aho74], the productions of the grammar should contain error productions of the form $A~->~\(*a~\fBerror\fP~\(*b$, where $A$ denotes a non-terminal, $\(*a , \(*b$ denote strings of grammar symbols, and \fBerror\fP denotes the token reserved by \fIYacc\fP for error handling. When the parser is presented with an input token which is not a legal symbol for the current state, it enters error recovery mode and inserts the \fBerror\fP token on the input. The parser pops states from the stack until the top state is one which can shift \fBerror\fP. Parsing then continues as dictated by the parse tables, except that any token for which there is no parsing action is deleted from the input. When three input tokens have been shifted, the parser assumes recovery is complete and leaves error recovery mode. In effect the parser assumes that an error has occurred while looking for a derivation of a non-terminal $A$ and that a series of tokens approximating to a derivation of $A$ has been consumed from the input. .PP \fIYacc\fP allows the user some control over error recovery actions by permitting error productions to have semantic actions associated with them. These can be used to specify actions to be taken in particular cases. \fIYacc\fP also allows the user to force the parser out of error recovery mode before three tokens have been shifted, and to clear the lookahead token. .PP The grammar for \fIpcc\fP contains eight error productions, one for the external definition construct (the highest-level block of which C programs are composed, that is function and data definitions), five for various forms of declarations and two for the statement construct. Only three of these productions have semantic actions, and these only change local variables. The productions for the statement construct are .br $statement~->~\fBerror\fP~';'~|~\fBerror\fP~'"}"'$ .br These productions mean that if the parser detects an error in a statement it will skip all input to the next semi-colon or right curly bracket. All the other error productions have the form .br $declaration~->~\fBerror\fP$ .br These cause the parser to skip input to anything which can follow the declaration. No use is made of the facilities to force the parser out of error recovery mode or clear the lookahead token. .PP In general, the method for error recovery in \fIYacc\fP has some disadvantages. The user has to write error productions which will control error recovery to an extent which the user may not realise. These productions may introduct ambiguities into the grammar. During recovery, input and stack states are deleted silently. No information about the nature of an error is available. The advantages of the method are that it is simple to implement and efficient to run. In the particular case of \fIpcc\fP, the main disadvantage is the poor quality of diagnostic messages, which is a result of the lack of information about errors. .NH The new method .PP The new method for error recovery in parsers generated by \fIYacc\fP uses two techniques, local correction with a forward move, and phrase-level recovery as presented in [Leinius70]. When the parser meets an illegal input token, it first tries to make a local correction to the input string by changes of a single token. If no local correction is successful, where success is judged by the number of moves which can then be made by the parser, a phrase-level recovery is made by replacing a part of the input with a non-terminal. Both already parsed input and input still remaining may be replaced. .NH 2 Local correction .PP The set of tokens which are legal shift symbols for the current configuration is determined by the current state. The parser attempts to repair the input by actions in the following order: inserting a token from this set on the input before the next token, deleting the next token, or replacing it with one from the set of legal tokens. In order to determine whether a repair is "good" the parser runs a forward move on the repaired input. This is achieved by copying some of the parse stack onto an error stack, buffering the input and turning off the semantic actions. The parser then restarts from the error state (the state in which it detected error), with the altered input. If the parser can continue to make moves without detecting a further error before five input tokens are shifted, or before accepting, the alteration is taken to be a good repair. The parser is returned to the error state and the parse stack, the input is backed up to the chosen alteration, and semantic actions are turned on again. If an alteration does not allow the parser to run a forward move which consumes five tokens from the input, a forward move is run with the next altered configuration from the set above. .NH 2 Phrase-level recovery .PP If no local correction succeeds, the parser is restored to the error state and the input is backed up to the illegal token. The parser chooses a goal non-terminal from the set of kernel items for the current state. Its item has the form .br $A~->~v sub 1 ... v sub m ~.~v sub m+1 ... v sub n$ .br where the $v sub i$ are grammar symbols. The phrase to be replaced by the goal non-terminal $A$ is $v sub 1 ... v sub n$. $v sub 1 ... v sub m$ have been parsed, so the parser pops $m$ states from the stack and pushes the goto state for the new top of stack and $A$. Further reductions may now take place. To complete the recovery, input is discarded until the next input token is legal for the current state. In effect, a reduction by the production $A~->~v sub 1 ... v sub m v sub m+1 ... v sub n$ has taken place. .PP A heuristic rule is used to choose the goal phrase from the kernel items of the error state, namely the last item to have been added to the kernel during construction of the item sets, except for the special case of state 0, where the first item is chosen. .PP The scheme is guaranteed to terminate, because it always consumes input tokens during a successful repair. .NH 2 Changes to \fIYacc\fP input specifications .PP Error productions are no longer required for error recovery and so may be deleted from grammar rules. The user must supply a routine \fIyyerrlval\fP as part of the \fIYacc\fP environment. The purpose of this routine is to supply a default semantic value which is required for tokens inserted during local correction and for non-terminals used as goals for phrase-level recovery. This semantic value will typically be a leaf of the syntax tree, suitably tagged. .NH 2 Error messages .PP The parser synthesises an error message from the recovery action taken in each case. It use the terminal and non-terminal names from the input grammar to \fIYacc\fP. Examples of messages are .br .nf .ce SEMICOLON inserted before RIGHTCURLY .br .fi for a successful local repair and .br .nf .ce e ASSIGN IF NAME replaced by e .br .fi for replacement of a phrase. .NH 2 Space requirements .PP Parsers generated by the new \fIYacc\fP require extra space for information for phrase-level recovery and diagnostic messages. No extra space is required for local recovery, as the information required, the valid shift symbols for each state, is present in the existing tables. For phrase-level recovery, two extra words are required for each state, the goal non-terminal and the number of symbols to pop from the stack. Tables of strings are needed for synthesizing diagnostic messages; one string is required for a meaningful name for each grammar symbol, excluding literal tokens. .PP The parser generated by the new \fIYacc\fP will have fewer states than the equivalent parser generated by the old \fIYacc\fP, because there are no error productions in its grammar. The space-saving device of default reductions for all states with a single reduction is still used. .NH The C compiler .PP The existing C compiler \fIpcc\fP contains a syntax analyzer which is generated by \fIYacc\fP. We took the source of this compiler, removed the error productions from the \fIYacc\fP specifications and included a new function \fIyyerrlval\fP which returns a semantic value for inserted tokens and non-terminals. This value is a new leaf of the syntax tree. The only other changes made were to the names of some of the terminals, such as changing SM to SEMICOLON and RC to RIGHTCURLY, to improve the error messages. .PP The relative sizes of the old and new C compilers are shown in Figure 1. .sp .KS .TS box center; l | c | c l | n | n . pcc new version _ Size in bytes of binary (ccom) 86776 98312 _ Parser only: Number of grammar rules 187 179 Number of states 312 303 Size in chars of source (y.tab.c) 42980 54617 .TE .ce Figure 1. Space required by the C compilers .sp .KE .PP It is obvious that the compiler performs identically to \fIpcc\fP on C programs which are syntactically correct. Error recovery for incorrect programs consists of repairs to the input and error messages. For the new compiler, repairs may be simple changes of one token or replacement of a phrase, and error messages describe the repairs. For the old compiler, error messages do not describe the action taken by the parser to recover, but are either uninformative ("Syntax error") or indicate what the parser finds incorrect. .PP In order to test the compiler's performance on incorrect programs, we made a collection of all C programs submitted by undergraduate students in the Department of Computer Science at the University of Warwick to \fIpcc\fP for compilation over three twenty-four hour periods, October 9, 10 and 16, 1984. Duplicate programs were removed and the programs were run through \fIpcc\fP and the new compiler. The code generated for syntactically correct programs was identical. Error recovery was evaluated according to the criteria used by Sippu [Sippu83], rating a correction as excellent if it was the same as a competent programmer might make, good if it introduced no spurious errors and missed no actual errors, fair if it introduced one spurious error or if it missed one error, and poor otherwise, or if the error message generated was meaningless. Missed errors, that is syntax errors that were present in the source code but not reported by the compiler, were counted. Also counted was the number of extra messages, that is messages about errors introduced into the source by incorrect recovery action taken by the compiler. A comparison of the performances of the two compilers, evaluated according to these criteria, is shown in Figure 2. Figure 3 shows a sample C program and its diagnostics. .sp .KS .TS center box; c | c s | c s l | r l | r l . pcc new version _ Quality of recovery action: Excellent 1% (1) 54% (64) Good 3% (3) 11% (13) Fair 54% (64) 13% (15) Poor 27% (32) 19% (22) _ Missed errors 15% (18) 3% (4) _ Total number of errors 118 118 _ Extra messages 127 82 .TE .ce Figure 2. Comparison of the performance of the C compilers .KE .sp .sp .sp .KS .TS n l . 1 /* Kernighan and Ritchie p. 102 - mutilated */ 2 3 strcmp(s, t) 4 char *s, *t 5 { 6 for ( ; *s == *t; s++, t++) 7 if *s == '\0' 8 return(0)! 9 return(*s - *t); 10 ? } .TE .TS l . Diagnostics from pcc: Line 5: Syntax error Diagnostics from new version: Line 6: SEMICOLON inserted before LCURLY Line 7: LPAREN inserted before MUL Line 8: RPAREN inserted before RETURN Line 9: UNOP replaced by SEMICOLON Line 11: QUEST deleted .TE .ce Figure 3. A sample C program .KE .sp .NH Discussion .PP The C compiler generated by the new \fIYacc\fP performed better on the collection of incorrect programs than \fIpcc\fP. The majority of the errors were simple ones which occurred sparsely, and were therefore amenable to repair by the local recovery tactic. This pattern of occurrence of simple errors concurs with Ripley and Druseikis' analysis of syntax errors in Pascal programs [Ripley78], which showed that the majority of these are single-token errors and occur infrequently. Clusters of errors and complicated errors were not handled so well by the phrase-level recovery, and these were responsible for the large number of extra messages generated. .PP The diagnostic messages produced depend on the names for the terminals and non-terminals, which should be carefully chosen by the grammar-writer. Ideally, messages should be at source level rather than lexical token level, as the user will understand a message of the form .DS Line 16: x \(eq y Semi-colon inserted ...... | .DE better than one of the form .DS Line 16: SEMICOLON inserted after ID .DE More communication between the lexical analyzer and the parser may be needed for this sort of message. Line numbers at present are occasionally out by one because of buffering of the lexical tokens. .PP A disadvantage is that the scheme shows bias towards assuming correctness of the left context. Local recovery assumes a single error in the current input token, and secondary recovery makes an arbitrary choice of item from the error state which takes no account of the right context. .PP Several other error recovery schemes for LR(1) parsers have been described. [Sippu81] and [Dain84] contain recent reviews of the literature. The scheme presented here bears resemblance to that devised by Graham, Haley and Joy for a Pascal compiler [Graham79], in that a two-stage recovery is attempted. There are several differences to note. Firstly, our scheme is a general-purpose recovery scheme incorporated in a new version of \fIYacc\fP, and is used by any parser generated by the new \fIYacc\fP. Graham requires special purpose error recovery routines and cost vectors to be supplied for use by their parser generator \fIEyacc\fP which contains no error recovery scheme. Secondly, \fIEyacc\fP produces parsers with certain states calling for reductions having their lookahead tokens enumerated, i.e. some default reductions are not made. Our \fIYacc\fP has the usual default reductions. Thirdly, Graham requires the user to supply error productions in the grammar, to control secondary recovery. These are not required for our scheme. .PP The error recovery scheme for the compiler-writing system HLP [Raiha83] incorporates a local recovery tactic into the phrase-level recovery scheme [Sippu83]. No forward move is made on the input and there is less check on the "correctness" of a local correction; the user must supply costs for deletion and insertion of each terminal in local correction. Different criteria are used for identifying and replacing the error phrase in phrase-level recovery. .PP A two-stage recovery scheme for LL(1) and LALR(1) parsers which uses the concept of \fIscope recovery\fP is implemented by Burke and Fisher [Burke82]. The scheme cannot be used however in LR parsers with default reductions. The user must supply additional language information such as constructs which open and close scope in the language, lists of tokens which cannot be inserted between a given pair of tokens, and lists of tokens which cannot be substituted for a given token. Pai and Kieburtz [Pai80] use \fIfiducial\fP (trustworthy) symbols, typically reserved words, in a scheme for LL(1) parsers which they suggest as suitable for extending to LR parsers. .PP Requiring the user of a parser-generator to supply information to aid error recovery in addition to the grammar may result in recovery which is more tailored to the language, but imposes an extra burden on the user, who may not have a full understanding of the mechanism of the parser and its error handling. The scheme which we have implemented in \fIYacc\fP makes few demands of this nature on its users, yet improves the quality of error recovery in its parsers. .NH References .LP .br [Aho74] Aho, A. V. and Johnson, S. C. LR Parsing. .I Computing Surveys 6, .R 2 (June 1974), 99-124. .br [Aho77] Aho, A. V. and Ullman, J. D. .I Principles of Compiler Design. .R Addison Wesley, 1977. .br [Burke82] Burke, M. and Fisher, G. A. A practical method for syntactic error diagnosis and recovery. .I ACM SIGPLAN Notices 17, 6 .R (June 1982), 67-78. .br [Dain84] Dain, J. A. Error recovery schemes in LR parsers. Theory of Computation Report No. 71, Dept. of Computer Science, University of Warwick, Coventry, December 1984. .br [Graham79] Graham, S. L., Haley, C. B. and Joy, W. N. Practical LR error recovery. .I ACM SIGPLAN Notices 14, .R 8 (August 1979), 168-175. .br [Johnson78a] Johnson, S. C. Yacc - Yet Another Compiler-Compiler. Bell Laboratories, Murray Hill, New Jersey, 1978. .br [Johnson78b] Johnson, S. C. A portable compiler: theory and practice. .I Proc. 5th ACM Symp. on Principles of Programming Languages .R (January 1978), 97-104. .br [Joy80] Joy, W. N., Graham, S. L. and Haley, C. B. Berkeley Pascal User's Manual. Computer Science Division, Dept. of Electrical Engineering and Computer Science, University of California, Berkeley, California, 1980. .br [Leinius70] Leinius, R. P. Error detection and recovery for syntax directed compiler systems. Ph. D. Th., Computer Science Dept., University of Wisconsin, Madison, 1970. .br [Pai80] Pai, A. B. and Kieburtz, R. B. Global context recovery: a new strategy for parser recovery from syntax errors. .I ACM Trans. on Programming Languages and Systems 2, .R 1 (January 1980), 18-41. .br [Raiha83] Raiha, K-J., Saarinen, M., Sarjakoski, M., Sippu, S., Soisalon-Soininen, E. and Tienari, M. Revised Report on the Compiler Writing System HLP78. Report A-1983-1, Dept. of Computer Science, University of Helsinki, Finland, January 1983. .br [Ripley78] Ripley, G. D. and Druseikis, F. A statistical analysis of syntax errors. .I Computer Languages 3, .R 4 (1978), 227-240. .br [Sippu81] Sippu, S. Syntax Error Handling in Compilers. Report A-1981-1, Dept. of Computer Science, University of Helsinki, Finland, March 1981. .br [Sippu83] Sippu, S. and Soisalon-Soininen, E. A syntax-error-handling technique and its experimental analysis. .I ACM Trans. on Programming Languages and Systems 5, .R 4 (October 1983), 656-679. -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.UUCP (08/18/87)
About 1-2 years ago someone at UNC-Chapel Hill did a study of error recovery mechanism in YACC. The paper is titled "Error Recovery in LR Parsing: A Case Study Using YACC" and is available as 'SoftLab Document No. 21' from UNC Chapel hill, Dept. of Computer Science. Requests for the report should be directed to : Softlab Group, Dept. of Computer Science, Sitterson Hall, University of North Carolina at Chapel Hill, NC-27514. or electronically to rts@unc.cs.unc.edu or bj@unc.cs.unc.edu ---------------------------------------------------------------- Phone: (919) 962-1827 Gopal Gupta (919) 962-1935 UNC-CH Computer Science. gupta@unc.cs.unc.edu, ...!decvax!mcnc!unc!gupta -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.istc.sri.com (09/04/87)
The version of yacc with improved error handling that juha mentioned is still available, to anyone with a UNIX (System V and bsd 4.1 or more) source license. A student here at Warwick, Nick Holloway, and I have been working on a scheme to produce error messages using the compiler source input rather than grammar names, automatically: i.e. if you use LEX and YACC (new versions) to build your translator you automatically get it to produce error messages like this: ex.c, line 15: syntax error while ((ch = getchar()) != EOF { ......................................^ ) inserted. We've used it to build the portable C compiler and awk. I'm writing it up now but if anyone wants to try out these new tools meanwhile, get in touch with me. -- Julia Dain Tel: 0203-523363 Dept of Computer Science uucp: ...mcvax!ukc!warwick!julia University of Warwick janet: julia@uk.ac.warwick.uu Coventry CV4 7AL UK -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
johnl@ima.istc.sri.com (09/04/87)
In article <689@ima.ISC.COM>, gupta%cs.unc.edu@RELAY.CS.NET (Gopal Gupta) writes: > About 1-2 years ago someone at UNC-Chapel Hill did a study > of error recovery mechanism in YACC. The paper is titled > "Error Recovery in LR Parsing: A Case Study Using YACC" For those interrested in coaxing acceptable error handling out of yacc without being able to hack the source or having to digest academic papers (not that it would hurt you...) there is hope. _Introduction_to_Compiler_Construction_with_Unix_ Schreiner and Friedman, Prentice-Hall, ISBN 0-13-474396-2 This is a usable tutorial introduction to getting results from lex/yacc. I bought it to learn the tools, since they had resisted attack using just the unix manuals. The book is organized around the evolution of a C subset compiler (roughly small-C) which I have since extended and retargeted and I'm using it to generate code for my architecture simulation. Anyway, they include simple yacc source construction rules for error recovery and a simple modification to /usr/lib/yaccpar that you can make at home. The result is truly usable error handling, better than a lot of commercial compilers. -- Steve Nuchia {soma,academ}!uhnix1!nuchat!steve (713) 334 6720 -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request