[comp.compilers] Error reporting in LR parsers

worley@compass.com (Dale Worley) (08/04/89)

   The idea of adding a 'yyexpected' function might sound good at
   first, but there are a number of problems with it.   As was pointed out
   before, the number of expected terminals will probably be quite large,
   as this will include terminals on which a shift is possible and
   terminals on which a reduction is possble.  For a grammar for Pascal,
   for example, this could be 30 or more terminals in some spots.  Also,
   you would probably have to modify the Yacc source code to do this
   because Yacc inserts default reduction actions.  These default
   reduction actions eliminate the lookahead information associated with
   reductions.

I've used similar techniques, and a few simple fixes can solve a lot
of these problems.  The first is to define "token groups", which are
groups of tokens the user thinks of as being similar.  For instance,
'+', '-', '*', etc. can be included in the group 'operator'; '(',
'IDENTIFIER', '-', etc., can be included in the group 'expression'.
Now, when a syntax error is found, the error-reporter assembles a list
of acceptable tokens.  Then it checks each token group, and if the
group is contained in the list of acceptable tokens, all those tokens
are removed from the list, and the name of the token group is added.
(In practice, only about three members of the group need to be checked
for membership, and if one token group overlaps another, you have to
be careful about the order in which the groups are tested.)  The
result is that you can usually reduce the list of allowed tokens to
about three items, and produce error messages like:

	    34	i = j k;
		      ^
    Syntax error.
    Found: IDENTIFIER 'k'
    Expecting: OPERATOR ; END

The trick of only checking a few members of the group can be used to
defeat the problem of losing information because of default
reductions.  The way to do this is to only test members whose validity
will not be eliminated by the default reductions.  In the above
example, 'j' may be reduced to 'expression' before the error is found,
but if the 'operator' group is represented only by '+', then the fact
that '*' cannot follow 'expression' will not stop the 'operator' group
from showing in the list of expected tokens.

Dale
--
Dale Worley, Compass, Inc.                      worley@compass.com
No one hates so bitterly as the lazy and poor contemplating the
energetic and prosperous.
-- 
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

djones@megatest.uucp (Dave Jones) (08/06/89)

Mr. Worley has generously taken the time to post a long article about getting
yacc to give up a list of token-categories which would be legal when a syntax
error is found. The idea of grouping tokens into categories is a good one.

But the technique in no way solves the "default reduction" problem, as
he claims it does. The technique he describes suffers from the exact same
problem that all the other attempts have done so far. I will demonstrate that
even the small example he gives will not work, by inspecting the actual
output from yacc.

From article <1989Aug4.032053.753@esegue.uucp>, by worley@compass.com (Dale Worley):
> 
... [ concerning how to get yacc to generate lists of legal tokens
      when a syntax error is encounterd ... ]
> 
> I've used similar techniques, and a few simple fixes can solve a lot
> of these problems. The first is to define "token groups", which are
> groups of tokens the user thinks of as being similar.  For instance,
> '+', '-', '*', etc. can be included in the group 'operator'; '(',
> 'IDENTIFIER', '-', etc., can be included in the group 'expression'.
> Now, when a syntax error is found, the error-reporter assembles a list
> of acceptable tokens.  Then it checks each token group, and if the
> group is contained in the list of acceptable tokens, all those tokens
> are removed from the list, and the name of the token group is added.
> (In practice, only about three members of the group need to be checked
> for membership, and if one token group overlaps another, you have to
> be careful about the order in which the groups are tested.)  The
> result is that you can usually reduce the list of allowed tokens to
> about three items, and produce error messages like:
> 
> 	    34	i = j k;
> 		      ^
>     Syntax error.
>     Found: IDENTIFIER 'k'
>     Expecting: OPERATOR ; END
> 
> The trick of only checking a few members of the group can be used to
> defeat the problem of losing information because of default
> reductions.
>  The way to do this is to only test members whose validity
> will not be eliminated by the default reductions.
>  In the above
> example, 'j' may be reduced to 'expression' before the error is found,
> but if the 'operator' group is represented only by '+', then the fact
> that '*' cannot follow 'expression' will not stop the 'operator' group
> from showing in the list of expected tokens.
> 

In the above example, when the yacc-generated compier discovers the
syntax error, the lookahead set under consideration would contain
only ';' and END.  It would not contain either '+' or '*'.

Having done the default reduction, the compiler would
have discarded the state or states which could shift '+' or '*'.
That's the whole point. Nothing in the algorithm described above retains
or recovers the lost information.

To make it more concrete, let's actually try to parse the example
fragment:

	BEGIN
          i = j k
        END

I've written a grammar for us to consider. In the interest of brevity,
and "doing it right", the grammar employs precedence-rules, but the result
would be the same if we used separate "expr", "term", and "factor"
productions. (If you are skeptical, try it.)

%token BEGIN END ID
%left '+'
%left '*'

%%

prog : BEGIN list END

list     : stat
         | list ';' stat
;

stat :  /* empty */
     |  ID '=' expr
;

expr : ID
     | expr '+' expr
     | expr '*' expr
;


%%


Say, "yacc -vd gram.y"

From the resulting y.output file, we observe the following:


   state 10
	stat :  ID = expr_    (5)
	expr :  expr_+ expr 
	expr :  expr_* expr 

	+  shift 12
	*  shift 13
	.  reduce 5



We see that when the parser, in state 10, does not see a '+' or a '*',
it will make the default reduction (5):

        stat : ID = expr

Doing so, it will discard three states from the top of the stack,
including state 10, which is the one which "knows about" the '+' and
the '*'.  Next it will do another default reduction, producing a "list".
Then it will be in state 3, as listed in the y.output file:

   state 3
	prog :  BEGIN list_END 
	list :  list_; stat 

	END  shift 6
	;  shift 7
	.  error


Only now will it discover that the next symbol, another ID, is not
legal.

Note that the only symbols which it now considers to be valid are
';' and END, although '+' and '*' would have been valid before the
default reductions were made.  How are we to infer that a class of
symbols which we choose to call "operators" would also be valid?
The information is gone.

In this example, state 3 can only be entered after having done
the default reductions which we saw. Thus the LALR(1) state 3 only
stands for one LR(1) state, which additionally has the lookahead tokens
'+' and '*'. But in general, a given LALR(1) state can stand in for
several LR(1) states with different LR(1)-lookahead sets. I have
demonstrated in previous postings how the entire LR(1)-lookahead
set can be calculated at runtime, as the file is parsed. But any such
calculation must take into account the input (or the states resulting
from the input). There is not enough information in the LALR(1) state alone.
[From djones@megatest.uucp (Dave Jones)]
-- 
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

lai@mips.com (David Lai) (08/08/89)

In <1989Aug6.024931.10014@esegue.uucp>, Dave Jones gives a rebuttal stating
how a default reduction in yacc has eliminated acceptable tokens from the
token list when an error occurs.  The problem can be solved by generating
the list of acceptable tokens *before* returning the error token.  This
method requires that you test the tokens before returning them to yacc (in
the scanner).

In the case given 'i = j k', the scanner upon reading 'k' checks the parser
state to see whether an identifier is acceptable in this state.  This will
be the state prior to the default reduction, where + and * are still
acceptable.  The list of acceptable tokens is remembered and the erroneous
token is returned.  The 'yyerror' function then prints out the (perhaps
decoded) list of acceptable tokens.  
-- 
     David Lai (lai@mips.com || {ames,prls,pyramid,decwrl}!mips!lai)
-- 
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

heirich@cs.ucsd.edu (Alan Heirich) (08/08/89)

This is the first of two postings.  This posting describes a workaround for
the default reduction problem with yacc.  The second posting describes code
modifications to DECUS yacc which provides meaningful diagnostic messages
both for describing a parse (debugging a grammar) and for an end-user at
run time.

In article <1989Aug6.024931.10014@esegue.uucp> djones@megatest.uucp (Dave Jones) writes:
>Mr. Worley has generously taken the time to post a long article about getting
>yacc to give up a list of token-categories which would be legal when a syntax
>error is found. The idea of grouping tokens into categories is a good one.
>But the technique in no way solves the "default reduction" problem, as
>he claims it does. 

Mr. Jones is correct.  However, the default reduction problem is not hard to
work around.

The problem: yacc violates correct shift-reduce parsing rules by assigning
"default reductions".  For the following item:

   A  -->  B C .

the corresponding rule should be reduced only when the lookahead token is in
the follow set of A.  A given automaton state might contain many such items,
and also many items on which a shift is anticipated, e.g.

   Q  -->  R S . T

(in which case a shift would be expected on when the lookahead token is in
the first set of T).

The yacc parser should report an error whenever the lookahead token is neither
in a follow set of a rule like "A" above, nor in a first set for a rule like
"Q" above.  But yacc builds its automaton in such a way that this is not the
case.  Instead, a rule like "A" would get reduced whenever the lookahead token
can't be handled in any other way.  In other words, a reduction happens when
an error should be reported.  These "default reductions" continue until the
parser is in a state with no reductions, at which point it reports an error.
Any diagnostic message based on the state where the error is reported are
thus potentially incorrect, since the lookahead set might be different from
the lookahead set of the state where the error was actually encountered.

Consequences of the problem: there are three types of states: shift only; 
reduce only; and combined shift & reduce states.  It turns out that this
"problem" in irrelevant to two of the kinds of states; and is easily worked
around for the case of combined shift & reduce states.

case 1: a state contains only shift items.
   Then there is no problem, because the error is reported instantly.

case 2: a state contains only reduce items.
   The default reductions will change the state before the error gets reported.
But this state will have the same lookahead set as the state where the error
was encountered.  This is because the reduction will return the parser to a
state where the lookahead set contains the first set of the reduced 
nonterminal.  The reductions don't change the lookahead set.

case 3: combined shift & reduce items.
   This is the potentially problematic state, which Mr. Jones illustrated in
his posting.  As he explained, the default reductions move the parser to a
state in which the lookahead set is not the same as when the error was 
encountered.  Specifically, the new lookahead set will contain the union of
the first sets of the reduced nonterminal(s), but will not contain the tokens
which could have been shifted in the state where the error was encountered.
   
   Case 3 has an easy fix.  Modify the yacc parser to save the current state
*after* it obtains a new token from yylex.  (This occurs near the beginning of
the parse code, right after the debug code that prints out the state number).
Then, when an error is encountered, yyerror traps the current state and
compares it to this saved state.  The set of expected tokens is the union of
the lookahead sets of these two states.  (In cases 1 and 2 above, these will
be the same sets of tokens).

The next posting describes a set of mods I have made to DECUS yacc to
implement automatic diagnostics based on this approach.

-------------------------
Alan Heirich     Comp. Sci. & Eng., Cognitive Science
C-014 University of California, San Diego 92093

heirich@cs.ucsd.edu
aheirich@ucsd.bitnet
-- 
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

heirich@cs.ucsd.edu (Alan Heirich) (08/08/89)

This is the second of two postings.  The first posting dealt with objections
to automatic error recovery in yacc.  This posting describes modifications to
DECUS yacc to permit automatic diagnostic generation.

In an earlier message I mentioned that I have modified yacc to provide useful
diagnostic messages to describe the parse (for debugging a grammar) and to
describe the expected tokens when a syntax error occurs (to debug a user).
I received many requests for the source code.  I have been advised that the
version of DECUS yacc which I modified is in fact pirate AT&T code, so I don't
think I can redistribute it.  But I will describe the mods, which are easy to
make.  [If you have AT&T source, these changes should be straightforward to
apply. -John]

The general goal is to encode the information from the parser description
file, describing states, goto sets, and lookahead sets, in a compact form
that can be accessed at run time.  Once you have done this you can use any
number of schemes of varying complexity to generate messages describing
the parse process and the expectations of the parser.  I will leave most
of this to the ingenuity of the reader, and simply describe how to get the
essential information out of the yacc data structures.

The changes are nearly all in the routine "output".  This routine writes out
the parser description to the description file.  You will want to modify it
to write out five pieces of information:

  (1) a set of strings containing token names
  (2) a set of strings containing nonterminal names
  (3) a set of states containing items sets
  (4) a set of states containing lookahead sets
  (5) a set of states containing goto sets

(1) and (2) are easy.  The token names are in an array of records.  
tokset[i].name is the string containing the name of token #i.  Use the
TLOOP macro on i to print out all the token names, i.e.
   TLOOP(i) { printf ("\"%s\",\n", tokset [i].name); }
Since TLOOP starts at 1, print a null string before you start this loop to
account for token #0 (which apparently is nonexistent as far as yacc is
concerned).
   The nonterminal names are in another array of records, nontrst.  Use the
NTLOOP macro, which starts at 0, as follows:
   NTLOOP(i) { printf ("\"%s\",\n", nontrst [i].name); }
  The item sets are useful so that the run time parser can show you a
description of the current state, rather than just emit a message "in state x".
You guessed it, the SLOOP macro steps through all of the states:
   SLOOP(i) { writeState (i); }
the writeState function (which you must create) is a bit trickier than simply
printing out a string.  Copy the code from the beginning of output() which
writes out the state description to the description file, then selectively
delete to get what you want.  You should end up with approximately the
following:
SLOOP(i)
   {
   nolook = !(tystate [i] == MUSTLOOKAHEAD);
   closure(i);
   nolook = 1;
   ITMLOOP(i,pp,qq) { writeItem (pp-> pitem); printf ("-99999,\n"); }
   if (tystate [i] == MUSTLOOKAHEAD)
         WSLOOP (wsets + (pstate[i+1]-pstate[i]), u)
            if (*(u->pitem) < 0)
               {
               writeItem (u->pitem); printf ("-99999,\n");
               } /* if */
  What is happening here?  Well, we first recompute closure for the state.
That sets important flags in the tystate array.  We then loop through all
the normal items in the state, thanks to ITMLOOP, and write them out (more
on this below).  Then, if the state requires lookahead we write out another
group of items (exactly what they are escapes me at the moment.  I think
they are items that reduce on the empty string, or something like that).
   What must writeItem(pp) do?  Approximately this (again, adapted from
the code that writes out the state descriptions to the description file):

writeItem (pp) int *pp;
   {
   int *p;
   for (p = pp; *p > 0; ++p)   
      { }  /* (skip ahead until *p > 0) */
   p = prdptr [-*p];
   printf ("%d,", *p);
   while (1)
   {
   p++;
   if (p == pp)   printf ("-88888,");
   if (*p <= 0)   break; else printf ("%d,", *p);
   } /* while */
   if (*pp < 0)   { printf ("%d,", *pp); }
   } /* writeItem */

This writes out the elements of a single item, encoded as integers.
-88888 represents the location of the . (or underscore) in the item, i.e. the
current position of the parse.  A positive number is the number of a token
or nonterminal name; if the number is > NTBASE (a yacc constant) then it
represents a nonterminal (and NTBASE must be subtracted to get the number
of the nonterminal); if the number is < NTBASE then it can be used directly
as an index into the token name table (described above).  A negative number
indicates a rule that is the be reduced; it always appears at the end of
an item, and is negated (to get the actual rule number, just negate the
value printed out).
   Notice that after writeItem is called, the value -99999 is printed out.
This is simply an end-of-item marker.
   For (4), the lookahead sets: use SLOOP again, and for each state i,
compute closure(i).  Well, here is the code, including some important details:

   SLOOP(i)
   {
   nolook = !(tystate [i] == MUSTLOOKAHEAD);
   closure (i);
   nolook = 1;
   aryfil (temp1, ntokens + nnonter + 1, 0);
   WSLOOP (wsets,u)
      {
      c = *(u -> pitem);
      if (c > 1 && c < NTBASE && temp1 [c] == 0)
         {
         printf ("%d,", c);
         temp1 [c] = 1;
         } /* if */
      } /* WSLOOP */
   } /* SLOOP */

   Finally, (5), the goto sets:

   SLOOP(i)
   {
   nolook = !(tystate [i] == NOLOOKAHEAD);
   closure (i);
   nolook = 1;
   aryfil (temp1, ntokens + nnonter + 1, 0);
   WSLOOP (wsets, u)
      {
      c = *(u -> pitem);
      if (c > NTBASE && temp1 [(c -= NTBASE) + ntokens] == 0)
         {
         temp1 [c + ntokens] = 1;
         printf ("%d,", c);
         } /* if */
      } /* WSLOOP */
   } /* SLOOP */

This basically does it.  What you get from all this is a lot of printed
information.  Write this to a text file and use it to initialize a set
of arrays which you compile into your parser.  You can then access the
information while your parser runs.  This allows it to describe states
in the same form as in the description file; describe the expected lookahead
tokens or goto items (i.e. the expected nonterminals, which are often much
more explanatory than the lookahead tokens); and describe the tokens as
they are obtained from yylex (very helpful).

This method certainly isn't perfect -- the biggest problem is that an end
user can be overwhelmed with information, since a programming language like
C or Pascal can generate states with 30 valid lookahead tokens.  (Using
the nonterminal names from the goto sets can help in this situation,
although only partially).  Still, it's a useful start, and does provide
helpful information.  Also, be aware that program errors tend to be static
semantic errors rather than syntax errors, so the gravity of the situation
is somewhat less than one might think...
 
But it is true, what we all *really* want is a son-of-yacc (yayacc?)
which allows us to specify textual error messages from within the grammar...
and which doesn't generate stupid reduce/reduce conflicts when we put
lots of error productions into the grammar!

-------------------------
Alan Heirich     Comp. Sci. & Eng., Cognitive Science
C-014 University of California, San Diego 92093

heirich@cs.ucsd.edu
aheirich@ucsd.bitnet
-- 
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

rusty@garnet.Berkeley.EDU (Rusty C. Wright) (08/10/89)

So with all this thrashing about yacc's error reporting, does anybody
know if bison does the "right thing"?
--
rusty c. wright, rusty@violet.berkeley.edu ucbvax!violet!rusty
-- 
--
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

djones@megatest.com (Dave Jones) (08/11/89)

From article <1989Aug8.130702.957@esegue.uucp), by lai@mips.com (David Lai):
) In <1989Aug6.024931.10014@esegue.uucp), Dave Jones gives a rebuttal stating
) how a default reduction in yacc has eliminated acceptable tokens from the
) token list when an error occurs.  The problem can be solved by generating
) the list of acceptable tokens *before* returning the error token.  This
) method requires that you test the tokens before returning them to yacc (in
) the scanner).
) 
) In the case given 'i = j k', the scanner upon reading 'k' checks the parser
) state to see whether an identifier is acceptable in this state.  This will
) be the state prior to the default reduction, where + and * are still
) acceptable.  The list of acceptable tokens is remembered and the erroneous
) token is returned.  The 'yyerror' function then prints out the (perhaps
) decoded) list of acceptable tokens.  

As I (and one or two other people) pointed out the last time this
subject came up, it is not necessary for the parser to keep a set of
acceptable tokens. It can keep a set of states which are discarded
by default reductions, (which should be much quicker to calculate),
then calculate the tokens only if they are needed when an error is
detected. Didn't I point this out recently also?

		Dave Jones
[From djones@megatest.com (Dave Jones)]
-- 
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

djones@megatest.com (Dave Jones) (08/11/89)

From article <1989Aug8.131112.1081@esegue.uucp>, by heirich@cs.ucsd.edu (Alan Heirich):
> This posting describes modifications to
> DECUS yacc to permit automatic diagnostic generation.
...
> 
> The changes are nearly all in the routine "output".  This routine writes out
> the parser description to the description file.  You will want to modify it
> to write out five pieces of information:
> 
>   (1) a set of strings containing token names
>   (2) a set of strings containing nonterminal names
>   (3) a set of states containing items sets
>   (4) a set of states containing lookahead sets
>   (5) a set of states containing goto sets
> 

All this info is in the y.output file from standard yacc. There's no need to
(ahem) hack yacc. If there's a demand, and I can find the time, I'll package
up and post some a nawk (new awk) scripts and so forth that use the y.output
file to generate a procedural, rather than table-based, parser.  They could
be modified easily enough to write the above info in any format you might
want.

But be warned, however you obtain this info, you still have to calculate the
legal-token-sets dynamically. There is not enough info in the LALR(1)
item-sets to calculate them from the state-number alone. You have to keep up
with the default-reduction states as they are popped. I hope I have said
this often enough and loudly enough now.

I did the scripts partly as an exercise in learning nawk, partly to get a
faster parser, but mostly to aid in debugging compilers. It's impossible to
pick through the coded tables in a debugger and make any kind of sense, but
it is easy to single step through code that looks like the following, which
is cut and pasted from a compiler I'm in the process of writing:

switch(state) {
...
  case 3:switch(lookahead){
    /*
     * 	file :  $$2 _ declarations 
     * 	declaration_list : _    (28)
     */
     case EOF: YREDUCE(28,0,NT_declaration_list);
     case error: YSHIFT(7);
     case CONST: YREDUCE(28,0,NT_declaration_list);
     case INSERT: YREDUCE(28,0,NT_declaration_list);
     case FUNCTION: YREDUCE(28,0,NT_declaration_list);
     case LABEL: YREDUCE(28,0,NT_declaration_list);
     case PROCEDURE: YREDUCE(28,0,NT_declaration_list);
     case TYPE: YREDUCE(28,0,NT_declaration_list);
     case VAR: YREDUCE(28,0,NT_declaration_list);
     case WITH: YREDUCE(28,0,NT_declaration_list);
     default: YERROR();
  };

The scripts "wrote" the above code from inspection of the y.output
file.
[From djones@megatest.com (Dave Jones)]
-- 
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

djones@megatest.com (Dave Jones) (08/11/89)

The errors in the following article all seem to be due to the mistaken
impression that tokens which can legally follow the production of
a non-terminal symbol are independant of the context of that non-terminal.

From article <1989Aug8.130922.1019@esegue.uucp), by heirich@cs.ucsd.edu (Alan Heirich):
...
) 
) The problem: yacc violates correct shift-reduce parsing rules by assigning
) "default reductions".  For the following item:
) 
)    A  --)  B C .
) 
) the corresponding rule should be reduced only when the lookahead token is in
) the follow set of A.

I've got a little trouble with the terminology. I'm going to assume

    "Follow set of A" means all the terminals which can follow A in any
    sentential form. (This is the definition is most applicable to the
    theory of SLR parsers, but I can't think of any alternative meaning.)

    "... should be reduced" is a little trickier. Except for an obscure
    bug related to the synthesized error token, yacc does what it "should do".
    I am assuming that you are talking about what a full LR(1) parser should
    do. But in that case you should say that it should reduce the rule
    A <= BC if and only if the LR-stack and lookahead togther form a valid
    prefix with this form: x B C t, and x A t is a valid prefix.
    What you actually describe is what an SLR parser "should do".

)  A given automaton state might contain many such items,
) and also many items on which a shift is anticipated, e.g.
) 
)    Q  --)  R S . T
) 
) (in which case a shift would be expected on when the lookahead token is in
) the first set of T).
) 
) The yacc parser should report an error whenever the lookahead token is neither
) in a follow set of a rule like "A" above, nor in a first set for a rule like
) "Q" above.

The above, too, is a little vague. But consider this:

G : '[' A ']'
  |  A ';'
  ;

A : BC
;

[Before puzzling over this too much, read on. A similar grammar comes into
 play later.]

) ...
) 
) Consequences of the problem: there are three types of states: shift only; 
) reduce only; and combined shift & reduce states.  It turns out that this
) "problem" in irrelevant to two of the kinds of states; and is easily worked
) around for the case of combined shift & reduce states.
...
) 
) case 2: a state contains only reduce items.
)    The default reductions will change the state before the error gets reported.
) But this state will have the same lookahead set as the state where the error
) was encountered.

Counter-example:

    G : '(' A ')'
      |  A ';'
      ;

    A : 'a'
      ;

Run yacc on this and you'll see this state:

    State 4:

	A : a _ (3)

	. reduce 3

In state 4, the only one which produces A, the only action is a default
reduction. The state has no follow-set coded for it by yacc, but in the
sense of the above comments, its follow set contains ')' and ';' because
it is used in both of these rules:

    G : '(' A ')'
      |  A ';'
      ;

Using either sense of the term "follow set", we can, by means of the default
reduction, uncover a state with a different set of legal tokens.

Consider the input

     '(' A x

After doing the default reduction in state 4, we reach state 5:

     state 5
        G :  ( A_)

        )  shift 7
	.  error

Its only legal next token is ')'.

) 
) case 3: combined shift & reduce items.
)    This is the potentially problematic state, which Mr. Jones illustrated in
) his posting.  As he explained, the default reductions move the parser to a
) state in which the lookahead set is not the same as when the error was 
) encountered.  Specifically, the new lookahead set will contain the union of
) the first sets of the reduced nonterminal(s), but will not contain the tokens
) which could have been shifted in the state where the error was encountered.

Ahem.

G:  '(' Q ')'
 ;

Q: P
 | P ';'
 ;

P : 'p'
  | 'p' '!'
  ;


Need I say more?

)    
)    Case 3 has an easy fix.  Modify the yacc parser to save the current state
) *after* it obtains a new token from yylex.  (This occurs near the beginning of
) the parse code, right after the debug code that prints out the state number).
) Then, when an error is encountered, yyerror traps the current state and
) compares it to this saved state.  The set of expected tokens is the union of
) the lookahead sets of these two states. 

Almost. Make the parser save the intire set of intermediate states popped
by default reductions, not just the first one, and use the union of all their
shiftable tokens and the shiftable tokens from the error state. This is,
effectively, the only workable algorithm I've seen so far, although I and
others have presented variations on it.
[From djones@megatest.com (Dave Jones)]
-- 
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

markg@well.sf.ca.us (Mark Grand) (08/13/89)

Generating a list of acceptable tokens before allowing YACC to perform a
default reduction is expensive.  A cheaper way (assumimg a fast
implementation of memcpy) is to take a snapshot of YACC's state stack every
time it gets a new token.  That way you can generate a list of the expected
tokens from the snapshot and only have to do it when actually needed.

-- 
Mark Grand                     markg@well.sf.ca.usa
GeoMaker Software              {apple,pacbell,hplabs,ucbvax}!well!markg
P.O. Box 273124                415-680-1964
Concord CA   94527-3124
-- 
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.

eachus@mbunix.mitre.org (Robert Eachus) (08/15/89)

In article <1989Aug12.201931.4857@esegue.uucp> Mark Grand writes:
>Generating a list of acceptable tokens before allowing YACC to perform a
>default reduction is expensive.  A cheaper way (assumimg a fast
>implementation of memcpy) is to take a snapshot of YACC's state stack every
>time it gets a new token.  That way you can generate a list of the expected
>tokens from the snapshot and only have to do it when actually needed.

     There is an even faster way.  Pat Prange and I used it in a
parser generator and driver {LALR} which is available on Multics.  If
the next action to be taken by the compiler is a reduction (other than
into an accepting state) keep a list of states but do nothing. If it
is the first read (shift) following an accepting state hold it to one
side also.  When a succeeding legal token has been read and is ready
to shift onto the stack (or an accepting state is reached), do the
pending read (if any) then do all pending reductions up to the current
read.

     With this scheme, error correction can start from the point just
before the  last legal token was read.  This allows all sorts of
possible error recovery strategies to be tried.  We had twelve
combinations that we tried including deleting the previously read
token and swaping the previous and current tokens.  All this was
"free" in that it required a few bytes of storage in the drivers
stack, and about 3 or 4 extra instructions per successful shift or
reduction.  (Failures are another story, we allowed ourselves about a
tenth of a CPU second per correction, but it was worth it since the
compilers could often continue through several difficult syntax errors
and do a meaningful error scan thorugh an entire program.

     Two of my favorite corrections came from the Ada Compiler
Validation Suite:

     legal-statement;
     IFB THEN ...

     was corrected to:

     legal-statement;
     if IFB then ...

     and

     X: INTEGER ran ge 1..10;

     was corrected to:

     X: INTEGER range 1..10;

     All without special error tables, and without looking at the
spellings of non-terminals! In fact the parser tables for the Ada
grammar above were about 3500 36-bit words, including 36 words for
error recovery (closures for constructs to be deleted if panic mode
was used to throw a sequence of invalid tokens).

     As far as I know this tool is still used to maintain the parsers
for several Honeywell compilers (especially for the DPS6 and DPS6+,
but not the Ada compiler from DDC), and the source (highly Multics
specific PL/1) is on every existing machine running Multics.  (There
just aren't too many of them any more...sigh!)

					Robert I. Eachus
					eachus@mbunix.mitre.org
-- 
Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU
{ decvax | harvard | yale | bbn }!ima.  Meta-mail to ima!compilers-request.
Please send responses to the author of the message, not the poster.