[comp.compilers] Set of allowable next tokens

nigelh@sol.uvic.ca (R. Nigel Horspool) (01/13/91)

jmr1g@ra.cs.Virginia.EDU (Jeremiah M. Ratner) writes:

>Is there any way i can configure yacc or some other tool to tell me,
>at each step in a parse, the set of tokens that may follow the current
>token?  I am currently doing this by hand, and it is, as they say,
>a tedious and error-prone process.

Yacc optimizes its tables to use default reductions, so it is
impossible to tell from the state listing in the y.output file
which symbols are valid in any state that contains a reduce
action.

If you are content with an approximate answer, it can be obtained
directly from LR parse tables that have not been optimized.
E.g., the row of the LR parse table for state n might contain
entries like:
	symbol a: shift to state s1
	symbol b: shift to state s2
	symbols {c,d,e}: reduce by rule r1
	symbols {f,g}: reduce by rule r2
and the set of symbols for which actions are defined
(i.e. {a,b,c,d,e,f}) is a good approximation to what symbols are
allowed when state n is the top state on the LR state stack.
This info is readily available from yacc's y.output file.

The set is correct (not an approximation) if a full LR(1) parser
generator is used.  However, for the SLR(1) and LALR(1) methods
(Yacc uses LALR(1)), the set of symbols that trigger reduce actions
may be too large.  To get an exact answer, it is necessary to take
context into account to eliminate some impossible next symbols from
the set.

If you need to know the exact set of valid next symbols at some
point in a parse, you can run an algorithm that simulates the updates
that occur to the state stack when each reduce action is performed.
An algorithm in pseudo Pascal goes something like this:

    ValidNextSymbols := Valid( SS, VT );
	{ where SS is the current state stack and VT is the set of
	 all terminal symbols in the grammar}

    function Valid( SS, T )
	Result := set of symbols, x, such that the top state
		   on stack SS has a shift action for x or an
		   accept action for x;
	Result := Result * T;

	for each reduce action in the top state on stack SS do
	    Test := set of symbols that trigger the reduce action;
	    Test := Test * T;
	    if Test != empty then
		CSS := copy of SS;
		update the stack CSS as required by the reduce action;
		{ i.e. pop k states where k is the length of the RHS
		 of the rule being reduced by, and push the state
		 reached by the goto action for the LHS symbol of
		 the rule }
		Result := Result + Valid( CSS, Test );
	    end if
	end for;
	return Result;
    end function;

{Note: `+' represents set union and `*' represents set intersection.}

Since the algorithm checks which reduce symbols are valid (i.e. those
that can eventually be shifted), it works equally well on parse tables
that have been optimized by the use of default reductions.  I.e., it
could be implemented to work with the tables used by a yacc-generated
parser.

R. Nigel Horspool
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

djones@decwrl.dec.com (Dave Jones) (01/15/91)

> Is there any way i can configure yacc or some other tool to tell me,
> at each step in a parse, the set of tokens that may follow the current
> token?  I am currently doing this by hand, and it is, as they say,
> a tedious and error-prone process.

> [This same question came up in November, but no answers. -John]


There were plenty of answers John. Some of them were wrong, but not
all. I do agree that the definitive answer has yet to be written.
Anyway, here's the crux of the problem:

The table-driven parser which a typical implementation of yacc generates
keeps state-information on a stack. The things it puts on the stack are
called "states" in the jargon, although the entire stack really comprises the
parser's state. It is easy to determine which tokens can be shifted when a
particular state is on top of the stack. To do that you can have your parser
look in the tables, just as the regular parser does when it is doing its
thing, or you can build something that reads the y.output file you get using
the -v option. (AWK might come in handy.)  But knowing which tokens can be
shifted when a particular state is on top is not enough, because other states
that might have allowed other tokens to be shifted may have been on top of
the stack since the previous token was shifted. That is possible because yacc
typically uses "default reductions". (You can find the technical info on this
stuff in the _Dragon Book_, new or old version. In my copy of the new Dragon
Book, the applicable part begins on page 244.)

So what you have to do is to figure out what tokens *might* have gotten
shifted in states that have been "default reduced" since the last token was
shifted. Of course, you could just keep a token-set, coded as a bit map. That
would be a straight-forward way to do it, and obviously correct, but would
impose some overhead even when the input was error-free. Several people,
including myself, posted algorithms that they claimed could reconstruct the
sets after the fact, some by analysing the stack by running simulations,
others by keeping some bookkeeping info on what states have been on the stack
since the last shift. The only one I have a copy of is the algorithm I
suggested, and it is no longer as obvious to me as it was then.

More later...
[My brain must have curdled, I forgot about the lively discussion in mid-1989
about finding the follow set in the context of error correction and detection.
There was a lot of discussion including several trenchant messages from Mr.
Jones, and several proposed solutions.  -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

djones@decwrl.dec.com (Dave Jones) (01/16/91)

> In the book 'Compiler Construction using Unix' by Schreiner & Friedman 
> (sorry, I don't have the book here at work, and I don't recall the
> publisher) a modification to yacc is described which prints the set of
> allowed tokens when encountering a syntax error. It's actually sufficient
> to modify the yaccpar file.
> 
> There is a tar file on a.cs.uiuc.edu in ~ftp/pub/friedman which contains the
> modified yaccpar file.

It's buggy. In fact, on any non-trivial grammar, it is likely to fail
very badly, omiting most of the legal tokens.

What the hack actually does is to print the tokens which could be
shifted at the time the compiler detected the error.  That is
not sufficient, for the reason I first pointed out in a letter to
comp.compilers, posted Feb. 23, 1989. Sorry, I don't have the article number.
The problem is that the parser is likely to have altered its state since
the last token-shift, (through "default reductions"), at times when
other tokens could have been shifted.

If anyone cares to see them, I still have a couple of counter-examples
on hand.
[Oops, right.  Since yacc may have taken some default reductions before
noticing an error, the set of allowable tokens in the state where it
notices the error need not have much to do with the state where the
error really occurred.  -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.