[comp.compilers] Technical question about yacc and LALR parsers

jar@florida.HQ.Ileaf.COM id AA07671; Sun, 21 Apr 91 00:33:37 -0400 (Jim Roskind x5570) (04/21/91)

I have been spending this weekend hacking byacc to automate the
process that I go through to evaluate conflicts.  One part of this is
walking backwards through a parser table.  When I generated backwards
pointers for each state in the parse, I noticed something rather
surprising (at least to me).  Basically, for each state k in the
parser, there are many states that can transition to k (no surprise),
but all such states transition with for the same reason (I was
surprised) i.e, all shifts/goto's into state k are made on the same
terminal/nonterminal.  To put it another way, when the parser
generator is sitting in any given state k, the token just to the left
of the carrot is a function of k only.

When I first coded my data structures, I assumed that for each state
k, I would need a list of prior states, and I also wanted to save the
token name that was used to transition from the "prior state".  When
all was said and done, I found that it was sufficient to save a "list
of prior states" for each state, as well as a single "prior token" for
each state.

If I haven't still made myself clear, the following is an example of
something that will *never* be found in the verbose output from yacc:

state 727
	parened_type : '(' VOID . ')'
	parened_type : '(' INT . ')'

Basically, if this was ever seen, it would mean that in state 727, it
was possible to have two distinct tokens to the immediate left of the
carrot.  Note that the following is possible (I think):

state 747
	gathered_type : '[' VOID . ']'
	gathered_type : '(' VOID . ')'

as it is certainly *not* the case that the state defines the entire
left context of the carrot.

One corollary (via the pigeon hole principle) of this observation is
that there must me more states in such a parser than there are
terminals plus nonterminals.

Assuming that I have explained myself, the question is: Is this
property true of all LALR parsers?  It was very conceivable that I
could hand code a table driven "LR like" parser that would *not* have
this property.  Needless to say, my fear is that my "simplified data
structures" are not sufficient for all grammars that byacc might
generate. 

Jim Roskind-   Author of a YACCable C++ grammar
Independent Consultant
(407)729-4348 or (617)290-4990 x5570
jar@hq.ileaf.com or ...!uunet!leafusa!jar
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

chris@crackers.clearpoint.com (Chris Clark) (04/21/91)

In Jim Roskind's previous posting, he describes a method for working back
through a grammar to determine the source of a conflict and an interesting
observation about the yacc states.

His interesting observation is the fact that a given state is always
entered on a transition on the same symbol.  This means you'll never see a
state like the one below in a yacc verbose output.

state 727
	parened_type : '(' VOID . ')'
	parened_type : '(' INT . ')'

Several authors have offered proofs that this must be true.  Jim's own
proof brings up the key reason, which involves the meaning of a production
and an item.  Peter Garst properly labels this a tail property.  

Here is yet another explanation of why the property holds and a way of
solving it by directly translating regular expression grammars.

In most yacc variants, each unique right-hand-side is considered a unique
production and is given a production number.  Each symbol within a unique
right-hand-side is given a number within the production.  An item is
simply an ordered pair (symbol number, production number).  A yacc state
is simply a unique collection of items.  However, because two unique
right-hand-size must have unique production numbers and thus unique items,
the two states cannot be merged by the (LALR) algorithm.

Fortunately, the "almost parsers" Jim wants are "easy" to create just by
changing the definition of an item.

For example, in Compiler Resources' Yacc++, we use regular expressions to
solve the same problem.  Because of our direct translation algorithm, you
will often see states like the one below.

state 727
	parened_type : '('  (VOID | INT)  . ')'

This state will be reached on transitions of both symbols VOID and INT.

We cannot quantify the number of states saved, except that it's a lot,
perhaps as much as 50% for some grammars.  The exact figures depend on how
many productions have suffixes in common.  A few productions with long
common tails can truly increase the percentage.  Other factors also come
into play, so your mileage may vary.

By the way, Jim's backtracking algorithm mirrors David Spector's algorithm
for splitting LALR states into LR ones--it was published in SIGPLAN
notices, Dec 1988, v 23 # 12, for those of you who are interested.  By
walking back only to the merged states, one can perform LR disambiguation.
By walking back to the start state, one can determine the set of ambiguous
prefixes, which is what Jim wants.

Regards,
Chris Clark & Barbara Zino
(508) 435-5016
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

jml@wally.altair.fr (Jean-Marie [John] Larcheveque) (04/22/91)

In article <9104210204.AA07587@florida.HQ.Ileaf.COM>
jar@florida.HQ.Ileaf.COM (Jim Roskind x5570) observes that, from the
verbose output of yacc, one can infer that a given parse state is always
reached through the same terminal/nonterminal transition, and asks if this
property is true of all LALR parsers.

The construction algorithm given in the Dragon book for SLR tables will
always yield parsers with this property. Indeed, a new state is
constructed from existing ones by picking a state and a transition symbol
and deriving the kernel items of the new state by moving the dot past this
symbol. So all the kernel items of a given state have the same symbol to
the left of the dot.  Now, since LALR construction does not involve
merging states with different cores (i.e. sets of LR(0) items), the same
property holds for LALR. The problem is, however, that some "LALR" parser
generators might carry out further mergers. But I think that, beyond LALR,
you can still merge parts of tables, but not whole states, because no two
states will have exactly the same actions and transitions. You do find
aggressive optimizations after which it is difficult to recognize the
original LALR states, as in the Syntax compiler compiler, described in
Boullier's thesis (Universite d'Orleans, France), but this is hardly
likely to happen in the various implementations of yacc.

--
Jean-Marie (John) Larcheveque  
<jml@bdblues.altair.fr> or <jml@nuri.inria.fr>
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

mauney@eos.ncsu.edu (Dr. Jon Mauney) (04/22/91)

> ...  Note that the following is possible (I think):
> 
> state 747
> 	gathered_type : '[' VOID . ']'
>	gathered_type : '(' VOID . ')'

This is not possible using the normal LALR construction, for the
reason you mentioned: transition into a state is due to shifting
of a single symbol. Therefore all configurations must have the
same symbol to the right of the dot.  The same must be true of
the state previous to that:  The predecessor of the given state 747
would have had to contain
 	gathered_type : '[' . VOID  ']'
	gathered_type : '(' . VOID  ')'
which has been shown to be impossible.

The only way that items can differ in the left context is
for one to be contained in the other.  For example,
given the grammar
  A --> x B
  A --> x y z
  B --> y w
the CFSM wil l contain the state
  A --> x y . z
  B --> y . w

--
Jon Mauney,  parsing partisan
Computer Science Dept.
N.C. State University
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

bliss@sp64.csrd.uiuc.edu (Brian Bliss) (04/23/91)

jar@florida.HQ.Ileaf.COM (Jim Roskind x5570) write:
>... all shifts/goto's into state k are made on the same
>terminal/nonterminal. 

Not only that, but one may deduce information about tokens to the
left of the previous one also.

If we are in state N with input token t, and the action is to reduce on
lookahead token t, by a rule Y -> X1 X2 X3 ... Xn, then obviously the last
n tokens are the same for all items in state N.  Likewise, for all states
with a transition to state N (whose are would be labelled Xn), the last
n-1 tokens are the same in all items in all such states.  (By tokens, I
mean terminals or nonterminals, grammar symbols in general.)

This means one never needs to keep a stack of grammar symbols as described
in the Dragon Book (well, their algorithm interleaves the grammar symbols
with the states, i.e. every other stack element is a grammar symbol), it
is sufficient to keep only a stack of states, the grammar symbol stack may
be uniquely derived from there.

bb 
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

boyland@aspen.berkeley.edu (John Boyland) (04/24/91)

In article <9104222013.AA09131@florida.HQ.Ileaf.COM>, jar@florida.HQ.Ileaf.COM (Jim Roskind x5570) writes:
|> [...] A second
|> example would would be (again verifiable by the parser generator) when a
|> "grammar preprocessor" back-substituted some rules and in doing so,
|> replicated the code actions.
|> [...]
|> 
|> Jim Roskind-   Author of a YACCable C++ grammar
|> jar@hq.ileaf.com or ...!uunet!leafusa!jar

Unfortunately, grammar preprocessors (such as one I've developed), often
have subtly different actions:

For example:

stmtlist : stmtlist ';' ? ';;' stmtlist
	{ $$ = sequence($stmtlist'1,$stmtlist'2); }

gets expanded into:

stmtlist : stmtlist ';' DOUBLESEMI stmtlist
	{ $$ = sequence($1,$4); }
         | stmtlist DOUBLESEMI stmtlist
	{ $$ = sequence($1,$3); }

(Duplicating rather than having a special optional-semicolon rule more often results in an LALR(1) grammar).

Notice how the actions differ in where arguments are found.

John Boyland
boyland@sequoia.Berkeley.EDU
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.