jar@florida.HQ.Ileaf.COM id AA07671; Sun, 21 Apr 91 00:33:37 -0400 (Jim Roskind x5570) (04/21/91)
I have been spending this weekend hacking byacc to automate the process that I go through to evaluate conflicts. One part of this is walking backwards through a parser table. When I generated backwards pointers for each state in the parse, I noticed something rather surprising (at least to me). Basically, for each state k in the parser, there are many states that can transition to k (no surprise), but all such states transition with for the same reason (I was surprised) i.e, all shifts/goto's into state k are made on the same terminal/nonterminal. To put it another way, when the parser generator is sitting in any given state k, the token just to the left of the carrot is a function of k only. When I first coded my data structures, I assumed that for each state k, I would need a list of prior states, and I also wanted to save the token name that was used to transition from the "prior state". When all was said and done, I found that it was sufficient to save a "list of prior states" for each state, as well as a single "prior token" for each state. If I haven't still made myself clear, the following is an example of something that will *never* be found in the verbose output from yacc: state 727 parened_type : '(' VOID . ')' parened_type : '(' INT . ')' Basically, if this was ever seen, it would mean that in state 727, it was possible to have two distinct tokens to the immediate left of the carrot. Note that the following is possible (I think): state 747 gathered_type : '[' VOID . ']' gathered_type : '(' VOID . ')' as it is certainly *not* the case that the state defines the entire left context of the carrot. One corollary (via the pigeon hole principle) of this observation is that there must me more states in such a parser than there are terminals plus nonterminals. Assuming that I have explained myself, the question is: Is this property true of all LALR parsers? It was very conceivable that I could hand code a table driven "LR like" parser that would *not* have this property. Needless to say, my fear is that my "simplified data structures" are not sufficient for all grammars that byacc might generate. Jim Roskind- Author of a YACCable C++ grammar Independent Consultant (407)729-4348 or (617)290-4990 x5570 jar@hq.ileaf.com or ...!uunet!leafusa!jar -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
chris@crackers.clearpoint.com (Chris Clark) (04/21/91)
In Jim Roskind's previous posting, he describes a method for working back through a grammar to determine the source of a conflict and an interesting observation about the yacc states. His interesting observation is the fact that a given state is always entered on a transition on the same symbol. This means you'll never see a state like the one below in a yacc verbose output. state 727 parened_type : '(' VOID . ')' parened_type : '(' INT . ')' Several authors have offered proofs that this must be true. Jim's own proof brings up the key reason, which involves the meaning of a production and an item. Peter Garst properly labels this a tail property. Here is yet another explanation of why the property holds and a way of solving it by directly translating regular expression grammars. In most yacc variants, each unique right-hand-side is considered a unique production and is given a production number. Each symbol within a unique right-hand-side is given a number within the production. An item is simply an ordered pair (symbol number, production number). A yacc state is simply a unique collection of items. However, because two unique right-hand-size must have unique production numbers and thus unique items, the two states cannot be merged by the (LALR) algorithm. Fortunately, the "almost parsers" Jim wants are "easy" to create just by changing the definition of an item. For example, in Compiler Resources' Yacc++, we use regular expressions to solve the same problem. Because of our direct translation algorithm, you will often see states like the one below. state 727 parened_type : '(' (VOID | INT) . ')' This state will be reached on transitions of both symbols VOID and INT. We cannot quantify the number of states saved, except that it's a lot, perhaps as much as 50% for some grammars. The exact figures depend on how many productions have suffixes in common. A few productions with long common tails can truly increase the percentage. Other factors also come into play, so your mileage may vary. By the way, Jim's backtracking algorithm mirrors David Spector's algorithm for splitting LALR states into LR ones--it was published in SIGPLAN notices, Dec 1988, v 23 # 12, for those of you who are interested. By walking back only to the merged states, one can perform LR disambiguation. By walking back to the start state, one can determine the set of ambiguous prefixes, which is what Jim wants. Regards, Chris Clark & Barbara Zino (508) 435-5016 -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
jml@wally.altair.fr (Jean-Marie [John] Larcheveque) (04/22/91)
In article <9104210204.AA07587@florida.HQ.Ileaf.COM>
jar@florida.HQ.Ileaf.COM (Jim Roskind x5570) observes that, from the
verbose output of yacc, one can infer that a given parse state is always
reached through the same terminal/nonterminal transition, and asks if this
property is true of all LALR parsers.
The construction algorithm given in the Dragon book for SLR tables will
always yield parsers with this property. Indeed, a new state is
constructed from existing ones by picking a state and a transition symbol
and deriving the kernel items of the new state by moving the dot past this
symbol. So all the kernel items of a given state have the same symbol to
the left of the dot. Now, since LALR construction does not involve
merging states with different cores (i.e. sets of LR(0) items), the same
property holds for LALR. The problem is, however, that some "LALR" parser
generators might carry out further mergers. But I think that, beyond LALR,
you can still merge parts of tables, but not whole states, because no two
states will have exactly the same actions and transitions. You do find
aggressive optimizations after which it is difficult to recognize the
original LALR states, as in the Syntax compiler compiler, described in
Boullier's thesis (Universite d'Orleans, France), but this is hardly
likely to happen in the various implementations of yacc.
--
Jean-Marie (John) Larcheveque
<jml@bdblues.altair.fr> or <jml@nuri.inria.fr>
--
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
mauney@eos.ncsu.edu (Dr. Jon Mauney) (04/22/91)
> ... Note that the following is possible (I think): > > state 747 > gathered_type : '[' VOID . ']' > gathered_type : '(' VOID . ')' This is not possible using the normal LALR construction, for the reason you mentioned: transition into a state is due to shifting of a single symbol. Therefore all configurations must have the same symbol to the right of the dot. The same must be true of the state previous to that: The predecessor of the given state 747 would have had to contain gathered_type : '[' . VOID ']' gathered_type : '(' . VOID ')' which has been shown to be impossible. The only way that items can differ in the left context is for one to be contained in the other. For example, given the grammar A --> x B A --> x y z B --> y w the CFSM wil l contain the state A --> x y . z B --> y . w -- Jon Mauney, parsing partisan Computer Science Dept. N.C. State University -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
bliss@sp64.csrd.uiuc.edu (Brian Bliss) (04/23/91)
jar@florida.HQ.Ileaf.COM (Jim Roskind x5570) write: >... all shifts/goto's into state k are made on the same >terminal/nonterminal. Not only that, but one may deduce information about tokens to the left of the previous one also. If we are in state N with input token t, and the action is to reduce on lookahead token t, by a rule Y -> X1 X2 X3 ... Xn, then obviously the last n tokens are the same for all items in state N. Likewise, for all states with a transition to state N (whose are would be labelled Xn), the last n-1 tokens are the same in all items in all such states. (By tokens, I mean terminals or nonterminals, grammar symbols in general.) This means one never needs to keep a stack of grammar symbols as described in the Dragon Book (well, their algorithm interleaves the grammar symbols with the states, i.e. every other stack element is a grammar symbol), it is sufficient to keep only a stack of states, the grammar symbol stack may be uniquely derived from there. bb -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
boyland@aspen.berkeley.edu (John Boyland) (04/24/91)
In article <9104222013.AA09131@florida.HQ.Ileaf.COM>, jar@florida.HQ.Ileaf.COM (Jim Roskind x5570) writes: |> [...] A second |> example would would be (again verifiable by the parser generator) when a |> "grammar preprocessor" back-substituted some rules and in doing so, |> replicated the code actions. |> [...] |> |> Jim Roskind- Author of a YACCable C++ grammar |> jar@hq.ileaf.com or ...!uunet!leafusa!jar Unfortunately, grammar preprocessors (such as one I've developed), often have subtly different actions: For example: stmtlist : stmtlist ';' ? ';;' stmtlist { $$ = sequence($stmtlist'1,$stmtlist'2); } gets expanded into: stmtlist : stmtlist ';' DOUBLESEMI stmtlist { $$ = sequence($1,$4); } | stmtlist DOUBLESEMI stmtlist { $$ = sequence($1,$3); } (Duplicating rather than having a special optional-semicolon rule more often results in an LALR(1) grammar). Notice how the actions differ in where arguments are found. John Boyland boyland@sequoia.Berkeley.EDU -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.