[comp.compilers] Generating LR parsers from EBNF?

S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) (05/08/91)

A recent posting (sorry, I can't remember exactly which) mentioned that
someone's parser generator used an extended BNF notation directly for
generating the LR automaton.  I had always been lead to believe that an
EBNF grammar is first translated into a context free grammar (CFG) and
then that is used for parse table generation.

I thought about the idea and tried a few tiny examples by hand.  The main
trick is to reason about the placing of the `dot' in the EBNF items.
Shifting the dot in an item may produce more than one derived item.  For
example, consider the simple grammar for nested parenthesized lists of
numbers:

	s -> ( {s} )
	s -> number

`{s}' menas 0 or more s's.  This grammar might generate `5' or 
`(1 2 (3 4) 5)'.

The item `s -> . ( {s} )' shifted to two items, `s -> ( { . s} )' and
`s -> ( {s} . )'  The state containing these two items would usually be two
states in a CFG generated automaton.

The parse tables came out smaller.  The EBNF generated table was smaller
because some table slots in the CFG generated table were being replaced
with sequences of operations, for example a reduction of a null production
might be replaced by a compound operation `reduce and goto'.  After these
changes some states and some columns of the goto table are never used and
may be removed.

I have looked in several books and a couple of conference proceedings but
I have failed to find any references on generating LR parse tables
directly from an EBNF grammar.  What I would like to know is:

  1.  Are there any references?

  2.  Is the EBNF approach equivalent to the CFG one?  Is
      the difference in the tables always a due to small set
      of `optimizations'?

  3.  Which is faster?  The EBNF item sets are smaller but
      more complex to handle.  Using the EBNF approach seems
      to reduce the need for `optimizing' the generated
      table.

  4.  I only generated the LR(0) automaton by hand.  I have
      not thought about `higher' grammar categories like
      LALR(1), LR(2), regular-LR etc.  Are these kinds
      easily generated using the EBNF method?


Stephen Adams                        S.R.Adams@ecs.soton.ac.uk
Computer Science                     S.R.Adams@sot-ecs.uucp
Southampton University
Southampton SO9 5NH, UK
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

grass@ulysses.att.com (Judith Grass) (05/09/91)

This is a longish response containing a bibtex list of references...

>> I have looked in several books and a couple of conference proceedings but
>> I have failed to find any references on generating LR parse tables
>> directly from an EBNF grammar.  What I would like to know is:

Research on this goes back to the mid-70's, but you'll find it mainly in
Canadian and European sources.  It seems right now that there isn't a lot
being published about parser generators and parser theory, as supposedly
there isn't anything interesting to say anymore.

In general, a lot of theoretical work has been done on the direct
generation of elalr parsers from regular-right part grammars (EBNF), but
there are very few finished, working examples of them.  Two I know of: my
own "Ryacc" parser generator (a prototype) and Compiler Resource, Inc.'s
Yacc++ (a product).

>>	  1.  Are there any references?

Tons.  Try these:

@article {LA77,
    author = "W. R. {LaLonde}",
    title = "Regular Right part Grammars and Their Parsers",
    journal = "CACM",
    volume = 20,
    number = 10,
    pages = "731--741",
    month = oct,
    year = 1977}

@article {LA79,
    author = "W. R. {LaLonde}",
    title = "Constructing {LR} Parsers for Regular Right part Grammars",
    journal = AI,
    volume = 11,
    pages = "177--193",
    year = 1979}

@article {PB,
    author = "P. W. {Purdom, Jr.} and C. A. Brown",
    title = "Parsing Extended {LR}(k) Grammars",
    journal = AI,
    volume = 15,
    pages = "115--127",
    year = 1981}
@article {CH,
    author = "N. P. Chapman",
    title = "{LALR}(1,1) Parser Generation for Regular Right Part Grammars",
    journal = AI,
    volume = 21,
    pages = "29--45",
    year = 1984}
@article {HEI,
    author = "S. Heilbrunner",
    title = "On the Definition of {ELR}(k) and {ELL}(k) Grammars",
    journal = AI,
    volume = 11,
    pages = "169--176",
    year = 1979}
@article {HEC,
    author = "R. Heckmann",
    title = "An Efficient {ELL}(1) Parser Generator",
    journal = AI,
    volume =  23,
    pages = "127--148",
    year = 1986}
@article {NS86,
    author = "I. Nakata and M. Sassa",
    title = "Generation of Efficient {LALR} Parsers for Regular Right
                Part Grammars",
    journal = AI,
    volume = 23,
    pages = "149--162",
    year = 1986}

@article {NS87,
    author = "I. Nakata and M. Sassa",
    title = "A Simple Realization of {LR}-Parsers for Regular Right Part
                Grammars",
    journal = IPL,
    volume = 24,
    pages = "113--120",
    year = 1987}
@inproceedings {GrKiSe,
    author = "J. E. Grass, Chandra Kintala and Ravi Sethi",
    title = "Object-oriented Redesign using {C++}",
    booktitle = "Usenix {C++} Conference Proceedings",
    pages = "75--86",
    address= "San Francisco, CA",
    month = "April",
    year = 1990}
@unpublished {GR90.pub,
        author = "J. E. Grass",
        title = "A User's Guide to {Ryacc} (version 0.0):
            A Parser Generator for Regular Right Part Grammars",
        institution = BLTM,
        note = "Available from author",
        year = 1990}


>>	  2.  Is the EBNF approach equivalent to the CFG one?  Is
>>	      the difference in the tables always a due to small set
>>	      of `optimizations'?

Since you can always rewrite an EBNF grammar as a BNF grammar, the
languages described are the same set.  Not all EBNF grammars will directly
yield an ELALR parser (or ELR or ELL) because of some kinds of
non-determinism possible in the description.  I am not sure that this is
the answer to the question you are asking, but...

>>	  3.  Which is faster?  The EBNF item sets are smaller but
>>	      more complex to handle.  Using the EBNF approach seems
>>	      to reduce the need for `optimizing' the generated
>>	      table.

From my experience, for the same language, the ELALR parser tends to be
about 1/3 faster, simply because there are fewer states in the parser and
fewer "artificial" productions.  The tables also are much smaller.  The
tables still can be packed and optimized, but the techniques for doing so
have to be a bit different.

>>	  4.  I only generated the LR(0) automaton by hand.  I have
>>	      not thought about `higher' grammar categories like
>>	      LALR(1), LR(2), regular-LR etc.  Are these kinds
>>	      easily generated using the EBNF method?

ELALR(1), ELR(1), LL(1) are all known techniques.  My guess is that doing
longer lookaheads is no problem, although I am not aware of anyone that
has actually tried it.  The techniques should be about the same as is done
for LR(2) and such.

		-- Judy Grass, AT&T Bell Labs, Murray Hill
			grass@ulysses.att.com
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

KARSTEN@tfl.dk (Karsten Nyblad, TFL, Denmark) (05/10/91)

In article <29305.9105072033@pessoa.ecs.soton.ac.uk>, S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) writes:
> [re generating parsing tables directly from EBNF]
> I thought about the idea and tried a few tiny examples by hand.  The main
> trick is to reason about the placing of the `dot' in the EBNF items.
> Shifting the dot in an item may produce more than one derived item.  For
> example, consider the simple grammar for nested parenthesized lists of
> numbers:
> 
> 	s -> ( {s} )
> 	s -> number
> 
> `{s}' menas 0 or more s's.  This grammar might generate `5' or 
> `(1 2 (3 4) 5)'.
> 
> The item `s -> . ( {s} )' shifted to two items, `s -> ( { . s} )' and
> `s -> ( {s} . )'  The state containing these two items would usually be two
> states in a CFG generated automaton.
> 
> The parse tables came out smaller.  The EBNF generated table was smaller
> because some table slots in the CFG generated table were being replaced
> with sequences of operations, for example a reduction of a null production
> might be replaced by a compound operation `reduce and goto'.  After these
> changes some states and some columns of the goto table are never used and
> may be removed.
> 
> I have looked in several books and a couple of conference proceedings but
> I have failed to find any references on generating LR parse tables
> directly from an EBNF grammar.  What I would like to know is:
> 
>   1.  Are there any references?
 
Acta Informatica 7, 61-73 (1976)
O.L. Madsen and B.B. Kristensen:  LR-Parsing of Extended Context Free
Grammars.

Acta Informatica 11 177-193 (1979)
Wilf R. LaLonde: Constructing LR Parsers for Regular Right Part Grammars

Communications of the ACM: Volume 20 number 10
Wilf R. LaLonde: Regular Right Part Grammars and Their Parsers

Acta Informatica Vol 21 Fasc 1 1984.
N.P. Chapman:  LALR(1,1) Parser Generation for Regular Right Part Grammars.

Acta Informatica Vol 23 149-162 (1986)
Ikuo Nakata and Masataka Sassa: Generation of Efficeint LALR Parsers for
Regular Right Part Grammars
(Take care:  Their algorithm can't handle the following
grammar:
     A -> a a
     A -> a A
There is a bug in their optimizing algorithm too.  It can't handle the
grammar of their example.  Otherwise it is the most advanced algorithm.)

It is possible to fix both the bugs.  I am currently writing a
parser generator, which does it.

>   2.  Is the EBNF approach equivalent to the CFG one?  Is the difference in
>	the tables always a due to small set of `optimizations'?
> 
>   3.  Which is faster?  The EBNF item sets are smaller but
>       more complex to handle.  Using the EBNF approach seems
>       to reduce the need for `optimizing' the generated table.

It is hard to tell.  I can confirm that the number of states is
smaller.  My parser generator generates 592 states for an ADA grammar,
which typed in directly from the standard, and which does not handle
pragmas.  I prefer generators which let the user write his grammar
like he likes, without loss of performance.  In order to obtain that,
you will still need to remove unit reductions, and make stackings that
are always followed by a reduction into one operation.  Removing unit
reductions is harder than in normal LR parsing because you will need
to split states that recognize repeation of a symbol to get the unit
productions, i.e., assume the production:

   expression -> term { or term }

When this production is reduced, the length of the right hand side is
frequently 1, and depending of the algorithm you use for generating
the LR0 automaton you might need splitting states.  I do.

I don't know whether anybody has ever completed writing a generator. 
My generator in itself is rather fast.  When generating the ADA
parser, it uses 20 seconds form start to the lookahead sets are
calculated when executed on a VAX with 2.8 the power of the VAX780 and
with optimization disabled when compiling the parser generator.  I am
currently not capable of doing anything but generating lookahead set,
and that takes 5700 lines of C code.  I think I will need 2-3000 lines
of code more before having a parser generator without error recorvery
and without any facilities for handling attributes.

>   4.  I only generated the LR(0) automaton by hand.  I have not thought
>	about `higher' grammar categories like LALR(1), LR(2), regular-LR etc.
>	Are these kinds easily generated using the EBNF method?
 
Well, generating the basic LR(0) atomaton is easy.  You generate a DFA
for each right hand side of the grammar.  You use the states in the
DFAs to substitute items, i.e., each DFA state repesent one or more
positions of dots in the original regular expression.  Now it is easy
to generate the LR(0).  Calculating the lookahead sets is as easy as
with normal LALR algorithm.

You can also generate the LR0 atomaton as you describe, Stephen. 
Madsen's and Kristensen's paper is about that.  They put some
restrictions on the regular expressions.  I do not remember why, but
it has the advantage that there is no ambiguities in how to build the
parse trees recognized by their parsers.

The real problem is deciding how much to pop from the stack when
reducing a production.  Chapman uses DFAs that recognize the reverse
sequence of symbols of the  right hand side, or rather recognize not
the symbols but the states representing them on the parser stack. 
When the parser reduces, it starts by recognizing the right hand part
starting from the top of the parser stack.  Then the recognized part
of the parser stack is substituted by the state representing the
nonterminal of the reduced production.

Nakata and Sassa prove a theorem, that say that when reducing, the
state to restart the parser in, will always be the topmost of a set of
possible states (topmost on the parser stack.)  As I wrote previously,
the there is a bug the proof, but after correcting that you will get
a parser that is faster than Chapmans, because you save recognizing
the right hand side once more.

Karsten Nyblad
TFL, A Danish Telecommunication Research Laboratory
E-mail: karsten@tfl.dk
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.