S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) (05/08/91)
A recent posting (sorry, I can't remember exactly which) mentioned that someone's parser generator used an extended BNF notation directly for generating the LR automaton. I had always been lead to believe that an EBNF grammar is first translated into a context free grammar (CFG) and then that is used for parse table generation. I thought about the idea and tried a few tiny examples by hand. The main trick is to reason about the placing of the `dot' in the EBNF items. Shifting the dot in an item may produce more than one derived item. For example, consider the simple grammar for nested parenthesized lists of numbers: s -> ( {s} ) s -> number `{s}' menas 0 or more s's. This grammar might generate `5' or `(1 2 (3 4) 5)'. The item `s -> . ( {s} )' shifted to two items, `s -> ( { . s} )' and `s -> ( {s} . )' The state containing these two items would usually be two states in a CFG generated automaton. The parse tables came out smaller. The EBNF generated table was smaller because some table slots in the CFG generated table were being replaced with sequences of operations, for example a reduction of a null production might be replaced by a compound operation `reduce and goto'. After these changes some states and some columns of the goto table are never used and may be removed. I have looked in several books and a couple of conference proceedings but I have failed to find any references on generating LR parse tables directly from an EBNF grammar. What I would like to know is: 1. Are there any references? 2. Is the EBNF approach equivalent to the CFG one? Is the difference in the tables always a due to small set of `optimizations'? 3. Which is faster? The EBNF item sets are smaller but more complex to handle. Using the EBNF approach seems to reduce the need for `optimizing' the generated table. 4. I only generated the LR(0) automaton by hand. I have not thought about `higher' grammar categories like LALR(1), LR(2), regular-LR etc. Are these kinds easily generated using the EBNF method? Stephen Adams S.R.Adams@ecs.soton.ac.uk Computer Science S.R.Adams@sot-ecs.uucp Southampton University Southampton SO9 5NH, UK -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
grass@ulysses.att.com (Judith Grass) (05/09/91)
This is a longish response containing a bibtex list of references... >> I have looked in several books and a couple of conference proceedings but >> I have failed to find any references on generating LR parse tables >> directly from an EBNF grammar. What I would like to know is: Research on this goes back to the mid-70's, but you'll find it mainly in Canadian and European sources. It seems right now that there isn't a lot being published about parser generators and parser theory, as supposedly there isn't anything interesting to say anymore. In general, a lot of theoretical work has been done on the direct generation of elalr parsers from regular-right part grammars (EBNF), but there are very few finished, working examples of them. Two I know of: my own "Ryacc" parser generator (a prototype) and Compiler Resource, Inc.'s Yacc++ (a product). >> 1. Are there any references? Tons. Try these: @article {LA77, author = "W. R. {LaLonde}", title = "Regular Right part Grammars and Their Parsers", journal = "CACM", volume = 20, number = 10, pages = "731--741", month = oct, year = 1977} @article {LA79, author = "W. R. {LaLonde}", title = "Constructing {LR} Parsers for Regular Right part Grammars", journal = AI, volume = 11, pages = "177--193", year = 1979} @article {PB, author = "P. W. {Purdom, Jr.} and C. A. Brown", title = "Parsing Extended {LR}(k) Grammars", journal = AI, volume = 15, pages = "115--127", year = 1981} @article {CH, author = "N. P. Chapman", title = "{LALR}(1,1) Parser Generation for Regular Right Part Grammars", journal = AI, volume = 21, pages = "29--45", year = 1984} @article {HEI, author = "S. Heilbrunner", title = "On the Definition of {ELR}(k) and {ELL}(k) Grammars", journal = AI, volume = 11, pages = "169--176", year = 1979} @article {HEC, author = "R. Heckmann", title = "An Efficient {ELL}(1) Parser Generator", journal = AI, volume = 23, pages = "127--148", year = 1986} @article {NS86, author = "I. Nakata and M. Sassa", title = "Generation of Efficient {LALR} Parsers for Regular Right Part Grammars", journal = AI, volume = 23, pages = "149--162", year = 1986} @article {NS87, author = "I. Nakata and M. Sassa", title = "A Simple Realization of {LR}-Parsers for Regular Right Part Grammars", journal = IPL, volume = 24, pages = "113--120", year = 1987} @inproceedings {GrKiSe, author = "J. E. Grass, Chandra Kintala and Ravi Sethi", title = "Object-oriented Redesign using {C++}", booktitle = "Usenix {C++} Conference Proceedings", pages = "75--86", address= "San Francisco, CA", month = "April", year = 1990} @unpublished {GR90.pub, author = "J. E. Grass", title = "A User's Guide to {Ryacc} (version 0.0): A Parser Generator for Regular Right Part Grammars", institution = BLTM, note = "Available from author", year = 1990} >> 2. Is the EBNF approach equivalent to the CFG one? Is >> the difference in the tables always a due to small set >> of `optimizations'? Since you can always rewrite an EBNF grammar as a BNF grammar, the languages described are the same set. Not all EBNF grammars will directly yield an ELALR parser (or ELR or ELL) because of some kinds of non-determinism possible in the description. I am not sure that this is the answer to the question you are asking, but... >> 3. Which is faster? The EBNF item sets are smaller but >> more complex to handle. Using the EBNF approach seems >> to reduce the need for `optimizing' the generated >> table. From my experience, for the same language, the ELALR parser tends to be about 1/3 faster, simply because there are fewer states in the parser and fewer "artificial" productions. The tables also are much smaller. The tables still can be packed and optimized, but the techniques for doing so have to be a bit different. >> 4. I only generated the LR(0) automaton by hand. I have >> not thought about `higher' grammar categories like >> LALR(1), LR(2), regular-LR etc. Are these kinds >> easily generated using the EBNF method? ELALR(1), ELR(1), LL(1) are all known techniques. My guess is that doing longer lookaheads is no problem, although I am not aware of anyone that has actually tried it. The techniques should be about the same as is done for LR(2) and such. -- Judy Grass, AT&T Bell Labs, Murray Hill grass@ulysses.att.com -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
KARSTEN@tfl.dk (Karsten Nyblad, TFL, Denmark) (05/10/91)
In article <29305.9105072033@pessoa.ecs.soton.ac.uk>, S.R.Adams@ecs.southampton.ac.uk (Stephen Adams) writes: > [re generating parsing tables directly from EBNF] > I thought about the idea and tried a few tiny examples by hand. The main > trick is to reason about the placing of the `dot' in the EBNF items. > Shifting the dot in an item may produce more than one derived item. For > example, consider the simple grammar for nested parenthesized lists of > numbers: > > s -> ( {s} ) > s -> number > > `{s}' menas 0 or more s's. This grammar might generate `5' or > `(1 2 (3 4) 5)'. > > The item `s -> . ( {s} )' shifted to two items, `s -> ( { . s} )' and > `s -> ( {s} . )' The state containing these two items would usually be two > states in a CFG generated automaton. > > The parse tables came out smaller. The EBNF generated table was smaller > because some table slots in the CFG generated table were being replaced > with sequences of operations, for example a reduction of a null production > might be replaced by a compound operation `reduce and goto'. After these > changes some states and some columns of the goto table are never used and > may be removed. > > I have looked in several books and a couple of conference proceedings but > I have failed to find any references on generating LR parse tables > directly from an EBNF grammar. What I would like to know is: > > 1. Are there any references? Acta Informatica 7, 61-73 (1976) O.L. Madsen and B.B. Kristensen: LR-Parsing of Extended Context Free Grammars. Acta Informatica 11 177-193 (1979) Wilf R. LaLonde: Constructing LR Parsers for Regular Right Part Grammars Communications of the ACM: Volume 20 number 10 Wilf R. LaLonde: Regular Right Part Grammars and Their Parsers Acta Informatica Vol 21 Fasc 1 1984. N.P. Chapman: LALR(1,1) Parser Generation for Regular Right Part Grammars. Acta Informatica Vol 23 149-162 (1986) Ikuo Nakata and Masataka Sassa: Generation of Efficeint LALR Parsers for Regular Right Part Grammars (Take care: Their algorithm can't handle the following grammar: A -> a a A -> a A There is a bug in their optimizing algorithm too. It can't handle the grammar of their example. Otherwise it is the most advanced algorithm.) It is possible to fix both the bugs. I am currently writing a parser generator, which does it. > 2. Is the EBNF approach equivalent to the CFG one? Is the difference in > the tables always a due to small set of `optimizations'? > > 3. Which is faster? The EBNF item sets are smaller but > more complex to handle. Using the EBNF approach seems > to reduce the need for `optimizing' the generated table. It is hard to tell. I can confirm that the number of states is smaller. My parser generator generates 592 states for an ADA grammar, which typed in directly from the standard, and which does not handle pragmas. I prefer generators which let the user write his grammar like he likes, without loss of performance. In order to obtain that, you will still need to remove unit reductions, and make stackings that are always followed by a reduction into one operation. Removing unit reductions is harder than in normal LR parsing because you will need to split states that recognize repeation of a symbol to get the unit productions, i.e., assume the production: expression -> term { or term } When this production is reduced, the length of the right hand side is frequently 1, and depending of the algorithm you use for generating the LR0 automaton you might need splitting states. I do. I don't know whether anybody has ever completed writing a generator. My generator in itself is rather fast. When generating the ADA parser, it uses 20 seconds form start to the lookahead sets are calculated when executed on a VAX with 2.8 the power of the VAX780 and with optimization disabled when compiling the parser generator. I am currently not capable of doing anything but generating lookahead set, and that takes 5700 lines of C code. I think I will need 2-3000 lines of code more before having a parser generator without error recorvery and without any facilities for handling attributes. > 4. I only generated the LR(0) automaton by hand. I have not thought > about `higher' grammar categories like LALR(1), LR(2), regular-LR etc. > Are these kinds easily generated using the EBNF method? Well, generating the basic LR(0) atomaton is easy. You generate a DFA for each right hand side of the grammar. You use the states in the DFAs to substitute items, i.e., each DFA state repesent one or more positions of dots in the original regular expression. Now it is easy to generate the LR(0). Calculating the lookahead sets is as easy as with normal LALR algorithm. You can also generate the LR0 atomaton as you describe, Stephen. Madsen's and Kristensen's paper is about that. They put some restrictions on the regular expressions. I do not remember why, but it has the advantage that there is no ambiguities in how to build the parse trees recognized by their parsers. The real problem is deciding how much to pop from the stack when reducing a production. Chapman uses DFAs that recognize the reverse sequence of symbols of the right hand side, or rather recognize not the symbols but the states representing them on the parser stack. When the parser reduces, it starts by recognizing the right hand part starting from the top of the parser stack. Then the recognized part of the parser stack is substituted by the state representing the nonterminal of the reduced production. Nakata and Sassa prove a theorem, that say that when reducing, the state to restart the parser in, will always be the topmost of a set of possible states (topmost on the parser stack.) As I wrote previously, the there is a bug the proof, but after correcting that you will get a parser that is faster than Chapmans, because you save recognizing the right hand side once more. Karsten Nyblad TFL, A Danish Telecommunication Research Laboratory E-mail: karsten@tfl.dk -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.