amb@apple.com (A. Michael Burbidge) (10/09/90)
After struggling for some time to write a yacc description for the Pascal language and after reading the description of the modifier yacc contained in the UCB Pascal source directory I am beginning to wonder if an LR(1) parsing algorithm can parse Pascal. The version of yacc used with the UCB Pascal claims to have additional lookahead sets. Can anyone shead any light on this subject. The optional semi-colons seem to be very difficult to deal with. Mike Burbidge amb@apple.com [In theory, any language that can be parsed by LR(k) can be parsed by LR(1), though the convolutions to do so can be unpleasant. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
mauney@eos.ncsu.edu (Jon Mauney) (10/10/90)
In article <9010091533.AA02386@apple.com>, amb@apple.com (A. Michael Burbidge) writes: > After struggling for some time to write a yacc description for the > Pascal language and after reading the description of the modifier yacc > contained in the UCB Pascal source directory I am beginning to wonder > if an LR(1) parsing algorithm can parse Pascal. ... Pascal is not terribly hard to do with LL(1), except of course the dangling else. Therefore it is also LALR(1). An LALR(1) grammar for Pascal is included with the parser generator distributed in conjunction with the Fischer&LeBlanc book "Crafting a Compiler". Handling the optional semicolon is not particularly difficult. Contortions are often added to grammars for specific purposes unrelated to the general problem of parsing the language, such as achieving a particular error recovery, or working around a deficiency in the parser generator. Berkeley Pascal was, for some time, unable to accept a null statement in the "then" clause: if i<0 then else foo(i); Rumor has it that this was due to a failing in yacc. The Berkeley Pascal/yacc grammar is probably a poor role model. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
KARSTEN@tfl.dk (Karsten Nyblad, TFL, Denmark) (10/10/90)
In article <9010091533.AA02386@apple.com>, amb@apple.com (A. Michael Burbidge) writes: > After struggling for some time to write a yacc description for the > Pascal language and after reading the description of the modifier yacc > contained in the UCB Pascal source directory I am beginning to wonder > if an LR(1) parsing algorithm can parse Pascal. ... I once wrote an LALR(1) parser for a superset of Pascal called Pascal+. I used the ISO Pascal standard except from for the extensions. Yes you can write an LR(1) parser, but you parser should be capable of handling disambiguating rules or you will have problems with the ELSE keyword. To make Pascal true LR(1) you need to make a little transformation of the grammar. Let's make an example: This is a little grammar like the four statements assignment, while loop and ifthen and ifthenelse. Stmt -> Assigment | While_header Stmt | IfThen_header Stmt ELSE Stmt | IfThen_header Stmt . This grammar is ambiguous because the parser will not know how to parse IfThen_header Ifthen_header Assignment ELSE Assignment Now we transform the grammar in the following way: We divide Stmt in two: 1) all Stmt ending on IfThen_header Stmt and 2) all Stmt not ending on IfThen_header: Stmt -> NoDangling | Dangling . NoDangling -> Assignment | While_header NoDangling | IfThen_header NoDangling ELSE NoDangling . Dangling -> While_header Dangling | IfThen_header NoDangling ELSE Dangling | IfThen_header Stmt . This grammar is LALR(1) and thus LR(1). Then about making semicolon optional. I also tried deleting all semicolons from my grammar. You do not need semicolons except in one case: the case statement, e.g.: CASE expr OF 2 : a:= a ; - 3 : END ^ Try deleting the semicolon. Then the parser will not know that the - is the start of a label until it has seen the colon. You can let your scanner divide - and + into prefix and infix operators by doing some lookahead in the scanner, but take care of writeln(5-3:3); Karsten Nyblad TFL, A Danish Telecommunication Research Laboratory E-mail: karsten@tfl.dk -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
joel@decwrl.dec.com (10/10/90)
Yacc is really a crude tool for parser construction. After much experimentation, I got yacc to not only accept Modula-2's statements, but to report when you forgot to put in a missing semicolon as well. I didn't do the same for Pascal, as it was too much work. You can find complete grammars in the Modula-2/Pascal distribution available for anonymous ftp from gatekeeper.dec.com, file /pub/DEC/Modula-2/m2.vax.tar.Z. Here's the relevant code for Modula-2, just to let you know the horrors of yacc grammar c onstruction. /* blame yacc for gross sequence syntax */ StatementSequence: StatementSequence1 /**/ | semis StatementSequence1 semis { $$ = $2; } | StatementSequence1 semis /**/ | semis StatementSequence1 { $$ = $2; } | semis { $$ = AddToStmtList(0,0); } | /* empty */ { $$ = AddToStmtList(0,0); } ; StatementSequence1: StatementSequence1 semis statement { $$ = AddToStmtList($1,$3); } | StatementSequence1 { yyerror("Missing semicolon"); } statement { $$ = AddToStmtList($1,$3); } | statement { $$ = AddToStmtList(0,$1); } ; semis: SEMICOLON | semis SEMICOLON ; -- - Joel McCormack (decwrl!joel, joel@decwrl.dec.com) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
meissner@osf.org (10/11/90)
| After struggling for some time to write a yacc description for the | Pascal language and after reading the description of the modifier yacc | contained in the UCB Pascal source directory I am beginning to wonder | if an LR(1) parsing algorithm can parse Pascal. ... I seem to recall the original CDC 6600 PASCAL was parsed by a recursive descent LL(1) parser, and that LL(1) grammers a subset of LALR(1) grammers which are a subset of LR(1) grammers. Has Pascal changed that much over the years that it needs more powerful parsing techniques, or is there some errors in your yacc file? -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (10/11/90)
In article <9010091533.AA02386@apple.com> A. Michael Burbidge <amb@apple.com> writes: >After struggling for some time to write a yacc description for the >Pascal language and after reading the description of the modifier yacc >contained in the UCB Pascal source directory I am beginning to wonder >if an LR(1) parsing algorithm can parse Pascal. ... Pascal is generally defined using a set of syntax diagrams which, by definition, means the language can be recognized using LL(1). This does not mean, however, that a particular Pascal dialect is LR(1); there may be extensions or error handling states, etc., which cause the language to not be LR(1). Also, YACC builds LALR(1) parsers, not LR(1). I vaguely recall one of Johnson's own papers saying something about a YACC-generated parser not being able to parse YACC input because YACC input is LALR(2)... so I'm not so sure that LALR(1) is equivalent to LALR(k). Or perhaps the "convolutions" are VERY "unpleasant"? -hankd@ecn.purdue.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
bliss@sp64.csrd.uiuc.edu (Brian Bliss) (10/11/90)
> I'm not so sure that LALR(1) is equivalent to LALR(k)
Any deterministic context free language can be recognized with a DPDA, and
for any DPDA, we can construct a LR(1) grammar. so the class of languages
recognizable with LR(k) grammars is the same as the class of languages
recognizable with LR(1) grammars. There are grammars (not languages!),
however that can be parsed with k tokens of lookahead and not with just 1.
The following grammar is LR(2) but not LR(1):
S-> A T
A-> a | a b
T-> b | c
Consider the string "abc". after the "a" is absorbed, we must decide
whether to reduce by "A-> a" or "A-> a b", with only the "b" as the
lookahead token. at this point, we have the conflict:
"abc"
a shift
a b reduce
A s
A c r
A T r
S
vs.
"ab"
a r
A s
A b r
A T r
S
The grammar is LR(2) (and LALR(2))
Languages which are recognized with LR(1) grammars can also be recognized
with LALR(1) grammars - from the LR(1) grammar, construct the canonical
set of items (states) needed for the DPDA if we merge items with the same
set of productions (but possibly different lookaheads) and get no
conflicts, we have a LALR(1) parser. If merging items results in a
conflict, we must have two items: @
L-> nonterm_string, a and L-> nonterm_string, b
<other prods> <other prods>
where merging to
L-> nonterm_string, a b
produces a conflict
so introduce 2 new nonterminals L' to your grammar.
-
for every production
A-> ... L ...
add L'-> nonterm_string if a and not b could legally follow L
in a sentential form. and add A-> ... L' ... delete A-> ... L ...
add L''-> nonterm_string if b and not a could legally follow L
in a sentential form. and add A-> ... L'' ... delete A-> ... L ...
otherwise, leave the production unchanged
-
only one of
A-> ... L ...
A-> ... L' ...
A-> ... L'' ...
is in the new grammar - we haven't made it ambiguous.
@ If there exists more than two items producing the conflict, either
3 or more being merged to the same LALR item, or 2 or more different sets
being merged to diffent LALR items, repeat this process, merging only two
conflicting items each time.
So if L (g) means the class languages recognized by a set of grammars n,
and G(n) means the set of grammars parsed in some manner n
L(G(LALR(1))) = L(G(LR(1))) = L(G(LALR(k))) = L(G(LR(k)))
but
G(LALR(1)) != G(LR(1)) != G(LALR(k)) != G(LR(k))
bb
--
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
lindsay@comp.vuw.ac.nz (Lindsay Groves) (10/16/90)
In article <9010101445.AA06181@dynamo.ecn.purdue.edu>, hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes: |> Pascal is generally defined using a set of syntax diagrams which, by |> definition, means the language can be recognized using LL(1). You must be using a strange definition a syntax diagrams if they can only describe languages that have LL(1) grammars. As I understand them, syntax diagrams can generate ALL context free languages, not just those that are also LL(1); i.e. I can turn any context free grammar into a set of syntax diagrams and vice versa. For example, the following syntax diagram describes the language consisting of all non-empty palindromes of even length over the alphabet {a, b}. This language is inherently ambiguous (ie cannot be recognised by a deterministic PDA), so it is definitely not LL(1) (nor LL(k), LR(k), ...): +-------------------------------+ | ^ v | S ----------> a ---> a --------------+----->| | ^ |---> b ---> b ---------->| | | |---> a ---> S ---> a --->| | | +---> b ---> S ---> b --->+ Lindsay -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
rekers@cwi.nl (Jan Rekers) (10/16/90)
In article <1990Oct16.015524.25858@comp.vuw.ac.nz>, lindsay@comp.vuw.ac.nz (Lindsay Groves) writes: |> For example, the following syntax diagram describes the language |> consisting of all non-empty palindromes of even length over the alphabet |> {a, b}. This language is inherently ambiguous (ie cannot be recognised |> by a deterministic PDA), so it is definitely not LL(1) (nor LL(k), |> LR(k), ...): The language of non-empty palindromes is not inherently ambiguous, the language of _sequences_of_ non-empty palindromes is. The syntax diagram you provide does describe _sequences_ of palindromes. Jan Rekers (rekers@cwi.nl) Centrum voor Wiskunde en Informatica (CWI) P.O. Box 4079, 1009 AB Amsterdam, The Netherlands -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
firth@sei.cmu.edu (Robert Firth) (10/18/90)
In article <9010091533.AA02386@apple.com> A. Michael Burbidge <amb@apple.com> writes: >After struggling for some time to write a yacc description for the >Pascal language and after reading the description of the modifier yacc >contained in the UCB Pascal source directory I am beginning to wonder >if an LR(1) parsing algorithm can parse Pascal. The version of yacc >used with the UCB Pascal claims to have additional lookahead sets. Can >anyone shead any light on this subject. The optional semi-colons seem >to be very difficult to deal with. Many moons ago, I wrote a front end for Pascal, and don't recall having any trouble except the dear old stupid "1.5" versus "1..5", which needs a two-character lookahead in the lexer. Moreover, I've just checked both Wirth's original paper and a couple of books on Pascal, and can't find ANY optional semicolons in the grammar. If you could post a little more detail, I'll try to be more helpful! -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
firth@sei.cmu.edu (Robert Firth) (10/18/90)
In article <1990Oct10.133752.14930@ncsuvx.ncsu.edu> mauney@eos.ncsu.edu (Jon Mauney) writes: >Berkeley Pascal was, for some time, unable to accept a null >statement in the "then" clause: > > if i<0 then else foo(i); But that is not legal Pascal. The relevant syntax (Wirth, section 9.2.2.1) reads <if-statement> ::= IF <expression> THEN <statement> | IF <expression> THEN <statement> ELSE <statement> and there is no production from <statement> that yields <empty>. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
firth@sei.cmu.edu (Robert Firth) (10/19/90)
In article <9094@fy.sei.cmu.edu> firth@sei.cmu.edu I wrote: [ The Pascal form 'if c then else s' ] >But that is not legal Pascal. The relevant syntax (Wirth, section 9.2.2.1) >reads > <if-statement> ::= IF <expression> THEN <statement> | > IF <expression> THEN <statement> ELSE <statement> > >and there is no production from <statement> that yields <empty>. I was wrong. The above is indeed what Wirth's original paper says, and what I used when implementing Pascal. But in the later book by Jensen and Wirth, an extra production is added <simple-statement> ::= ... | <empty-statement> by means of which one can indeed produce <empty> from <statement>. Since this book is by the original designer and of later date, we should take it as superseding the older paper. Many thanks to Bob Corbett for setting me straight on this point. So, given that we do have empty statements, how nasty are they? There is a simple mechanical test: list all the tokens that can start a statement, then all the tokens that can follow a statement, and if the sets overlap we are screwed, since any token in the intersection might EITHER start a statement OR start a different nonterminal after an empty statement. The tokens that can start a statement are GOTO BEGIN IF CASE WHILE REPEAT FOR WITH and in addition, note that a label or case label can start with The tokens that can follow a statement are semicolon END ELSE UNTIL The intersection is null, so we can always identify an empty statement by a one-token lookahead. On another topic: while scrabbling through the listings I came across a genuine and forgotten nasty. Consider this fragment CASE fruit OF apple: pear This can continue as, for instance apple: pear: x := 1; or as apple: pear := 1; In other words, you cannot tell whether 'pear' starts another case label or the statement labelled. That may discombobulate a parser generator, especially since one or more comments may appear between the identifier and the next token. Again, hope that helps, and my apologies for regurgitating obsolete information. [Norm Diamond <diamond@tkov50.enet.dec.com>, Charles E Eaker <eaker@sungod.crd.ge.com>, and Jon Mauney <mauney@csc.ncsu.edu> also noted that Pascal does indeed allow null statements. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
djones@megatest.uucp (Dave Jones) (10/21/90)
Yes it can. Frequently is, for that matter.
>From article <9112@fy.sei.cmu.edu), by firth@sei.cmu.edu (Robert Firth):
...
)
) On another topic: while scrabbling through the listings I came across
) a genuine and forgotten nasty. Consider this fragment
)
) CASE fruit OF
) apple: pear
)
) This can continue as, for instance
)
) apple: pear: x := 1;
)
) or as
)
) apple: pear := 1;
)
) In other words, you cannot tell whether 'pear' starts another case
) label or the statement labelled.
BRRRRRRRRRZTTTTTT!! I'm sorry. That answer is incorrect.
Wirth wanted labels to be ughly and inconvenient, because the
GOTO is, (ahem), considered harmful. Therefore he used literal numbers as
labels, not identifiers. (Gee thanks, Nick! I'm sure we're all much
better programmers for it.) So the example you give,
case fruit of
apple: pear: x := 1;
is not legal. Just for the fun of it, here's the error barrage that
the Berkeley Pascal compiler produces:
apple: pear: x := 1;
E ------------------------^--- Expected '='
E --------------------------^--- Malformed statement in case
E 2 - constant pear found where variable required
In proper context, the following would be legal:
case fruit of
apple: 999: x := 1;
Pascal is not only LR(1), it is LALR(1). But the discussion, based as
it is on whether or not an algorithm can predict the next production
entirely on the basis of a one-token lookahead, would be
more appropriate to a discussion of whether Pascal is LL(1).
It is that also, if memory serves.
P.s. "Labeled" has only two l's.
--
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
crocker@Alliant.COM (Ben Crocker) (10/24/90)
Having written a Pascal compiler with an LL(1) parser generator, I can vouch for the proposition that Pascal is LL(1). -- Ben Crocker (crocker@alliant.com) Alliant Computer Systems Corp. One Monarch Drive Littleton, MA 01460 (508) 486-4950x1356 [I think we've established that Pascal can be parsed LL(1). Unless something new shows up, I'll consider the topic closed. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
piet@cs.ruu.nl (Piet van Oostrum) (10/26/90)
> In article <9010232339.AA20860@Alliant.COM>, crocker@Alliant.COM (Ben > Crocker) (BC) writes: > Having written a Pascal compiler with an LL(1) parser generator, I can > vouch for the proposition that Pascal is LL(1). Pascal isn't LL(1). The IF THEN ELSE construction is impossible to do in LL(1). However, most LL(1) parser generators let you pass this by choosing the required branch (ELSE belonging to innermost THEN). -- Piet* van Oostrum, Dept of Computer Science, Utrecht University, Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands. Telephone: +31 30 531806 Uucp: uunet!mcsun!ruuinf!piet Telefax: +31 30 513791 Internet: piet@cs.ruu.nl (*`Pete') -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
andy@Theory.Stanford.EDU (Andy Freeman) (10/27/90)
In article <9010232339.AA20860@Alliant.COM> crocker@Alliant.COM (Ben Crocker) writes: >Having written a Pascal compiler with an LL(1) parser generator, I can >vouch for the proposition that Pascal is LL(1). Such compilers are built on tokenizers with 2 character look-ahead. Remember that "1..5" has the same tokens as "1 .. 5", but requires 2 character look-ahead to distinguish from streams containing "1.<digit>". Look-ahead 2 tokenising feeding a Lx(1) parser does not demonstrate that Pascal is Lx(1); it demonstrates that a tokenized version of a language may have different look-ahead requirements than the stream-of-characters version. -andy -- UUCP: {arpa gateways, sun, decwrl, uunet, rutgers}!neon.stanford.edu!andy ARPA: andy@neon.stanford.edu BELLNET: (415) 723-3088 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
jas@Ingres.COM (Jim Shankland) (10/29/90)
In article <1990Oct26.220803.1411@Neon.Stanford.EDU> andy@Theory.Stanford.EDU (Andy Freeman) writes: >In article <9010232339.AA20860@Alliant.COM> crocker@Alliant.COM (Ben Crocker) writes: >>Having written a Pascal compiler with an LL(1) parser generator, I can >>vouch for the proposition that Pascal is LL(1). > >Such compilers are built on tokenizers with 2 character look-ahead. Remember >that "1..5" has the same tokens as "1 .. 5", but requires 2 character >look-ahead to distinguish from streams containing "1.<digit>". > >Look-ahead 2 tokenising feeding a Lx(1) parser does not demonstrate that >Pascal is Lx(1); it demonstrates that a tokenized version of a language may >have different look-ahead requirements than the stream-of-characters version. Such postings should best be made from a machine with a name other than "theory.stanford.edu" :->. For any language with an LR(n) grammar, n > 0, there is an LR(n -1) grammar accepting the same language. In other words, a language is either LR or not LR; it makes no sense to say that a language is LR(k), for any particular k. There are practical considerations, in that the LR(n) grammar for a language may be considerably smaller and more comprehensible than the LR(n - 1) grammar. But that's all. The original proof of this is due to Knuth; it can no doubt be found in many sources, certainly in Hopcroft and Ullman. (I'm away from my references, so I can't be more precise; sorry.) As for LL languages, in <1990Oct26.213044.28243@Neon.Stanford.EDU>, Sriram Sankar (sankar@neon.stanford.edu) asserts: >LL(n+1) languages are a strict superset of LL(n) languages; As I said, I'm away from my references, and I'm not really a theoretician, but I believe this, too, to be incorrect. I think that if there is an LL(n + 1) grammar for a language, then there will also be some LL(n) grammar for the language (n >= 0). Consider the following proof sketch showing that an LL(2) grammar can be rewritten into an LL(1) grammar (the generalization should be straightforward): If a grammar G is LL(2), but not LL(1), then there must be productions of the form A --> a alpha B --> a beta such that both productions could occur from the same parser state. (A and B might be the same non-terminal.) Intuitively, the parser cannot "know" which production is the correct one with only the "a" as lookahead. Now rewrite the grammar as follows: remove those two productions and replace them with A' --> alpha B' --> beta where A' and B' are new non-terminals not used elsewhere; also, for any production P having an A or a B in its right-hand side, add a new production P' that is just like P, except that every occurrence of A on the rhs has been replaced by "a A'"; similarly with B, substituting "a B'". Intuitively, this defers the decision of which parse is correct by pushing the a and encoding the fact that an a was read in the parser state. Finally, as to whether Pascal is or is not LL(anything): I will ship a case of Anchor Steam beer to anyone who can show that Pascal is even context-free by writing a context-free grammar that will accept only correct Pascal programs and reject all other strings, including strings that would be correct Pascal programs except that an undeclared variable is used in the program. No fair cheating by using a Turing machine (e.g., C code) to maintain a symbol table. jas -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
firth@sei.cmu.edu (Robert Firth) (11/05/90)
In article <14251@goofy.megatest.UUCP> djones@megatest.uucp (Dave Jones) writes: > case fruit of > apple: pear: x := 1; > >is not legal. Please check the syntax in Wirth's paper, section 9.2.2.2. It reads: case-list-element ::= {case-label :}+ statement | {case-label :}+ In other words, each case arm can be introduced by several case labels, all of which of course can be identifiers. > Just for the fun of it, here's the error barrage that >the Berkeley Pascal compiler produces: You mean you believe what a Berkeley compiler tells you? You wouldn't happen to be needing a bridge, would you? >P.s. "Labeled" has only two l's. Sigh. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.