[mod.compilers] ...combining r.e.'s with LL

johnl@ima.UUCP (Compilers mailing list) (07/03/86)

Guido van Rossum relates:
> I just read an article in Acta Informatica about parser generation for
> "ELL(1) grammars."...
> An ELL(1) grammar (Extended LL(1) grammar) is a context free grammar
> where the right hand sides of rules are not strings of grammar symbols
> but regular expressions over the set of grammar symbols...

The idea of using REs for the right-hand sides is useful but hardly novel;
it's been used in many EBNFs.  In fact, I hope there's nothing novel
here, since I've been using it in an LL(1) parser generator that's almost
ten years old!

> ...ELL(1) languages are languages for which an ELL(1) grammar exists; these
> are the same as the LL(1) languages, so the ELL(1) paradigm does not add
> essential power to the mechanism [I think],...

Correct.  I object to the name "ELL(1)" for just this reason--ELL(1) is
just LL(1) in a convenient notation.  The analysis required to produce a
parser from an ELL(1)-notation grammar follows trivially from standard
LL(1) techniques.  The Acta Informatica article is surprising in how much
it presents as ELL(1) technique that is common with LL(1) technique--such
as the use of strongly-connected regions in computing the director sets.
Regular expressions are a notational tool, but the full expressive power of
REs is not available with the technique.

Next, skipping forward to some questions from the parent article (to return
later to the proposal):

> 	- has anyone actually used ELL(1) grammars?  experiences?

Yes; as I mentioned above, I've done a parser generator (which is in the
public domain) which uses REs on the RHS of rules.  It has (), |, and * as
you would expect, also + (one or more repetitions), [] (option), and a
couple of other "operators" for notational convenience.

Experience?  Parsers are quite fast, parse tables are small, grammars can
be readily understood by ordinary mortals.  The notation is natural enough
that I can generate either a parser or a good set of syntax diagrams from
the same grammar without doing any transformations.  Incidentally, this is
a nice tool for checking what you've written in the grammar, since subtle
errors in the grammar tend to be quite obvious when you see the syntax
diagrams.

> 	- any ideas on how to implement proper error recovery?

It's no worse (nor better!) than error recovery for a standard LL parser--
except that the parse stack (or whatever state information you keep) will
be cleaner because you don't have lots of intermediate nonterminals which
are artifacts of the grammar notation.  Looking at it differently, you
could say that you don't have to do any work (either at parser-generation
time or at compiler execution time) to get/keep a parser state useful for
error recovery information.  But error recovery is still sticky; simplistic
approaches will work as badly with ELL(1) as with LL(1).

And finally, the parent article (Guido van Rossum) proposes:
> All this triggered a different idea in my mind.  The ELL(1) property
> requires (amongst others) that the regular expressions correspond to
> Deterministic Finite Automata.  In general r.e.'s correspond to
> Nondeterministic Finite Automata.

You can construct an NDFA for an RE, and it has some useful properties.
However, a grammar is regular iff you can construct an FA for it, and...

> ...*every* NFA can be transformed to a DFA (with possibly
> exponentially many more states) that recognizes the same language...

True; however, the exponential explosion of states isn't relevant here.
The REs that you allow don't have the problem because of the LL(1)
constraint.  Aho/Hopcroft/Ullman give a constructive algorithm for NDFA->
DFA, and it does indeed generate a DFA with potentially as many as 2^n
states for an n-state NDFA, at one point in the construction.  (The NDFA
has at most n = 2k states, where k is the number of symbols in the RE.
Obviously if you're writing down k things, 2k is a nice number of objects
to work with but 2^(2k) is not so nice.)

However, A/H/U indicate that there's a cleanup afterward to eliminate
unreachable states.  If you think about the LL(1) property in relation to
the RE, you can see that most of the new states go away since their purpose
is (essentially) to encode state information.  There isn't much state
information to encode--you always know where you are and you know that it's
only going to take one symbol to decide where to go next.  This means that
the 2^n estimate of states isn't close enough to be a useful upper bound.
Another way to look at it is that since you're doing an LL(1) parse with a
PDA instead of a DFA, you can carry the interesting state dynamically in
the PDA's stack instead of trying to enumerate all the possible states and
encode them statically in the DFA transition function.

>...This
> makes me feel that a larger groups of grammars could be called ELL(1) by
> proper redefinition of that term,...

But (as you pointed out later) you haven't changed the class of grammars,
so you haven't gained anything in a theoretical sense.

The LL(1) property, coupled with the fact that you're going to want to
decorate the grammar with semantic actions that have to be performed in a
predictable fashion at predictable times in the parse, will keep you from
making much use of the nondeterminism of an NDFA.

Finally, in terms of parser quality, I don't see any reason to choose an
NDFA.  While it's true that an NDFA has an advantage (compactness) over a
DFA, I doubt that it would be more compact or faster than the PDA that
drives an LL parse.
---
Dick Dunn	{hao,ucbvax,allegra}!nbires!rcd		(303)444-5710 x3086
   ...Simpler is better.
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima

Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request