[mod.compilers] advice needed re: parsing C decl syntax

johnl@ima.UUCP (Compilers mailing list) (12/06/86)

I've always been under the impression that YACC and friends produce
code/tables which are somewhat worse than could be done with a hand-built
parser (ie. usually recursive-descent).

Recently, I've tried to parse the C declaration syntax using recursive-descent
and have found that it's a fairly difficult task.  I find a need to delay
parsing decisions until much later in the parse.  For example, you don't know
whether you're parsing a function definition until you've seen a left
brace - up to that point, it could be anything from a declaration of a
variable (up until the left paren) to an external function definition.

I'm still reluctant to use a straight LALR table parser for reasons of
(ahem) efficiency.  Things like expressions & statements are wonderful
candidates for recursive-descent.  Lists of thingies also fall out into
simple code:

	thingie-list ::= thingie | thingie ',' thingie-list

turns into:

	thingie ();
	while (inputChoice (COMMA)) {
		thingie ();
	}

without the need for interpreting table entries.

Has anybody successfully married LR parsing with recursive-descent
techniques?  Are there any other clean tricks/techniques I could use?

thanx,
Paul Tarvydas
Tarvydas-Sanford Controls Inc.
...{utcs|utzoo|...}!utcsri!tarvydas
[Well, actually, I've never seen a compiler where the parser was the
particularly slow part.  More typically you get slowness either in character
processing in the lexer or else in slow table searches in the code gen.
If you insist, I see no reason why you couldn't define EXPR as a token in
the yacc parser and, in contexts where an expression is legal, sneak off to
an operator precedence or recursive descent parser.  But I really don't see
the point.  Yacc certainly has its problems, but excessively slow parsing has
never seemed to be one of them.  -John]
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (12/09/86)

LR parsing implies bottom up parsing.

The standard trick in LL (recursive descent) parsing is to delay doing
something about a token if it could be one of several things.
Basically you are rewriting your grammar so that it is LL(1) instead of
LL(k), k > 1, by left factoring.  For example:

	statement	-> identifier := expression
	 		-> identifier ( paramlist )

is not LL(1). But when you factor it out thus:

	statement	-> identifier stmttail
	stmttail	-> := expression
			-> ( paramlist )

Now one token of lookahead suffices.

	Ken
[This is true, but in real recursive descent compilers that I have seen, it's
as likely that you parse these things by kludgery as by adding productions.
Most recursive descent compilers are written by hand, so you'd parse the first
syntax by eating the identifier, remembering it, eating the next token, and then
going ahead with the appropriate construction.  But I'd rather let yacc do the
work for me.  Yacc can parse anything with enough help -- I have written a yacc
parser with a moderate number of kludges that correctly parses Fortran 77, a
language in which a statement can't be tokenized until you know what kind of
statement it is.  -John]
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (12/09/86)

Paul Tarvydas writes:
>I'm still reluctant to use a straight LALR table parser for reasons of
>(ahem) efficiency.  Things like expressions & statements are wonderful
>candidates for recursive-descent.
> ...
>Has anybody successfully married LR parsing with recursive-descent
>techniques?  Are there any other clean tricks/techniques I could use?

At the most recent Compiler Construction Conference (SigPlan notices,
July 1986), Thomas Pennello presented a technique for "compiling" the
parse-tables directly into (assembly) code, which apparently speeds up
the parsing process by a factor of 6-10.

>[But I really don't see the point.  Yacc certainly has its problems, but
>excessively slow parsing has never seemed to be one of them.  -John]

In his conclusions, Pennello also echoes this opinion:
    "Trying to speed up an LR parser may seem like beating as dead
    horse, since LR parsers are already reasonably fast."

Still, I think the paper is worth reading.

		Steve Vegdahl
		Computer Research Lab
		Tektronix Labs
		Beaverton, Oregon
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (12/10/86)

In article <282@ima.UUCP> Paul Tarvydas <toronto.edu!tarvydas@csri> writes:

>Recently, I've tried to parse the C declaration syntax using recursive-descent
>and have found that it's a fairly difficult task.

It is difficult regardless of the parsing technique used. Using LR
(or LALR), rather than LL, parsing helps slightly. C's declaration syntax
is a mess.

>I'm still reluctant to use a straight LALR table parser for reasons of
>(ahem) efficiency. [...] Lists of thingies also fall out into
>simple code:
>
>       thingie ();
>       while (inputChoice (COMMA)) {
>               thingie ();
>       }

I doubt this is more efficient than a table driven parser. The
cost of the call to "inputChoice" is at least 3 instructions (say
10 bytes or more). A parsing table could encode this in 2 bytes (2
bits for the action (shift/reduce/accept), leaving 14 bits for
the symbol).  The execution time of the table driven parser
should be at least equal to, if not faster than the above code.
The parser's code to check that the current symbol equals the
value in the table would have similar speed to the code in the
"inputChoice" routine, but without the overhead of the procedure
call (register saving/restoring, allocating local vars, etc).

My experience is that table driven parsers (whether LL or LR) are
smaller and faster than recursive descent. They also offer the
significant advantage of allowing automatic syntax error
correction/recovery. Doing this in a procedural recursive-descent
parser requires passing recovery sets to each parsing procedure,
as described in Hartmann's book on the Concurrent Pascal
compiler, or Pemberton's paper in "Software - Practice &
Experience" (around 1981). This slows the parser (even when
parsing a syntactically correct program), and increases the
stack space requirements.

Recursive descent parsers have the advantages of being easy to
write by hand, automatic allocation of temporary variables while
analysing recursively defined structures (such as statements and
expressions), and the ability to pass values down to lower level
parsing routines. YACC's method to do the last two is much less
clean, although a pre-processor such as Prep (van Katwijk,
ACM SIGPLAN Notices, Oct 1983) helps.
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (Compilers mailing list) (12/11/86)

I recently wrote a C interpreter (which of course must parse also).  I
found the particular problem mentioned to not be that bad.  The
declaration of a function (or a variable) and the actual definition of
a function was handled by the same code.  Roughly I had a routine
which I called declaration.  This called something which picked up the
declarative part (like the int i or the char *p(arg)).  I put the list
of arguments in a place to be found later by `declaration'.  Then I
would look to see if I found a function and if I was not looking at a
semi-colon.  If that was the case, I assumed I had a function
definition and parsed it.

The part I found to be tricky was the code which picked up the `int i'
or the `char *p[4](arg)'.  It turns out that this is not very
recursive.  I did this in a loop which continued until I did not see
anything I liked.  In that look I picked up the *'s, followed by a
name, followed by the []'s or ()'s for functions.  The ()'s used for
grouping was the reason for the loop.  If you want more details, I
could send you part of the code.

However, I would like to end with a statement about yacc.  As was
pointed out, the problem with yacc is not the speed nor the size.  It
is rather difficult to write a good grammar for C which yacc will
accept and still provides useful screening of illegal input.  I have
written one which correctly knows about lvalue's.  It knows the
difference between functions and pointers to functions so that the
items inside a struct (for example) are always pointers to functions
and not functions.  It also has 0 shift/reduce conflicts (as reported
out of yacc).  All the other grammars I have looked at fail miserably
in these reguards.  The code produced (tables and driver) is 8K I
think.  (The reason I did not use it was because of legal problems
with ATT which I still have not heard a definate answer on.)
-- 
Perry Smith
{convex!ctvax,{ti-csl,infotel}!pollux}!bobkat!pedz
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (Compilers mailing list) (12/17/86)

> ... Doing [automatic error recovery] in a procedural recursive-descent
> parser requires passing recovery sets to each parsing procedure...
> This slows the parser (even when parsing a syntactically correct program),
> and increases the stack space requirements...

On the contrary, doing automatic error recovery in recursive descent does
not require this, although one particular error-recovery algorithm may.
(I would add that if you are implementing the usually-sparse recovery sets
as bitmaps and passing them by value, as one might all-too-easily do in
Pascal, it's not surprising that the result is slow and bulky.)  Building
automatic error recovery into a recursive-descent parser is not difficult
or inefficient if you think hard and do it right; I speak from experience
with a C parser, not exactly a favorable case.  While I tend to agree that
table-driven parsers are likely to be a lot smaller and possibly somewhat
faster, the *practical* differences in error recovery are not significant.

(Why was I doing recursive-descent parsing if I think table-driven is
better, you ask?  Partly because of some of the nice practical conveniences
of recursive descent, partly because the particular project was supposed to
be highly portable and hence I didn't want to be dependent on a big parser
generator that I couldn't export freely.  I tinkered with hand-generated
tables for a bit, and I think that approach could have been made to work
acceptably, but it was taking too long to get all the details right.)

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request