[comp.compilers] Yacc Poll

johnl@ima.UUCP (07/31/87)

I am presently involved in a discussion about compiler implementation,
specifically the pros and cons of Yacc,
and I would like opinions from other experienced compiler-writers.

There seem to be some standard objections to Yacc in particular, e.g.

	a. the form of Yacc's input
	b. the limitations of its syntax/semantic interface
	c. the incomprehensibility of the C output
	d. the size of the generated parser
	e. the speed of the generated parser
	f. debugging difficulties

and some more general objections, e.g.

	g. generic objections to LR parsing
	h. generic preference for some other technique
	i. etc.

I have some sympathy with objections (a), (b), and (d),
although they appear minor compared with the advantages gained,
and don't warrant overlooking the use of the tool.
I reject the remainder. Some of these are simply uninformed.

We need to choose between 3 existing syntax analysers for the language in
question (Cobol): ad-hoc parsing, recursive-descent, and a Yacc grammar.
As the author of both the recursive-descent and Yacc implementations,
I strongly favour Yacc, or at least LR parsing.

Would any readers care to comment, or indeed vote?
E-mail replies please; if enough interest, I'll summarize to the moderator.
-- 
Esmond Pitt, Austec International Ltd
...!seismo!munnari!ausmelb!ejp,ejp@ausmelb.oz.au

[Cobol was designed so long ago that it seems uniquely well-suited to ad-hoc
parsing. On the other hand, so was Fortran, and I was never sorry that I wrote
the INfort parser in heavily kludged yacc. The two huge wins for me were that
I knew that it would complain about any invalid input (unlike recursive
descent where practically every line seems to be followed by "else
syntax_error();") and that as people asked for their favorite featurettes, I
could throw in the new syntax and know I wouldn't break existing stuff and not
know about it. Other opinions solicited, particularly from actual experience.
-John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

pase@ogcvax.UUCP (Douglas M. Pase) (08/04/87)

Sometime back I was given the assignment to write a compiler for an in-house
functional language which was to be used for numerical processing.  Once I had
a rough description of the language, I trotted off to my trusty vax and whipped
out a recursive descent parser in `C'.  It took me approximately 5 weeks
(partly because details of the language were still fluid and I was unfamiliar
with the VMS environment).

Just about the time I was putting the finishing touches on the parser a mandate
from Higher Up arrived which prohibited the use of `C' (or any tool which
smelled of Unix).  The Officially Blessed Language was Modula-2, it being a
"more advanced" and "higher level" language.  Back I went to the drawing board,
scrapping most of what I had done.  The parser this time took over seven
months, of which two were spent learning Modula-2.  It ran much slower, and
used 5-10 times as much memory.

About two months before I finished this I bought a small Unix box a (Fortune
32:16 with a 60Mb disk) with Yacc, Lex, and the gang all on board.  I thought
that for fun I would see what effort was required to build a parser using the
available tools.  The parser was completed in four evenings (about 3-4 hours
each).  (All three parsers were for the same language, but the language was
at a different stage of development for each parser.)

Part of the problem was the fluidity of the language design.  Every new token
meant 3 days worth of messing around with the transition tables.  Then people
had to dink around with the grammar (not as painful to change, but still not
pleasant).  Another problem was the low quality of Modula-2 compiler we were
forced (at gunpoint) to use.  However, all things considered the LR parser was
by far the easiest, with the `C' recursive descent parser a not too distant
second, and the Modula-2 R.D. parser being a very distant third.
--
Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase  or  pase@Oregon-Grad.csnet
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/06/87)

> ... The two huge wins [of yacc] for me were that
> I knew that it would complain about any invalid input (unlike recursive
> descent where practically every line seems to be followed by "else
> syntax_error();") and that as people asked for their favorite featurettes, I
> could throw in the new syntax and know I wouldn't break existing stuff...

I half agree with this.  The "else syntax_error();" objection is, I'm afraid,
completely bogus:  it is quite possible to write a recursive-descent parser
with error recovery handled systematically and automatically in the scanner.
I've done it.  The parser needs to supply a bit more information to the
scanner than usual, so the scanner can compare actual input with expected
input, but it's not hard to do and it works very nicely... much better than
error recovery in yacc, in fact.  The parser always sees a syntactically
correct program, and hence never has to try to unwind the semantics after
discovering an error.

The part about ease of change, though, I agree with.  Personally I tend
towards using recursive descent for "production quality" parsers for well-
specified languages, and yacc for experimental work (where the language
changes underfoot all the time) and doing quick jobs (where minimizing
development effort is the primary concern).  The type of language is also
a consideration:  programming languages tend to be easy to parse with
recursive descent -- C and Pascal in particular were *designed* that way,
and parsing C with an LR parser is a little tricky (not that it's trivial
with recursive descent, groan...) -- while arbitrarily-invented notations
are much easier to handle with the greater power of yacc.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
[I'd be interested in hearing what the recursive descent lexer feeds to the
parser when the input is something like a = b + ( ;    -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

steve@hubcap.clemson.edu ("Steve" Stevenson) (08/11/87)

Hand-writing a recursive descent parser makes as much sense as hand-writing
a LALR(1) parser.  Why don't you use yacc-lex to write a simple input
mimic of yacc, then use a reasonable plagiarism (like from Backhouse's
book.)

We did it here - and extended this LL beast as a master's project. (Dave
Rivers at DataGeneral. Hi,Dave :-) ].
-- 
D. E. (call me Steve) Stevenson            steve@hubcap.clemson.edu
Department of Computer Science,            (803)656-5880.mabell
Clemson University, Clemson, SC 29634-1906
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/14/87)

In article <634@ima.ISC.COM> harvard!ames!ausmelb!ejp writes:
>I am presently involved in a discussion about compiler implementation,
>specifically the pros and cons of Yacc,
>and I would like opinions from other experienced compiler-writers.

Well, the Toolpack/1 parser was generated using Yacc (the output was munged
into Fortran though).  Recursive-descent parsing was considered (1) probably
insufficiently powerful to parse Fortran's somewhat grubby syntax, and (2 -
more importantly) impossible since ANSI Fortran does not provide recursion.
"ad hoc" was *NEVER EVER* considered suitable for parsing Fortran - it was hard
enough to get the bugs out of the Yacc input code (~1500 lines).

My own experience with writing recursive-descent parsers is that it is easy
to provide error recovery far superior to the normal Yacc stuff (and which one
often sees in C compilers).  Unfortunately this also depends on the syntax of
the language in question - if there is not much variety of input 'terminator'
symbols (e.g. ';', 'end' in Pascal) error recovery is hobbled anyway.

The only place where I would use ad-hoc methods by choice is in lexical
analysis, as this is probably simple enough to hand-craft decently.  In lexing
Fortran one has either to hand-craft an ad-hoc lexer or use something more
powerful than regular expressions (vis. lex) - or sadistically leave it to the
parser-writer to tokenise 'DO' statements etc...  Toolpack/1 used a
table-driven pushdown automaton with backtracking - ugh.

Thus Toolpack/1 took a similar approach to that of Stu Feldman with his f77
compiler - FORTLEX for the lexer, and YACC for the parser.  I recommend that
anyone who is considering the use or not of Yacc to read his paper describing
his implementation experience (in the proceeding of some ancient SIGPLAN
conference  - I think 1978).  He recommends its use.

>There seem to be some standard objections to Yacc in particular, e.g.
>	c. the incomprehensibility of the C output
Perfectly comprehensible - even when translated into Fortran!
>	f. debugging difficulties
Certainly easier than with an ad hoc parser (except for very small languages).

>and some more general objections, e.g.
>	g. generic objections to LR parsing
Well, if recursive descent (LL) is powerful enough without kludging the
grammar, I find it can provide a more convenient framework for the semantics
and error recovery - but only for a suitable target language (not COBOL).
In this case LR has no advantages over LL, and may even lose because of its
excessive power.
-- 
Malcolm Cohen			       mcvax!keilir!malcolm
Utgardar, Computer Centre, University of Iceland, Reykjavik
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/14/87)

In article <378@hubcap.UUCP>, steve@hubcap.clemson.edu ("Steve" Stevenson) writes:
> Hand-writing a recursive descent parser makes as much sense as hand-writing
> a LALR(1) parser.  Why don't you use yacc-lex to write a simple input
> mimic of yacc, then use a reasonable plagiarism (like from Backhouse's
> book.)
> 

Actually, it makes a lot more sense.  The natural control structure for
implementing a recursive-descent parser is a set of mutually recursive
procedures, one per syntactic construct.  This is a much simpler structure,
bearing a much more obvious resemblance to the grammar, than the structure
that results from hand-coding a bottom-up parser.  In fact, writing
the control structure of a recursive-descent parser from the grammar can be
done as fast as you can type it in, and will result in a program that is
no larger than a yacc grammar.  The only difficult part (and it is not
difficult for a well-designed language) is to figure out the set of tokens
that can begin each type of construct, so that you can decide which of
several recursive procedures to call (or whether to return).  I've been
able to do this correctly in my head as I typed in the parser for several
programming languages, including C and Pascal.

The advantages of a hand-written parser include:

(1) Speed.  By a few carefully chosen modifications of the program produced by
    a literal transcription of the grammar, an experienced programmer can easily
    obtain a parser that is substantially faster than yacc.  If pressed, I have
    examples supporting this claim.  The obvious optimizations are to collapse
    the code of separate, non-recursive constructs and to use some sort of
    precedence-based technique for handling the expression portion of the
    language.  Of course it is also easy to obtain recursive-descent parsers
    that are slower than yacc, but at least you can control the efficiency.

(2) Flexibility in error-reporting and recovery.  You control the code -- you
    can do anything you want.  This should be obvious.  And as someone pointed
    out, you can use abstractions like: EXPECT(token, msg, skip_set) to mean

	"Expect to find the specified token as the current token.  If so, read
	past it; otherwise print the specified error message and read tokens
	until you find the expected one or one in the skip_set.  If the
	expected one was found, read past it."

    This produces error recovery that is at least the same quality as in yacc
    and at least as obvious from looking at the code.

(3) Ease of data allocation in the semantic code.  You automatically get a
    fresh stack frame of variables for each nested occurrence of a particular
    construct.  Does anyone know how to implement nested constructs such as FOR
    loops in a yacc parser without having to explicitly implement a stack to
    keep track of information about each currently active for loop?  This is a
    real pain in the butt!

(4) Flexibility in use of the parser.  Unlike a yacc parser, your parser can
    (by definition) be called recursively.  For instance, suppose you decide
    that in order to process the semantics of the language construct you are
    looking at right now, you need to first process some declaration it depends
    on, which is in a different file.  What you want to do is suspend the
    current parse, reinvoke the parser on the other file, then come back to
    what you were doing.  This is trivial in a hand-written parser, and
    infeasible (as far as I know) in a yacc parser.

The only disadvantage of a hand-written parser that I am aware of is that, for
an experimental language whose syntax is changing rapidly, the effort spent
modifying the parser may be considerable.  This is not a serious impediment
for a programmer who is experienced with recursive-descent parsers, but is a
consideration if the parser is to be maintained by programmers who have always
done their parsers with canned generators like yacc.

To summarize with a controversial claim: parser generators are a classic
example of a solution looking for a problem.  They exist primarily because
they are fun to write and because automated parsing techniques are fun for
computer scientists to study (or at least easy to study).  After using yacc in
three different commercial-quality projects, I do not find that is has saved
me any significant amount of labor, certainly not enough to justify the
restrictions it has put upon my parsers.  Parsing is the most straightforward
part of language processing, the part that is easiest to do well without any
table-driven or program-generator techniques.  It is the semantic processing
that is least-well understood and which is usually implemented quite poorly.

Michael Condict		{ihnp4|vax135|cuae2}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

------------------

P.S.

To support my claim about the resemblance of a recursive-descent parser to
the grammar, here is a short example.  Note that the parser does more
than the grammar, namely error reporting and recovery, using the above-
described EXPECT macro:

	Grammar				    R.D. Parser
	------				    -----------
				TOK_TYPE tok, gettok();

prog:	  stmt			void prog() { tok=gettok();
	| prog stmt		    while (tok != EOF_TOK) stmt();
	;			}

stmt:	  			void stmt() {
				    switch (tok) {
	  IF				case IF: tok=gettok();
	      expr			    expr();
	  THEN				    EXPECT(THEN, "THEN expected", END);
	  				    if (tok == END) break;
	      stmt			    stmt();
					    EXPECT(END, "END IF expected", END);
	  END IF			    EXPECT(IF, "END IF expected", END);
					    break;
	|				case ID: tok=gettok();
	  ID ':='			    EXPECT(ASSN, "':=' expected", ';');
		expr ';'		    expr();
					    EXPECT(';', "';' expected", ';');
					    break;
					default:
					    error("Bad token beginning stmt");
					    tok=gettok();
				    }
	;			}

expr:	  			void expr() {
	primary    		    primary();
				    while (1) {
					if (tok == '+') {
	| expr '+' primary		    tok=gettok();  primary();
					} else if (tok == '-'); {
	| expr '-' primary		    tok=gettok();  primary();
					} else
						break;
				    }
	;			}

primary:  ID			void primary() {
				    EXPECT(ID, "ID expected in expr", ';');
	;			}
[I've written my share of commercial compilers, and found that yacc saved me an
enormous amount of time.  To each his own.  It occurs to me that it shouldn't
be all that hard to change the yacc parser so that it implements the kind of
error recovery that people have been suggesting for RD, to fake up a stream
of tokens that lets the parse continue.  Has anyone done that, or are there
pitfalls I hadn't noticed?  -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

jbn@glacier.STANFORD.EDU (John B. Nagle) (08/16/87)

     An improved YACC, "eyacc", was distributed with 4.1BSD and was used
in building the Pascal compiler.  "eyacc" allowed the inclusion of error
rules in the grammar which told the parser how to recover from errors.
Due to lack of interest, "eyacc" seems to have been discontinued.

					John Nagle
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/17/87)

> ... It occurs to me that it shouldn't
> be all that hard to change the yacc parser so that it implements the kind of
> error recovery that people have been suggesting for RD, to fake up a stream
> of tokens that lets the parse continue.  Has anyone done that, or are there
> pitfalls I hadn't noticed?  -John]

This might be practical, although you would probably have to extend yacc
a bit to make it work well:  it is essential that error repair prefer
terminating over extending potentially-infinite syntax like parameter
lists.  My gut feeling is that the bottom-up nature of LALR parsing gives
you rather less information about what's going on, perhaps enough to make
the idea impractical.  My LALR is rusty enough that I'm not sure of this,
though.  Certainly it's harder to do than the top-down version.

[That had occurred to me, but I suspect that heuristics like preferring a
token that reduces to one that shifts would help.  Besides, you have the
same problem in the top down case:  Assume the input string is
"foo = bar + ;" and after it's parsed the +, the lexer were unwise enough
to pass back an open paren as its invented token, because, say, that was
the first thing in the legal list.  Due to an nroff snafu, I haven't
finished reading the paper in the previous article -- they may have
addressed this.  -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/17/87)

Michael Condict	writes:
>The advantages of a hand-written parser include:
>(1) Speed.
>(2) Flexibility in error-reporting and recovery.
>(3) Ease of data allocation in the semantic code.
>(4) Flexibility in use of the parser.

>The only disadvantage of a hand-written parser that I am aware of is that, for
>an experimental language whose syntax is changing rapidly, the effort spent
>modifying the parser may be considerable.  This is not a serious impediment
>for a programmer who is experienced with recursive-descent parsers, but is a
>consideration if the parser is to be maintained by programmers who have always
>done their parsers with canned generators like yacc.

I have used both recursive descent and parser generators; my experience is
that parser generators are far better.  The following are advantages that
I see to using a LR parser generator.
 1. Debugging of the grammar.  If your grammar is ambiguous, a parser
    generator will tell you about it.  For a hand-written parser, such
    an ambiguity would have to be discovered manually.
 2. Time savings.  Writing/modifying a grammar, rather than parsing code,
    is much faster and less error prone.
 3. Automatic error recovery.  Contrary to Mr. Condict's claim above, LR
    parsing is very well suited to nice error recovery--and it can happen
    "for free"--i.e., with no work other than writing the grammar.
    Although YACC has a poor error-recovery mechanism, there are other
    parser generators that handle error recovery very nicely.  I have yet
    to see an error-reporting/recovery mechanism for a hand-coded parser
    that is as nice as I've seen generated by a LR parser generator.  And
    the hand-coded ones have to be coded up!
 4. Multi-language use.  If your compiler-implementation language changes,
    one can get new parsers by simply writing a new "back-end" for the
    parser-generator and a new parser driver.  With recursive-descent, you
    have to rewrite all of your parsers.
 5. Wider range of grammars.  With recursive descent, one is limited to
    LL grammars, which are less powerful than LR.

>To summarize with a controversial claim: parser generators are a classic
>example of a solution looking for a problem.  
>... I do not find that is has saved
>me any significant amount of labor, certainly not enough to justify the
>restrictions it has put upon my parsers.

I have used YACC, as well as several other parser generators.  YACC was
probably the least easy to use of any, but I'd still pick it in a second
over having to write a parser by hand.  I liken the LR/RD parser
comparison as being similar to the HLL/assembly language comparison.
LR parsing lets you use an appropriate tool and get the job done in
a timely manner, while RD parsing forces the programmer to specify
everything at a low level.

		Steve Vegdahl
		Computer Research Lab
		Tektronix Labs
		Beaverton, Oregon
harvard!seismo!sun!tekchips!stevev
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

johnl@ima.UUCP (08/18/87)

> ...My gut feeling is that the bottom-up nature of LALR parsing gives
> you rather less information about what's going on, perhaps enough to make
> the idea impractical.  My LALR is rusty enough that I'm not sure of this,
> though.  Certainly it's harder to do than the top-down version.

The LALR parser contains all the information about what's going on,
but since it is cleverly encoded in the states, one must go to some
trouble to extract it.  Bernard Dion showed how it can be done

 %A Bernard A. Dion
 %T Locally least-cost error correctors for context-free and
    context-sensitive parsers
 %D 1982
 %I UMI Research Press
 %C Ann Arbor
 %O Ph.D. Thesis, University of Wisconsin-Madison, 1978

With enough invariant-preserving transformation, Dion's algorithm can be
brought to a point that some may consider practical. Dion extracts a lot of
information, though. If you are willing to settle for less, more efficient
extractors are possible. If you just want to know what the possibilities are
for the next symbol, that's easy; it is just the symbols that can be shifted
by the current state, or by some state lower on the stack in the case that
this state can reduce.

            Jon Mauney    ...!mcnc!ece-csc!mauney or mauney@ece-csc.ncsu.edu
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.ARPA
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | cca}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request