[comp.compilers] Hand-rolled parsers

firth@sei.cmu.edu (Robert Firth) (11/10/89)

This addresses the question 'Why do people write recursive descent parsers by
hand, rather than using parser-generator tools such as lex and yacc?'.  It
answers a slightly different question, namely, 'Why do I write RD parsers by
hand...', and as such represents a personal and biased view.

1. Ease of construction.

Given a decent grammar, I can write the basic RD parser in a couple of days,
and do some solid component testing along the way.  Moreover, I can write it
in a reasonable language.  The tools I've used, on the other hand, have made
the job much harder, for these reasons

. truly bizarre, unfriendly, and unreadable input format or syntax

. very unhelpful and unforgiving error messages.  such as the message
  "12 shift/reduce conflicts" as the first level error message; the next
  level is a 50 page trace of the parser generator internal states.

. serious bugs in the lexer generator or parser generator, leading to any of:
  infinite loops, core dumps, silent generation of code skeletons that
  won't compile.

2. Flexibility

The hand lexer and parser are not committed to a single inadequate technique.
For example, an RD parser has no trouble distinguishing between the character
"'" as the start of an Ada character literal, and the character "'" as the
Ada attribute symbol.  The average lexer generator cannot disambiguate these
cases.  Another one I had trouble with was the difference between Pascal
"1.10" and "1..10".

As another example, when parsing an expression I can use good old operator
precedence techniques, and gain a lot of efficiency for a small loss of
robustness.

Finally, if I do tree building by hand I can straighten out on the fly a lot
of stuff that might cause trouble during later traversals, for example by
flattening nested lists, massaging slightly different productions into a
common format, and so on.

3. Error Handling

I've found most mechanically generated parsers to be truly dreadful at error
handling.  A message like "syntax error on line 33" is not just amateur, it's
pathetic.  No user should have to put up with such rubbish.  As another
example, try giving a certain C compiler an input where the ";" has been
omitted before the "}".  This is arguably the simplest repair possible, but
you'll get a stream junk consequential error messages.

With an RD parser, the error handling

. can be tailored closely to the error situation

. can give a very informative message

. can be tuned (in respect of both message and recovery
  action) based on usage statistics

4. Efficiency

RD parsers written by hand are much smaller and faster than the
machine-generated ones I've seen.  One example from my own experience is a
parser for Modula-2, where the hand version is about 600 lines of code (24
kbytes object) and the machine version was over TWO MEGABYTES of object code
and static data.

For the future - one day I'll give up these antique ways, and move to modern
technology.  But that technology will at have to be based on something at
least as powerful as attribute grammars or von-Wyngaarden grammars; as far as
I can see, mechanical context-free parsing just doesn't hack it.

As I said, just a personal view.
[From firth@sei.cmu.edu (Robert Firth)]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.

bart@videovax.tv.tek.com (Bart Massey) (11/12/89)

In article <1989Nov10.025412.4014@esegue.segue.boston.ma.us> firth@sei.cmu.edu (Robert Firth) writes:
> This addresses the question 'Why do people write recursive descent parsers by
> hand, rather than using parser-generator tools such as lex and yacc?'.  It
> answers a slightly different question, namely, 'Why do I write RD parsers by
> hand...', and as such represents a personal and biased view.

I just thought it was important to correct what I believe are some
fundamental misconceptions of the author regarding existing tools.  These
are, of course, *my* opinions and beliefs, so take them for what they're
worth.

> 1. Ease of construction.
> 
> Given a decent grammar, I can write the basic RD parser in a couple of days,
> and do some solid component testing along the way.  Moreover, I can write it
> in a reasonable language.  The tools I've used, on the other hand, have made
> the job much harder, for these reasons
...
>
> . very unhelpful and unforgiving error messages.  such as the message
>   "12 shift/reduce conflicts" as the first level error message; the next
>   level is a 50 page trace of the parser generator internal states.

Yeah, but given a grammar suitable for RD parsing, you won't *see* any of
these error messages!  I agree -- once I have a grammar suitable for RD
parsing, it wouldn't take me any longer to write an RD parser than a LR
parser using a parser generator.  What this argument ignores is that many
artificial languages are either LR(1) or nearly so *before any grammar
transformations*, and that the parser-generator may do some simple
transformations automatically and *virtually instantaneously*.  I've tried
to transform a non-trivial grammar for RD by hand -- I didn't enjoy it much.

> . serious bugs in the lexer generator or parser generator, leading to any of:
>   infinite loops, core dumps, silent generation of code skeletons that
>   won't compile.

Could happen.  Using FLEX+BISON (how about PD-YACC, anyone?) I've never seen
anything I would class as a serious bug.  And in fact I wrote a 500-line
lexer and 2000-line parser recently, without running across any bugs in
either LEX or YACC (misfeatures, but no bugs :-)!

> 2. Flexibility
> 
> The hand lexer and parser are not committed to a single inadequate technique.
> For example, an RD parser has no trouble distinguishing between the character
> "'" as the start of an Ada character literal, and the character "'" as the
> Ada attribute symbol.  The average lexer generator cannot disambiguate these
> cases.  Another one I had trouble with was the difference between Pascal
> "1.10" and "1..10".

Note that one can always pass information from the parser back to the lexer
using global variables to guide the lex.  While this isn't encouraged, IMHO
it is just as powerful and clean as embedding the pass-back in the RD parser.

As for your second example, I'm confused -- isn't ".." a seperate token in
your Pascal compiler?  If so, you can give LEX and clones rules for "."
(structure member), ".." (subrange), and "[0-9]*.[0-9]+" or some such
(floating constant), and it will disambiguate in what I believe to be
exactly the right way -- choose the longest matching prefix, or the earlier
one on identical lengths!

> As another example, when parsing an expression I can use good old operator
> precedence techniques, and gain a lot of efficiency for a small loss of
> robustness.

YACC and clones, at least, support operator-precedence parsing as a natural
part of the input -- it is traditional to use this for any expression-like
construct in the input!  Typing in a C expression lexer and parser is
literally a 1-hour exercise -- you should try it sometime!

...

> For the future - one day I'll give up these antique ways, and move to modern
> technology.  But that technology will at have to be based on something at
> least as powerful as attribute grammars or von-Wyngaarden grammars; as far as
> I can see, mechanical context-free parsing just doesn't hack it.

If you're writing a commercial compiler for a large compiled HLL, I can
understand where the extra investment of effort in a "custom" scanner and/or
parser might conceivably be worth it.  But given a finite investment of
effort, I personally prefer to spend as little as possible of it on the
best-understood part of formal language analysis -- scanning and parsing.
LEX and YACC have so far helped me to do so, without incurring costs which
were significant to me in my application.

					Bart Massey
					..tektronix!videovax.tv.tek.com!bart
					..tektronix!reed.bitnet!bart
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.

chris@mimsy.umd.edu (Chris Torek) (11/13/89)

Speaking of error recovery and error messages: this brings up something
which is a problem in some newer C compilers.  Fancy error reporting is
useful with, say, Pascal or Modula 2, where you might see something
like this:

	"foo.p", line 12: missing `]' inserted
	var a : array[1..10 of integer;
	-------------------^

Unfortunately, when this is naively applied to C code, the result
is often something like this (lines broken at 72 characters):

	"foo.c", line 12: missing `;' inserted
		(--((&__stdioF[1]))->_w < 0 ? ((&__stdioF[1]))->_w >= ((
&__stdioF[1]))->_lbfsize ? (*((&__stdioF[1]))->_p = ((--((&__stdioF[0]))
->_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++)
))), *((&__stdioF[1]))->_p != '\n' ? (int)*((&__stdioF[1]))->_p++ : __st
dio_wbuf('\n', (&__stdioF[1])) : __stdio_wbuf((int)((--((&__stdioF[0]))-
>_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++))
), (&__stdioF[1])) : (*((&__stdioF[1]))->_p = ((--((&__stdioF[0]))->_r <
 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++))), (i
nt)*((&__stdioF[1]))->_p++)) chars++;
		--------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
----------------------------^

Of course, the mess the compiler spat out above is nothing like the
original line 12, which actually read

		putchar(getchar()) chars++;

In practise, the errors I have seen have been even uglier than the
one above, since there is actually more text, most of it blanks and
tabs, which I compressed out of the above expansion.

(Not to pick on them, but this particular bit of naive `friendly' error
reporting is done by the MIPSco C compiler as shipped by DEC for Ultrix
3.x on the DECstation 3100.  I have not seen anyone else actually
commit this bit of stupidity.  Friendly error reporting is all well and
good, but at least this should be optional, so that it does not get in
the way of integrated environments such as that provided by Emacs.
Moreover, it should refer to the original source code, not any
intermediate output.  [It is possible that the error is only in the
intermediate output, of course, as, e.g., with a bad `#define'.])
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.

diamond@ws.sony.junet (Norman Diamond) (11/14/89)

In message <1989Nov12.222136.14085@esegue.segue.boston.ma.us>,
chris@mimsy.umd.edu (Chris Torek) complains about "fancy" error
reporting in the MIPS C compiler, a technique which is less helpful
than in non-preprocessed languages (lines broken at 72 characters):

>[begin quote]

"foo.c", line 12: missing `;' inserted
		(--((&__stdioF[1]))->_w < 0 ? ((&__stdioF[1]))->_w >= ((
&__stdioF[1]))->_lbfsize ? (*((&__stdioF[1]))->_p = ((--((&__stdioF[0]))
->_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++)
))), *((&__stdioF[1]))->_p != '\n' ? (int)*((&__stdioF[1]))->_p++ : __st
dio_wbuf('\n', (&__stdioF[1])) : __stdio_wbuf((int)((--((&__stdioF[0]))-
>_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++))
), (&__stdioF[1])) : (*((&__stdioF[1]))->_p = ((--((&__stdioF[0]))->_r <
 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++))), (i
nt)*((&__stdioF[1]))->_p++)) chars++;
		--------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
----------------------------^

Of course, the mess the compiler spat out above is nothing like the
original line 12, which actually read

		putchar(getchar()) chars++;

[...]

Friendly error reporting is all well and good, but at least this should be
optional, so that it does not get in the way of integrated environments such
as that provided by Emacs.  Moreover, it should refer to the original source
code, not any intermediate output.

>[end quote]

In fact I found this very helpful when errors were introduced by macros,
e.g. when there is a syntax error, or a call to what should be a nested
macro but which forgot to exist.  (Dr. Torek mentioned this in an aside,
as a possibility.  Please be assured that it is very very possible.)

Furthermore Dr. Torek's request is difficult to meet in systems that
separate preprocessing into a separate pass (I believe Dr. Torek might
know someone who worked on such a system :-)).

I think the best option is, if you don't like the echo of the processed
line and the mess of uncountable (well, not reliably countable) lines of
hyphens, then ignore this mess and only read the line number at the
beginning of the error message.  Then you can still enjoy hunting on
your own through that original source line and the macro calls, as you
get to do in older systems.
--
Norman Diamond, Sony Corp. (diamond%ws.sony.junet@uunet.uu.net seems to work)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue.  Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.