firth@sei.cmu.edu (Robert Firth) (11/10/89)
This addresses the question 'Why do people write recursive descent parsers by hand, rather than using parser-generator tools such as lex and yacc?'. It answers a slightly different question, namely, 'Why do I write RD parsers by hand...', and as such represents a personal and biased view. 1. Ease of construction. Given a decent grammar, I can write the basic RD parser in a couple of days, and do some solid component testing along the way. Moreover, I can write it in a reasonable language. The tools I've used, on the other hand, have made the job much harder, for these reasons . truly bizarre, unfriendly, and unreadable input format or syntax . very unhelpful and unforgiving error messages. such as the message "12 shift/reduce conflicts" as the first level error message; the next level is a 50 page trace of the parser generator internal states. . serious bugs in the lexer generator or parser generator, leading to any of: infinite loops, core dumps, silent generation of code skeletons that won't compile. 2. Flexibility The hand lexer and parser are not committed to a single inadequate technique. For example, an RD parser has no trouble distinguishing between the character "'" as the start of an Ada character literal, and the character "'" as the Ada attribute symbol. The average lexer generator cannot disambiguate these cases. Another one I had trouble with was the difference between Pascal "1.10" and "1..10". As another example, when parsing an expression I can use good old operator precedence techniques, and gain a lot of efficiency for a small loss of robustness. Finally, if I do tree building by hand I can straighten out on the fly a lot of stuff that might cause trouble during later traversals, for example by flattening nested lists, massaging slightly different productions into a common format, and so on. 3. Error Handling I've found most mechanically generated parsers to be truly dreadful at error handling. A message like "syntax error on line 33" is not just amateur, it's pathetic. No user should have to put up with such rubbish. As another example, try giving a certain C compiler an input where the ";" has been omitted before the "}". This is arguably the simplest repair possible, but you'll get a stream junk consequential error messages. With an RD parser, the error handling . can be tailored closely to the error situation . can give a very informative message . can be tuned (in respect of both message and recovery action) based on usage statistics 4. Efficiency RD parsers written by hand are much smaller and faster than the machine-generated ones I've seen. One example from my own experience is a parser for Modula-2, where the hand version is about 600 lines of code (24 kbytes object) and the machine version was over TWO MEGABYTES of object code and static data. For the future - one day I'll give up these antique ways, and move to modern technology. But that technology will at have to be based on something at least as powerful as attribute grammars or von-Wyngaarden grammars; as far as I can see, mechanical context-free parsing just doesn't hack it. As I said, just a personal view. [From firth@sei.cmu.edu (Robert Firth)] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.
bart@videovax.tv.tek.com (Bart Massey) (11/12/89)
In article <1989Nov10.025412.4014@esegue.segue.boston.ma.us> firth@sei.cmu.edu (Robert Firth) writes: > This addresses the question 'Why do people write recursive descent parsers by > hand, rather than using parser-generator tools such as lex and yacc?'. It > answers a slightly different question, namely, 'Why do I write RD parsers by > hand...', and as such represents a personal and biased view. I just thought it was important to correct what I believe are some fundamental misconceptions of the author regarding existing tools. These are, of course, *my* opinions and beliefs, so take them for what they're worth. > 1. Ease of construction. > > Given a decent grammar, I can write the basic RD parser in a couple of days, > and do some solid component testing along the way. Moreover, I can write it > in a reasonable language. The tools I've used, on the other hand, have made > the job much harder, for these reasons ... > > . very unhelpful and unforgiving error messages. such as the message > "12 shift/reduce conflicts" as the first level error message; the next > level is a 50 page trace of the parser generator internal states. Yeah, but given a grammar suitable for RD parsing, you won't *see* any of these error messages! I agree -- once I have a grammar suitable for RD parsing, it wouldn't take me any longer to write an RD parser than a LR parser using a parser generator. What this argument ignores is that many artificial languages are either LR(1) or nearly so *before any grammar transformations*, and that the parser-generator may do some simple transformations automatically and *virtually instantaneously*. I've tried to transform a non-trivial grammar for RD by hand -- I didn't enjoy it much. > . serious bugs in the lexer generator or parser generator, leading to any of: > infinite loops, core dumps, silent generation of code skeletons that > won't compile. Could happen. Using FLEX+BISON (how about PD-YACC, anyone?) I've never seen anything I would class as a serious bug. And in fact I wrote a 500-line lexer and 2000-line parser recently, without running across any bugs in either LEX or YACC (misfeatures, but no bugs :-)! > 2. Flexibility > > The hand lexer and parser are not committed to a single inadequate technique. > For example, an RD parser has no trouble distinguishing between the character > "'" as the start of an Ada character literal, and the character "'" as the > Ada attribute symbol. The average lexer generator cannot disambiguate these > cases. Another one I had trouble with was the difference between Pascal > "1.10" and "1..10". Note that one can always pass information from the parser back to the lexer using global variables to guide the lex. While this isn't encouraged, IMHO it is just as powerful and clean as embedding the pass-back in the RD parser. As for your second example, I'm confused -- isn't ".." a seperate token in your Pascal compiler? If so, you can give LEX and clones rules for "." (structure member), ".." (subrange), and "[0-9]*.[0-9]+" or some such (floating constant), and it will disambiguate in what I believe to be exactly the right way -- choose the longest matching prefix, or the earlier one on identical lengths! > As another example, when parsing an expression I can use good old operator > precedence techniques, and gain a lot of efficiency for a small loss of > robustness. YACC and clones, at least, support operator-precedence parsing as a natural part of the input -- it is traditional to use this for any expression-like construct in the input! Typing in a C expression lexer and parser is literally a 1-hour exercise -- you should try it sometime! ... > For the future - one day I'll give up these antique ways, and move to modern > technology. But that technology will at have to be based on something at > least as powerful as attribute grammars or von-Wyngaarden grammars; as far as > I can see, mechanical context-free parsing just doesn't hack it. If you're writing a commercial compiler for a large compiled HLL, I can understand where the extra investment of effort in a "custom" scanner and/or parser might conceivably be worth it. But given a finite investment of effort, I personally prefer to spend as little as possible of it on the best-understood part of formal language analysis -- scanning and parsing. LEX and YACC have so far helped me to do so, without incurring costs which were significant to me in my application. Bart Massey ..tektronix!videovax.tv.tek.com!bart ..tektronix!reed.bitnet!bart -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.
chris@mimsy.umd.edu (Chris Torek) (11/13/89)
Speaking of error recovery and error messages: this brings up something
which is a problem in some newer C compilers. Fancy error reporting is
useful with, say, Pascal or Modula 2, where you might see something
like this:
"foo.p", line 12: missing `]' inserted
var a : array[1..10 of integer;
-------------------^
Unfortunately, when this is naively applied to C code, the result
is often something like this (lines broken at 72 characters):
"foo.c", line 12: missing `;' inserted
(--((&__stdioF[1]))->_w < 0 ? ((&__stdioF[1]))->_w >= ((
&__stdioF[1]))->_lbfsize ? (*((&__stdioF[1]))->_p = ((--((&__stdioF[0]))
->_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++)
))), *((&__stdioF[1]))->_p != '\n' ? (int)*((&__stdioF[1]))->_p++ : __st
dio_wbuf('\n', (&__stdioF[1])) : __stdio_wbuf((int)((--((&__stdioF[0]))-
>_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++))
), (&__stdioF[1])) : (*((&__stdioF[1]))->_p = ((--((&__stdioF[0]))->_r <
0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++))), (i
nt)*((&__stdioF[1]))->_p++)) chars++;
--------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
----------------------------^
Of course, the mess the compiler spat out above is nothing like the
original line 12, which actually read
putchar(getchar()) chars++;
In practise, the errors I have seen have been even uglier than the
one above, since there is actually more text, most of it blanks and
tabs, which I compressed out of the above expansion.
(Not to pick on them, but this particular bit of naive `friendly' error
reporting is done by the MIPSco C compiler as shipped by DEC for Ultrix
3.x on the DECstation 3100. I have not seen anyone else actually
commit this bit of stupidity. Friendly error reporting is all well and
good, but at least this should be optional, so that it does not get in
the way of integrated environments such as that provided by Emacs.
Moreover, it should refer to the original source code, not any
intermediate output. [It is possible that the error is only in the
intermediate output, of course, as, e.g., with a bad `#define'.])
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
--
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue.
Please send responses to the author of the message, not the poster.
diamond@ws.sony.junet (Norman Diamond) (11/14/89)
In message <1989Nov12.222136.14085@esegue.segue.boston.ma.us>, chris@mimsy.umd.edu (Chris Torek) complains about "fancy" error reporting in the MIPS C compiler, a technique which is less helpful than in non-preprocessed languages (lines broken at 72 characters): >[begin quote] "foo.c", line 12: missing `;' inserted (--((&__stdioF[1]))->_w < 0 ? ((&__stdioF[1]))->_w >= (( &__stdioF[1]))->_lbfsize ? (*((&__stdioF[1]))->_p = ((--((&__stdioF[0])) ->_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++) ))), *((&__stdioF[1]))->_p != '\n' ? (int)*((&__stdioF[1]))->_p++ : __st dio_wbuf('\n', (&__stdioF[1])) : __stdio_wbuf((int)((--((&__stdioF[0]))- >_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++)) ), (&__stdioF[1])) : (*((&__stdioF[1]))->_p = ((--((&__stdioF[0]))->_r < 0 ? __stdio_rget((&__stdioF[0])) : (int)(*((&__stdioF[0]))->_p++))), (i nt)*((&__stdioF[1]))->_p++)) chars++; -------------------------------------------------------- ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------ ----------------------------^ Of course, the mess the compiler spat out above is nothing like the original line 12, which actually read putchar(getchar()) chars++; [...] Friendly error reporting is all well and good, but at least this should be optional, so that it does not get in the way of integrated environments such as that provided by Emacs. Moreover, it should refer to the original source code, not any intermediate output. >[end quote] In fact I found this very helpful when errors were introduced by macros, e.g. when there is a syntax error, or a call to what should be a nested macro but which forgot to exist. (Dr. Torek mentioned this in an aside, as a possibility. Please be assured that it is very very possible.) Furthermore Dr. Torek's request is difficult to meet in systems that separate preprocessing into a separate pass (I believe Dr. Torek might know someone who worked on such a system :-)). I think the best option is, if you don't like the echo of the processed line and the mess of uncountable (well, not reliably countable) lines of hyphens, then ignore this mess and only read the line number at the beginning of the error message. Then you can still enjoy hunting on your own through that original source line and the macro calls, as you get to do in older systems. -- Norman Diamond, Sony Corp. (diamond%ws.sony.junet@uunet.uu.net seems to work) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus}!esegue. Meta-mail to compilers-request@esegue. Please send responses to the author of the message, not the poster.