esink@turia.dit.upm.es (Eric Wayne Sink) (10/16/90)
Greetings compiler gurus: I have a few questions which are probably a bit basic for this group, but I'd like ideas anyway. I believe that few people write parsers without YACC (or am I wrong ?). Suppose you WERE writing a recursive descent parser without YACC, for the purpose of understanding how they work. Take the example language to be C; with the following simple grammar: statement : typename identifier ';' | typename identifier '(' ')' ';' | typename identifier '[' constant ']' ';' ; Obviously this grammar is not useful or complete, I have made it only complex enough to illustrate my problem. Now, try parsing the following statements: int x; int x(); int x[5]; The recursive descent routines for 'statement' might look something like this, as I am envisioning it: int do_statement() { int match = 0; match = try_variable(); if (! match) match = try_function; if (! match) match = try_array(); if (! match) error(); } int try_variable() { token typename, varname; semi; typename = get_token(); varname = get_token(); semi = get_token(); if (typename == TYPE && varname == IDENTIFIER && semi == SEMICOLON) { /* process it, put it into the symbol table, whatever */ return 1; } else { return 0; /* fail */ } } The other routines are similar : the idea is that they read tokens and see if they fit the correct pattern, returning a fail code if they do not. try_variable is something I am calling an AND routine; for lack of correct terminology. do_statement is something I call an OR routine, because any of the alternatives will satisfy it. The question is: suppose we are parsing int wolf(); The parser above will try to parse it as a variable declaration, reading the first 3 tokens; and finding that the third one failed. Therefore, it will return failing, and do_statement will try the next alternative. However, three tokens will have been read, and try_function needs ALL three of them to determine if the pattern matches. They have been lost, unless some method is found to retain them. How is this problem usually solved ? I wrote a SIMPLE recursive descent parser in a class, but only needed a single character of looking back - here we need 3 tokens, and this is a simple example. The situation could be much worse in some grammars, couldn't it ? Finally, am I going about this all the wrong way ? Any comments on the ideas I have given here would be appreciated... Thanks to all ! Eric W. Sink Universidad Politecnica de Madrid Departamento de Telematica esink@turia.dit.upm.es [Either you need to permit arbitrary pushback, or else merge your states, since you can't tell what kind of declaration it is until after you've read the identifier. Most grammars are not LL(1), although many can be rewritten to be by adding lots of sub-productions. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
firth@sei.cmu.edu (Robert Firth) (10/18/90)
In article <9010160921.AA00952@turia.dit.upm.es> esink@turia.dit.upm.es (Eric Wayne Sink) writes: >I believe that few people write parsers without YACC (or am I wrong ?). (Well, I wouldn't go near yacc with a ten-meter pole, but who cares what I think!) >Suppose you WERE writing a recursive descent parser without YACC, for >the purpose of understanding how they work. Take the example language >to be C; with the following simple grammar: > >statement > : typename identifier ';' > | typename identifier '(' ')' ';' > | typename identifier '[' constant ']' ';' > ; > >Obviously this grammar is not useful or complete, I have made it only >complex enough to illustrate my problem. Now, try parsing the following >statements: > >int x; >int x(); >int x[5]; Yes, I believe you are going about it the wrong way. The basic idea of recursive descent is that you don't call a recogniser until you are sure that (in the absence of errors) it will indeed find what it expects. For example, suppose you have recognisers for the IF statement, the WHILE statement, and the GOTO statement. You enter them with the initial token already consumed, so the code that gets you there looks like this IF nextis(iftoken) THEN readifstatement() ELSE IF nextis(whiletoken) THEN readwhilestatement() ELSE IF nextis(gototoken) THEN readgotostatement() ELSE error... The auxiliary routine nextis(T) is about the most helpful thing to have in such a parser: if the current token is T then it CONSUMES that token and returns TRUE, otherwise it LEAVES THE TOKEN ALONE and returns FALSE. Now, when we get to readifstatement() we have consumed the IF, so the body reads something like readcondition() nexthadbetterbe(thentoken) readstatement() IF nextis(elsetoken) THEN readstatement() Let's turn now to your example. At the top level, we read first whatever is common in the left parts of the productions, since they have to be there in all cases: readtype() readidentifier() We now switch on the next token: IF nextis(semicolontoken) THEN -- we're done ELSE IF nextis(leftparentoken) THEN read_rest_of_case_2() ELSE IF nextis(leftbrackettoken) THEN read_rest_of_case_3() ELSE error... There's more to it than that, of course: passing information down to the recognisers, building parse trees, and all the error handling. But the above should give you the basic structure of the code. As a final useful trick: I always have the recognisers return the tree fragment for the production they finish recognising, so in reality the code tends to look like this IF nextis(iftoken) RETURN readifstatement() ... readifstatement: c := readcondition() nexthadbetterbe(thentoken) s1 := readstatement() s2 := IF nextis(elsetoken) THEN readstatement() ELSE NIL RETURN buildcell(iftoken,c,s1,s2) Good luck, and hope it all helps! Robert Firth -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.