jar@florida.eng.ileaf.com (Jim Roskind x5570) (09/10/90)
I thought that this posting led to an interesting question that I have thought about several times in the past. I thought that my comments might be of general interest. Although I disagree with some of his motivations, the problem that he ended up with *is* interesting, and I have a few comments on it. >I'm interested in parsing a language that is not LALR(1) using YACC. The >general approach taken in the past (by others) is to handle the non-LALR(1) >portions (which are fairly simple) by writing a hard-coded "intermediate >stage" between the lexer and the parser. This intermediate stage filters the >stream of lexer tokens and uses recursive descent to detect the problem It has not been my experience with YACC that a recursive descent parser really helps things. IF life is *so* bad that you *really* need a recursive descent parser, then a) I am surprised, b) the language is not going be be very readable c) what is the big benefit of using YACC if you are writting a recursive descent parser anyway? I believe cfront (*the* fundamental C++ implementation) went this way (combined YACC *and* recursive descent parser), and the negative impact on the language syntax (re: ambiguities) is only recently being explored fully. >combinations and pack them up into single tokens which are then passed to the >parser. > >It seems to me that it wouldn't be difficult to implement this intermediate >stage using YACC as well. In this version, the lexer would pass its tokens ^^^^^^^^^^^^ I think that if you get away with cascading lalr(1) parsers to achieve your ends, then it is likely that a single lalr(1) parser will do the trick. Again, my experience is that a LR(1) grammar can do much more than folks give it credit for. >to the first-stage YACC grammar, which would pass most of them unchanged. >Sequences that needed it would be reduced as above. The resulting stream >would then go to the second-stage YACC grammar, which would do most of the >parsing work. There are reasons for using distinct grammars in parsing, such as when the actions or basis of parsing in the two stages are very orthogonal. In such cases, combining the grammars would provide unnecessary complexity. This situation is quite distinct from trying to overcome the restriction of an LALR(1) parser by cascading several such parsers. My easy example is C, wherein phase 1 the tokens need to be parsed base on a preprocessor syntax (and expanded via macro expansion rules, which are well beyond a simple LALR grammar), and then in phase 2 the *resulting* tokens must be parsed according to standard C syntax. >What I need to know is, how does one write a set of YACC actions that cause >the resulting grammar to emulate yylex? I understand the preprocessor tricks >that allow two different YACC grammars to co-exist without name collision. >What I don't understand is how to make the generated parser of the first >stage "return" tokens to its caller. It seems to require generalized >co-routines, which of course C does not provide. The operating system is >Unix System V. This is a real problem. Most simply put, yyparse is called exactly once during a parse, and then it calls many other functions during it actions. *IF* two versions of yacc were active at the same time, then one must have been called first, and remain dormant during the entire run of the second version. In contrast, lex/flex is called *many* times during its run, and simply "picks up where it left off". I believe what you meant by "emulate yylex" is that you wanted a different calling convention (a literal interpretation of such statement would be more difficult to answer ;-) ). IMHO, John suggests the most reasonable *external* approach: > [I suppose you could run it as two processes connected by a pipe, the Unix > version of coroutines. -John] which amounts to creating two separate threads of control, so that the two version may proceed independently. In such a scenario, one would assume that one version of yacc was (as a consequence of some actions) feeding some queue, while the other version was extracting from the queue. In truth, a simplier solution is too allow the version that parses the tokens directly supplied by lex/flex run to completion, and then run the second version. This approach is insufficient when there is significant feedback from the second parse that assists the lexical analysis (or primary parse), and must be provided in a timely manner. A good example of such feedback is present in a C language parser. A C parser must update the symbol table, and the lexical analyzer must interrogate the symbol table to determine the typedef status of arbitrary identifiers at the *current* scope. (Try to figure out what: F(*b)[4]; means if *without* knowing whether "F" is a typedefed name, or a function name at the current scope, and you will see the problem :-) ). When this feedback requirement gets very tight (as it is in the above example), then the interprocess communications requirements for synchronizing the processes will grow, and the whole system will get very complex. (Fortunately in the C example we don't need two parsers). The real solution, IMHO, is to rewrite the parser engine for a yacc (Berkeley YACC would be a good candidate) so that all the state was preserved in static variables, rather than in auto variables on the stack. With such a rewrite, it should be possible to call yyparse() repeatedly and have it "pick up where it left off". For efficiency reasons, a special call could be made like "goto YY_TEMPORARY_EXIT" which would save all the state variables (held in auto variables) into their static counterparts, rather than relying on the use of static variables continuously. Typical entry in to yyparse() would check for a saved state, and restore it if it existed. The final parser could continue to use the "old calling conventions", and intermediate parer(s) could use the "goto YY_TEMPORARY_EXIT" call whenever some tokens were made available for processing by the "calling" parser. Historically, I believe that the interface to YACC proved sufficient for many applications, and hence it was never questioned. Your question is good, and the answer awaits an interested party (I believe that Corbett (the author of Berkeley YACC) has more pressing tasks), to recode the parser engine. The task is actually very straightforward. The real question is, how much will it help you, and are you just asking for power that you don't need? If you need the power, then I believe you are the "interested party." Jim Roskind Independent Consultant (407)729-4348 or (617)577-9813 x5570 jar@hq.ileaf.com or ...!uunet!leafusa!jar -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.