[comp.compilers] New YACC calling conventions: was "Lexers in YACC?"

jar@florida.eng.ileaf.com (Jim Roskind x5570) (09/10/90)
I thought that this posting led to an interesting question that I have
thought about several times in the past.  I thought that my comments
might be of general interest.  Although I disagree with some of his
motivations, the problem that he ended up with *is* interesting, and I
have a few comments on it.

>I'm interested in parsing a language that is not LALR(1) using YACC.  The
>general approach taken in the past (by others) is to handle the non-LALR(1)
>portions (which are fairly simple) by writing a hard-coded "intermediate
>stage" between the lexer and the parser.  This intermediate stage filters the
>stream of lexer tokens and uses recursive descent to detect the problem

It has not been my experience with YACC that a recursive descent
parser really helps things.  IF life is *so* bad that you *really*
need a recursive descent parser, then a) I am surprised, b) the
language is not going be be very readable c) what is the big benefit
of using YACC if you are writting a recursive descent parser anyway?

I believe cfront (*the* fundamental C++ implementation) went this way
(combined YACC *and* recursive descent parser), and the negative
impact on the language syntax (re: ambiguities) is only recently being
explored fully.

>combinations and pack them up into single tokens which are then passed to the
>parser.
>
>It seems to me that it wouldn't be difficult to implement this intermediate
>stage using YACC as well.  In this version, the lexer would pass its tokens
             ^^^^^^^^^^^^ 
I think that if you get away with cascading lalr(1) parsers to achieve
your ends, then it is likely that a single lalr(1) parser will do the
trick.  Again, my experience is that a LR(1) grammar can do much more
than folks give it credit for.

>to the first-stage YACC grammar, which would pass most of them unchanged.
>Sequences that needed it would be reduced as above.  The resulting stream
>would then go to the second-stage YACC grammar, which would do most of the
>parsing work.

There are reasons for using distinct grammars in parsing, such as when
the actions or basis of parsing in the two stages are very orthogonal.
In such cases, combining the grammars would provide unnecessary
complexity.  This situation is quite distinct from trying to overcome
the restriction of an LALR(1) parser by cascading several such
parsers.  My easy example is C, wherein phase 1 the tokens need to be
parsed base on a preprocessor syntax (and expanded via macro expansion
rules, which are well beyond a simple LALR grammar), and then in phase
2 the *resulting* tokens must be parsed according to standard C syntax.

>What I need to know is, how does one write a set of YACC actions that cause
>the resulting grammar to emulate yylex?  I understand the preprocessor tricks
>that allow two different YACC grammars to co-exist without name collision.
>What I don't understand is how to make the generated parser of the first
>stage "return" tokens to its caller.  It seems to require generalized
>co-routines, which of course C does not provide.  The operating system is
>Unix System V.

This is a real problem.  Most simply put, yyparse is called exactly
once during a parse, and then it calls many other functions during it
actions.  *IF* two versions of yacc were active at the same time, then
one must have been called first, and remain dormant during the entire
run of the second version.  In contrast, lex/flex is called *many* times
during its run, and simply "picks up where it left off".  I believe
what you meant by "emulate yylex" is that you wanted a different
calling convention (a literal interpretation of such statement would
be more difficult to answer ;-) ).

IMHO, John suggests the most reasonable *external* approach:

>   [I suppose you could run it as two processes connected by a pipe, the Unix
>   version of coroutines. -John]

which amounts to creating two separate threads of control, so that the
two version may proceed independently. In such a scenario, one would
assume that one version of yacc was (as a consequence of some actions)
feeding some queue, while the other version was extracting from the
queue. 

In truth, a simplier solution is too allow the version that parses the
tokens directly supplied by lex/flex run to completion, and then run
the second version.  This approach is insufficient when there is
significant feedback from the second parse that assists the lexical
analysis (or primary parse), and must be provided in a timely manner.
A good example of such feedback is present in a C language parser.
A C parser must update the symbol table, and the lexical
analyzer must interrogate the symbol table to determine the typedef
status of arbitrary identifiers at the *current* scope. (Try to figure
out what:

	F(*b)[4];

means if *without* knowing whether "F" is a typedefed name, or a
function name at the current scope, and you will see the problem :-) ).

When this feedback requirement gets very tight (as it is in the above
example), then the interprocess communications requirements for
synchronizing the processes will grow, and the whole system will get
very complex. (Fortunately in the C example we don't need two
parsers). 

The real solution, IMHO, is to rewrite the parser engine for a yacc
(Berkeley YACC would be a good candidate) so that all the state was
preserved in static variables, rather than in auto variables on the
stack.  With such a rewrite, it should be possible to call yyparse()
repeatedly and have it "pick up where it left off".  For efficiency
reasons, a special call could be made like "goto YY_TEMPORARY_EXIT"
which would save all the state variables (held in auto variables) into
their static counterparts, rather than relying on the use of static
variables continuously.  Typical entry in to yyparse() would check for
a saved state, and restore it if it existed.  The final parser could
continue to use the "old calling conventions", and intermediate
parer(s) could use the "goto YY_TEMPORARY_EXIT" call whenever some
tokens were made available for processing by the "calling" parser.

Historically, I believe that the interface to YACC proved sufficient
for many applications, and hence it was never questioned.  Your
question is good, and the answer awaits an interested party (I believe
that Corbett (the author of Berkeley YACC) has more pressing tasks),
to recode the parser engine.  The task is actually very
straightforward.  The real question is, how much will it help you, and
are you just asking for power that you don't need?  If you need the
power, then I believe you are the "interested party."

Jim Roskind
Independent Consultant
(407)729-4348 or (617)577-9813 x5570
jar@hq.ileaf.com or ...!uunet!leafusa!jar
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.