[comp.compilers] Lexers in YACC?

cowan@marob.masa.com (John Cowan) (09/05/90)

I'm interested in parsing a language that is not LALR(1) using YACC.  The
general approach taken in the past (by others) is to handle the non-LALR(1)
portions (which are fairly simple) by writing a hard-coded "intermediate
stage" between the lexer and the parser.  This intermediate stage filters the
stream of lexer tokens and uses recursive descent to detect the problem
combinations and pack them up into single tokens which are then passed to the
parser.

It seems to me that it wouldn't be difficult to implement this intermediate
stage using YACC as well.  In this version, the lexer would pass its tokens
to the first-stage YACC grammar, which would pass most of them unchanged.
Sequences that needed it would be reduced as above.  The resulting stream
would then go to the second-stage YACC grammar, which would do most of the
parsing work.

What I need to know is, how does one write a set of YACC actions that cause
the resulting grammar to emulate yylex?  I understand the preprocessor tricks
that allow two different YACC grammars to co-exist without name collision.
What I don't understand is how to make the generated parser of the first
stage "return" tokens to its caller.  It seems to require generalized
co-routines, which of course C does not provide.  The operating system is
Unix System V.

Has anyone tried to solve this sort of problem?  Please send comments
directly to me, and I will summarize them to the net.  I have set followups
to "poster".
-- 
cowan@marob.masa.com			(aka ...!hombre!marob!cowan)
[I suppose you could run it as two processes connected by a pipe, the Unix
version of coroutines. -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

mike@relay.EU.net (Michael K. Gschwind) (09/11/90)

In article <26E50E36.34BE@marob.masa.com> you write:
>What I don't understand is how to make the generated parser of the first
>stage "return" tokens to its caller.  It seems to require generalized
>co-routines, which of course C does not provide.  The operating system is
>Unix System V.

No, you don't want to do this. Rather, I would queue the tokens, insert 
magic tokens (which make the grammar LALR(1)) and after the first stage 
call the second. Sequence works fine as long as there is no interaction.
In the above scenario, there is no need for interaction.

BTW, in many cases, lex is powerful enough to make a grammar lalr(1)
with right context rules. You want to look into these. If these aren't
sufficient, you could use start states as well.  
E.g., if you have to differentiate between function definitions,
function declarations and variable dfinitions in C, you could use:

%%
{identifier}/\([^)]\)			return FUNCTION_DEFINITION;
{identifier}/\([^)]\);			return FUNCTION_DECLARATION;
{identifier}				return VARIABLE; 

%%

			bye,
				mike

Michael K. Gschwind             mike@vlsivie.at
Institute for VLSI-Design       mike@vlsivie.uucp
Technical University, Vienna    e182202@awituw01.bitnet
Voice: (++43).1.58801 8144      Fax:   (++43).1.569697
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.