[comp.compilers] YACC grammar for C language

djones@decwrl.dec.com (Dave Jones) (01/13/89)
> In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes:
> 
>    Does anybody have or know of YACC grammar for C that does
> not require the lexical analyser to differentiate typedef
> names from other identifier names?

Nope, nobody does. There ain't no such thing. If type-names
are treated the same as identifiers, C is inherently ambiguous.

> We have tried to "fix" a
> grammar with this but always get an illegal grammar for YACC.
> Our lexical analyser does not have access to the full set of
> typedef names.

The lexical analyzer is going to have to know something about type-names.
It is very tricky to do it well. I have a suspicion that the
C grammar that has been circulating on the net for the last year or
two, with the comment that all it needs is a lexical analyzer, is, in
reality, a rather arcane practical joke.  But then again maybe not.

I'll show you how I did it.  I'm not too happy with the solution.
If you can think of a better way, please show me. I peeked into pcc
to see how they did it. Believe me, you don't even want to know. (Gack.)

Anyhow, what I did was this:

The lexical scanner has access to the hash-table of type-names.
The trick is to tell the scanner when it needs to look an identifier
up in that table and when it does not. So I added two empty productions,
"td" and "ntd", to toggle typedefs on and off:

   td	: /* recognize typedef names as such. */
          { $$=0; typedefs_ok=1; }
        ;

   ntd	: /* recognize typedef names only as identifiers. */
	  { $$=0; typedefs_ok = 0; }
	;

These productions are used lots of places in the grammar, like so:

  struct_specifier
	: ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}'
					{ $$=tree(N_StructIDDef); }

In this example, the parser first looks ahead at the keyword "struct",
and seeing it, decides to do the production ntd. Doing so turns
typename recognition off. Then the parser looks ahead to see an
identifier, which it cannot recognize as a typedef name.  Seeing the
identifier, it decides to do the production td, which turns typename
recognition back on.

Or so it would seem...

The catch is that yaccpar, the standard yacc parser-template, employs
what might be called "lazy LALR(1)" lookahead.  In some situations,
where the parser is destined to do a certain production willy-nilly,
it does not do a lookahead at all. (Recall that yacc uses "default"
productions in some error states.) It is difficult to predict where
yacc will be lazy and where it will not. You wouldn't want to depend
on that anyway. It could change on a yacc upgrade, or when you tweak
the C grammar.

Oh what to do? What to do? If the parser defers lexing IDENTIFIER
until after it does the production td, the identifier may incorrectly
be tagged as a type-name.

It does not help to move all the "td" productions to the other side
of IDENTIFIER. That breaks when yacc generates an "eager" lookahead.

I scratched my head about this for quite a while.  Finally it occured
to me that there is no reason whatsoever why yaccpar should use this
"lazy" lookahead scheme.  So I hacked yaccpar. (Now you see why I'm
not ecstaticly happy about the solution.) I changed it so that it always
fetches the lookahead token before changing state, even if the token does
not need to be inspected.  The modified algorithm is actually faster
than the old one, albeit not significantly so.
[From megatest!djones@decwrl.dec.com (Dave Jones)]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request