djones@decwrl.dec.com (Dave Jones) (01/13/89)
> In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes: > > Does anybody have or know of YACC grammar for C that does > not require the lexical analyser to differentiate typedef > names from other identifier names? Nope, nobody does. There ain't no such thing. If type-names are treated the same as identifiers, C is inherently ambiguous. > We have tried to "fix" a > grammar with this but always get an illegal grammar for YACC. > Our lexical analyser does not have access to the full set of > typedef names. The lexical analyzer is going to have to know something about type-names. It is very tricky to do it well. I have a suspicion that the C grammar that has been circulating on the net for the last year or two, with the comment that all it needs is a lexical analyzer, is, in reality, a rather arcane practical joke. But then again maybe not. I'll show you how I did it. I'm not too happy with the solution. If you can think of a better way, please show me. I peeked into pcc to see how they did it. Believe me, you don't even want to know. (Gack.) Anyhow, what I did was this: The lexical scanner has access to the hash-table of type-names. The trick is to tell the scanner when it needs to look an identifier up in that table and when it does not. So I added two empty productions, "td" and "ntd", to toggle typedefs on and off: td : /* recognize typedef names as such. */ { $$=0; typedefs_ok=1; } ; ntd : /* recognize typedef names only as identifiers. */ { $$=0; typedefs_ok = 0; } ; These productions are used lots of places in the grammar, like so: struct_specifier : ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}' { $$=tree(N_StructIDDef); } In this example, the parser first looks ahead at the keyword "struct", and seeing it, decides to do the production ntd. Doing so turns typename recognition off. Then the parser looks ahead to see an identifier, which it cannot recognize as a typedef name. Seeing the identifier, it decides to do the production td, which turns typename recognition back on. Or so it would seem... The catch is that yaccpar, the standard yacc parser-template, employs what might be called "lazy LALR(1)" lookahead. In some situations, where the parser is destined to do a certain production willy-nilly, it does not do a lookahead at all. (Recall that yacc uses "default" productions in some error states.) It is difficult to predict where yacc will be lazy and where it will not. You wouldn't want to depend on that anyway. It could change on a yacc upgrade, or when you tweak the C grammar. Oh what to do? What to do? If the parser defers lexing IDENTIFIER until after it does the production td, the identifier may incorrectly be tagged as a type-name. It does not help to move all the "td" productions to the other side of IDENTIFIER. That breaks when yacc generates an "eager" lookahead. I scratched my head about this for quite a while. Finally it occured to me that there is no reason whatsoever why yaccpar should use this "lazy" lookahead scheme. So I hacked yaccpar. (Now you see why I'm not ecstaticly happy about the solution.) I changed it so that it always fetches the lookahead token before changing state, even if the token does not need to be inspected. The modified algorithm is actually faster than the old one, albeit not significantly so. [From megatest!djones@decwrl.dec.com (Dave Jones)] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request