jhh@calmasd.GE.COM (Jung Hamel) (01/10/89)
Does anybody have or know of YACC grammar for C that does not require the lexical analyser to differentiate typedef names from other identifier names. We have tried to "fix" a grammar with this but always get an illegal grammar for YACC. Our lexical analyser does not have access to the full set of typedef names. Thanks, Jung Hammel. ...ucsd!calmasd!jhh or jhh@calmasd.GE.COM
stevev@tekchips.CRL.TEK.COM (Steve Vegdahl) (01/12/89)
In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes: > > Does anybody have or know of YACC grammar for C that does > not require the lexical analyser to differentiate typedef > names from other identifier names. We have tried to "fix" a > grammar with this but always get an illegal grammar for YACC. > Our lexical analyser does not have access to the full set of > typedef names. This is not as simple as is might seem. Consider the following programs: ---------------- begin program 1 ---------------- typedef int foo; int a, int b(); myFunc() { a*b(); } ---------------- end program 1 ---------------- ---------------- begin program 2 ---------------- typedef int a; int foo, int baz(); myFunc() { a*b(); } ---------------- end program 2 ---------------- Assuming that typedef names are treated as identifiers, these programs are syntacticaly identical. (The definitions of "myFunc" are character-for-character identical.) Yet, we really want to parse the string "a*b();" differently in the two cases. In program 1 it is a multiplication of "a" by the result of the function call to "b", giving a parse something like: STATEMENT EXPR EXPR ID: a OP STAR: * EXPR ID: b SEMI: ; In program 2, we are declaring "b" to be a function returning a pointer to something of type "a", giving a parse something like DECLLIST ID: a DECLNAMES DECLNAME STAR: * ID: b SEMI: ; If we don't distinguish typedef names, say, in the scanner, we have no idea which way to parse the string "a*b();", so we have to do something like picking one of them, and possibly transforming the AST into other one once we know whether "a" is a typedef---typically during semantic analysis. (One can probably construct examples that are even more complicated than this one.) Thus, while it is possible to write a grammar C that treats identifiers as typedefs, this can only be done by pushing more work onto the semantic analysis phase. (Of course in the degenerate we could write a parser that accepted *any* token sequence, forcing the semantic analysis phase to do all the parsing!) Steve Vegdahl Computer Research Lab Tektronix Labs Beaverton, Oregon
meissner@xyzzy.UUCP (Michael Meissner) (01/12/89)
In article <175@calmasd.GE.COM> jhh@calmasd.GE.COM (Jung Hamel) writes: | | Does anybody have or know of YACC grammar for C that does | not require the lexical analyser to differentiate typedef | names from other identifier names. We have tried to "fix" a | grammar with this but always get an illegal grammar for YACC. | Our lexical analyser does not have access to the full set of | typedef names. In short, it can't be done without a context sensitive parser (ie, without things like the lexical analyser knowing about the symbol table). Among other things, the following fragment: (identifer) * expression can be parsed either as a cast of dereferncing a pointer expression (if identifier is a typedef name) or multiplication (if identifier is a variable). This "feature" has probably been cursed by all C implementators ever afterward (I know I sure did). IMHO it is one of four things I would change if I could have my way and have all the extant source code change overnight. In case you are wondering, the other three are: 1) Change operator priorities, so that: x & mask == result, would be evaluated as (x & mask) == value; 2) Remove macro procedures and add inline capability as a mandated feature (including deletion of the inline function if there are no callers); 3) Fix the overloading of "extern" and "static", and mandate one global memory model (ie, the fact that UNIX supports a relaxed REF/DEF scheme, wheras K&R and ANSI mandate the more strict single DEF, multiple references model). Sigh, these won't get changed because the time for such changes has long since past. -- Michael Meissner, Data General. Uucp: ...!mcnc!rti!xyzzy!meissner Arpa: meissner@dg-rtp.DG.COM (or) meissner%dg-rtp.DG.COM@relay.cs.net
djones@decwrl.dec.com (Dave Jones) (01/13/89)
> In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes: > > Does anybody have or know of YACC grammar for C that does > not require the lexical analyser to differentiate typedef > names from other identifier names? Nope, nobody does. There ain't no such thing. If type-names are treated the same as identifiers, C is inherently ambiguous. > We have tried to "fix" a > grammar with this but always get an illegal grammar for YACC. > Our lexical analyser does not have access to the full set of > typedef names. The lexical analyzer is going to have to know something about type-names. It is very tricky to do it well. I have a suspicion that the C grammar that has been circulating on the net for the last year or two, with the comment that all it needs is a lexical analyzer, is, in reality, a rather arcane practical joke. But then again maybe not. I'll show you how I did it. I'm not too happy with the solution. If you can think of a better way, please show me. I peeked into pcc to see how they did it. Believe me, you don't even want to know. (Gack.) Anyhow, what I did was this: The lexical scanner has access to the hash-table of type-names. The trick is to tell the scanner when it needs to look an identifier up in that table and when it does not. So I added two empty productions, "td" and "ntd", to toggle typedefs on and off: td : /* recognize typedef names as such. */ { $$=0; typedefs_ok=1; } ; ntd : /* recognize typedef names only as identifiers. */ { $$=0; typedefs_ok = 0; } ; These productions are used lots of places in the grammar, like so: struct_specifier : ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}' { $$=tree(N_StructIDDef); } In this example, the parser first looks ahead at the keyword "struct", and seeing it, decides to do the production ntd. Doing so turns typename recognition off. Then the parser looks ahead to see an identifier, which it cannot recognize as a typedef name. Seeing the identifier, it decides to do the production td, which turns typename recognition back on. Or so it would seem... The catch is that yaccpar, the standard yacc parser-template, employs what might be called "lazy LALR(1)" lookahead. In some situations, where the parser is destined to do a certain production willy-nilly, it does not do a lookahead at all. (Recall that yacc uses "default" productions in some error states.) It is difficult to predict where yacc will be lazy and where it will not. You wouldn't want to depend on that anyway. It could change on a yacc upgrade, or when you tweak the C grammar. Oh what to do? What to do? If the parser defers lexing IDENTIFIER until after it does the production td, the identifier may incorrectly be tagged as a type-name. It does not help to move all the "td" productions to the other side of IDENTIFIER. That breaks when yacc generates an "eager" lookahead. I scratched my head about this for quite a while. Finally it occured to me that there is no reason whatsoever why yaccpar should use this "lazy" lookahead scheme. So I hacked yaccpar. (Now you see why I'm not ecstaticly happy about the solution.) I changed it so that it always fetches the lookahead token before changing state, even if the token does not need to be inspected. The modified algorithm is actually faster than the old one, albeit not significantly so. [From megatest!djones@decwrl.dec.com (Dave Jones)] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { decvax | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
djones@megatest.UUCP (Dave Jones) (01/13/89)
In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes: > > Does anybody have or know of YACC grammar for C that does > not require the lexical analyser to differentiate typedef > names from other identifier names? Nope, nobody does. There ain't know such thing. If type-names are treated the same as identifiers, C is inherently ambiguous. > We have tried to "fix" a > grammar with this but always get an illegal grammar for YACC. > Our lexical analyser does not have access to the full set of > typedef names. > The lexical analyzer is going to have to know something about type-names. It is very tricky to do it well. I have a suspicion that the C grammar that has been circulating on the net for the last year or two, with the comment that all it needs is a lexical analyzer, is, in reality, a rather arcane practical joke. But then again maybe not. I'll show you how I did it. I'm not too happy with the solution. If you can think of a better way, please show me. I peeked into pcc to see how they did it. Believe me, you don't even want to know. (Gack.) Anyhow, what I did was this: The lexical scanner has access to the hash-table of type-names. The trick is to tell the scanner when it needs to look an identifier up in that table and when it does not. So I added two empty productions, "td" and "ntd", to toggle typedefs on and off: td : /* recognize typedef names as such. */ { $$=0; typedefs_ok=1; } ; ntd : /* recognize typedef names only as identifiers. */ { $$=0; typedefs_ok = 0; } ; These productions are used lots of places in the grammar, like so: struct_specifier : ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}' { $$=tree(N_StructIDDef); } In this example, the parser first looks ahead at the keyword "struct", and seeing it, decides to do the production ntd. Doing so turns typename recognition off. Then the parser looks ahead to see an identifier, which it cannot recognize as a typedef name. Seeing the identifier, it decides to do the production td, which turns typename recognition back on. Or so it would seem... The catch is that yaccpar, the standard yacc parser-template, employs what might be called "lazy LALR(1)" lookahead. In some situations, where the parser is destined to do a certain production willy-nilly, it does not do a lookahead at all. (Recall that yacc uses "default" productions in some error states.) It is difficult to predict where yacc will be lazy and where it will not. You wouldn't want to depend on that anyway. It could change on a yacc upgrade, or when you tweek the C grammar. Oh what to do? What to do? If the parser defers lexing IDENTIFIER until after it does the production td, the identifier may incorrectly be tagged as a type-name. It does not help to move all the "td" productions to the other side of IDENTIFIER. That breaks when yacc generates an "eager" lookahead. I scratched my head about this for quite a while. Finally it ocurred to me that there is no reason whatsoever why yaccpar should use this "lazy" lookahead scheme. So I hacked yaccpar. (Now you see why I'm not ecstaticly happy about the solution.) I changed it so that it always fetches the lookahead token before changing state, even if the token does not need to be inspected. The modified algorithm is actually faster than the old one, albeit not significantly so.
piet@ruuinf (Piet van Oostrum) (01/14/89)
In article <1183@goofy.megatest.UUCP>, djones@megatest (Dave Jones) writes:
`
`I'll show you how I did it. I'm not too happy with the solution.
`If you can think of a better way, please show me. I peeked into pcc
`to see how they did it. Believe me, you don't even want to know. (Gack.)
`
`.......
`
` struct_specifier
` : ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}'
` { $$=tree(N_StructIDDef); }
`
The best way to do this is:
1. Have different tokens for IDENTIFIER and TYPEDEF names.
2. in every context where a typedef name may be used as a normal identifier
(e.g. in field names, declarations, parameters, enum lists) use the
nonterminal identifier.
3. add the production rule:
identifier: IDENTIFIER | TYPEDEF { $$=detypedef($1); };
You use IDENTIFIER or TYPEDEF in those rules where identifier would be
ambiguous.
This is the way I find most logical, and not surprisingly this is the way
gcc does it. (If you want to see a good compiler look at gcc :=}.)
--
Piet van Oostrum, Dept of Computer Science, University of Utrecht
Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
Telephone: +31-30-531806. piet@cs.ruu.nl (mcvax!hp4nl!ruuinf!piet)
gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/14/89)
In article <1183@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes: -In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes: -> Does anybody have or know of YACC grammar for C that does -> not require the lexical analyser to differentiate typedef -> names from other identifier names? -Nope, nobody does. There ain't know such thing. If type-names -are treated the same as identifiers, C is inherently ambiguous. More precisely, type names must be taken into account when parsing C. There are cases where the wrong parse will result if this isn't done. -I have a suspicion that the C grammar that has been circulating on the -net for the last year or two, with the comment that all it needs is a -lexical analyzer, is, in reality, a rather arcane practical joke. No, I think the people who keep requesting a copy of it really do think that all they need to do is feed it to YACC and that will take care of the syntactic phase of a C compiler. Of course, they're mistaken..