[comp.lang.c] YACC grammar for C language

jhh@calmasd.GE.COM (Jung Hamel) (01/10/89)

   Does anybody have or know of YACC grammar for C that does
not require the lexical analyser to differentiate typedef
names from other identifier names. We have tried to "fix" a
grammar with this but always get an illegal grammar for YACC.
Our lexical analyser does not have access to the full set of
typedef names.

Thanks, Jung  Hammel.   ...ucsd!calmasd!jhh

                        or

                         jhh@calmasd.GE.COM

stevev@tekchips.CRL.TEK.COM (Steve Vegdahl) (01/12/89)

In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes:
> 
>    Does anybody have or know of YACC grammar for C that does
> not require the lexical analyser to differentiate typedef
> names from other identifier names. We have tried to "fix" a
> grammar with this but always get an illegal grammar for YACC.
> Our lexical analyser does not have access to the full set of
> typedef names.

This is not as simple as is might seem.  Consider the following programs:

---------------- begin program 1 ----------------
typedef int foo;
int a, int b();

myFunc()
{
a*b();
}
---------------- end program 1 ----------------

---------------- begin program 2 ----------------
typedef int a;
int foo, int baz();

myFunc()
{
a*b();
}
---------------- end program 2 ----------------

Assuming that typedef names are treated as identifiers, these programs
are syntacticaly identical.  (The definitions of "myFunc" are
character-for-character identical.)  Yet, we really want to parse the
string "a*b();" differently in the two cases.  In program 1 it is a
multiplication of "a" by the result of the function call to "b", giving a
parse something like:
  STATEMENT
   EXPR
    EXPR
     ID: a
    OP
     STAR: *
    EXPR
     ID: b
   SEMI: ;
In program 2, we are declaring "b" to be a function returning a pointer to
something of type "a", giving a parse something like
  DECLLIST
   ID: a
   DECLNAMES
    DECLNAME
    STAR: *
    ID: b
   SEMI: ;

If we don't distinguish typedef names, say, in the scanner, we have no idea
which way to parse the string "a*b();", so we have to do something like
picking one of them, and possibly transforming the AST into other one once
we know whether "a" is a typedef---typically during semantic analysis.  (One
can probably construct examples that are even more complicated than this one.)

Thus, while it is possible to write a grammar C that treats identifiers as
typedefs, this can only be done by pushing more work onto the semantic
analysis phase.  (Of course in the degenerate we could write a parser that
accepted *any* token sequence, forcing the semantic analysis phase to do all
the parsing!)

		Steve Vegdahl
		Computer Research Lab
		Tektronix Labs
		Beaverton, Oregon

meissner@xyzzy.UUCP (Michael Meissner) (01/12/89)

In article <175@calmasd.GE.COM> jhh@calmasd.GE.COM (Jung Hamel) writes:
| 
|    Does anybody have or know of YACC grammar for C that does
| not require the lexical analyser to differentiate typedef
| names from other identifier names. We have tried to "fix" a
| grammar with this but always get an illegal grammar for YACC.
| Our lexical analyser does not have access to the full set of
| typedef names.

In short, it can't be done without a context sensitive parser (ie,
without things like the lexical analyser knowing about the symbol
table).  Among other things, the following fragment:

	(identifer) * expression

can be parsed either as a cast of dereferncing a pointer expression
(if identifier is a typedef name) or multiplication (if identifier is
a variable).  This "feature" has probably been cursed by all C
implementators ever afterward (I know I sure did).

IMHO it is one of four things I would change if I could have my way
and have all the extant source code change overnight.  In case you are
wondering, the other three are:

    1)	Change operator priorities, so that:  x & mask == result, would
	be evaluated as (x & mask) == value;

    2)	Remove macro procedures and add inline capability as a mandated
	feature (including deletion of the inline function if there are
	no callers);

    3)	Fix the overloading of "extern" and "static", and mandate one global
	memory model (ie, the fact that UNIX supports a relaxed REF/DEF
	scheme, wheras K&R and ANSI mandate the more strict single DEF,
	multiple references model).

Sigh, these won't get changed because the time for such changes has
long since past.

-- 
Michael Meissner, Data General.

Uucp:	...!mcnc!rti!xyzzy!meissner
Arpa:	meissner@dg-rtp.DG.COM   (or) meissner%dg-rtp.DG.COM@relay.cs.net

djones@decwrl.dec.com (Dave Jones) (01/13/89)

> In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes:
> 
>    Does anybody have or know of YACC grammar for C that does
> not require the lexical analyser to differentiate typedef
> names from other identifier names?

Nope, nobody does. There ain't no such thing. If type-names
are treated the same as identifiers, C is inherently ambiguous.

> We have tried to "fix" a
> grammar with this but always get an illegal grammar for YACC.
> Our lexical analyser does not have access to the full set of
> typedef names.

The lexical analyzer is going to have to know something about type-names.
It is very tricky to do it well. I have a suspicion that the
C grammar that has been circulating on the net for the last year or
two, with the comment that all it needs is a lexical analyzer, is, in
reality, a rather arcane practical joke.  But then again maybe not.

I'll show you how I did it.  I'm not too happy with the solution.
If you can think of a better way, please show me. I peeked into pcc
to see how they did it. Believe me, you don't even want to know. (Gack.)

Anyhow, what I did was this:

The lexical scanner has access to the hash-table of type-names.
The trick is to tell the scanner when it needs to look an identifier
up in that table and when it does not. So I added two empty productions,
"td" and "ntd", to toggle typedefs on and off:

   td	: /* recognize typedef names as such. */
          { $$=0; typedefs_ok=1; }
        ;

   ntd	: /* recognize typedef names only as identifiers. */
	  { $$=0; typedefs_ok = 0; }
	;

These productions are used lots of places in the grammar, like so:

  struct_specifier
	: ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}'
					{ $$=tree(N_StructIDDef); }

In this example, the parser first looks ahead at the keyword "struct",
and seeing it, decides to do the production ntd. Doing so turns
typename recognition off. Then the parser looks ahead to see an
identifier, which it cannot recognize as a typedef name.  Seeing the
identifier, it decides to do the production td, which turns typename
recognition back on.

Or so it would seem...

The catch is that yaccpar, the standard yacc parser-template, employs
what might be called "lazy LALR(1)" lookahead.  In some situations,
where the parser is destined to do a certain production willy-nilly,
it does not do a lookahead at all. (Recall that yacc uses "default"
productions in some error states.) It is difficult to predict where
yacc will be lazy and where it will not. You wouldn't want to depend
on that anyway. It could change on a yacc upgrade, or when you tweak
the C grammar.

Oh what to do? What to do? If the parser defers lexing IDENTIFIER
until after it does the production td, the identifier may incorrectly
be tagged as a type-name.

It does not help to move all the "td" productions to the other side
of IDENTIFIER. That breaks when yacc generates an "eager" lookahead.

I scratched my head about this for quite a while.  Finally it occured
to me that there is no reason whatsoever why yaccpar should use this
"lazy" lookahead scheme.  So I hacked yaccpar. (Now you see why I'm
not ecstaticly happy about the solution.) I changed it so that it always
fetches the lookahead token before changing state, even if the token does
not need to be inspected.  The modified algorithm is actually faster
than the old one, albeit not significantly so.
[From megatest!djones@decwrl.dec.com (Dave Jones)]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { decvax | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

djones@megatest.UUCP (Dave Jones) (01/13/89)

In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes:
> 
>    Does anybody have or know of YACC grammar for C that does
> not require the lexical analyser to differentiate typedef
> names from other identifier names?

Nope, nobody does. There ain't know such thing. If type-names
are treated the same as identifiers, C is inherently ambiguous.

> We have tried to "fix" a
> grammar with this but always get an illegal grammar for YACC.
> Our lexical analyser does not have access to the full set of
> typedef names.
> 

The lexical analyzer is going to have to know something about type-names.
It is very tricky to do it well. I have a suspicion that the
C grammar that has been circulating on the net for the last year or
two, with the comment that all it needs is a lexical analyzer, is, in
reality, a rather arcane practical joke.  But then again maybe not.

I'll show you how I did it.  I'm not too happy with the solution.
If you can think of a better way, please show me. I peeked into pcc
to see how they did it. Believe me, you don't even want to know. (Gack.)

Anyhow, what I did was this:

The lexical scanner has access to the hash-table of type-names.
The trick is to tell the scanner when it needs to look an identifier
up in that table and when it does not. So I added two empty productions,
"td" and "ntd", to toggle typedefs on and off:

   td	: /* recognize typedef names as such. */
          { $$=0; typedefs_ok=1; }
        ;

   ntd	: /* recognize typedef names only as identifiers. */
	  { $$=0; typedefs_ok = 0; }
	;

These productions are used lots of places in the grammar, like so:

  struct_specifier
	: ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}'
					{ $$=tree(N_StructIDDef); }

In this example, the parser first looks ahead at the keyword "struct",
and seeing it, decides to do the production ntd. Doing so turns
typename recognition off. Then the parser looks ahead to see an
identifier, which it cannot recognize as a typedef name.  Seeing the
identifier, it decides to do the production td, which turns typename
recognition back on.

Or so it would seem...

The catch is that yaccpar, the standard yacc parser-template, employs
what might be called "lazy LALR(1)" lookahead.  In some situations,
where the parser is destined to do a certain production willy-nilly,
it does not do a lookahead at all. (Recall that yacc uses "default"
productions in some error states.) It is difficult to predict where
yacc will be lazy and where it will not. You wouldn't want to depend
on that anyway. It could change on a yacc upgrade, or when you tweek
the C grammar.

Oh what to do? What to do? If the parser defers lexing IDENTIFIER
until after it does the production td, the identifier may incorrectly
be tagged as a type-name.

It does not help to move all the "td" productions to the other side
of IDENTIFIER. That breaks when yacc generates an "eager" lookahead.

I scratched my head about this for quite a while.  Finally it ocurred
to me that there is no reason whatsoever why yaccpar should use this
"lazy" lookahead scheme.  So I hacked yaccpar. (Now you see why I'm
not ecstaticly happy about the solution.) I changed it so that it always
fetches the lookahead token before changing state, even if the token does
not need to be inspected.  The modified algorithm is actually faster
than the old one, albeit not significantly so.

piet@ruuinf (Piet van Oostrum) (01/14/89)

In article <1183@goofy.megatest.UUCP>, djones@megatest (Dave Jones) writes:
`
`I'll show you how I did it.  I'm not too happy with the solution.
`If you can think of a better way, please show me. I peeked into pcc
`to see how they did it. Believe me, you don't even want to know. (Gack.)
`
`.......
`
`  struct_specifier
`	: ntd STRUCT td IDENTIFIER '{' struct_declaration_list '}'
`					{ $$=tree(N_StructIDDef); }
`
The best way to do this is:
1. Have different tokens for IDENTIFIER and TYPEDEF names.
2. in every context where a typedef name may be used as a normal identifier
(e.g. in field names, declarations, parameters, enum lists) use the
nonterminal identifier.
3. add the production rule:
		identifier: IDENTIFIER | TYPEDEF { $$=detypedef($1); };
You use IDENTIFIER or TYPEDEF in those rules where identifier would be
ambiguous.

This is the way I find most logical, and not surprisingly this is the way
gcc does it. (If you want to see a good compiler look at gcc :=}.)
-- 
Piet van Oostrum, Dept of Computer Science, University of Utrecht
Padualaan 14, P.O. Box 80.089, 3508 TB Utrecht, The Netherlands
Telephone: +31-30-531806. piet@cs.ruu.nl (mcvax!hp4nl!ruuinf!piet)

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/14/89)

In article <1183@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
-In article <175@calmasd.GE.COM>, jhh@calmasd.GE.COM (Jung Hamel) writes:
->    Does anybody have or know of YACC grammar for C that does
-> not require the lexical analyser to differentiate typedef
-> names from other identifier names?
-Nope, nobody does. There ain't know such thing. If type-names
-are treated the same as identifiers, C is inherently ambiguous.

More precisely, type names must be taken into account when parsing C.
There are cases where the wrong parse will result if this isn't done.

-I have a suspicion that the C grammar that has been circulating on the
-net for the last year or two, with the comment that all it needs is a
-lexical analyzer, is, in reality, a rather arcane practical joke.

No, I think the people who keep requesting a copy of it really do think
that all they need to do is feed it to YACC and that will take care of
the syntactic phase of a C compiler.  Of course, they're mistaken..