[comp.lang.c] Grammar for ANSI C

mnc@m10ux.UUCP (04/01/87)

Well I finally got around to trying out the parser for ANSI C that was kindly
provided to USENET by Jeff Lee (gatech!jeff) several months back, and have
discovered that it is not as useful as I had hoped.  The two problems I've
found so far are:

	(1) Programs that use typedef'd names produce syntax errors.  This
	    is undoubtedly because typedefs cannot be handled properly by ANY
	    C parser without the assistance of a symbol table.  This, in
	    turn, is due to an ambiguity in the context-free language, which I
	    pointed out in a previous posting (see excerpt reprinted below).

	(2) While I wouldn't expect a context-free parser to accept only those
	    programs that are legal in the ANSI standard (impossible without a
	    symbol table), this one allows so many constructs that are not
	    legal even at the context-free level as to render it useless for
	    checking syntactic compliance with the standard.  For instance it
	    accepts  "char typedef int extern typedef float x;" without a
	    complaint.

In fact, a quick look at the grammar shows that it allows an arbitrary
sequence of words from the set {char,int,float,typedef,extern,static,...} as a
"declaration", and does not even require the declarator ("x", in the above
case).  It is hard to understand why the grammar is written this way, since
it would not have been much more complicated to say that a declaration must
have either a storage_class_specifier or a type_specifier or both, followed by
at least one declarator, as shown here (with abbreviated rule names):

	declaration :
		  storage_class declarators ';'
		| type_specifier declarators ';'
		| storage_class type_specifier declarators ';'

(And type_specifier would have to be changed to allow sequences of "long" and
"short", of course).  Is there some new construct in ANSI C that I am not
aware of which requires this more general syntax?  Why does the grammar
explicitly allow (with a production solely for that purpose) a declaration
without a declarator?  Or in other words, what the hell does "int ;" mean?
Or "extern ;", for that matter?  I realize that some C compilers accept
nonsense like this (e.g. Sys V cc), but should we really be legitimizing this
in a standard?

And what kind of funny business is going on with the definition of "pointer"
in the grammar?  It allows declarations like: "int *char*char float** x;".
Again, this one looks like it was put in on purpose, so I imagine it must
have some sensible meaning, but I couldn't guess one if you held a gun to my
head.

On a positive note, I am pleased to see that the grammar no longer allows
declarations that contain neither storage_class nor type_specifier (e.g. "x;").
This was a source of ambiguity in the context: "main(){x;...}" where it
could either be declaring a local int x, or evaluating a global variable x.
More seriously, the fragment "main(){f();...}" which formerly could mean
"call f with no args", or "declare f to be a function returning int" is no
longer ambiguous.

It is my understanding that this grammar is part of the draft ANSI standard.
I hope that it will be improved before the standard is finalized.  I am
willing to do the work if no one else already has, but I would need a copy
of the full standard document to ensure that I know what the syntax is
supposed to be, since it is not possible to determine this from the grammar.

I don't mean to be looking a gift horse in the mouth, and this parser is
certainly better than no grammar at all, but I would like to hear from anyone
who has a more discriminating parser for ANSI C (failure to handle typedefs
is no concern, as this is easily fixed).

---------------- excerpt from previous article on C ambiguity: ---------------
The context-free ambiguity in C is:

		{t (x);  ...  }

It means "call function t with argument x", if t is not a declared identifier
or is declared to be a function.  If t is a typedef'd name, it means "create a
new variable x of type t".  Other examples are: "t (*x);" and "t (x());".
Note that this would be a semantic ambiguity as well, were typedef names in
a separate name space from function names.  Although the single valid meaning
eliminates any technical ambiguity, it is poor language design, since the
two vastly different meanings cannot be distinguished by the reader (human
or electronic) without referring back to the definition of t, which may be in
an include file or may not even exist.  Furthermore, it forces any parser for
C to contain a symbol table.
------------------------------ end of excerpt -------------------------------- 

Michael Condict		{ihnp4|vax135|cuae2}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ
-- 
Michael Condict		{ihnp4|vax135|cuae2}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ