mnc@m10ux.UUCP (04/01/87)
Well I finally got around to trying out the parser for ANSI C that was kindly provided to USENET by Jeff Lee (gatech!jeff) several months back, and have discovered that it is not as useful as I had hoped. The two problems I've found so far are: (1) Programs that use typedef'd names produce syntax errors. This is undoubtedly because typedefs cannot be handled properly by ANY C parser without the assistance of a symbol table. This, in turn, is due to an ambiguity in the context-free language, which I pointed out in a previous posting (see excerpt reprinted below). (2) While I wouldn't expect a context-free parser to accept only those programs that are legal in the ANSI standard (impossible without a symbol table), this one allows so many constructs that are not legal even at the context-free level as to render it useless for checking syntactic compliance with the standard. For instance it accepts "char typedef int extern typedef float x;" without a complaint. In fact, a quick look at the grammar shows that it allows an arbitrary sequence of words from the set {char,int,float,typedef,extern,static,...} as a "declaration", and does not even require the declarator ("x", in the above case). It is hard to understand why the grammar is written this way, since it would not have been much more complicated to say that a declaration must have either a storage_class_specifier or a type_specifier or both, followed by at least one declarator, as shown here (with abbreviated rule names): declaration : storage_class declarators ';' | type_specifier declarators ';' | storage_class type_specifier declarators ';' (And type_specifier would have to be changed to allow sequences of "long" and "short", of course). Is there some new construct in ANSI C that I am not aware of which requires this more general syntax? Why does the grammar explicitly allow (with a production solely for that purpose) a declaration without a declarator? Or in other words, what the hell does "int ;" mean? Or "extern ;", for that matter? I realize that some C compilers accept nonsense like this (e.g. Sys V cc), but should we really be legitimizing this in a standard? And what kind of funny business is going on with the definition of "pointer" in the grammar? It allows declarations like: "int *char*char float** x;". Again, this one looks like it was put in on purpose, so I imagine it must have some sensible meaning, but I couldn't guess one if you held a gun to my head. On a positive note, I am pleased to see that the grammar no longer allows declarations that contain neither storage_class nor type_specifier (e.g. "x;"). This was a source of ambiguity in the context: "main(){x;...}" where it could either be declaring a local int x, or evaluating a global variable x. More seriously, the fragment "main(){f();...}" which formerly could mean "call f with no args", or "declare f to be a function returning int" is no longer ambiguous. It is my understanding that this grammar is part of the draft ANSI standard. I hope that it will be improved before the standard is finalized. I am willing to do the work if no one else already has, but I would need a copy of the full standard document to ensure that I know what the syntax is supposed to be, since it is not possible to determine this from the grammar. I don't mean to be looking a gift horse in the mouth, and this parser is certainly better than no grammar at all, but I would like to hear from anyone who has a more discriminating parser for ANSI C (failure to handle typedefs is no concern, as this is easily fixed). ---------------- excerpt from previous article on C ambiguity: --------------- The context-free ambiguity in C is: {t (x); ... } It means "call function t with argument x", if t is not a declared identifier or is declared to be a function. If t is a typedef'd name, it means "create a new variable x of type t". Other examples are: "t (*x);" and "t (x());". Note that this would be a semantic ambiguity as well, were typedef names in a separate name space from function names. Although the single valid meaning eliminates any technical ambiguity, it is poor language design, since the two vastly different meanings cannot be distinguished by the reader (human or electronic) without referring back to the definition of t, which may be in an include file or may not even exist. Furthermore, it forces any parser for C to contain a symbol table. ------------------------------ end of excerpt -------------------------------- Michael Condict {ihnp4|vax135|cuae2}!m10ux!mnc AT&T Bell Labs (201)582-5911 MH 3B-416 Murray Hill, NJ -- Michael Condict {ihnp4|vax135|cuae2}!m10ux!mnc AT&T Bell Labs (201)582-5911 MH 3B-416 Murray Hill, NJ