[comp.lang.c++] yaccable grammars for C and C++: 1 of 4

jar@io.UUCP (Jim Roskind x5570) (03/08/90)

FILENAME: FREEGRAM.TXT
AUTHOR: Jim Roskind
	Independent Consultant
	516 Latania Palm Drive
	Indialantic FL 32903
	(407)729-4348
	jar@ileaf


							3/7/90


I have created a set of grammars for C and C++ in order to use them 
within my own personal project development.  I have made my work in 
this area available to other developers at no charge with the hope 
that they would use my work.  I believe the entire C++ community can 
benefit from such standardization.  If any of the copyright notices 
on the grammars (which are VERY liberal) prevent using my work, 
please notify me of the problem.

Note that the grammars can each be processed by YACC, but they are 
very clean, and make NO USE of the precedence setting (i.e.: %prec) 
or associativity setting (i.e.:%assoc) constructs of YACC.  This 
feature should make them easily portable to other parser generator 
input format.  This "cleanliness" fact also provides brutal exposure 
of all the complex constructs in C++, and the complexity of the 
grammar as a whole (the C++ grammar is 2 to 3 times as large as the C 
grammar) reflects this exposure.

The files included in this set are:

    FREEGRAM.TXT    This introductory file
    CPLUSPLS.TXT    Parsing ambiguities in C++, and in my grammar
    CPLUSPLS.Y      My YACC compatible C++ grammar
    C.Y             My YACC compatible, ANSI C conformant grammar

I have posted all 4 files to comp.lang.c++, to make this information 
as available as possible to users and developers.  I will also post 
this introductory note to comp.compilers, and comp.lang.c.  I am
arranging for archival support via several ftp sites, and updates 
will only be posted to those sites.  

In light of the recent debate on comp.lang.c++ about the 
"impossibility" of generating a YACC grammar for C++ (that doesn't 
also accept every possible string of tokens), providing these 
grammars at this point is especially nice.  Rather than addressing 
each of the ad hoc arguments that such an endeavor is impossible (the 
least convincing of which said effectively: "... heard it was 
impossible, ... something about LALR ...", and the most correct of 
which said: "...it is not possible to create a simple YACC grammar 
for C++.."), I am offering my grammars as my rather complete 
commentary on the topic. (I'll probably get some flaming due to this 
comment, but I had to read all the stuff that was posted, that told 
me what I had done was impossible. ;-) )  My general feeling is that 
it is a shame for opinions to be confused with facts, and for 
impossibility to be confused with difficulty or complexity.

Developing the C grammar (that is intended to be compatible with the 
ANSI C standard) was relatively straight forward (compared to the C++ 
grammar).  The one difficulty in this process was the desire to avoid 
use of %prec and %assoc constructs in YACC, which would tend to 
obscure ambiguities.  Since I didn't know what ambiguities were lying 
in wait in C++, obscuring ambiguities was unacceptable.  It took 
several weeks to remove the conflicts that typically appear, and the 
tedious process exposed several ambiguities that are not currently 
disambiguated by the ANSI standard.  The quality of the C grammar is 
(IMHO) dramatically higher than what has been made available within 
the public domain.  Specifically, a C grammar's support of 
redefinition of typedef names within inner scopes (the most difficult 
area of the grammar) is typically excluded from public domain 
grammar, and even excluded from most grammars that are supplied 
commercially with parser generators!  I expect that this grammar will 
be very useful in the development of C related tools.

The development of the C++ grammar (initially compatible with version 
1.2, but enhanced to support version 2.0 specifications as they were 
made available) was anything but straight forward.  The requirement 
that I set to NOT USE %prec and %assoc proved both a blessing and a 
curse.  The blessing was that I could see what the problems were in 
the language, the curse was that there were A LOT of conflicts (I can 
recall times during the development effort when the number of 
conflicts was well in excess of 200).  Initially I was unaware that 
many other attempts had been made and failed, and I went ahead 
"blindly" trying to resolve the conflicts in my grammar.  After 
raising the issues that I noticed with Bjarne Stroustrup, I became 
aware that there were some very significant syntactical ambiguities 
within the current C++ language.  Fortunately, by the time I first 
spoke to Dr. Stroustrup, I had already derived some results that 
other attempts had not uncovered.  Encouraged by my results, I 
continued on despite hearing ever louder claims that my goal was 
"impossible".

Towards the end of the development of the C++ grammar, which took 
roughly 3 months of my time, I began to see the folly in part of my 
quest.  I came to the conclusion that further attempts to modify my 
grammar, so as to defer resolutions of ambiguities, would lead to an 
unreadable language. Specifically, my feeling was that I was entering 
into a battle of wits with the compiler, and the compiler was 
starting to win.  It was beginning to be the case that the parser 
COULD figure out what I said, but I couldn't.  Indeed, even examples 
in a version of the C++ 2.0 reference manual demonstrated this 
problem (my parser could parse some examples that neither I nor the 
authors parsed correctly!).  At this point I decided to stop my quest 
to FURTHER defer resolutions of ambiguities, and let the grammar 
commit in one direction (always in favor of declarations), at the 
late point that is provided by my grammar.  If this direction proved 
"incorrect in light of the context that followed", then I generated a 
syntax error.  I believe this strategy provides ample room for 
expressiveness. In support of this expressiveness, I have (based on 
my discussions with language experts) deferred disambiguation far 
longer than other attempts at producing an LR(1) grammar.  I would 
strongly argue that any code that my grammar identifies as having a 
"syntax error" (based on "premature" disambiguation), but cfront 
allows, should ABSOLUTELY be rewritten in a less ambiguous (and hence 
more portable) form.  As a final comment, by virtue of the consistent 
methodology used to build my grammar, there are a number of clearly 
unambiguous constructs that my grammar CAN parse, and cfront 2.0 
cannot (not to mention recent versions of GNU's G++ and Zortech's C++ 
compiler).

One major motivation for using the C++ grammar that I have provided 
is that it is capable of supporting old style function definitions 
(e.g.: main(argc, argv) int argc; char*argv[]; {...} ).  To my 
knowledge, no other C++ parser is currently capable of this feat. I 
believe this capability was removed from the C++ specification in 
order to reduce ambiguities in a specific implementation (cfront). As 
my grammar demonstrates, this action was not necessary. Supporting 
old style function definition GREATLY eases the transition to the use 
of C++ from traditional C.  I expect that as some parsers begin to 
support this option, that other parsing engines will be forced in 
this direction by a competitive marketplace.  Using my grammar, and 
the standards it implies, appears to be a very straightforward 
approach to this support.

A second motivation for using my grammar is that it can be processed 
by YACC.  The advantage in this fact lies with YACC's capability to 
identify ambiguities.  For software manufacturers that are heavily 
concerned with correctness, this is an INCREDIBLE advantage.  My 
experience with hand written parsers (which usually represent a 
translation by a fallible human from a grammar to parsing code) is 
that they evolve and become more correct with time.  Ambiguous cases 
are often misparsed, without the author ever realizing there was a 
conflict! In contrast, either a YACC grammar supports a given 
construct, or it doesn't.  In contrast, if a YACC grammar supports a 
construct, the semantic interpretation is usually rather straight 
forward.  The likelihood of internal errors in the parser is 
therefore SIGNIFICANTLY reduced. The fact the the grammars I supplied 
are free of %assoc and %prec operators, implies the grammar are 
fairly portable, and the conflicts are open to peer code review (and 
not obscured).

If you find any errors in my grammars, I would be DELIGHTED to hear 
mention of them!!!!  These should fall into one of the following 
categories:

	1) The grammar left out the following features of C++...
	or
	2) The grammar mis-parses the following sequences...
	or
	3) The discussion of the following ambiguity is
	incorrect...
        4) The grammar could be simplified as follows...

Please send correspondence of this form to jar@ileaf.com. My response 
to 1's will be to add the feature (if possible!); feel sad that I 
made a mistake; and feel glad that YOU found it.  I will have a 
similar response to 2's.  Responses of type 3 are GREAT, but I 
haven't found many folks that really get into YACC ambiguities, so I 
have low expectations... feel free to surprise me!!! :-) :-).  Items 
of type 3 are interesting, but since simplicity is in the eye of the 
beholder, such suggestions are subject to debate.  I would be 
interested in seeing suggestions in this area with the constraint 
that they do not increase the number of conflicts in the grammar!  
Please use YACC to check your suggestions before submitting them. (It 
is often AMAZING how the slightest change in the grammar can change 
the number of conflicts, and it took a great deal of work to reach 
the current state). Distribution site(s) will be set up to distribute 
updates and or corrections.  Postings about the presence of 
corrections will be made on the net.

Since the two grammars (C and C++) were generated in parallel, you 
should be able to compare non-terminals directly.  This will 
hopefully make it easier to identify the complexities involved with 
the C++ grammar, and "ignore" those that result from standard ANSI C. 
In both cases I have left the standard if-if-else ambiguity intact. 
In the case of ANSI C grammar, this is the only shift-reduce conflict 
in the grammar.  Although there are a number of conflicts in the C++ 
grammar, there are actually very few classes of problems. In order to 
disambiguate the C++ grammar enough that YACC can figure out what to 
do, I was commonly forced to "inline expand" non-terminals found in 
the C grammar.  This expansion allowed YACC to defer disambiguation 
until it was possible for an LR(1) parser to understand the context. 
The unfortunate consequence of this inline expansion is a large 
growth in the number of rules, and the presence of an effective 
"multiplier" in most cases where conflicts do occur. As a result, any 
conflicts that arise are multiplied by a factor corresponding to the 
number of rules I had to list separately.  I have grouped the C++ 
grammar conflicts in the "Status" section of the GRAMMAR.TXT paper, 
but you are welcome to explore my grammars using YACC directly (be 
warned that you will need a robust version of YACC to handle the C++ 
sized grammar).  PLEASE do not be put off by the number of conflicts 
in the C++ grammar.  There are VERY FEW CONFLICTS, but my elaborated 
grammar confuses the count.

The CPLUSPLS.TXT paper is FAR from a publishable quality paper, but 
it discusses many of the issues involved in ambiguities in my 
grammar, and more generally in the C++ language. I hope CPLUSPLS.TXT 
it is a vast improvement over "nothing at all", but apologize in 
advance for lack of polished structure, and the presence of many 
typos (which must surely be present). I hope you find this 
almost-paper interesting.  My attempts to document conflicts have 
certainly clarified the problems in my mind. Based on my experience 
with the conflicts I have identified, most current compilers and 
translator fall prey to the situations that I have uncovered.  I hope 
that other compilers, that do not make use of the grammar I have made 
available, will at least seek to standardize the resolution of the 
problems identified.

As a small commercial message... I am a freelance consultant, and I 
travel far and wide to perform contracts.  If you like the work that 
I am presenting in this group of documents, and would like to see a 
resume or at least talk, please feel free to contact me.

I hope that the grammars that I have provided, will lead to many 
successful C++ processing projects.

Jim Roskind
Independent Consultant
516 Latania Palm Drive
Indialantic FL 32903
(407)729-4348
jar@ileaf

rfg@ics.uci.edu (Ronald Guilmette) (03/21/90)

In article <1432@io.UUCP> jar@io.UUCP (Jim Roskind x5570) writes:
>
>In light of the recent debate on comp.lang.c++ about the
>"impossibility" of generating a YACC grammar for C++ (that doesn't
>also accept every possible string of tokens), providing these
>grammars at this point is especially nice.  Rather than addressing
>each of the ad hoc arguments that such an endeavor is impossible (the
>least convincing of which said effectively: "... heard it was
>impossible, ... something about LALR ...", and the most correct of
>which said: "...it is not possible to create a simple YACC grammar
>for C++.."), I am offering my grammars as my rather complete
>commentary on the topic. (I'll probably get some flaming due to this
>comment, but I had to read all the stuff that was posted, that told
>me what I had done was impossible. ;-) )  My general feeling is that
>it is a shame for opinions to be confused with facts, and for
>impossibility to be confused with difficulty or complexity.

I wish I had said that!

Thanks Jim.


// Ron Guilmette (rfg@ics.uci.edu)
// C++ Entomologist
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.