[comp.std.c++] parsing C++ woes

kearns@softrue.UUCP (Steven Kearns) (03/11/91)

A Parsing Pitfall related to "C++ with nested types".
=============================================================

I think that everyone wants C++ to be parsable without having to write the
complete internals of a compiler.  However, as things now stand, this may
not be possible.  

On pages 93 and 96 of E&S, it is made clear that the definition of
C++ syntax assumes that a lexical analyzer returns identifiers, and
that the parser can tell if an identifier is a CLASSname or
TYPEDEFname or neither.  So all CLASSnames are also identifiers, and
all TYPEDEFnames are also identifiers, and an identifier may also be
a CLASSname, a TYPEDEFname, or neither, but not both.

This distinction is neccessary in order to disambiguate the grammar,
which is otherwise quite ambiguous. 

Unfortunately, yacc doesn't work like that, because in yacc a token
cannot be classified in two ways simultaneously.  Instead people have
developed a hack in which the lexical analyzer classifies names as a
CLASSname, TYPEDEFname, or NEITHERname.  Then every use of
"identifier" in the grammar has to be expanded to possibly use any
one of "CLASSname", "TYPEDEFname", or "NEITHERname".  When yacc
realizes that a new name has been declared, it must inform the
lexical analyzer as quickly as possible so that future occurrences of
that name can be classified correctly.  Gross!

Another possibility is to modify yacc so it understands "inherited
tokens" such as the relationship between CLASSname and identifier.
Yacc still has to update the name interpretation process when a new
variable is declared.

Now that C++ 2.1 insists on allowing type names to be nested just
like variable names, the lexical analyzer's task is much harder.  The
classification of an identifier depends on the active scope, and
there are many places that the active scope changes when parsing a
C++ program: after a class qualifier, inside a class declaration,
inside a compound statement, etc..

The question of the day is, Does the active scope change after the
"." (DOT) or "->" (ARROW) operators?

The "name" that appears after a "." or "->" must be a member of the
class referred to on the left side of the "." or "->".  From this it
might be tempting to define the C++ SYNTAX so that the names on the
right hand side are interpreted in the scope of the object referred to
on the left hand side.  After all, in "foo.a", we expect name "a" to
be a member of the class referred to by foo.  

BUT THIS DEFINITION WOULD BE DISASTEROUS!  The left side of "." and
"->" might be an arbitrary expression, potentially involving
overloaded functions and the complete range of C++ complexity, and
changing to its scope would mean calculating its type DURING PARSING.
This would completely eliminate the possibility of making a simple
parser for C++.

Instead, names on the right hand side of a "." or "->" should be
interpreted in the same scope as the left hand side DURING PARSING.
Later, these names should be reinterpreted in scope of the class
referred to by the left hand side.  

The bottom line is that the syntax given for a "name" in E&S must
be changed.  Here is the old definition:

/* grammar 1 */

name:  
	identifier
	operator-function-name
	conversion-function-name
	~ CLASSname
	qualified-name

The problem with this definition is that the definitions for
"conversion function", "~ CLASSname", and "qualified-name" all make
use of the CLASSname token, thereby assuming that the lexical
analyzer can make this distinction in this context.  However, I just
pointed out that it is infeasible to classify a name as a CLASSname
or not on the right side of a "." or "->", during parsing.  Therefore
the syntax for "name" must be modified.  Here is one way, which
basically involves rewriting the relevant grammar rules so that all
references to CLASSname tokens are replaced by references to the less
specific "identifier" token.  

/* grammar 2 */

name:  
	identifier
	operator-function-name
	conversion-function-name-2
	~ identifier
	qualified-name-2

qualified-name-2:
	qualified-class-name-2 :: name

qualified-class-name-2:
	identifier
	identifier :: qualified-class-name-2

conversion-function-name-2:
	OPERATOR conversion-type-name-2

conversion-type-name-2:
	type-specifier-list-2 ptr-operator-2

...... it goes on and on and on and on ........


I would not be surprised if this change introduces more ambiguities
into the language.  

Here is an illustrative example:

class outer {
	class inner {
		int i;
	};
	int j;
};

main () {
	outer o;
	o.j;   // this must be legal
	int a;

	o.operator inner();        // are both of these legal?
	o.operator outer::inner(); // probably should be!

	o.a;   // should be syntactically correct, semantically wrong
	o.operator bobble();  // syntactically correct, semantically wrong
};

********************************************************
* Steven Kearns            ....uunet!softrue!kearns    *
* Software Truth           softrue!kearns@uunet.uu.net *
********************************************************

peter@prefect.Berkeley.EDU (Peter Moore) (03/13/91)

> I think that everyone wants C++ to be parsable without having to write the
> complete internals of a compiler.  However, as things now stand, this may
> not be possible.  

I think you are making this sound harder than it really is.  The
typename/id distinction is not new to C++.  C has the same problem,
albeit to a lesser extent.  The solution is simple: you have a symbol
table that the lexer uses to classify an identifier as a typename or a
simple id.  The parser updates entries in this symbol table as
typenames go in and out of scope, and the lexer sees these changes
next time it looks up an id.  (I haven't thought very hard about it,
but it might be technically possible to avoid the problem in C by
clever grammars, but I think almost everyone makes life easier for
themselves and uses symbol table information instead).

C++ scoping rules are more complicated than C, but a symbol table is
still straight forward.  Each class has a symbol table associated with
it, with its table linked to the table of its super class.  At any
time, the parser has a stack of the current active scopes.  When
entering a member function, or interpreting a member name, the symbol
table of the class is pushed on to the stack.

> Unfortunately, yacc doesn't work like that, because in yacc a token
> cannot be classified in two ways simultaneously.  Instead people have
> developed a hack in which the lexical analyzer classifies names as a
> CLASSname, TYPEDEFname, or NEITHERname.  Then every use of
> "identifier" in the grammar has to be expanded to possibly use any
> one of "CLASSname", "TYPEDEFname", or "NEITHERname".

For cases where an id or typedef name are both acceptable, rules like:
id
	:   ID
	|   TYPEDEF
	    { $$ = id_from_typedef($1); } 
	;

typedef
	:   TYPEDEF
	|   ID
	    { $$ = typedef_from_id($1); }
	;

isolate most of the trouble.  (Yes, there are either some semantic
checks in the conversion routines, or checks further up in the grammar).

The above are easy parts of C++ parsing.  The worst part is the
declaration/expression ambiguity that can require unbounded look-ahead
(see ARM pg 93).  A lesser problem (but one closer to my heart) is the
strain that

	a) collapsing the the union/struct/enum tag name space into
	   the typedef name space.
	b) having typedef names drift into the object/function name
	   space in the special case of constructors/function-like casts
	c) allowing typedef names as member and variable names

puts on someone trying to parse member names.  In particular, making
sure that you don't interpret a constructor declaration as a field
declaration with an extra set of parenthesis is painful.

	Peter Moore