kearns@softrue.UUCP (Steven Kearns) (03/11/91)
A Parsing Pitfall related to "C++ with nested types". ============================================================= I think that everyone wants C++ to be parsable without having to write the complete internals of a compiler. However, as things now stand, this may not be possible. On pages 93 and 96 of E&S, it is made clear that the definition of C++ syntax assumes that a lexical analyzer returns identifiers, and that the parser can tell if an identifier is a CLASSname or TYPEDEFname or neither. So all CLASSnames are also identifiers, and all TYPEDEFnames are also identifiers, and an identifier may also be a CLASSname, a TYPEDEFname, or neither, but not both. This distinction is neccessary in order to disambiguate the grammar, which is otherwise quite ambiguous. Unfortunately, yacc doesn't work like that, because in yacc a token cannot be classified in two ways simultaneously. Instead people have developed a hack in which the lexical analyzer classifies names as a CLASSname, TYPEDEFname, or NEITHERname. Then every use of "identifier" in the grammar has to be expanded to possibly use any one of "CLASSname", "TYPEDEFname", or "NEITHERname". When yacc realizes that a new name has been declared, it must inform the lexical analyzer as quickly as possible so that future occurrences of that name can be classified correctly. Gross! Another possibility is to modify yacc so it understands "inherited tokens" such as the relationship between CLASSname and identifier. Yacc still has to update the name interpretation process when a new variable is declared. Now that C++ 2.1 insists on allowing type names to be nested just like variable names, the lexical analyzer's task is much harder. The classification of an identifier depends on the active scope, and there are many places that the active scope changes when parsing a C++ program: after a class qualifier, inside a class declaration, inside a compound statement, etc.. The question of the day is, Does the active scope change after the "." (DOT) or "->" (ARROW) operators? The "name" that appears after a "." or "->" must be a member of the class referred to on the left side of the "." or "->". From this it might be tempting to define the C++ SYNTAX so that the names on the right hand side are interpreted in the scope of the object referred to on the left hand side. After all, in "foo.a", we expect name "a" to be a member of the class referred to by foo. BUT THIS DEFINITION WOULD BE DISASTEROUS! The left side of "." and "->" might be an arbitrary expression, potentially involving overloaded functions and the complete range of C++ complexity, and changing to its scope would mean calculating its type DURING PARSING. This would completely eliminate the possibility of making a simple parser for C++. Instead, names on the right hand side of a "." or "->" should be interpreted in the same scope as the left hand side DURING PARSING. Later, these names should be reinterpreted in scope of the class referred to by the left hand side. The bottom line is that the syntax given for a "name" in E&S must be changed. Here is the old definition: /* grammar 1 */ name: identifier operator-function-name conversion-function-name ~ CLASSname qualified-name The problem with this definition is that the definitions for "conversion function", "~ CLASSname", and "qualified-name" all make use of the CLASSname token, thereby assuming that the lexical analyzer can make this distinction in this context. However, I just pointed out that it is infeasible to classify a name as a CLASSname or not on the right side of a "." or "->", during parsing. Therefore the syntax for "name" must be modified. Here is one way, which basically involves rewriting the relevant grammar rules so that all references to CLASSname tokens are replaced by references to the less specific "identifier" token. /* grammar 2 */ name: identifier operator-function-name conversion-function-name-2 ~ identifier qualified-name-2 qualified-name-2: qualified-class-name-2 :: name qualified-class-name-2: identifier identifier :: qualified-class-name-2 conversion-function-name-2: OPERATOR conversion-type-name-2 conversion-type-name-2: type-specifier-list-2 ptr-operator-2 ...... it goes on and on and on and on ........ I would not be surprised if this change introduces more ambiguities into the language. Here is an illustrative example: class outer { class inner { int i; }; int j; }; main () { outer o; o.j; // this must be legal int a; o.operator inner(); // are both of these legal? o.operator outer::inner(); // probably should be! o.a; // should be syntactically correct, semantically wrong o.operator bobble(); // syntactically correct, semantically wrong }; ******************************************************** * Steven Kearns ....uunet!softrue!kearns * * Software Truth softrue!kearns@uunet.uu.net * ********************************************************
peter@prefect.Berkeley.EDU (Peter Moore) (03/13/91)
> I think that everyone wants C++ to be parsable without having to write the > complete internals of a compiler. However, as things now stand, this may > not be possible. I think you are making this sound harder than it really is. The typename/id distinction is not new to C++. C has the same problem, albeit to a lesser extent. The solution is simple: you have a symbol table that the lexer uses to classify an identifier as a typename or a simple id. The parser updates entries in this symbol table as typenames go in and out of scope, and the lexer sees these changes next time it looks up an id. (I haven't thought very hard about it, but it might be technically possible to avoid the problem in C by clever grammars, but I think almost everyone makes life easier for themselves and uses symbol table information instead). C++ scoping rules are more complicated than C, but a symbol table is still straight forward. Each class has a symbol table associated with it, with its table linked to the table of its super class. At any time, the parser has a stack of the current active scopes. When entering a member function, or interpreting a member name, the symbol table of the class is pushed on to the stack. > Unfortunately, yacc doesn't work like that, because in yacc a token > cannot be classified in two ways simultaneously. Instead people have > developed a hack in which the lexical analyzer classifies names as a > CLASSname, TYPEDEFname, or NEITHERname. Then every use of > "identifier" in the grammar has to be expanded to possibly use any > one of "CLASSname", "TYPEDEFname", or "NEITHERname". For cases where an id or typedef name are both acceptable, rules like: id : ID | TYPEDEF { $$ = id_from_typedef($1); } ; typedef : TYPEDEF | ID { $$ = typedef_from_id($1); } ; isolate most of the trouble. (Yes, there are either some semantic checks in the conversion routines, or checks further up in the grammar). The above are easy parts of C++ parsing. The worst part is the declaration/expression ambiguity that can require unbounded look-ahead (see ARM pg 93). A lesser problem (but one closer to my heart) is the strain that a) collapsing the the union/struct/enum tag name space into the typedef name space. b) having typedef names drift into the object/function name space in the special case of constructors/function-like casts c) allowing typedef names as member and variable names puts on someone trying to parse member names. In particular, making sure that you don't interpret a constructor declaration as a field declaration with an extra set of parenthesis is painful. Peter Moore