corbett%BEATNIX.UUCP%CUNYVM.CUNY.EDU@cunyvm.cuny.ed (Bob Corbett) (12/17/88)
In the lastest issue of SIGPLAN Notices (December 1988), R. T. House of the Darling Downs Institute of Advanced Education argues against one of the extensions included in Extended Pascal. The extension to which he objects is that the bounds of subrange types are now allowed to be constant expressions instead of just constants. His objection is that this generalization makes the language much harder to parse. The problem is caused by the similarity between the start of an enumerated type and the start of a parenthesized expression. The code fragments type t = (red); and type t = (red)..(green); illustrate the problem. The intuitive LL(k) grammar for Extended Pascal types requires the parser to determine if the type to the right of the equals sign is an enumerated type or a subrange type before it advances over the first parenthesis. Thus, k must be >= 4 for the intuitive LL(k) grammar. Similarly, the intuitive LR(k) grammar requires the parser to decide if red is an enumeration literal or a factor before it shifts over the second parenthesis. Thus, k must be >= 2 for the intuitive LR(k) grammar. He points out how difficult it is to produce an LL(1) or LR(1) grammar for the syntax of Extended Pascal. He goes on to claim that This trivial problem seems likely to cost millions of dollars in extra compiler development time and reduce machine efficiency when compilers are in use. Mr. House's fears are without basis. This problem will not cost millions of dollars because we compiler writers are not paid enough to drive the cost up that high. As it happens, there is a simple way of handling this problem, which should keep the cost in extra compiler development time low. As part of my study of Extended Pascal for the second public review, I wrote an LALR(1) grammar for the language. I found the same problem as did Mr. House, and I reported it to X3J9. However, as I mentioned in my comments, I thought the consistency from the users' viewpoint outweighed the problems presented to compiler writers. Now, however, I have gone past the point of just parsing the language and have discovered that the problem is easier to handle while compiling than while simply parsing. I currently handle the problem as follows. My grammar for types includes the following rules: type: LPAR ID COMMA idlist LPAR | expr DOTDOT expr | expr where LPAR is the token '(', RPAR is the token ')', COMMA is the token ',', DOTDOT is the token '..', ID is an identifier, idlist is the nonterminal symbol for a list of identifiers, and expr is the nonterminal symbol for expressions. This part of the grammar is LALR(1). The compiler builds an abstract syntax tree (AST) for each type as it parses. When either of the first two forms of types is recognized, it is handled in the natural way. When the last form of type is recognized, the compiler checks to see if the AST has the form type / | \ / | \ ( ID ) If it does, the compiler treats the type as an enumeration type and declares the identifier to be an enumeration literal. If it does not, the compiler reports a syntax error and prints a message stating that it expected the following token to be '..'. Note that this scheme works because I am building an AST. When I was simply parsing, this scheme did not occur to me. Robert Paul Corbett uunet!elxsi!corbett ucbvax!sun!elxsi!corbett