[comp.lang.pascal] Type syntax in Extended Pascal

corbett%BEATNIX.UUCP%CUNYVM.CUNY.EDU@cunyvm.cuny.ed (Bob Corbett) (12/17/88)
In the lastest issue of SIGPLAN Notices (December 1988), R. T. House of the
Darling Downs Institute of Advanced Education argues against one of the
extensions included in Extended Pascal.  The extension to which he objects is
that the bounds of subrange types are now allowed to be constant expressions
instead of just constants.  His objection is that this generalization makes the
language much harder to parse.  The problem is caused by the similarity between
the start of an enumerated type and the start of a parenthesized expression.
The code fragments

      type t = (red);

and

      type t = (red)..(green);

illustrate the problem.  The intuitive LL(k) grammar for Extended Pascal types
requires the parser to determine if the type to the right of the equals sign is
an enumerated type or a subrange type before it advances over the first
parenthesis.  Thus, k must be >= 4 for the intuitive LL(k) grammar.  Similarly,
the intuitive LR(k) grammar requires the parser to decide if red is an
enumeration literal or a factor before it shifts over the second parenthesis.
Thus, k must be >= 2 for the intuitive LR(k) grammar.  He points out how
difficult it is to produce an LL(1) or LR(1) grammar for the syntax of Extended
Pascal.  He goes on to claim that

     This trivial problem seems likely to cost millions of dollars in
     extra compiler development time and reduce machine efficiency
     when compilers are in use.

Mr. House's fears are without basis.  This problem will not cost millions of
dollars because we compiler writers are not paid enough to drive the cost up
that high.  As it happens, there is a simple way of handling this problem,
which should keep the cost in extra compiler development time low.

As part of my study of Extended Pascal for the second public review, I wrote
an LALR(1) grammar for the language.  I found the same problem as did Mr.
House, and I reported it to X3J9.  However, as I mentioned in my comments, I
thought the consistency from the users' viewpoint outweighed the problems
presented to compiler writers.  Now, however, I have gone past the point of
just parsing the language and have discovered that the problem is easier to
handle while compiling than while simply parsing.

I currently handle the problem as follows.  My grammar for types includes the
following rules:

        type:  LPAR ID COMMA idlist LPAR
                |  expr DOTDOT expr
        |  expr

where LPAR is the token '(', RPAR is the token ')', COMMA is the token ',',
DOTDOT is the token '..', ID is an identifier, idlist is the nonterminal symbol
for a list of identifiers, and expr is the nonterminal symbol for expressions.
This part of the grammar is LALR(1).  The compiler builds an abstract syntax
tree (AST) for each type as it parses.  When either of the first two forms of
types is recognized, it is handled in the natural way.  When the last form of
type is recognized, the compiler checks to see if the AST has the form

                   type
                   / | \
                  /  |  \
                 (   ID  )

If it does, the compiler treats the type as an enumeration type and declares the
identifier to be an enumeration literal.  If it does not, the compiler reports
a syntax error and prints a message stating that it expected the following token
to be '..'.  Note that this scheme works because I am building an AST.  When I
was simply parsing, this scheme did not occur to me.

                        Robert Paul Corbett
                        uunet!elxsi!corbett
                        ucbvax!sun!elxsi!corbett