wsmith@m.cs.uiuc.edu (09/19/88)
What are the advantages and disadvantages of designing a language that may be lexically analyzed with no resort to semantic information? Pascal, Prolog, and Smalltalk are such a language because each may be unambiguously parsed when the lexical value of every token is determined by a finite-state-machine (i.e. regular-expression). C is not such a language because of the typedef construct. A typedef changes the lexical class of the new type's identifier to avoid horrendous ambiguity in the language. C++ has the same problem, only worse. What I would like to do in a new language is make the lexical analyzer have more lexical categories. For example, TypeIdentifier and variableidentifier would be lexically different because the first has an initial uppercase letter while the second doesn't. If other categories of identifiers are needed, they may be defined to be identifiers-with-a-minus-sign or under_scored or Under_Scored each could belong to a different category if there was a valid semantic reason to distinguish between them. (Prolog and Smalltalk already do this.) There are a lot of possiblities and I don't mean to limit myself to only splitting the lexical classes for identifiers. Other tokens could also have several uses, but identifiers were the easiest to explain. Advantages I can see: 1. A person familiar with the lexical rules of the language can more easily understand a routine without consulting all of the declarations involved. 2. The lexical analyzer and parser could be separated into two processes, possibly improving the performance of the compiler on a parallel architecture. 3. Post-processors of the language such as a pretty printer or cross-reference utility do not need the symbol table part of the compiler in order to be written. 4. It is easier to make a language oriented editor wholly table driven. Disadvantages I can see: 1. An explosion of the number of lexical classes may be too difficult to remember. 2. Users may disagree with the language designer's set of lexical classes (or be just plain stubborn). Bill Smith uiucdcs!wsmith wsmith@cs.uiuc.edu
liberte@m.cs.uiuc.edu (09/20/88)
My interpretation of Bill Smith's argument is that languages should distinguish tokens lexically if they are used in different ways syntactically. As part of that argument, he says: > C is not such > a language because of the typedef construct. A typedef changes > the lexical class of the new type's identifier to avoid > horrendous ambiguity in the language. But a typedef would not have to change the lexical class of identifiers if typedef identifiers were only used in the syntax in unambiguous ways. However, they are used ambiguously and the only way to disambiguate is to use the (semantic) fact that an identifier was declared as a typedef. For example, when you see "foo * bar;" in C, you cant tell whether it is a use of the typedef identifier "foo" or a multiplication expression; not without looking back to find out if the next and previous lines are declarations and/or if "foo" is a typedef id. But a better syntax could make that locally unambiguous. Pascal also supports user defined type identifiers, but they may only be used, like all type identifiers, in unambiguous ways. Actually, some Pascals support type casting that looks identical to function calls, so this is ambiguous in some sense, but it doesnt matter at the syntactic level. (A similar example is the ambiguity between a parameterless function call and a variable reference.) So, while you see it as a problem to be solved at the lexical level, I would prefer to solve it at the syntax level. > Advantages I can see: > > 1. A person familiar with the lexical rules of the language can > more easily understand a routine without consulting all of the > declarations involved. Granted. But if the syntax doesnt permit ambiguous use of tokens, then a reader of a program should be able to tell pretty quickly what is meant from the local context. The other advantages you list relate to splitting up lexical and syntax processing and not requiring semantic info. But the same advantages obtain with a syntactic solution. Dan LaLiberte uiucdcs!liberte liberte@cs.uiuc.edu liberte%a.cs.uiuc.edu@uiucvmd.bitnet
nick@ccicpg.UUCP (Nick Crossley) (09/21/88)
In article <5200026@m.cs.uiuc.edu> wsmith@m.cs.uiuc.edu writes: > >What I would like to do in a new language is make the lexical analyzer have >more lexical categories. For example, TypeIdentifier and variableidentifier >would be lexically different because the first has an initial uppercase letter >while the second doesn't. If other categories of identifiers are needed, they >may be defined to be identifiers-with-a-minus-sign or under_scored or >Under_Scored each could belong to a different category if there was a valid >semantic reason to distinguish between them. (Prolog and Smalltalk already do >this.) > >Bill Smith uiucdcs!wsmith wsmith@cs.uiuc.edu I agree it is an good idea in language design to ensure that lexing is independent of parsing. Algol68 does this to some extent. 'Keywords' (such as BEGIN/END, etc.) are individual symbols, which could be represented with a single character if an implementation had a large enough character set and appropriate keyboards. In practice, and in the reference language, these symbols are normally represented using the appropriate sequence of letters, but in a different alphabet from variable names. Type and operator identifiers are also spelled using this different alphabet. The language does not restrict how these two alphabets are implemented; conventionally upper and lower case are used, but bold, underlining, quoting, etc., are all possible. This does leave one (nasty) ambiguity, between type and operator names. Consider the fragment :- WORD a; Is this a declaration of a REF WORD a, or is it a monadic formula, with the monadic operator WORD? This is very similar to the C typedef problem, and is usually solved in a similar way. The lexer initially thinks all uppercase words are type names, but the parser will change that when it sees an operator or priority definition. This often implies an implementation restriction that a single compilation unit cannot use an uppercase word as both a type and an operator (in different scopes). This ambiguity could be solved by using a third alphabet (say italic) for operators. Problems with lexing operator tokens, when the user is allowed to define additional operators, were avoided by splitting all possible operator characters into two classes, monad and nomad. Dyadic operators can be any of the combinations :- monad nomad monad nomad nomad nomad Monadic operators can be either of :- monad monad nomad nomad characters are < > / = * x (a times-symbol, not the letter x). All others are monad. Note that this avoids all possible lexical ambiguities in building up tokens from characters; it does not distinguish between a dyadic or monadic operator, as in Algol68 there is no need to do this at the lexical level. It does ensure that the lexing of a sequence of operator characters is fixed and does not depend on context. Note that this scheme makes the C operator -- impossible in Algol68 - the lexer would return two separate tokens, since - is a monad and there is no operator formed by 'monad monad'. -- <<< standard disclaimers >>> Nick Crossley, CCI, 9801 Muirlands, Irvine, CA 92718-2521, USA Tel. (714) 458-7282, uucp: ...!uunet!ccicpg!nick
haahr@phoenix.Princeton.EDU (Paul Gluckauf Haahr) (09/22/88)
in article <5200026@m.cs.uiuc.edu> wsmith@m.cs.uiuc.edu writes: > What I would like to do in a new language is make the lexical analyzer > have more lexical categories. For example, TypeIdentifier and > variableidentifier would be lexically different because the first has > an initial uppercase letter while the second doesn't. If other > categories of identifiers are needed, they may be defined to be > identifiers-with-a-minus-sign or under_scored or Under_Scored each > could belong to a different category if there was a valid semantic > reason to distinguish between them. a suggestion: how about reserving all identifiers that start with i through n (the first two letters of "integer") for integers, and assuming all other identifiers are real. that way, a compiler would not have to put information about the type in the symbol table, it could just look at the first character, and we can do away with declarations. i believe there was once a programming language that did this :-) paul (princeton!haahr)