[comp.lang.misc] Dumb Lexical Analyzers are Smart

wsmith@m.cs.uiuc.edu (09/19/88)

What are the advantages and disadvantages of designing a language that may
be lexically analyzed with no resort to semantic information?  

Pascal, Prolog, and Smalltalk are such a language because each may be
unambiguously parsed when the lexical value of every token is determined
by a finite-state-machine (i.e. regular-expression).  C is not such
a language because of the typedef construct.  A typedef changes
the lexical class of the new type's identifier to avoid horrendous ambiguity in
the language.  C++ has the same problem, only worse.

What I would like to do in a new language is make the lexical analyzer have
more lexical categories.  For example, TypeIdentifier and variableidentifier
would be lexically different because the first has an initial uppercase letter 
while the second doesn't.  If other categories of identifiers are needed, they
may be defined to be identifiers-with-a-minus-sign or under_scored or
Under_Scored each could belong to a different category if there was a valid
semantic reason to distinguish between them.  (Prolog and Smalltalk already do 
this.)

There are a lot of possiblities and I don't mean to limit myself to only
splitting the lexical classes for identifiers.  Other tokens could also 
have several uses, but identifiers were the easiest to explain.  

Advantages I can see:

	1.  A person familiar with the lexical rules of the language can 
		more easily understand a routine without consulting all of the
		declarations involved.

	2.  The lexical analyzer and parser could be separated into two 
		processes, possibly improving the performance of the compiler 
		on a parallel architecture.

	3.  Post-processors of the language such as a pretty printer or
		cross-reference utility do not need the symbol
		table part of the compiler in order to be written.
	
	4.  It is easier to make a language oriented editor wholly table 
		driven.

Disadvantages I can see:

	1.   An explosion of the number of lexical classes may be too 
		difficult to remember.

	2.   Users may disagree with the language designer's set of lexical
		classes (or be just plain stubborn).

Bill Smith		uiucdcs!wsmith		wsmith@cs.uiuc.edu

liberte@m.cs.uiuc.edu (09/20/88)

My interpretation of Bill Smith's argument is that languages should
distinguish tokens lexically if they are used in different ways
syntactically.  As part of that argument, he says:

> C is not such
> a language because of the typedef construct.  A typedef changes
> the lexical class of the new type's identifier to avoid
> horrendous ambiguity in the language.

But a typedef would not have to change the lexical class of
identifiers if typedef identifiers were only used in the syntax in
unambiguous ways.  However, they are used ambiguously and the only
way to disambiguate is to use the (semantic) fact that an
identifier was declared as a typedef.  

For example, when you see "foo * bar;" in C, you cant tell whether
it is a use of the typedef identifier "foo" or a multiplication
expression; not without looking back to find out if the next and
previous lines are declarations and/or if "foo" is a typedef id. 
But a better syntax could make that locally unambiguous.

Pascal also supports user defined type identifiers, but they may
only be used, like all type identifiers, in unambiguous ways. 
Actually, some Pascals support type casting that looks identical to
function calls, so this is ambiguous in some sense, but it doesnt
matter at the syntactic level.  (A similar example is the ambiguity
between a parameterless function call and a variable reference.)

So, while you see it as a problem to be solved at the lexical
level, I would prefer to solve it at the syntax level.

> Advantages I can see:
> 
> 	1.  A person familiar with the lexical rules of the language can 
> 		more easily understand a routine without consulting all of the
> 		declarations involved.

Granted.  But if the syntax doesnt permit ambiguous use of tokens,
then a reader of a program should be able to tell pretty quickly
what is meant from the local context.

The other advantages you list relate to splitting up lexical and
syntax processing and not requiring semantic info.  But the same
advantages obtain with a syntactic solution.

Dan LaLiberte
uiucdcs!liberte
liberte@cs.uiuc.edu
liberte%a.cs.uiuc.edu@uiucvmd.bitnet

nick@ccicpg.UUCP (Nick Crossley) (09/21/88)

In article <5200026@m.cs.uiuc.edu> wsmith@m.cs.uiuc.edu writes:
>
>What I would like to do in a new language is make the lexical analyzer have
>more lexical categories.  For example, TypeIdentifier and variableidentifier
>would be lexically different because the first has an initial uppercase letter 
>while the second doesn't.  If other categories of identifiers are needed, they
>may be defined to be identifiers-with-a-minus-sign or under_scored or
>Under_Scored each could belong to a different category if there was a valid
>semantic reason to distinguish between them.  (Prolog and Smalltalk already do 
>this.)
>
>Bill Smith		uiucdcs!wsmith		wsmith@cs.uiuc.edu

I agree it is an good idea in language design to ensure that lexing is
independent of parsing.

Algol68 does this to some extent.  'Keywords' (such as BEGIN/END, etc.) are
individual symbols, which could be represented with a single character if
an implementation had a large enough character set and appropriate keyboards.
In practice, and in the reference language, these symbols are normally
represented using the appropriate sequence of letters, but in a different
alphabet from variable names.  Type and operator identifiers are also spelled
using this different alphabet.

The language does not restrict how these two alphabets are implemented;
conventionally upper and lower case are used, but bold, underlining, quoting,
etc., are all possible.

This does leave one (nasty) ambiguity, between type and operator names.
Consider the fragment :-
	WORD a;
Is this a declaration of a REF WORD a, or is it a monadic formula, with
the monadic operator WORD?  This is very similar to the C typedef problem,
and is usually solved in a similar way.  The lexer initially thinks all
uppercase words are type names, but the parser will change that when it
sees an operator or priority definition.  This often implies an implementation
restriction that a single compilation unit cannot use an uppercase word
as both a type and an operator (in different scopes).  This ambiguity
could be solved by using a third alphabet (say italic) for operators.

Problems with lexing operator tokens, when the user is allowed to define
additional operators, were avoided by splitting all possible operator
characters into two classes, monad and nomad.  Dyadic operators can be
any of the combinations :-
	monad
	nomad
	monad nomad
	nomad nomad
Monadic operators can be either of :-
	monad
	monad nomad
nomad characters are < > / = * x (a times-symbol, not the letter x).
All others are monad.

Note that this avoids all possible lexical ambiguities in building up tokens
from characters; it does not distinguish between a dyadic or monadic operator,
as in Algol68 there is no need to do this at the lexical level.  It does ensure
that the lexing of a sequence of operator characters is fixed and does not
depend on context.  Note that this scheme makes the C operator -- impossible
in Algol68 - the lexer would return two separate tokens, since - is a monad
and there is no operator formed by 'monad monad'.
-- 

<<< standard disclaimers >>>
Nick Crossley, CCI, 9801 Muirlands, Irvine, CA 92718-2521, USA
Tel. (714) 458-7282,  uucp: ...!uunet!ccicpg!nick

haahr@phoenix.Princeton.EDU (Paul Gluckauf Haahr) (09/22/88)

in article <5200026@m.cs.uiuc.edu> wsmith@m.cs.uiuc.edu writes:
> What I would like to do in a new language is make the lexical analyzer
> have more lexical categories.  For example, TypeIdentifier and
> variableidentifier would be lexically different because the first has
> an initial uppercase letter while the second doesn't.  If other
> categories of identifiers are needed, they may be defined to be
> identifiers-with-a-minus-sign or under_scored or Under_Scored each
> could belong to a different category if there was a valid semantic
> reason to distinguish between them.

a suggestion: how about reserving all identifiers that start with i
through n (the first two letters of "integer") for integers, and
assuming all other identifiers are real.  that way, a compiler would
not have to put information about the type in the symbol table, it
could just look at the first character, and we can do away with
declarations.

i believe there was once a programming language that did this :-)

paul (princeton!haahr)