[net.lang.c] C not LALR

richw@ada-uts.UUCP (01/18/86)

C's grammar is CONTEXT SENSITIVE !?    Can it be ?!

The following is quoted from page 121 of "C: A Reference Manual" by
Harbison & Steele (which, by the way, beats the pants off of
Kernighan & Ritchie as a reference manual).  After the quote,
I've included a small program which just may reveal a minor bug
in your C compiler (it did for mine).


    Allowing ordinary identifiers, as opposed to reserved words only,
    as type specifiers makes the C grammar context sensitive, and
    hence not LALR(1).  To see this, consider this program line

                       A ( *B );

    If A has been defined as a typedef name, then the line is a
    declaration of a variable B to be of type "pointer to A."
    (The parentheses surrounding "*B" are ignored.)  If A is not
    a type name, then this line is a call of the function A with
    the single parameter *B.  This ambiguity cannot be resolved
    grammatically.
        C compilers based on UNIX' YACC parser-generator -- such
    as the Portable C Compiler -- handle this problem by feeding
    information acquired during semantic analysis back to the
    lexer.  In fact, most C compilers do some typedef analysis
    during lexical analysis.


All I have to say, concerning the design of C's syntax, is "Oops".

I also realized that this, combined with that real spiffy feature
of C that identifiers are the same if the first 8 characters are
the same, could be combined to really confuse C compilers.  I tried
the following program on the compiler I use:

    typedef int long_type_name;
    
    f(a)
    int *a;
    {
        long_type_of_function_name (*a);

        printf("Bye");
    }

According to H&S, a correct C compiler should say that this is a
redeclaration of "a" (since "long_type_of_function_name" and
"long_type_name" are, uh, the same identifer).  However, the
compiler I use simply eats it up, thinking that the line in
question is a call to some external function (which, since it
wasn't explicitly declared, C gratiously assumes returns an
int -- isn't C just so helpful !).  My guess is that when the
lexer checks to see if the function name is really a typedef'd
name, it checks ALL of the characters in both names (i.e. strcmp)
instead of checking just the first 8 (i.e. strncmp).

Of course, since the identifiers really ARE different, it SEEMS
as if the compiler's thinking it's a function call IS correct.
Technically, it's a buggy compiler, though.

Isn't it strange that it seems better for the compiler to be wrong?

Doesn't that make you wonder if something is SERIOUSLY wrong with C?

Personally, I think that the real fault for my "buggy" compiler
lies not with the compiler writer, but in the shoddy language design
that haunts the deep-dark corners of C.  I mean, is there any excuse
for the grammar being context sensitive?  Or, for that matter, for
identifiers having only 8 significant characters?

-- Rich  "Picky-Picky-Picky"  Wagner


P.S.  Forgive me if this piece of C trivia has been already discussed
      (or flamed, as in this case) in net.lang.c -- I just found out
      about it and was amazed.

gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (01/23/86)

>                        A ( *B );
> 
>     If A has been defined as a typedef name, then the line is a
>     declaration of a variable B to be of type "pointer to A."
>     (The parentheses surrounding "*B" are ignored.)  If A is not
>     a type name, then this line is a call of the function A with
>     the single parameter *B.  This ambiguity cannot be resolved
>     grammatically.
>         C compilers based on UNIX' YACC parser-generator -- such
>     as the Portable C Compiler -- handle this problem by feeding
>     information acquired during semantic analysis back to the
>     lexer.  In fact, most C compilers do some typedef analysis
>     during lexical analysis.

Which is just fine.  There are many other cases where the symbol
table must be consulted to determine how to handle a construct
(scope rules, etc.).  Nobody promised that C grammar would be
strictly LALR(1)-parsable.  What is your problem?

> All I have to say, concerning the design of C's syntax, is "Oops".

Unlike the toy languages they teach you about in school, C evolved
as a living language, in direct response to the perceived needs of
its users.  Typedefs were a fairly recent innovation; it is a minor
miracle that they fit into the language as well as they do.

> I also realized that this, combined with that real spiffy feature
> of C that identifiers are the same if the first 8 characters are
> the same, could be combined to really confuse C compilers.

That is not a current C language rule.  It was enforced in the
original PDP-11 implementation because on that architecture it
was important to keep data space requirements small.

> Isn't it strange that it seems better for the compiler to be wrong?
> 
> Doesn't that make you wonder if something is SERIOUSLY wrong with C?

It makes me think there is something seriously wrong with the
programmer who relies on automatic 8-character truncation of symbolic
names to make his code work.  Also, for the particular feature you're
griping about to even come into play, your code has to violate
several rules of good style.  In practice, competent C programmers
NEVER write code where the so-called "ambiguity" is an issue.

"If you want Ada, you know where to find it."

ekrell@mit-bugs-bunny.arpa (01/25/86)

> C's grammar is CONTEXT SENSITIVE !?    Can it be ?!

First of all, the term "context sensitive" is incorrect.
C's grammar is context free since every production in the grammar has
exactly one non-terminal in its left hand side. The problem is the grammar
is ambiguous, i.e., some programs have more than one parse tree.

This is nothing out of the ordinary. Lots of languages have ambiguous
"official" grammars. Even Ada does. For instance, is A(1) a reference
to element 1 of an array A or is it a call to function A with 1 being
an argument ??. You can't distinguish between the two syntactically.

Please note that these two examples are not an indication of the language
being ambiguous, only that a particular grammar is ambiguous.
You can easily solve these two ambiguities by changing the grammar.
In the Ada example, you simply don't distinguish in the grammar
between array references and function calls and let the semantic
phase do the required checking.
-- 
    Eduardo Krell               UCLA Computer Science Department
    ekrell@ucla-locus.arpa      ..!{sdcrdcf,ihnp4,trwspp,ucbvax}!ucla-cs!ekrell

jack@boring.uucp (Jack Jansen) (01/25/86)

This 'not LALR' business isn't that interesting, since the
C syntax has a few ambiguities. How about this one:

foo(a,b) int a,b; {
    int *p;

    p = &b;
    a = a/*p;
    return(a);
}

Simple, eh? Just returns a/b in a difficult way.
But, did you notice that the routine isn't finished yet?
END OF COMMENT */+1;
    return(a);
}
So, now it is.

Anyway, all C compilers I've ever tested this on will see the
first /* as comment-open (probably somewhere far in the beginning,
in the lexical analyzer or thereabouts), but still, it's not
really what you'ld like to have in a language.....
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

farren@well.UUCP (Mike Farren) (01/27/86)

In article <10200035@ada-uts.UUCP>, richw@ada-uts.UUCP writes:
With a little creative deletion: 
> C's grammar is CONTEXT SENSITIVE !?    Can it be ?!
>       To see this, consider this program line
> 
>                        A ( *B );
> 
>     If A has been defined as a typedef name, then the line is a
>     declaration of a variable B to be of type "pointer to A."
>     (The parentheses surrounding "*B" are ignored.)  If A is not
>     a type name, then this line is a call of the function A with
>     the single parameter *B.  This ambiguity cannot be resolved
>     grammatically.
> Doesn't that make you wonder if something is SERIOUSLY wrong with C?
   
      No, it supports my long-held theory that even Kernighan and Ritchie
are only human.  Oops, perhaps, but hardly a fatal flaw.

> Personally, I think that the real fault for my "buggy" compiler
> lies not with the compiler writer, but in the shoddy language design
> that haunts the deep-dark corners of C.  I mean, is there any excuse
> for the grammar being context sensitive?  Or, for that matter, for
> identifiers having only 8 significant characters?
> 
> -- Rich  "Picky-Picky-Picky"  Wagner

	Well, I think you've lived up to your nickname.  One example of
context sensitivity, especially in an area which is designed solely to
make things a little easier on us poor programmers, hardly makes for 
"shoddy language design".  There's probably no excuse, but then, I don't
think that one is needed - too many good programs have been written for
two many years using C for me to think it shoddy.  As to the 8 significant
characters, that's a problem for the compiler designer, not a built-in
limitation, and undoubtedly stems from associated assembler and linker
limitations, not C's own.

-- 
           Mike Farren
           uucp: {your favorite backbone site}!hplabs!well!farren
           Fido: Sci-Fido, Fidonode 125/84, (415)655-0667

stephen@datacube.UUCP (01/27/86)

> C's grammar is CONTEXT SENSITIVE !?    Can it be ?! ...

C is context sensitive in many ways, but so is just about every other
programming language in existence. PASCAL or MODULA-II, for example,
have an ambiguity for:

Statement ::= assign_stmt | procedure_invocation | ...

assign_stmt ::= IDENTIFIER ':=' expression
procedure_invocation ::= IDENTIFIER ...

This also must be resolved by feedback to the lexer.

In a general sense, any programming language where variables are
declared is context-sensitive, since the semantics of later invocations
of the variable depend on what type the variable was declared.

Really, you are flaming the lack of power of an LALR(n) grammar. However,
due to considerations of efficiency and convenience, an LALR grammar is
probably one of the best choices for parsing a language (especially if
you have yacc available).

abh6509@ritcv.UUCP (A. Hudson) (01/27/86)

Did you bother running it through lint?????????????

mts@cosivax.UUCP (Michael Stolarchuk) (01/29/86)

> C's grammar is CONTEXT SENSITIVE !?    Can it be ?!
>       To see this, consider this program line
> 
>                        A ( *B );
> 
>     If A has been defined as a typedef name, then the line is a
>     declaration of a variable B to be of type "pointer to A."
>     (The parentheses surrounding "*B" are ignored.)  If A is not
>     a type name, then this line is a call of the function A with
>     the single parameter *B.  This ambiguity cannot be resolved
>     grammatically.
> Doesn't that make you wonder if something is SERIOUSLY wrong with C?
   
First I think its great you found this.  It never occured to me.
If someone were to ask me (before the message) if C were context sensitive,
I would have said no.

Even so, context sensitive is an attribute.  The "goodness" of something has
very little to do to do "context sensitive grammar".   Its not too tough
to make a language context sensitive, seem many of the languages
I use have some level of sensitivity, if not in the grammar then really
in the semantics.  In many cases the lines between syntax and grammar
are not as clear as you think.  Its is possible to migrate some of
the typechecking in some languages from semantics into syntax (sometimes
for convience), and imbed context sensitivity into the grammer.
Alternativly, one could write the C grammar to not be context sensitive
by having a production called "typedecl_paramcall: ...", then perform the
distinction between the two at "semantic-level".  The grammar may be
inconvienient (difficult to construct, maintain, and understand), and that
is usually the impetus.

> Personally, I think that the real fault for my "buggy" compiler
> lies not with the compiler writer, but in the shoddy language design
> that haunts the deep-dark corners of C.  ...

Bugs are bugs.  One definition of production quality code in use today
is "one bug per 1000 lines of code".  The compiler I have here has about 
20 thousand lines, so I would expect about 20 bugs to always exist, no
matter what release.  So if a new release appears, the bugs may have shifted.

Secondly, I don't think it is wise to assosiate the quality of a compiler,
even the quality of any tool, to the function its supposed to perform,
just as no one complains about how correct an instruction set is.  The quality
is associated with the product, not the function.

> ... I mean, is there any excuse
> for the grammar being context sensitive? ...


> ... Or, for that matter, for
> identifiers having only 8 significant characters? ...

These are both implementation details, left to the project team creating
the product.  The constraints on the project are the motivating force.
I may expect (in the context sensitive case) it was impractical to choose
a non context sensitive grammar, for some of the reasons already pointed
out above, or even better ones: the availability of a debugged grammar,
the availability of a base development compiler, etc.  As for the 
character symbol length, you probably don't need to recall bell 5.2 and above,
along with berk 4.0 and above (if I remember right) were the releases to
use variable length symbol names.  If the release of the system you are 
using doesn't have variable length symbols, then it is one of the constraints
of the orjects you are working on.  As as example, it may not even be
wise to move from a truncating compiler to a variable length one if 
lots of code already exists, and someone wasn't extremely careful about
name lengths.

You first observation was a good one.  Many of the other statements
seems to imply some dissatisfaction with the product you are using.
Perhaps its a good time to think about the work you are doing, and
whether or not you are willing to put up with the situation you are in.

roy@gitpyr.UUCP (Roy Mongiovi) (01/31/86)

Now wait just a minute here.  I'm not sure you really mean "context sensitive".
Any language that has goto's or forces you to declare your variables is context
sensitive.  (A goto isn't legal if the corresponding label isn't defined, and
that assignment statement isn't valid if the variable isn't declared.)  In fact,
all that yucky type checking etc. that's done by the semantic routines is there
because useful languages really are context sensitive.  It seems to me that you
want a word that means "the grammar is ambiguous unless you consider context,"
but that's a bit more specific than context sensitivity....
-- 
Roy J. Mongiovi.	Office of Computing Services.		User Services.
Georgia Institute of Technology.	Atlanta GA  30332.	(404) 894-6163
 ...!{akgua, allegra, amd, hplabs, ihnp4, masscomp, ut-ngp}!gatech!gitpyr!roy

richw@ada-uts.UUCP (01/31/86)

Well, I've calmed down a bit.  That was quite a flame...

Whether you consider these ambiguities "all that bad" seems to
be a matter of personal taste.  I guess I should've kept in mind
that the goals in the design of C were speed and ease of compilation
(it seems), along with access to low-level "stuff".  Being hyper-
critical probably stems from dealing with languages where program
correctness and readability are major concerns...

In any case, the 8 significant character stuff...  (yes, more flames)

>> ...  As to the 8 significant
>> characters, that's a problem for the compiler designer, not a built-in
>> limitation, and undoubtedly stems from associated assembler and linker
>> limitations, not C's own.
>>
>>         Mike Farren

I'm not sure what "standard C" says about this, if there is such a
standard yet, but I simply don't like being told in K&R and H&S that
if I want my programs to be portable, I should ensure that the first
six characters of all of my external id's are different, ignoring case,
so that I can port it to a Honeywell 6000.  Is it really crucial for
the compiler generated code to conform to the limitations of ALL
linkers and loaders?  I've got a linker that only deals with 2 significant
characters -- should I expect all C programmers to conform to it?
Of course not; I should either get a modern linker or worry about
mapping identifiers in the object code MYSELF -- it really isn't
hard.

If anybody can think of ANY reason for limiting the number of significant
characters of non-external identifiers to 8, I'd be honestly interested
in hearing it -- these identifiers don't even exist after compilation!
Programs (in this case, compilers) which must
deal with strings of indefinite length are a little harder to write;
you might have to even use the heap (GASP!).  But, come on, should
you, in your language definition, make such silly concessions simply
to ease the construction of compilers for your language?  An individual
compiler is written ONCE; once your compiler can handle arbitrarily
long identifiers, the people that program in your language NEVER have
to worry about them again in the multitude of programs they will write
and compile using your compiler.

jack@boring.uucp (Jack Jansen) (02/01/86)

In article <7800010@datacube.UUCP> stephen@datacube.UUCP writes:
>
>C is context sensitive in many ways, but so is just about every other
>programming language in existence. PASCAL or MODULA-II, for example,
>have an ambiguity for:
>
>Statement ::= assign_stmt | procedure_invocation | ...
>
>assign_stmt ::= IDENTIFIER ':=' expression
>procedure_invocation ::= IDENTIFIER ...
>
>This also must be resolved by feedback to the lexer.
This is not true. The lexical analyzer doesn't have anything to
do with it, since both things are IDENTIFIERs. The problem in
C is that type-names are reserved words, so, in effect, a 'typedef'
introduces a new reserved word. This makes it necessary to
either tell lex about it (the quick-and-dirty approach, used
by most compilers I know of), or mess up your yacc grammar in
a truly horrible way.

Also, note that, at least in pascal, where there are *no*
procedure-type variables, you only have to look ahead exactly
*one* token to find out wether it is an assignment or a call.
(Next token is ( or any statement terminator => call else assigment).

In C, you could conceivably have to look ahead an unlimited number
of tokens:

typedef int geheel_getal;

geheel_getal ((((((((((((((((((((a))))))))))))))))))));

This makes the C typedef problem an order of magnitude more
difficult.
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

richw@ada-uts.UUCP (02/03/86)

Some clarifications:

Whether the ambiguity pointed out in H&S causes C's grammar to
be context sensitive deserves to be questioned.  My faint recollection
of what it technically means to be context sensitive doesn't seem
to apply (i.e. more the one terminal and/or non-terminal on the
left-hand-side of a production?).

However, whether or not H&S is correct in using the term "context-
sensitive", there IS a point to be made!  By simply looking at

    A (*B) ;

the parser CAN generate A (sub-) tree for that statement/declaration.
That is, the parser WILL still only recognize strings in the language
for C (where language is defined in the more formal sense of being
a possibly-infinite set of strings).  HOWEVER, the tree that is built
for this statement/declaration contains almost no semantic information.
Contrast it against the sub-tree built for

    if (<conditional-expression>)
        <statement>
    else
        <statement>

After parsing that, the language processor KNOWS that that is an IF
statement.  However, after parsing, the language processor needs to
look at PREVIOUS information (context, if you like) to figure out if

    A (*B) ;

is a statement (i.e. procedure call) or a variable declaration.
There's a big difference between the two interpretations...

My apologies for probably using the wrong terminology, but remember
that it was quoted from H&S (i.e. how to pawn off blame for a mistake).

-- Rich

henry@utzoo.UUCP (Henry Spencer) (02/05/86)

> ... I simply don't like being told in K&R and H&S that
> if I want my programs to be portable, I should ensure that the first
> six characters of all of my external id's are different, ignoring case,
> so that I can port it to a Honeywell 6000.  Is it really crucial for
> the compiler generated code to conform to the limitations of ALL
> linkers and loaders?

Do you want your programs to be portable or not?  The fewer assumptions
you make about the environment they will run in, the more portable they
will be.  Six-characters-monocase is a rather extreme restriction (although
there are worse linkers on some real systems!), but code that conforms to
it will run on almost anything.  Once you go beyond 8 characters or so of
significance, the number of machines it will run on drops *sharply*.  Do
you care about portability or not?

> ... I should either get a modern linker or worry about
> mapping identifiers in the object code MYSELF -- it really isn't
> hard.

Tell that to an IBM user, who doesn't *want* to write his own linker.
Many people have to live with support software they do not control...
and those who do control such software often are reluctant to invest the
enormous efforts that would be needed to support two incompatible versions.
Whether this is *right* or not, it is a *fact*.  Nor are such problems
trivial to solve, especially when maintenance and updates are considered;
many people who are faced with software containing unportable constructs
like long identifiers eventually give up on it.  Again, whether this is
*right* or not, it is a *fact*:  if your code relies on long identifiers,
there are many potential users who will never be able to run it.

> If anybody can think of ANY reason for limiting the number of significant
> characters of non-external identifiers to 8, I'd be honestly interested
> in hearing it -- these identifiers don't even exist after compilation!

Here the arguments are weaker; object-file formats and linkers don't enter
the picture.  But for the present, the problem remains:  many people have
compilers which observe such limits, and are not free to change.  *This*
problem should eventually go away, since the ANSI drafts all mandate rather
longer minima for internal identifiers.  Meanwhile, again, many potential
users will be unable to run your software if you make it gratuitously
unportable by using identifiers not unique in the first 8.

> ... come on, should
> you, in your language definition, make such silly concessions simply
> to ease the construction of compilers for your language?  ...

A sound point for new languages.  C is not a new language.  Life would be
a lot simpler if C had mandated arbitrary-length identifiers from the start,
but it didn't happen that way.  IT DID NOT HAPPEN THAT WAY.  That is the
situation; whether you like it or not doesn't change it.  You can either
live with it, and make your software portable, or refuse to accept it, and
make your software unportable.  A disagreeable choice, but that's the way
it is.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

rb@ccivax.UUCP (rex ballard) (02/07/86)

In article <10200037@ada-uts.UUCP> richw@ada-uts.UUCP writes:
>
>Well, I've calmed down a bit.  That was quite a flame...
>
>Whether you consider these ambiguities "all that bad" seems to
>be a matter of personal taste.
>
>In any case, the 8 significant character stuff...  (yes, more flames)
>
>If anybody can think of ANY reason for limiting the number of significant
>characters of non-external identifiers to 8, I'd be honestly interested
>in hearing it.

As has been mentioned before, the big problem with >6|8 significant
characters is with the assemblers.  The old RT-11 assemblers only
provided 6 characters.  Some of the Whitesmith flavors are the same,
along with most old micro assemblers like for the 8080, where the
memory available for the symbol table was less than 64K.

The same problem exists with the linkers, librarians, and debuggers.
Debuggers which reference source line lables such as DBX and SDB
are free of this limitation, but source is not always available.

Static and automatic lables are often treated as local lables with
L[1-256] or a similar approach.  This makes debugging more difficult,
but eases the symbol table crunch.  These can be made variable length
so long as the programmer understands the difference.  Structure
member names are another candidate for variable length names.

One popular approach is to use #defines do define full names and use a
good pre-processor to resolve them into their cryptic names.  Doing
this automatically seems attractive until you get into the problem of
global resolution.  Perhaps incorporating the file name into the lable
would help.

As long as there are assemblers that run in 'PDP-11 emulation mode'
and machines with memory restrictions, the restriction will hold for
'portable code'.

One alternative (though less practical).  Compile directly from 'C'
source to link module.  This bypasses the assembler but not the linker.
Another is to go from source to executable, this can lead to a very
large compiler, like SmallTalk, but would make debugging easier.

I have noticed that lint (4.2) will complain when a very long
name is declared, and a trunkated version is referenced (Particularly
with #defines).  Ideally, lint should check for both 'Unique prefix'
and 'Unique full-name' on all lables, issuing '<lable> not used',
'<lable> undefined', and '<lable multiply defined>' if there is a clash
either way.

Unfortunately, it seems that there is little incentive to rewrite
assemblers, linkers, librarians...,  Fourth generation languages,
incremental compilers, and interactive developement systems are
following the FORTH tradition of keeping symbol table information
directly even after the source is compiled.  Could something like this
could be done for C?

I am glad there is interest in making these types of improvements in the
language.  Perhaps by investigating some of the good features of languages
like FORTH or SmallTalk, rather than trying to poke at their weaknesses
(they have many), a better, more powerful 'C' developement system
will evolve.  I'd love to see a fully interactive environment that
allows unit testing and gradual integration of actual compiled code.
DBX comes close, but I'm almost positive it could be even better.