richw@ada-uts.UUCP (01/18/86)
C's grammar is CONTEXT SENSITIVE !? Can it be ?!
The following is quoted from page 121 of "C: A Reference Manual" by
Harbison & Steele (which, by the way, beats the pants off of
Kernighan & Ritchie as a reference manual). After the quote,
I've included a small program which just may reveal a minor bug
in your C compiler (it did for mine).
Allowing ordinary identifiers, as opposed to reserved words only,
as type specifiers makes the C grammar context sensitive, and
hence not LALR(1). To see this, consider this program line
A ( *B );
If A has been defined as a typedef name, then the line is a
declaration of a variable B to be of type "pointer to A."
(The parentheses surrounding "*B" are ignored.) If A is not
a type name, then this line is a call of the function A with
the single parameter *B. This ambiguity cannot be resolved
grammatically.
C compilers based on UNIX' YACC parser-generator -- such
as the Portable C Compiler -- handle this problem by feeding
information acquired during semantic analysis back to the
lexer. In fact, most C compilers do some typedef analysis
during lexical analysis.
All I have to say, concerning the design of C's syntax, is "Oops".
I also realized that this, combined with that real spiffy feature
of C that identifiers are the same if the first 8 characters are
the same, could be combined to really confuse C compilers. I tried
the following program on the compiler I use:
typedef int long_type_name;
f(a)
int *a;
{
long_type_of_function_name (*a);
printf("Bye");
}
According to H&S, a correct C compiler should say that this is a
redeclaration of "a" (since "long_type_of_function_name" and
"long_type_name" are, uh, the same identifer). However, the
compiler I use simply eats it up, thinking that the line in
question is a call to some external function (which, since it
wasn't explicitly declared, C gratiously assumes returns an
int -- isn't C just so helpful !). My guess is that when the
lexer checks to see if the function name is really a typedef'd
name, it checks ALL of the characters in both names (i.e. strcmp)
instead of checking just the first 8 (i.e. strncmp).
Of course, since the identifiers really ARE different, it SEEMS
as if the compiler's thinking it's a function call IS correct.
Technically, it's a buggy compiler, though.
Isn't it strange that it seems better for the compiler to be wrong?
Doesn't that make you wonder if something is SERIOUSLY wrong with C?
Personally, I think that the real fault for my "buggy" compiler
lies not with the compiler writer, but in the shoddy language design
that haunts the deep-dark corners of C. I mean, is there any excuse
for the grammar being context sensitive? Or, for that matter, for
identifiers having only 8 significant characters?
-- Rich "Picky-Picky-Picky" Wagner
P.S. Forgive me if this piece of C trivia has been already discussed
(or flamed, as in this case) in net.lang.c -- I just found out
about it and was amazed.
gwyn@brl-tgr.ARPA (Doug Gwyn <gwyn>) (01/23/86)
> A ( *B ); > > If A has been defined as a typedef name, then the line is a > declaration of a variable B to be of type "pointer to A." > (The parentheses surrounding "*B" are ignored.) If A is not > a type name, then this line is a call of the function A with > the single parameter *B. This ambiguity cannot be resolved > grammatically. > C compilers based on UNIX' YACC parser-generator -- such > as the Portable C Compiler -- handle this problem by feeding > information acquired during semantic analysis back to the > lexer. In fact, most C compilers do some typedef analysis > during lexical analysis. Which is just fine. There are many other cases where the symbol table must be consulted to determine how to handle a construct (scope rules, etc.). Nobody promised that C grammar would be strictly LALR(1)-parsable. What is your problem? > All I have to say, concerning the design of C's syntax, is "Oops". Unlike the toy languages they teach you about in school, C evolved as a living language, in direct response to the perceived needs of its users. Typedefs were a fairly recent innovation; it is a minor miracle that they fit into the language as well as they do. > I also realized that this, combined with that real spiffy feature > of C that identifiers are the same if the first 8 characters are > the same, could be combined to really confuse C compilers. That is not a current C language rule. It was enforced in the original PDP-11 implementation because on that architecture it was important to keep data space requirements small. > Isn't it strange that it seems better for the compiler to be wrong? > > Doesn't that make you wonder if something is SERIOUSLY wrong with C? It makes me think there is something seriously wrong with the programmer who relies on automatic 8-character truncation of symbolic names to make his code work. Also, for the particular feature you're griping about to even come into play, your code has to violate several rules of good style. In practice, competent C programmers NEVER write code where the so-called "ambiguity" is an issue. "If you want Ada, you know where to find it."
ekrell@mit-bugs-bunny.arpa (01/25/86)
> C's grammar is CONTEXT SENSITIVE !? Can it be ?!
First of all, the term "context sensitive" is incorrect.
C's grammar is context free since every production in the grammar has
exactly one non-terminal in its left hand side. The problem is the grammar
is ambiguous, i.e., some programs have more than one parse tree.
This is nothing out of the ordinary. Lots of languages have ambiguous
"official" grammars. Even Ada does. For instance, is A(1) a reference
to element 1 of an array A or is it a call to function A with 1 being
an argument ??. You can't distinguish between the two syntactically.
Please note that these two examples are not an indication of the language
being ambiguous, only that a particular grammar is ambiguous.
You can easily solve these two ambiguities by changing the grammar.
In the Ada example, you simply don't distinguish in the grammar
between array references and function calls and let the semantic
phase do the required checking.
--
Eduardo Krell UCLA Computer Science Department
ekrell@ucla-locus.arpa ..!{sdcrdcf,ihnp4,trwspp,ucbvax}!ucla-cs!ekrell
jack@boring.uucp (Jack Jansen) (01/25/86)
This 'not LALR' business isn't that interesting, since the C syntax has a few ambiguities. How about this one: foo(a,b) int a,b; { int *p; p = &b; a = a/*p; return(a); } Simple, eh? Just returns a/b in a difficult way. But, did you notice that the routine isn't finished yet? END OF COMMENT */+1; return(a); } So, now it is. Anyway, all C compilers I've ever tested this on will see the first /* as comment-open (probably somewhere far in the beginning, in the lexical analyzer or thereabouts), but still, it's not really what you'ld like to have in a language..... -- Jack Jansen, jack@mcvax.UUCP The shell is my oyster.
farren@well.UUCP (Mike Farren) (01/27/86)
In article <10200035@ada-uts.UUCP>, richw@ada-uts.UUCP writes: With a little creative deletion: > C's grammar is CONTEXT SENSITIVE !? Can it be ?! > To see this, consider this program line > > A ( *B ); > > If A has been defined as a typedef name, then the line is a > declaration of a variable B to be of type "pointer to A." > (The parentheses surrounding "*B" are ignored.) If A is not > a type name, then this line is a call of the function A with > the single parameter *B. This ambiguity cannot be resolved > grammatically. > Doesn't that make you wonder if something is SERIOUSLY wrong with C? No, it supports my long-held theory that even Kernighan and Ritchie are only human. Oops, perhaps, but hardly a fatal flaw. > Personally, I think that the real fault for my "buggy" compiler > lies not with the compiler writer, but in the shoddy language design > that haunts the deep-dark corners of C. I mean, is there any excuse > for the grammar being context sensitive? Or, for that matter, for > identifiers having only 8 significant characters? > > -- Rich "Picky-Picky-Picky" Wagner Well, I think you've lived up to your nickname. One example of context sensitivity, especially in an area which is designed solely to make things a little easier on us poor programmers, hardly makes for "shoddy language design". There's probably no excuse, but then, I don't think that one is needed - too many good programs have been written for two many years using C for me to think it shoddy. As to the 8 significant characters, that's a problem for the compiler designer, not a built-in limitation, and undoubtedly stems from associated assembler and linker limitations, not C's own. -- Mike Farren uucp: {your favorite backbone site}!hplabs!well!farren Fido: Sci-Fido, Fidonode 125/84, (415)655-0667
stephen@datacube.UUCP (01/27/86)
> C's grammar is CONTEXT SENSITIVE !? Can it be ?! ...
C is context sensitive in many ways, but so is just about every other
programming language in existence. PASCAL or MODULA-II, for example,
have an ambiguity for:
Statement ::= assign_stmt | procedure_invocation | ...
assign_stmt ::= IDENTIFIER ':=' expression
procedure_invocation ::= IDENTIFIER ...
This also must be resolved by feedback to the lexer.
In a general sense, any programming language where variables are
declared is context-sensitive, since the semantics of later invocations
of the variable depend on what type the variable was declared.
Really, you are flaming the lack of power of an LALR(n) grammar. However,
due to considerations of efficiency and convenience, an LALR grammar is
probably one of the best choices for parsing a language (especially if
you have yacc available).
abh6509@ritcv.UUCP (A. Hudson) (01/27/86)
Did you bother running it through lint?????????????
mts@cosivax.UUCP (Michael Stolarchuk) (01/29/86)
> C's grammar is CONTEXT SENSITIVE !? Can it be ?! > To see this, consider this program line > > A ( *B ); > > If A has been defined as a typedef name, then the line is a > declaration of a variable B to be of type "pointer to A." > (The parentheses surrounding "*B" are ignored.) If A is not > a type name, then this line is a call of the function A with > the single parameter *B. This ambiguity cannot be resolved > grammatically. > Doesn't that make you wonder if something is SERIOUSLY wrong with C? First I think its great you found this. It never occured to me. If someone were to ask me (before the message) if C were context sensitive, I would have said no. Even so, context sensitive is an attribute. The "goodness" of something has very little to do to do "context sensitive grammar". Its not too tough to make a language context sensitive, seem many of the languages I use have some level of sensitivity, if not in the grammar then really in the semantics. In many cases the lines between syntax and grammar are not as clear as you think. Its is possible to migrate some of the typechecking in some languages from semantics into syntax (sometimes for convience), and imbed context sensitivity into the grammer. Alternativly, one could write the C grammar to not be context sensitive by having a production called "typedecl_paramcall: ...", then perform the distinction between the two at "semantic-level". The grammar may be inconvienient (difficult to construct, maintain, and understand), and that is usually the impetus. > Personally, I think that the real fault for my "buggy" compiler > lies not with the compiler writer, but in the shoddy language design > that haunts the deep-dark corners of C. ... Bugs are bugs. One definition of production quality code in use today is "one bug per 1000 lines of code". The compiler I have here has about 20 thousand lines, so I would expect about 20 bugs to always exist, no matter what release. So if a new release appears, the bugs may have shifted. Secondly, I don't think it is wise to assosiate the quality of a compiler, even the quality of any tool, to the function its supposed to perform, just as no one complains about how correct an instruction set is. The quality is associated with the product, not the function. > ... I mean, is there any excuse > for the grammar being context sensitive? ... > ... Or, for that matter, for > identifiers having only 8 significant characters? ... These are both implementation details, left to the project team creating the product. The constraints on the project are the motivating force. I may expect (in the context sensitive case) it was impractical to choose a non context sensitive grammar, for some of the reasons already pointed out above, or even better ones: the availability of a debugged grammar, the availability of a base development compiler, etc. As for the character symbol length, you probably don't need to recall bell 5.2 and above, along with berk 4.0 and above (if I remember right) were the releases to use variable length symbol names. If the release of the system you are using doesn't have variable length symbols, then it is one of the constraints of the orjects you are working on. As as example, it may not even be wise to move from a truncating compiler to a variable length one if lots of code already exists, and someone wasn't extremely careful about name lengths. You first observation was a good one. Many of the other statements seems to imply some dissatisfaction with the product you are using. Perhaps its a good time to think about the work you are doing, and whether or not you are willing to put up with the situation you are in.
roy@gitpyr.UUCP (Roy Mongiovi) (01/31/86)
Now wait just a minute here. I'm not sure you really mean "context sensitive". Any language that has goto's or forces you to declare your variables is context sensitive. (A goto isn't legal if the corresponding label isn't defined, and that assignment statement isn't valid if the variable isn't declared.) In fact, all that yucky type checking etc. that's done by the semantic routines is there because useful languages really are context sensitive. It seems to me that you want a word that means "the grammar is ambiguous unless you consider context," but that's a bit more specific than context sensitivity.... -- Roy J. Mongiovi. Office of Computing Services. User Services. Georgia Institute of Technology. Atlanta GA 30332. (404) 894-6163 ...!{akgua, allegra, amd, hplabs, ihnp4, masscomp, ut-ngp}!gatech!gitpyr!roy
richw@ada-uts.UUCP (01/31/86)
Well, I've calmed down a bit. That was quite a flame... Whether you consider these ambiguities "all that bad" seems to be a matter of personal taste. I guess I should've kept in mind that the goals in the design of C were speed and ease of compilation (it seems), along with access to low-level "stuff". Being hyper- critical probably stems from dealing with languages where program correctness and readability are major concerns... In any case, the 8 significant character stuff... (yes, more flames) >> ... As to the 8 significant >> characters, that's a problem for the compiler designer, not a built-in >> limitation, and undoubtedly stems from associated assembler and linker >> limitations, not C's own. >> >> Mike Farren I'm not sure what "standard C" says about this, if there is such a standard yet, but I simply don't like being told in K&R and H&S that if I want my programs to be portable, I should ensure that the first six characters of all of my external id's are different, ignoring case, so that I can port it to a Honeywell 6000. Is it really crucial for the compiler generated code to conform to the limitations of ALL linkers and loaders? I've got a linker that only deals with 2 significant characters -- should I expect all C programmers to conform to it? Of course not; I should either get a modern linker or worry about mapping identifiers in the object code MYSELF -- it really isn't hard. If anybody can think of ANY reason for limiting the number of significant characters of non-external identifiers to 8, I'd be honestly interested in hearing it -- these identifiers don't even exist after compilation! Programs (in this case, compilers) which must deal with strings of indefinite length are a little harder to write; you might have to even use the heap (GASP!). But, come on, should you, in your language definition, make such silly concessions simply to ease the construction of compilers for your language? An individual compiler is written ONCE; once your compiler can handle arbitrarily long identifiers, the people that program in your language NEVER have to worry about them again in the multitude of programs they will write and compile using your compiler.
jack@boring.uucp (Jack Jansen) (02/01/86)
In article <7800010@datacube.UUCP> stephen@datacube.UUCP writes: > >C is context sensitive in many ways, but so is just about every other >programming language in existence. PASCAL or MODULA-II, for example, >have an ambiguity for: > >Statement ::= assign_stmt | procedure_invocation | ... > >assign_stmt ::= IDENTIFIER ':=' expression >procedure_invocation ::= IDENTIFIER ... > >This also must be resolved by feedback to the lexer. This is not true. The lexical analyzer doesn't have anything to do with it, since both things are IDENTIFIERs. The problem in C is that type-names are reserved words, so, in effect, a 'typedef' introduces a new reserved word. This makes it necessary to either tell lex about it (the quick-and-dirty approach, used by most compilers I know of), or mess up your yacc grammar in a truly horrible way. Also, note that, at least in pascal, where there are *no* procedure-type variables, you only have to look ahead exactly *one* token to find out wether it is an assignment or a call. (Next token is ( or any statement terminator => call else assigment). In C, you could conceivably have to look ahead an unlimited number of tokens: typedef int geheel_getal; geheel_getal ((((((((((((((((((((a)))))))))))))))))))); This makes the C typedef problem an order of magnitude more difficult. -- Jack Jansen, jack@mcvax.UUCP The shell is my oyster.
richw@ada-uts.UUCP (02/03/86)
Some clarifications: Whether the ambiguity pointed out in H&S causes C's grammar to be context sensitive deserves to be questioned. My faint recollection of what it technically means to be context sensitive doesn't seem to apply (i.e. more the one terminal and/or non-terminal on the left-hand-side of a production?). However, whether or not H&S is correct in using the term "context- sensitive", there IS a point to be made! By simply looking at A (*B) ; the parser CAN generate A (sub-) tree for that statement/declaration. That is, the parser WILL still only recognize strings in the language for C (where language is defined in the more formal sense of being a possibly-infinite set of strings). HOWEVER, the tree that is built for this statement/declaration contains almost no semantic information. Contrast it against the sub-tree built for if (<conditional-expression>) <statement> else <statement> After parsing that, the language processor KNOWS that that is an IF statement. However, after parsing, the language processor needs to look at PREVIOUS information (context, if you like) to figure out if A (*B) ; is a statement (i.e. procedure call) or a variable declaration. There's a big difference between the two interpretations... My apologies for probably using the wrong terminology, but remember that it was quoted from H&S (i.e. how to pawn off blame for a mistake). -- Rich
henry@utzoo.UUCP (Henry Spencer) (02/05/86)
> ... I simply don't like being told in K&R and H&S that > if I want my programs to be portable, I should ensure that the first > six characters of all of my external id's are different, ignoring case, > so that I can port it to a Honeywell 6000. Is it really crucial for > the compiler generated code to conform to the limitations of ALL > linkers and loaders? Do you want your programs to be portable or not? The fewer assumptions you make about the environment they will run in, the more portable they will be. Six-characters-monocase is a rather extreme restriction (although there are worse linkers on some real systems!), but code that conforms to it will run on almost anything. Once you go beyond 8 characters or so of significance, the number of machines it will run on drops *sharply*. Do you care about portability or not? > ... I should either get a modern linker or worry about > mapping identifiers in the object code MYSELF -- it really isn't > hard. Tell that to an IBM user, who doesn't *want* to write his own linker. Many people have to live with support software they do not control... and those who do control such software often are reluctant to invest the enormous efforts that would be needed to support two incompatible versions. Whether this is *right* or not, it is a *fact*. Nor are such problems trivial to solve, especially when maintenance and updates are considered; many people who are faced with software containing unportable constructs like long identifiers eventually give up on it. Again, whether this is *right* or not, it is a *fact*: if your code relies on long identifiers, there are many potential users who will never be able to run it. > If anybody can think of ANY reason for limiting the number of significant > characters of non-external identifiers to 8, I'd be honestly interested > in hearing it -- these identifiers don't even exist after compilation! Here the arguments are weaker; object-file formats and linkers don't enter the picture. But for the present, the problem remains: many people have compilers which observe such limits, and are not free to change. *This* problem should eventually go away, since the ANSI drafts all mandate rather longer minima for internal identifiers. Meanwhile, again, many potential users will be unable to run your software if you make it gratuitously unportable by using identifiers not unique in the first 8. > ... come on, should > you, in your language definition, make such silly concessions simply > to ease the construction of compilers for your language? ... A sound point for new languages. C is not a new language. Life would be a lot simpler if C had mandated arbitrary-length identifiers from the start, but it didn't happen that way. IT DID NOT HAPPEN THAT WAY. That is the situation; whether you like it or not doesn't change it. You can either live with it, and make your software portable, or refuse to accept it, and make your software unportable. A disagreeable choice, but that's the way it is. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
rb@ccivax.UUCP (rex ballard) (02/07/86)
In article <10200037@ada-uts.UUCP> richw@ada-uts.UUCP writes: > >Well, I've calmed down a bit. That was quite a flame... > >Whether you consider these ambiguities "all that bad" seems to >be a matter of personal taste. > >In any case, the 8 significant character stuff... (yes, more flames) > >If anybody can think of ANY reason for limiting the number of significant >characters of non-external identifiers to 8, I'd be honestly interested >in hearing it. As has been mentioned before, the big problem with >6|8 significant characters is with the assemblers. The old RT-11 assemblers only provided 6 characters. Some of the Whitesmith flavors are the same, along with most old micro assemblers like for the 8080, where the memory available for the symbol table was less than 64K. The same problem exists with the linkers, librarians, and debuggers. Debuggers which reference source line lables such as DBX and SDB are free of this limitation, but source is not always available. Static and automatic lables are often treated as local lables with L[1-256] or a similar approach. This makes debugging more difficult, but eases the symbol table crunch. These can be made variable length so long as the programmer understands the difference. Structure member names are another candidate for variable length names. One popular approach is to use #defines do define full names and use a good pre-processor to resolve them into their cryptic names. Doing this automatically seems attractive until you get into the problem of global resolution. Perhaps incorporating the file name into the lable would help. As long as there are assemblers that run in 'PDP-11 emulation mode' and machines with memory restrictions, the restriction will hold for 'portable code'. One alternative (though less practical). Compile directly from 'C' source to link module. This bypasses the assembler but not the linker. Another is to go from source to executable, this can lead to a very large compiler, like SmallTalk, but would make debugging easier. I have noticed that lint (4.2) will complain when a very long name is declared, and a trunkated version is referenced (Particularly with #defines). Ideally, lint should check for both 'Unique prefix' and 'Unique full-name' on all lables, issuing '<lable> not used', '<lable> undefined', and '<lable multiply defined>' if there is a clash either way. Unfortunately, it seems that there is little incentive to rewrite assemblers, linkers, librarians..., Fourth generation languages, incremental compilers, and interactive developement systems are following the FORTH tradition of keeping symbol table information directly even after the source is compiled. Could something like this could be done for C? I am glad there is interest in making these types of improvements in the language. Perhaps by investigating some of the good features of languages like FORTH or SmallTalk, rather than trying to poke at their weaknesses (they have many), a better, more powerful 'C' developement system will evolve. I'd love to see a fully interactive environment that allows unit testing and gradual integration of actual compiled code. DBX comes close, but I'm almost positive it could be even better.