ramsey@NCoast.ORG (Cedric Ramsey) (08/31/90)
/* ** I am posting this to comp.lang.c again, I don't think the mailer did the ** first time */ >> Hello again ! This question is directed towards the 'C' and >>compiler gurus out there. I was studying the grammar for the 'C' >>language and I couldn't help but notice that for declarations the >>grammar is context sensitive. >[...] >>Since the 'typedef-name' is an identifier is impossible to determine that >>it is a type defintion without looking at the context. I guess that one could >>do a pre-scan of the source code and build typedef trees but I thought that >>'C' was context free grammar. > >You've hit on a common problem in parsing C -- the typedef vs. >identifier issue. > >Indeed, C is `context sensitive' in that a typedef name has a very >different syntactic value from that of an ordinary identifier. >Nevertheless, parser generators that handle only simpler languages do >very well with C. How do they resolve this contradiction? They >discard a certain sort of purity and introduce an informal feedback. > >In processing the program, a compiler will be maintaining a symbol >table, and keeping typedef names in it. It is a simple matter to make >the lexer inspect the symbol table when processing an identifier, and >to return a different token type for an identifier that has appeared >in a typedef. In this way, the grammar has different lexeme types for >identifiers and typedefs, and the context sensitivity goes away. > >Every C compiler that I've studied has this feedback mechanism between >the semantic phase and the lexer. It's a familiar solution to the >problem. > >Kevin, KE9TV >until 8/31: kenny@cs.uiuc.edu >after 9/17: ke9tv@nrtc.northrop.com > > When I wrote the last message I was had the notion that typedef names could be used before they are declared, in ansi 'C'. I guess that was a false assumption. Because the following, I think, is illegal: typedef struct vehical { make_t make; style_t style; owner_t owner; } vehical_t; typedef ... owner_t; typedef ... make_t; typedef ... style_t; The compiler must know ahead of time that make_t, style_t and owner_t are type names. That way, the scanner, it could lookup the name in the symbol table and see that it is a typedef name and return TYPEDEF_NAME. I don't have the ansi draft; only K&R2. K&R2 doesn't mention, at least I don't recall reading it, that typedef names must occur before they are used so these points a purely speculative. Also K&R2 doesn't specify if the following is legal: typedef unsigned char uchar_t; uchar_t uchar_t; I would say that this is illegal, even though uchar_t is not a keyword. Why, because ... I don't know, maybe because it would be harder to parse. In lue of the above speculations, the compiler wouldn't have to make multiple passes to collect typedef names. I hope this is true or the grammar at back of K&R2 would not be acceptable to yacc, as claimed, without a rewrite, I could be wrong though. If thus far I am correct, I would go on further to say that, identifiers must be delared before they are used wether that be as declarators or as typedef names. What is the verdit can I safely assume this stuff or should send off for the ansi 'C' standard, at my first financial opportunity. If you guys agree that 'C' is context sensitive then what languages truely are context-free, if any.
volpe@underdog.crd.ge.com (Christopher R Volpe) (08/31/90)
In article <1990Aug30.223440.7377@NCoast.ORG>, ramsey@NCoast.ORG (Cedric Ramsey) writes: |>When I wrote the last message I was had the notion that typedef names |>could be used before they are declared, in ansi 'C'. I guess that was |>a false assumption. Because the following, I think, is illegal: |> |>typedef struct vehical { |> make_t make; |> style_t style; |> owner_t owner; |>} vehical_t; |> |>typedef ... owner_t; |>typedef ... make_t; |>typedef ... style_t; Yeah, it's illegal. The only thing that can be used without an explicit declaration is a function, which would then be implicitly declared as returning int. |> |>The compiler must know ahead of time that make_t, style_t and owner_t |>are type names. That way, the scanner, it could lookup the name in the |>symbol table and see that it is a typedef name and return TYPEDEF_NAME. |>I don't have the ansi draft; only K&R2. K&R2 doesn't mention, at least |>I don't recall reading it, that typedef names must occur before they are |>used so these points a purely speculative. Also K&R2 doesn't specify if |>the following is legal: |> |>typedef unsigned char uchar_t; |>uchar_t uchar_t; |> |>I would say that this is illegal, even though uchar_t is not a keyword. |>Why, because ... I don't know, maybe because it would be harder to parse. It's illegal because typedef names are in the same name space as objects according to section A11.1 of K&R2. Therefore, an identifier cannot be declared as both a typedef name and an object in the same scope. The typedef name can be redeclared in an inner block, however. |> |>In lue of the above speculations, the compiler wouldn't have to make |>multiple passes to collect typedef names. I hope this is true or |>the grammar at back of K&R2 would not be acceptable to yacc, as claimed, |>without a rewrite, I could be wrong though. The grammar is not claimed to be acceptable to yacc as-is. Section A13 says that the grammar is acceptable to yacc if the production "typedef_name: identifier" is removed and typedef_name is made a terminal symbol. |>If you guys agree that 'C' is context sensitive then what languages |>truely are context-free, if any. Let's clarify some terminology here: First, all context free languages are context sensitive and all context free grammars are context sensitive. There is a hierarchy involved here. The concepts are not mutually exclusive, but rather the former is a superset of the latter. When you say "context sensitive", you really mean "non context free". Second, there is a distinction between a grammar being context free and a language being context free. A language is context free if it can be completely described by a context free grammar. The C grammar is context free but it does not fully describe the language. No programming language that requires that any identifiers be declared before they are used is a context free language. Such requirements are almost always enforced semantically rather than syntactically in order to allow use of a context free grammar. (This is done using a symbol table.) Pascal is not a context free language either, for the same reason. ================== Chris Volpe G.E. Corporate R&D volpecr@crd.ge.com
alex@vmars.tuwien.ac.at (Alexander Vrchoticky) (08/31/90)
ramsey@NCoast.ORG (Cedric Ramsey) writes: >If you guys agree that 'C' is context sensitive then what languages >truely are context-free, if any. Uh, almost none, or almost every. It depends on your view. I won't go into the details of formal language theory here, but clarify a few points: - Efficient syntax analysis of programming languages dictates the use of context free grammars for syntax specification. - From a formal language point of view no practical programming language is context-free. (It can't be context free if it allows for an unlimited number of identifiers that can appear three or more times in a program.) The apparent contradiction is usually resolved as follows: - Build a parser based on a context free grammar that generates a `reasonably small' superset of the language and use other techniques (symbol tables) to `filter out' the cases that are illegal in the language but which slip through the parser based on the superset. The phrase `reasonably small' is not really very well-defined. However I think few people would call the following a reasonable grammar for C, even though it generates a superset of C: S --> A S | \epsilon A --> 'a' | 'b' | ... | 'z' (and the uppercase letters, and the digits, and ...) On the other hand most people are very happy with mapping all possible identifiers into one token with the use of a lexical analyzer and using table lookup to check for things like proper declarations. This is done routinely in every compiler. Fewer people are happy with the fact that two different tokens in the C language (identifier and typedef-name) have the same lexical representation and can therefore not be told apart without resorting to table lookup. But that's the way C is defined. And to answer the original question: Most (all?) newer languages have avoided the problem by a somewhat more generous use of keywords. -- Alexander Vrchoticky Technical University Vienna, Dept. for Real-Time Systems Voice: +43/222/58801-8168 Fax: +43/222/569149 e-mail: alex@vmars.tuwien.ac.at or vmars!alex@relay.eu.net (Don't use 'r'!)
meissner@osf.org (Michael Meissner) (08/31/90)
In article <11508@crdgw1.crd.ge.com> volpe@underdog.crd.ge.com (Christopher R Volpe) writes: ... | It's illegal because typedef names are in the same name space as | objects according to section A11.1 of K&R2. Therefore, an identifier | cannot be declared as both a typedef name and an object in the | same scope. The typedef name can be redeclared in an inner block, however. Providing a type is specified in the declaration (this is where the classic quote 'the ice is thin here' in K&R-I is used). | |> | |>In lue of the above speculations, the compiler wouldn't have to make | |>multiple passes to collect typedef names. I hope this is true or | |>the grammar at back of K&R2 would not be acceptable to yacc, as claimed, | |>without a rewrite, I could be wrong though. | | The grammar is not claimed to be acceptable to yacc as-is. Section | A13 says that the grammar is acceptable to yacc if the production | "typedef_name: identifier" is removed and typedef_name is made a | terminal symbol. The classic case that cannot be handled without knowing whether you have an identifier or typedef name is: func( (ident1) * ident2 ); If ident1 is a typedef name, then the argument to func casts the result of dereferencing pointer ident2 to ident1. If ident1 is an identifer, than the argument is the multiplication between ident1 and ident2. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Do apple growers tell their kids money doesn't grow on bushes?
henry@zoo.toronto.edu (Henry Spencer) (08/31/90)
In article <1990Aug30.223440.7377@NCoast.ORG> ramsey@NCoast.ORG (Cedric Ramsey) writes: >I don't have the ansi draft; only K&R2. K&R2 doesn't mention, at least >I don't recall reading it, that typedef names must occur before they are >used so these points a purely speculative... A careful reading of section A11.1 of K&R2 will answer these questions. >...doesn't specify if the following is legal: > >typedef unsigned char uchar_t; >uchar_t uchar_t; > >I would say that this is illegal, even though uchar_t is not a keyword. >Why, because ... I don't know, maybe because it would be harder to parse. Typedefs *are* hard to parse. Unfortunately, this particular abomination is legal... *if* you put a "{" between the two declarations. It is not legal to redeclare uchar_t in the same scope as its original declaration, but redeclaring it inside a block is legal. Parsing this is no fun at all. -- TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology OSI: handling yesterday's loads someday| henry@zoo.toronto.edu utzoo!henry
john@basho.uucp (John Lacey) (09/02/90)
In article <1990Aug30.223440.7377@NCoast.ORG> of comp.lang.c ramsey@NCoast.ORG (Cedric Ramsey) writes: } If you guys agree that 'C' is context sensitive then what languages } truely are context-free, if any. In article <11508@crdgw1.crd.ge.com> of comp.lang.c volpe@underdog.crd.ge.com (Christopher R Volpe) writes: } Let's clarify some terminology here: First, all context free languages } are context sensitive and all context free grammars are context sensitive. } There is a hierarchy involved here. The concepts are not mutually } exclusive, but rather the former is a superset of the latter. When you } say "context sensitive", you really mean "non context free". This seems hardly a clarification. You say all A are B and all B are A, then claim there is a hierarchy in the next sentence. There *is* a hierarchy. But "context sensitive" is not the same as "non context free". The set of context-sensitive grammars is a superset of the set of context-free grammars. This implies that all context-free grammars are context-sensitive, but not all context-sensitive grammars are context-free. The correct statement, then, is that C (as a complete language) is context-sensitive but not context-free. In brief, Chris said S <= F and F <= S and that C is ~F, when in fact, F < S, and C is in (S - F). (All where F = context-free and S = context-sensitive.) Any further discussion on this topic belongs in comp.theory. Toodles and cheers, John -- John Lacey, E-mail: ...!osu-cis!n8emr!uncle!basho!john (coming soon: john@basho.uucp) Voice: (614) 436--3773, or 487--8570 "What was the name of the dog on Rin-tin-tin?" --Mickey Rivers, ex-Yankee CF
volpe@underdog.crd.ge.com (Christopher R Volpe) (09/04/90)
In article <1990Sep2.012002.7004@basho.uucp>, john@basho.uucp (John Lacey) writes: |>In article <11508@crdgw1.crd.ge.com> of comp.lang.c |> volpe@underdog.crd.ge.com (Christopher R Volpe) writes: |>} Let's clarify some terminology here: First, all context free languages |>} are context sensitive and all context free grammars are context sensitive. |>} There is a hierarchy involved here. The concepts are not mutually |>} exclusive, but rather the former is a superset of the latter. When you |>} say "context sensitive", you really mean "non context free". |> |>This seems hardly a clarification. You say all A are B and all B are A, |>then claim there is a hierarchy in the next sentence. Ok, I made an error in saying "superset" when I should have said "subset". That was a mistake. But, I believe that the rest of what I said is correct. In addition, I never said anything of the form "all A are B and all B are A". Where did I ever say the relationship was commutative? What I *did* say was that the relationship applied to both *grammars* and *languages*. |> |>There *is* a hierarchy. But "context sensitive" is not the same as |>"non context free". The set of context-sensitive grammars is a superset |>of the set of context-free grammars. This implies that all context-free |>grammars are context-sensitive, but not all context-sensitive grammars |>are context-free. That was (almost) my whole point! |>The correct statement, then, is that C (as a complete |>language) is context-sensitive but not context-free. The rest of my point was that even though the *language* is context-sensitive-but-not-context-free, the *grammar* given in K&RII is truly context-free (it doesn't completely define C, though). ================== Chris Volpe G.E. Corporate R&D volpecr@crd.ge.com