[comp.lang.c] Summary: 'C', is it's grammar context sensitive ?

ramsey@NCoast.ORG (Cedric Ramsey) (08/31/90)

/*
 ** I am posting this to comp.lang.c again, I don't think the mailer did the
 ** first time
 */

>>	Hello again ! This question is directed towards the 'C' and
>>compiler gurus out there. I was studying the grammar for the 'C'
>>language and I couldn't help but notice that for declarations the
>>grammar is context sensitive.
>[...]
>>Since the 'typedef-name' is an identifier is impossible to determine that
>>it is a type defintion without looking at the context. I guess that one could
>>do a pre-scan of the source code and build typedef trees but I thought that
>>'C' was context free grammar.
>
>You've hit on a common problem in parsing C -- the typedef vs.
>identifier issue.
>
>Indeed, C is `context sensitive' in that a typedef name has a very
>different syntactic value from that of an ordinary identifier.
>Nevertheless, parser generators that handle only simpler languages do
>very well with C.  How do they resolve this contradiction?  They
>discard a certain sort of purity and introduce an informal feedback.
>
>In processing the program, a compiler will be maintaining a symbol
>table, and keeping typedef names in it.  It is a simple matter to make
>the lexer inspect the symbol table when processing an identifier, and
>to return a different token type for an identifier that has appeared
>in a typedef.  In this way, the grammar has different lexeme types for
>identifiers and typedefs, and the context sensitivity goes away.
>
>Every C compiler that I've studied has this feedback mechanism between
>the semantic phase and the lexer.  It's a familiar solution to the
>problem.
>
>Kevin, KE9TV
>until 8/31: kenny@cs.uiuc.edu
>after 9/17: ke9tv@nrtc.northrop.com
>
>
When I wrote the last message I was had the notion that typedef names
could be used before they are declared, in ansi 'C'. I guess that was 
a false assumption. Because the following, I think, is illegal:

typedef struct vehical {
  make_t make;
  style_t style;
  owner_t owner;
} vehical_t;

typedef ... owner_t;
typedef ... make_t;
typedef ... style_t;

The compiler must know ahead of time that make_t, style_t and owner_t 
are type names. That way, the scanner, it could lookup the name in the 
symbol table and see that it is a typedef name and return TYPEDEF_NAME.
I don't have the ansi draft; only K&R2. K&R2 doesn't mention, at least
I don't recall reading it, that typedef names must occur before they are
used so these points a purely speculative. Also K&R2 doesn't specify if
the following is legal:

typedef unsigned char uchar_t;
uchar_t uchar_t;

I would say that this is illegal, even though uchar_t is not a keyword.
Why, because ... I don't know, maybe because it would be harder to parse.

In lue of the above speculations, the compiler wouldn't have to make 
multiple passes to collect typedef names. I hope this is true or 
the grammar at back of K&R2 would not be acceptable to yacc, as claimed,
without a rewrite, I could be wrong though.  

If thus far I am correct, I would go on further to say that, identifiers
must be delared before they are used wether that be as declarators
or as typedef names.

What is the verdit can I safely assume this stuff or should send off
for the ansi 'C' standard, at my first financial opportunity.

If you guys agree that 'C' is context sensitive then what languages
truely are context-free, if any.

volpe@underdog.crd.ge.com (Christopher R Volpe) (08/31/90)

In article <1990Aug30.223440.7377@NCoast.ORG>, ramsey@NCoast.ORG (Cedric
Ramsey) writes:
|>When I wrote the last message I was had the notion that typedef names
|>could be used before they are declared, in ansi 'C'. I guess that was 
|>a false assumption. Because the following, I think, is illegal:
|>
|>typedef struct vehical {
|>  make_t make;
|>  style_t style;
|>  owner_t owner;
|>} vehical_t;
|>
|>typedef ... owner_t;
|>typedef ... make_t;
|>typedef ... style_t;

Yeah, it's illegal. The only thing that can be used without an explicit
declaration is a function, which would then be implicitly declared
as returning int. 
|>
|>The compiler must know ahead of time that make_t, style_t and owner_t 
|>are type names. That way, the scanner, it could lookup the name in the 
|>symbol table and see that it is a typedef name and return TYPEDEF_NAME.
|>I don't have the ansi draft; only K&R2. K&R2 doesn't mention, at least
|>I don't recall reading it, that typedef names must occur before they are
|>used so these points a purely speculative. Also K&R2 doesn't specify if
|>the following is legal:
|>
|>typedef unsigned char uchar_t;
|>uchar_t uchar_t;
|>
|>I would say that this is illegal, even though uchar_t is not a keyword.
|>Why, because ... I don't know, maybe because it would be harder to parse.

It's illegal because typedef names are in the same name space as
objects according to section A11.1 of K&R2. Therefore, an identifier
cannot be declared as both a typedef name and an object in the
same scope. The typedef name can be redeclared in an inner block, however.

|>
|>In lue of the above speculations, the compiler wouldn't have to make 
|>multiple passes to collect typedef names. I hope this is true or 
|>the grammar at back of K&R2 would not be acceptable to yacc, as claimed,
|>without a rewrite, I could be wrong though.  

The grammar is not claimed to be acceptable to yacc as-is. Section
A13 says that the grammar is acceptable to yacc if the production
"typedef_name: identifier" is removed and typedef_name is made a 
terminal symbol.

|>If you guys agree that 'C' is context sensitive then what languages
|>truely are context-free, if any. 

Let's clarify some terminology here: First, all context free languages
are context sensitive and all context free grammars are context sensitive.
There is a hierarchy involved here. The concepts are not mutually
exclusive, but rather the former is a superset of the latter. When you
say "context sensitive", you really mean "non context free". Second, there
is a distinction between a grammar being context free and a language
being context free. A language is context free if it can be completely
described by a context free grammar. The C grammar is context free but
it does not fully describe the language. No programming language that
requires that any identifiers be declared before they are used is a 
context free language. Such requirements are almost always enforced
semantically rather than syntactically in order to allow use of a 
context free grammar. (This is done using a symbol table.) Pascal
is not a context free language either, for the same reason.

==================
Chris Volpe
G.E. Corporate R&D
volpecr@crd.ge.com

alex@vmars.tuwien.ac.at (Alexander Vrchoticky) (08/31/90)

ramsey@NCoast.ORG (Cedric Ramsey) writes:

>If you guys agree that 'C' is context sensitive then what languages
>truely are context-free, if any. 

Uh, almost none, or almost every. It depends on your view.

I won't go into the details of formal language theory here,
but clarify a few points:

- Efficient syntax analysis of programming languages dictates the use
  of context free grammars for syntax specification.

- From a formal language point of view no practical programming language
  is context-free. (It can't be context free if it allows for an 
  unlimited number of identifiers that can appear three or more 
  times in a program.)

The apparent contradiction is usually resolved as follows:

- Build a parser based on a context free grammar that generates a 
  `reasonably small' superset of the language and use other techniques 
  (symbol tables) to `filter out' the cases that are illegal in the 
  language but which slip through the parser based on the superset.

The phrase `reasonably small' is not really very well-defined.
However I think few people would call the following a reasonable grammar
for C, even though it generates a superset of C:

S --> A S | \epsilon
A --> 'a' | 'b' | ... | 'z' (and the uppercase letters, and the digits, and ...)

On the other hand most people are very happy with mapping all possible 
identifiers into one token with the use of a lexical analyzer and
using table lookup to check for things like proper declarations.
This is done routinely in every compiler.

Fewer people are happy with the fact that two different tokens in the 
C language (identifier and typedef-name) have the same lexical representation
and can therefore not be told apart without resorting to table lookup.
But that's the way C is defined. 

And to answer the original question: Most (all?) newer languages have 
avoided the problem by a somewhat more generous use of keywords. 

--
Alexander Vrchoticky  Technical University Vienna, Dept. for Real-Time Systems
Voice:  +43/222/58801-8168   Fax: +43/222/569149
e-mail: alex@vmars.tuwien.ac.at or vmars!alex@relay.eu.net  (Don't use 'r'!)

meissner@osf.org (Michael Meissner) (08/31/90)

In article <11508@crdgw1.crd.ge.com> volpe@underdog.crd.ge.com
(Christopher R Volpe) writes:

	...

| It's illegal because typedef names are in the same name space as
| objects according to section A11.1 of K&R2. Therefore, an identifier
| cannot be declared as both a typedef name and an object in the
| same scope. The typedef name can be redeclared in an inner block, however.

Providing a type is specified in the declaration (this is where the
classic quote 'the ice is thin here' in K&R-I is used).

| |>
| |>In lue of the above speculations, the compiler wouldn't have to make 
| |>multiple passes to collect typedef names. I hope this is true or 
| |>the grammar at back of K&R2 would not be acceptable to yacc, as claimed,
| |>without a rewrite, I could be wrong though.  
| 
| The grammar is not claimed to be acceptable to yacc as-is. Section
| A13 says that the grammar is acceptable to yacc if the production
| "typedef_name: identifier" is removed and typedef_name is made a 
| terminal symbol.

The classic case that cannot be handled without knowing whether you
have an identifier or typedef name is:

	func( (ident1) * ident2 );

If ident1 is a typedef name, then the argument to func casts the result of
dereferencing pointer ident2 to ident1.  If ident1 is an identifer,
than the argument is the multiplication between ident1 and ident2.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Do apple growers tell their kids money doesn't grow on bushes?

henry@zoo.toronto.edu (Henry Spencer) (08/31/90)

In article <1990Aug30.223440.7377@NCoast.ORG> ramsey@NCoast.ORG (Cedric Ramsey) writes:
>I don't have the ansi draft; only K&R2. K&R2 doesn't mention, at least
>I don't recall reading it, that typedef names must occur before they are
>used so these points a purely speculative...

A careful reading of section A11.1 of K&R2 will answer these questions.

>...doesn't specify if the following is legal:
>
>typedef unsigned char uchar_t;
>uchar_t uchar_t;
>
>I would say that this is illegal, even though uchar_t is not a keyword.
>Why, because ... I don't know, maybe because it would be harder to parse.

Typedefs *are* hard to parse.  Unfortunately, this particular abomination
is legal... *if* you put a "{" between the two declarations.  It is not
legal to redeclare uchar_t in the same scope as its original declaration,
but redeclaring it inside a block is legal.  Parsing this is no fun at all.
-- 
TCP/IP: handling tomorrow's loads today| Henry Spencer at U of Toronto Zoology
OSI: handling yesterday's loads someday|  henry@zoo.toronto.edu   utzoo!henry

john@basho.uucp (John Lacey) (09/02/90)

In article <1990Aug30.223440.7377@NCoast.ORG> of comp.lang.c
    ramsey@NCoast.ORG (Cedric Ramsey) writes:
} If you guys agree that 'C' is context sensitive then what languages
} truely are context-free, if any. 

In article <11508@crdgw1.crd.ge.com> of comp.lang.c
    volpe@underdog.crd.ge.com (Christopher R Volpe) writes:
} Let's clarify some terminology here: First, all context free languages
} are context sensitive and all context free grammars are context sensitive.
} There is a hierarchy involved here. The concepts are not mutually
} exclusive, but rather the former is a superset of the latter. When you
} say "context sensitive", you really mean "non context free". 

This seems hardly a clarification.  You say all A are B and all B are A,
then claim there is a hierarchy in the next sentence.

There *is* a hierarchy.  But "context sensitive" is not the same as
"non context free".  The set of context-sensitive grammars is a superset
of the set of context-free grammars.  This implies that all context-free
grammars are context-sensitive, but not all context-sensitive grammars
are context-free.  The correct statement, then, is that C (as a complete
language) is context-sensitive but not context-free.

In brief, Chris said S <= F and F <= S and that C is ~F, when in fact,
F < S, and C is in (S - F). (All where F = context-free and 
S = context-sensitive.)

Any further discussion on this topic belongs in comp.theory.

Toodles and cheers,

John
-- 
John Lacey, 
   E-mail:  ...!osu-cis!n8emr!uncle!basho!john  (coming soon: john@basho.uucp)
   Voice:   (614) 436--3773, or 487--8570
"What was the name of the dog on Rin-tin-tin?"  --Mickey Rivers, ex-Yankee CF

volpe@underdog.crd.ge.com (Christopher R Volpe) (09/04/90)

In article <1990Sep2.012002.7004@basho.uucp>, john@basho.uucp (John
Lacey) writes:
|>In article <11508@crdgw1.crd.ge.com> of comp.lang.c
|>    volpe@underdog.crd.ge.com (Christopher R Volpe) writes:
|>} Let's clarify some terminology here: First, all context free languages
|>} are context sensitive and all context free grammars are context sensitive.
|>} There is a hierarchy involved here. The concepts are not mutually
|>} exclusive, but rather the former is a superset of the latter. When you
|>} say "context sensitive", you really mean "non context free". 
|>
|>This seems hardly a clarification.  You say all A are B and all B are A,
|>then claim there is a hierarchy in the next sentence.

Ok, I made an error in saying "superset" when I should have said 
"subset". That was a mistake. But, I believe that the rest of what
I said is correct. In addition, I never said anything of the form
"all A are B and all B are A". Where did I ever say the relationship
was commutative? What I *did* say was that the relationship applied
to both *grammars* and *languages*. 

|>
|>There *is* a hierarchy.  But "context sensitive" is not the same as
|>"non context free".  The set of context-sensitive grammars is a superset
|>of the set of context-free grammars.  This implies that all context-free
|>grammars are context-sensitive, but not all context-sensitive grammars
|>are context-free. 

That was (almost) my whole point!

|>The correct statement, then, is that C (as a complete
|>language) is context-sensitive but not context-free.

The rest of my point was that even though the *language* is 
context-sensitive-but-not-context-free, the *grammar* given in K&RII
is truly context-free (it doesn't completely define C, though).

==================
Chris Volpe
G.E. Corporate R&D
volpecr@crd.ge.com