[comp.lang.misc] Is C context-free?

gvcormack@watdragon.waterloo.edu (Gordon V. Cormack) (03/25/89)

Is C context-free?


The question is too ill-defined to support the discussion
that has been going on in this medium.  I think the ground rules need
to be established before anything useful can come of it.

Since I have been misquoted as saying "C is not context-free", I will
repeat my original statement that seems to have started this whole
thing.

> Don't blame C++.  Blame C.  
> 
> C is not context-free parsable [actually I believe it is, but not 
> with any sensible grammar].
> -- 
> Gordon V. Cormack     CS Dept, University of Waterloo, Canada N2L 3G1
> gvcormack@waterloo.EDU  gvcormack@uwaterloo.CA  gvcormac@water.BITNET


So what does it mean for something to be context-free?  

A context-free language is a set of strings (a formal language) that
is generated by a context-free grammar.  It has been shown that the
context-free languages are exactly those languages that can be 
recognised by a single-stack nondeterministic automaton.  There is
a well known theorem, known as the "pumping lemma" that can be used
to show whether a given language is context-free, without having to
write a context-free grammar (or prove non-existence of same).

When we say "is C context-free", I suppose we can consider the
set of all ascii strings that are correct C programs to be the
formal language in question.  Under this assumption, C is not 
context free, and neither is any useful programming language.

We qualify what we say to "is the syntax of C context-free?".
Then what is syntax?  If we consider syntax to mean some
context-free language that contains all valid C programs, then C
is context-free, and so are all programming languages.  Here is 
a context-free grammar for all programming languages:

      S -> any_character | S any_character
      
      any_character -> a | b | c | ... | 0 | 1 | ... | , | & | ...

I don't really believe this fits our working definition of "syntax".

We therefore have to agree on what constitutes a syntactically valid
C program, and what constitutes an invalid one.  Then we can start
arguing about whether the set of syntactically valid C programs is
a context-free language or not.  (The original posting to which 
i replied was complaining that a missing type definition should not
be reported as a syntax error.)

In any event, in order to show that such a language is context-free, 
it is necessary to give a CF grammar that generates all strings in the 
language.  The language is a set of ascii strings, not a set of
meta-symbols like Identifier, Type_identifier, etc.  With Pascal
and other langauges that I say have context-free syntax, this 
distinction is moot, because one can trivially write a context-free
grammar to describe each of the meta-symbols, and by substitution,
write a context-free grammar to describe the language.  It happens
that some things, like white space, are a bit tedious to describe,
but it is possible [cf. Salomon and Cormack, "Scannerless parsing
of programming languages, Proc. Sigplan 89].  With C, the "language"
represented by Type_identifier is not only not context-free, it is
not a language because it does not represent a fixed set of strings.
(The set of Type_identifiers changes throughout the program).
And it is my opinion that there are C programs for which we cannot
answer "is this program syntactically valid?" without knowing whether
a particular substring represents a Type_identifier or not.

If you allow meta-symbols with arbitrary definitions, then there is
an absurd context-free grammar that "describes" any programming
language:

   S -> first_valid_character  |  S next_valid_character

where next_valid_character is a meta-symbol that at changes its
definition every time it is used, to match only valid upcoming
input characters.

Furthermore, I said "context-free parsable", not just "context-free".
I won't claim this has a formal meaning, but I certainly meant to
imply it to mean a unique parse in which the parse tree reflects
the precedence, associativity, and nesting of language constructs.
My qualificiation "any sensible grammar" was meant to disqualify
grammars that disregard this.


Conclusions:  
  
Detailed technical arguments are useless without a testable hypothesis. 

We need to agree on a definition of "C syntax" before we can proceed.

Once we have agreed on that we can try to answer the question

   Is C's syntax context-free?
-- 
Gordon V. Cormack     CS Dept, University of Waterloo, Canada N2L 3G1
gvcormack@waterloo.EDU  gvcormack@uwaterloo.CA  gvcormac@water.BITNET