[comp.lang.misc] Language Design, or: is C's Grammar Context Free

chris@mimsy.UUCP (Chris Torek) (03/28/89)

Maybe this will work best a bit at a time.

First, our definition:

In article <12606@watdragon.waterloo.edu> gvcormack@watdragon.waterloo.edu
(Gordon V. Cormack) writes:
>A context-free grammar G is a 4-tuple  G = <N,T,P,S> where N is a finite
>set of nonterminal symbols, T (the alphabet) is a finite set of terminal 
>symbols, P is a finite set of production rules of the form  A -> X1 X2 X3 ...
>where A is in N, and X1, X2, X3 are in Union(N,T).  S in N is the Start symbol.

In article <9165@claris.com> hearn@claris.com (Bob Hearn) writes:
>... I think this whole argument is founded on misunderstandings.

It is.

>It all depends on what you mean by "context-free grammar."

Most people mean the definition above.

>If you follow the technical definition, then there is no
>question that C's grammar (as specified in K&R) is context-free.

Aha!  But the grammar in K&R 1st ed. is incorrect.

In article <9861@megaron.arizona.edu> gudeman@arizona.edu (David Gudeman)
writes:
>One inaccuracy is in the production
>	typedef-name : identifier
>I don't know if there are others.

(I recall a correction or two elsewhere, but my K&R is in my office and
I am not.)

>So both claims made in this argument are correct:
>(1) the grammar of C is not context free
>and
>(2) the grammar given in K&R is context free

Returning to <9165@claris.com> (Bob Hearn):
>... As for Cormack's claim that C does not have a finite set of terminals,

(this refers to the line `The K&R grammar does not have a finite alphabet.' 
in <12606@watdragon.waterloo.edu>)

>I can't say that I really understand what he means.  Perhaps he is 
>referring to the fact that there is an infinite number of identifiers?

Rather, that there are an unbounded (not infinite, although it comes
out the same) number of type declarators, via the `typedef' mechanism.

>Well, if we restrict ourselves to K&R, then the terminal in question is
>simply (identifier), and it does not decompose further.

(He then suggests splitting it into letters, etc., a simple mechanical
task.)

Unfortunately, the terminal in question is the `identifier' in
`typedef-name : identifier'.  This is, as has been noted, inaccurate.
The key is that only those identifiers which (a) have been previously
defined via `typedef' and (b) have not been hidden by new local
declarations are in fact typedef names.  This information can only be
found by context (oops, there is that word again), and if the size of
the input string (C code) is unbounded, there is no way to capture this
in a CFG (using our definition above).  If we limit the length of the
language to be recognised---to any fixed number of symbols, as long as
it is fixed---we can construct a CFG for this subset of C.

Fortunately, none of this really matters.

As someone else pointed out (in an article I forgot to save), we
generally do not write grammars that accept only strictly correct
inputs.  Instead, we take some shortcuts and patch things up via
semantics.  This approach works quite well for C, although a number of
compilers do it wrong (including PCC) and patch the grammar by having
the lexer return `type' instead of `identifier' when the identifier has
been defined via typedef.  In particular, this prevents

	typedef int temperature;
	f() { auto double temperature; ... }

from compiling correctly.  (PCC does handle

	f(temperature) double temp; { ... }

correctly, by turning off typedef name recognition in the lexer during
argument parsing.  It could do the same after type keywords in local
variable declarations, and I am not sure why it does not---it may have
done a lookahead already, perhaps.  But that is not the only place
PCC has been seen to fall short....)

>someone, a long time ago, made some comment on how screwed up C's grammar
>was, and that it wasn't context-free.  Well, as I don't think C's grammar
>is particularly screwed up relative to other languages' grammars, I
>responded. ...

Strictly speaking (as long as you stick with `typedef produces a new
type specifier'), C's grammar is not context free; but it is indeed
not seriously `screwed up'.  Perfect generators are not necessary, and
a language `close enough to C' exists that can be parsed simply.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

chris@mimsy.UUCP (Chris Torek) (03/28/89)

Incidentally, now that I have said all that (some might say `too much'
:-) ), the only real difference between a C grammar that accepts
undeclared typedef names and a Pascal grammar that accepts undeclared
variable names is that, for some reason, people like to think of type
declarators as keywords and variables as identifiers.  At least, this
is the only explanation I can see that we never argue whether Pascal's
grammar is context free.  (If you require the parser to catch undeclared
variables, it is not, just as C's is not if you require the parser to
catch undeclared typedef names.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

chris@mimsy.UUCP (Chris Torek) (03/28/89)

Oops, in amongst all that verbiage, I forgot to answer the other
obvious question:  Why, if there is a reasonable CFG that accepts
all correct C programs (and some incorrect ones as well, that can
be rejected by the semantics instead), does PCC use something else?

Our `impure' CFG includes (among others) the following productions
(rearranged and simplified to make the problem obvious):

	statement : '{' optional-declarators statement-list '}'
	statement : expression
	declarator : typedef-name '*' identifier	// i.e., a pointer
	typedef-name : identifier
	expression : identifier '*' identifier		// i.e., multiply

Thus, there are two correct parses for the input string

	{ t * x; }

One declares the variable `x' as a pointer to the typedef'd type `t';
the other multiplies the value of `t' by the value of `x'.

It seems our impure C CFG is ambiguous.

Well, we never said otherwise....

As a practical matter, we want the compiler to choose the declarator
parse if and only if `t' is in fact an active typedef name.  Fortunately,
there are a number of ways to gimmick a parser to do this.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

akwright@watdragon.waterloo.edu (Andrew K. Wright) (03/28/89)

In article <16559@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>Incidentally, now that I have said all that (some might say `too much'
>:-) ), the only real difference between a C grammar that accepts
>undeclared typedef names and a Pascal grammar that accepts undeclared
>variable names is that, for some reason, people like to think of type
>declarators as keywords and variables as identifiers.  At least, this
>is the only explanation I can see that we never argue whether Pascal's
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
>grammar is context free.  (If you require the parser to catch undeclared
>variables, it is not, just as C's is not if you require the parser to
>catch undeclared typedef names.)

The difference I see is that a C grammar that accepts undeclared typedef
names destroys more of the structure of the grammar than a Pascal
grammar that accepts undeclared variable names.  Consider the following
C expression:

	x = (a)(b);

If "a" is a typedef name, then the right side is a cast.  If "a" is
not a typedef name, then the right side is a function call.
I dont believe similar situations exist for Pascal (corrections welcome).
As G.V. Cormack notes, any language can be parsed by some context free
grammar; we are interested in grammars which represent the semantic
structure of the language as closely as possible.

About "C's grammar [not] being screwed up":  with a few simple kludges
such as recognizing typedef names in the lex, it is fairly
easy to build a grammar which has sensible structure for C.
Or is it?  Some points were raised about where PCC does/doesnt turn off
typedef recognition, and this seems ill-defined.
(Anybody know what the ANSI draft says?)
If having parts of your language ill-defined does not bother you,
consider trying to build a suffix or reverse parser for C.  Such
a parser might be used in a syntax directed editor, or for syntax
error recovery.  Such parsers cannot preserve the structure of
a C grammar because they do not have available the information required
to tell if an identifier is a typedef or not.

Andrew K. Wright      akwright@watmath.waterloo.edu
CS Dept., University of Waterloo, Ont., Canada.

chris@mimsy.UUCP (Chris Torek) (03/29/89)

>In article <16559@mimsy.UUCP> I included the throwaway% line:
>>... At least, this is the only explanation I can see that we never
>			    ^^^^^^^^^^^^^^^^^^^^^^^^^^
>>argue whether Pascal's grammar is context free.
-----
% cf. Webster: `throw.away \'thro--*-.wa-\ n : a handbill or circular
  distributed free'; applied to writing (especially comic writing), it
  refers to a line with similar worth.
-----

In article <12778@watdragon.waterloo.edu> akwright@watdragon.waterloo.edu
(Andrew K. Wright) writes:
>The difference I see is that a C grammar that accepts undeclared typedef
>names destroys more of the structure of the grammar than a Pascal
>grammar that accepts undeclared variable names.  Consider the following
>C expression:
>
>	x = (a)(b);
>
>If "a" is a typedef name, then the right side is a cast.  If "a" is
>not a typedef name, then the right side is a function call.
>I dont believe similar situations exist for Pascal (corrections welcome).

I am not sure what you mean by `destroys more of the structure of the
grammar'.  It is ambiguous, and in particular it is an ambiguity of an
uncomfortable sort, quite unlike the shift/reduce ambiguity for if/then/else
which everyone resolves by taking the shift.  Yacc permits this reduce/
reduce conflict, but resolves it by taking the first grammar rule; we
want the parser to take the *appropriate* grammar rule, which is sometimes
the first and sometimes not.  So yacc unaided will not produce a
deterministic parser that does what we want.

BUT all this applies to constructing a deterministic LALR(1) parser,
which is not at all the same thing as having a context free grammar!
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

kwalker@arizona.edu (Kenneth Walker) (03/29/89)

In article <12778@watdragon.waterloo.edu>, akwright@watdragon.waterloo.edu (Andrew K. Wright) writes:
> About "C's grammar [not] being screwed up":  with a few simple kludges
> such as recognizing typedef names in the lex, it is fairly
> easy to build a grammar which has sensible structure for C.
> Or is it?  Some points were raised about where PCC does/doesnt turn off
> typedef recognition, and this seems ill-defined.
> (Anybody know what the ANSI draft says?)

(These comments are based on the July 1988 draft.) The standard allows
typedef names to be redeclared within an inner scope:

   typedef int x;
   {
   char *x;
   }

The only restriction is that the type must be explicily given. That is
you cannot redeclare it like:

   static x;

One solution is to define the identifier in the declarator to
be any identifier:

any_ident: Identifier | Typedef_name;

rather than turn off typedef recognition in the lexical analyzer. This is
also needed for labels.  However, this simple change to the grammar
presented in the standard does not result in an LALR1 grammar. I found
the transformations required to get a grammar that yacc would accept
quite difficult to find and the result was not very elegant. So my
answer to the question "Or is it?" is NO!

   Ken Walker / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
   +1 602 621 2858  kwalker@Arizona.EDU   {uunet|allegra|noao}!arizona!kwalker

mnc@m10ux.ATT.COM (Michael Condict) (04/04/89)

In article <16559@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> Incidentally, now that I have said all that (some might say `too much'
> :-) ), the only real difference between a C grammar that accepts
> undeclared typedef names and a Pascal grammar that accepts undeclared
> variable names is that, for some reason, people like to think of type
> declarators as keywords and variables as identifiers.  At least, this
> is the only explanation I can see that we never argue whether Pascal's
> grammar is context free.  (If you require the parser to catch undeclared
> variables, it is not, just as C's is not if you require the parser to
> catch undeclared typedef names.)
> -- 
> In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
> Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

No, I think there is a real difference in "context-freeness" between
Pascal and C.  In C compilers that are widely considered to conform to
K&R, the syntax accepted by the parser is actually modified when a typedef
is encountered and, furthermore, there is no better way to handle the
typedefs.  To elaborate:

In Pascal, there is only one way to interpret a given statement or declaration,
regardless of any previous declarations.

Consider the following in C, however:

	. . . { T (x); . . . }

How should the construct be parsed?  Is it the declaration of
a variable x of type T (where T is a previously typedef'd identifier)?
Or is it a call to the function T with argument x?

We can't even distinguish between an executable statement and a declaration
in C without resorting to context.  This makes C fundamentally
less context-free than Pascal (or FORTRAN or Ada, for that matter).  The
difference can be put in practical terms by noting that it is impossible to
write a processor for C that accepts exactly those programs that would not
be considered a "syntax error", unless you put a symbol table into the
program, to remember the context of typedefs.  No such requirement holds for
Pascal.  For example, a program that counts executable statements in C
programs must have a symbol table for typedefs, with all the extra complexity
that implies.  The corresponding program for counting statements in Pascal
is much simpler.  The same would hold for a cross-reference program and any
number of other program analyzers.

To summarize, the only way you can parse C with a context-free parser is if
you don't care whether a declaration is confused with an executable statement.
This utterly destroys the usefullness of such a parse, as far as I'm concerned.
-- 
Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

mmengel@cuuxb.ATT.COM (Marc W. Mengel) (04/04/89)

In article <913@m10ux.ATT.COM> mnc@m10ux.ATT.COM (Michael Condict) writes:
>In Pascal, there is only one way to interpret a given statement or declaration,
>regardless of any previous declarations.

>Consider the following in C, however:

>	. . . { T (x); . . . }

>How should the construct be parsed?  Is it the declaration of
>a variable x of type T (where T is a previously typedef'd identifier)?
>Or is it a call to the function T with argument x?
...
>To summarize, the only way you can parse C with a context-free parser is if
>you don't care whether a declaration is confused with an executable statement.
>This utterly destroys the usefullness of such a parse, as far as I'm concerned.

That may destroy the usefulness in your mind, but that does not make it
non-context-free.  To make a language recognizer (whose sole purpose
in life is to say "yes" this is a valid program, or "no" it is not)
you only need say yes or no for the whole program, not say what piece
is which.  C may very well be not context free, or not easily context
free, but this is based on things like making sure variables are declared,
the right type, etc. (i.e. only allow p->q->w if q is a pointer member
of a structure type p is a pointer to, and the type q points to has a
member w...)

The suggestion you make, of a rule like:

	decl_or_fcall	: identifier '(' identifier ')' ';' ;

to spot the case you mention works well. In fact it is a useful tool
for doing things like allowing forward-referenced typedefs.

>Michael Condict		{att|allegra}!m10ux!mnc
-- 
 Marc Mengel					mmengel@cuuxb.att.com
 						attmail!mmengel
 						...!{lll-crg|att}!cuuxb!mmengel

dhesi@bsu-cs.UUCP (Rahul Dhesi) (04/05/89)

In article <913@m10ux.ATT.COM> mnc@m10ux.ATT.COM (Michael Condict) writes:
>Consider the following in C, however:
>
>	. . . { T (x); . . . }
>
>How should the construct be parsed?

The parsing is always the same, from the point of view of a
context-free grammer.  It's the semantics that are different.  That you
assign a different meaning to the language construct based on what has
come before has nothing to do with its context-free grammar.

Context-free grammars don't care about the declaration of type names.
If they could, they wouldn't be context-free.

The confusion here arises because real languages like Pascal and C
simply *cannot* be completely defined using context-free grammars.  But
they are--because the definition we use is incomplete, and we fill in
the missing parts with a set of additional rules that are not part of
the context-free grammar.  The size of this set of rules may or may not
be greater for C than for Pascal.
-- 
Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi
                    ARPA:  dhesi@bsu-cs.bsu.edu

peter@ficc.uu.net (Peter da Silva) (04/06/89)

If 'c's grammar is context free, and the context sensitivity is in the
semantics, tell me if the following is a syntax error:

	x y z;
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

dhesi@bsu-cs.bsu.edu (Rahul Dhesi) (04/07/89)

In article <3718@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>If 'c's grammar is context free, and the context sensitivity is in the
>semantics, tell me if the following is a syntax error...

It depends on which context-free grammar you are using.  Here's a
simple one:

program ::= char charlist
charlist ::= .null. | charlist char
char ::= 'a' | 'b' | ... '0' | '1' | ... '#' | '(' | .... blank | tab ... etc.

(.null. means empty string.)

My lexical analyzer always returns a single character or EOF.  My
parser is pretty simple.  There is only one possible error during
parsing:  "premature EOF".

Semantic analysis is a nightmare :-).
-- 
Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi
                    ARPA:  dhesi@bsu-cs.bsu.edu

mnc@m10ux.ATT.COM (Michael Condict) (04/07/89)

In article <6541@bsu-cs.UUCP>, dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
> In article <913@m10ux.ATT.COM> mnc@m10ux.ATT.COM (Michael Condict) writes:
> >Consider the following in C, however:
> >
> >	. . . { T (x); . . . }
> >
> >How should the construct be parsed?
> 
> The parsing is always the same, from the point of view of a
> context-free grammer.  It's the semantics that are different.  That you
> assign a different meaning to the language construct based on what has
> come before has nothing to do with its context-free grammar.
>   . . .
> Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee}!bsu-cs!dhesi
>                     ARPA:  dhesi@bsu-cs.bsu.edu

Everyone seems to have missed my point on this, because I used a bad example.
Consider instead the one that someone else posted:

	{ T * x; . . . }

In this case (which could be the declaration of a var x of type (T *) or
could be the expression "T multiplied by x"), the parsing most certainly should
not be the same for either case, at least not if you intend to do anything
useful with the parse.  One of the following two different parse trees
should be produced:

		  *                   decl
		/  \                 /    \
	       T    x		    T      *
					   |
					   x

What you are ignoring is that very few parsers stand alone by themselves.  They
are used in a program that wants to do something with the language.  The
fundamental difference between C and other languages is that because of the
above bogosity, there is much less you can do with the language without
including a symbol table for typedefs.  I can't believe this is even open
for argument.  Try writing programs for C and Pascal that delete all executable
statements, printing only the declarations.  The C version will have to have
a symbol table -- the Pascal version will not.  Isn't that perfectly clear?

Of course we all know that no useful programming language, including C,
Pascal, FORTRAN, Ada or LISP, is context-free when we consider the exact
set of programs that are accepted by the compiler, but that was not my point.
-- 
Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

chris@mimsy.UUCP (Chris Torek) (04/09/89)

In article <920@m10ux.ATT.COM> mnc@m10ux.ATT.COM (Michael Condict) writes:
>What you are ignoring is that very few parsers stand alone by themselves.

We are not ignoring this.  The topic was `context free grammars', not
`compilers and parsers'.  If you want to argue that one cannot build
a useful compiler using a CFG approximation for C, that is fine (since
we have seen that this is true).  Just do not do it by saying that
there is no context free grammar that recognises all valid C programs.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

gudeman@arizona.edu (David Gudeman) (04/09/89)

In article  <16807@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
]In article <920@m10ux.ATT.COM> mnc@m10ux.ATT.COM (Michael Condict) writes:
]>What you are ignoring is that very few parsers stand alone by themselves.
]
]We are not ignoring this.  The topic was `context free grammars', not
]`compilers and parsers'.  If you want to argue that one cannot build
]a useful compiler using a CFG approximation for C, that is fine (since
]we have seen that this is true).  Just do not do it by saying that
]there is no context free grammar that recognises all valid C programs.

There is no context free grammar that recognises all valid C programs.
Several people have pointed out that your artificial definition of
syntax (anything that can be described with a context free grammar)
begs the issue.  By my definition of syntax, and aparently that of
most other posters, C's syntax is not context free.

kwalker@arizona.edu (Kenneth Walker) (04/10/89)

In article <10148@megaron.arizona.edu>, gudeman@arizona.edu (David Gudeman) writes:
> There is no context free grammar that recognises all valid C programs.
> Several people have pointed out that your artificial definition of
> syntax (anything that can be described with a context free grammar)
> begs the issue.  By my definition of syntax, and aparently that of
> most other posters, C's syntax is not context free.

"Context free grammar" and "recognize" have precise technical definitions;
look them up. By those definitions, there are context free grammars which
will recongnize any valid C program. The fact that the grammar may
not produce a parse which matches your definition of the syntax of the
program does not invalidate the truth of that statement. As you point
out, one should not confuse grammar and syntax.

The statement you want is: there is no unambiguous context free grammar
which will generate a parse which matches your syntax for all C programs.
That may be rather long winded, but at least it is accurate.

As far as I know, "context free syntax" does not have a widely recognized
meaning (can anyone come up with a reference to disprove this?). Clearly
you can define it in such a way that it would be correct to say that
C does not have a context free syntax. However, people will confuse the term
with "context free grammar", so I wouldn't suggest using it. It is better
to make a long winded explaination of the problem with C's syntax.

   Ken Walker / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
   +1 602 621 2858  kwalker@Arizona.EDU   {uunet|allegra|noao}!arizona!kwalker

dhesi@bsu-cs.bsu.edu (Rahul Dhesi) (04/10/89)

In article <10148@megaron.arizona.edu> gudeman@arizona.edu (David Gudeman)
writes:
>There is no context free grammar that recognises all valid C programs.

Well...there is no context free grammar that recognizes all valid C
programs and *no others*.

Not surprising, since there is no context-free grammar that recognizes
exactly all valid programs in *any* real programming language in common
use.

It's tempting to wonder if BASIC and FORTRAN might be exceptions, since
they do not require variables to be declared before use.
Unfortunately, if you say "GOTO 10000" in either, there had better be a
label or line number called "10000" somewhere, and no context-free
grammar can guarantee that.  TECO (the text editor, whose programs are
indistinguishable from line noise by common mortals) may be another
candidate, but it too wants jump labels to exist.

Try one of the functional programming languages, in their purest form
(no jump labels, no variables).  They may come close.
-- 
Rahul Dhesi <dhesi@bsu-cs.bsu.edu>
UUCP:    ...!{iuvax,pur-ee}!bsu-cs!dhesi

diamond@diamond.csl.sony.junet (Norman Diamond) (04/10/89)

In article <920@m10ux.ATT.COM> mnc@m10ux.ATT.COM (Michael Condict) writes:
>Consider instead the one that someone else posted:
>
>	{ T * x; . . . }
>
>...  One of the following two different parse trees
>should be produced:
>
>		  *                   decl
>		/  \                 /    \
>	       T    x		    T      *
>					   |
>					   x

Those are not parse trees.  Parse trees would resemble the following
ones.  Actually K&R's grammar would produce more complicated trees
than these, but a more complicated grammar could produce these trees.

		expr                 decl
               / | \                / | \
	      T  *  x              T  *  x

And it really isn't difficult to unify these cases to allow an
unambiguous grammar, leaving the resolution to the semantic stage.

		expr_or_decl
		/    |    \
	       T     *     x

Of course, C is ugly because of the difficulty human readers have
in reading such things, but compilers (C compilers anyway) have no
sense of aesthetics.
Norman Diamond, Sony Computer Science Lab (diamond%csl.sony.jp@relay.cs.net)
  The above opinions are my own.   |  Why are programmers criticized for
  If they're also your opinions,   |  re-inventing the wheel, when car
  you're infringing my copyright.  |  manufacturers are praised for it?

mnc@m10ux.ATT.COM (Michael Condict) (04/11/89)

In article <16807@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
> In article <920@m10ux.ATT.COM> mnc@m10ux.ATT.COM (Michael Condict) writes:
> >What you are ignoring is that very few parsers stand alone by themselves.
> 
> We are not ignoring this.  The topic was `context free grammars', not
> `compilers and parsers'.  If you want to argue that one cannot build
> a useful compiler using a CFG approximation for C, that is fine (since
> we have seen that this is true).  Just do not do it by saying that
> there is no context free grammar that recognises all valid C programs.
> -- 
> In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
> Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

I never said "there is no context free grammar that recognises all valid C
programs", because this is a vacuous statement, without further explanation.
What do you mean by "all"?  Do you mean "exactly all" or "all plus more"?
What do you mean by valid?  Do you mean the programs that have no "syntax
errors"?  What's a syntax error?  Avoid being circular when you reply.  Or
do you mean the ones that pass through compilers that conform to the ANSI
(proposed) standard without any error messages?  Or the ones that not only
make it through the compiler, but execute in a valid fashion (e.g., do not
attempt to divide by 0)?

If you mean the language consisting of all and only those C programs that meet
the last criterion, clearly such a language is not only not context-free,
but is not even computable.  If you go to the other extreme (as you seem to be
doing) and allow the grammar to accept more than the set of syntactically
valid C programs, then of course such a language is context-free, as is 
the language consisting of an arbitrary sequence of tokens from the
vocabulary of C tokens.

But so what?  I attempted to inject parsers and language-analysis programs
into the discussion because otherwise the discussion has no point.  If you
don't allow me to talk about what sort of use I might want to make of the
parse information, then I can't say whether I think you have a grammar for
a useful superset of the C language.  If the grammar is context-free and
accepts all correctly executing, legal C programs, then it must also accept a
superset of this set, so the crucial issue is: what superset?  E.g., what
errors are you failing to reject and which distinct constructs are you failing
to distinguish between (such as declarations vs. executable statements, in
"T (x);").
-- 
Michael Condict		{att|allegra}!m10ux!mnc
AT&T Bell Labs		(201)582-5911    MH 3B-416
Murray Hill, NJ

gudeman@arizona.edu (David Gudeman) (04/12/89)

In article  <10152@megaron.arizona.edu> kwalker@arizona.edu (Kenneth Walker) writes:
>In article <10148@megaron.arizona.edu>, gudeman@arizona.edu (David Gudeman) writes:
>> There is no context free grammar that recognises all valid C programs.
>> ...  By my definition of syntax, ... C's syntax is not context free.
>
>"Context free grammar" and "recognize" have precise technical definitions;
>look them up. By those definitions, there are context free grammars which
>will recongnize any valid C program.

Ken, did you notice who you were responding to?  I think you know that
I could give the technical definitions of "context free grammar" and
"recognize" without looking them up.

>...The statement you want is: there is no unambiguous context free grammar
>which will generate a parse which matches your syntax for all C programs.
>That may be rather long winded, but at least it is accurate.

No, that's not the statement I want.  I think the argument revolves
around one's definition of "C's syntax".  Let CF-C be the context free
language recognized by the C grammar in K&R.  Here is the argument you
seem to be making:

(1)   the syntax of C is CF-C
(2)   CF-C is context free
(3)   therefore, the syntax of C is context free.

Here is the argument _I_ am making:

(1)  the syntax of C is a proper subset of CF-C, called CS-C.
(2)  there is no context free grammar which recognizes exactly the
     language CS-C.
(3)  therefore, the syntax of C is not context free.

Both arguments are sound, and the point of disagreement is statement
(1).  Of course I can't claim that your definition of C's syntax is
_wrong_, but I claim that it is not useful.

I think the most useful definition of the syntax of a programming
language is:

   Definition: the set of strings that do not produce syntax errors
   when compiled by a correct compiler for the language.

This is not unambigous since different correct compilers may produce
slightly different sets of syntax errors.  But it hard to imagine a C
compiler that does not give a syntax error for the declaration:

  newtype x;

where "newtype" has not been declared with a typedef.

aj-mberg@dasys1.UUCP (Micha Berger) (04/12/89)

> Consider the following in C, however:

>	. . . { T (x); . . . }
 
> How should the construct be parsed?

It would have to be an expression, since anything within brackets ain't a
decleration. This isn't Pascal kiddees, all functions are global.

> Consider instead the one that someone else posted:
> 
> 	{ T * x; . . . }

This just shows that your grammar is poor. Obviously it has to be the
interpreter's job to distinguish between decleration and expression.

I don't see any qualitative difference between typedef's and variable
decleration. An identifier is an identifier. Are you going to tell me
that Pascal isn't CF because A*B could have diffent meanings, because
multiplying integers creates different code than multiplynig reals?

Another thing. No language is a CFG. A language just isn't a grammar.
In order to prove that C isn't A CFL, you'ld have to prove that there
doesn't exist a CFG that only accepts valid C programs. A grammar
doesn't care what you do with each statement, so who cares how it would
be parsed? You'll just get a very complicated interpreter, since very
little could be said about statement types in advance.
-- 
					Micha Berger				     
Disclaimer: All opinions expressed here are my own. The spelling, noone's.
email: ...!cmcl2!phri!dasys1!aj-mberg	       Aspaklaria Publications
  vox: (718) 380-7572			       73-32 173 St, Hillcrest, NY 11366

kwalker@arizona.edu (Kenneth Walker) (04/13/89)

In article <9313@dasys1.UUCP>, aj-mberg@dasys1.UUCP (Micha Berger) writes:
> 
> I don't see any qualitative difference between typedef's and variable
> decleration. An identifier is an identifier. Are you going to tell me
> that Pascal isn't CF because A*B could have diffent meanings, because
> multiplying integers creates different code than multiplynig reals?

The difference is, at least in part, one of practicality. The parse of
A*B is useful for generating code for either integer or real multiplication.
However, I really want my parser to tell the difference between a
declaration and a statement. If you think you would be happy waiting
until the parser is done to decide, be my guest and try it (you will have
to decide what the grammar needs to look like first). The purpose of
a grammar based parser is to take a big chunk out of the complexity 
of program analysis. There is still plenty of work left when it is done,
so the more it can do the better!

Some people have claimed that parsing C is easy. How many of you have
actually tried to produce a _useful_ parser for full ANSI Standard C?
I did it using YACC and the trick of feeding a symbol table back into
the lexical analyser, but it wasn't fun. A corollary of Murphy's Law:
there is a direct relationship between ignorance and optimism. (Which
seems to explain why I usually underestimate how long projects will
take.)

   Ken Walker / Computer Science Dept / Univ of Arizona / Tucson, AZ 85721
   +1 602 621 2858  kwalker@Arizona.EDU   {uunet|allegra|noao}!arizona!kwalker

diamond@diamond.csl.sony.junet (Norman Diamond) (04/13/89)

In article <9313@dasys1.UUCP> aj-mberg@dasys1.UUCP (Micha Berger) writes:
>> Consider the following in C, however:
>
>>	. . . { T (x); . . . }
> 
>> How should the construct be parsed?
>
>It would have to be an expression, since anything within brackets ain't a
>decleration. This isn't Pascal kiddees, all functions are global.

Since Micha Berger is so convinced of C's superiority, maybe he should
learn it.  Anything within brackets is a declaration until the end of
the declarations, and then statements (expressions) until the closing
bracket.  This isn't Pascal kiddee; intuition and aesthetics are not
welcome here.

Norman Diamond, Sony Computer Science Lab (diamond%csl.sony.jp@relay.cs.net)
  The above opinions are my own.   |  Why are programmers criticized for
  If they're also your opinions,   |  re-inventing the wheel, when car
  you're infringing my copyright.  |  manufacturers are praised for it?

aj-mberg@dasys1.UUCP (Micha Berger) (04/18/89)

In article <10162@socslgw.csl.sony.JUNET>, diamond@diamond.csl.sony.junet (Norman Diamond) writes:
> In article <9313@dasys1.UUCP> aj-mberg@dasys1.UUCP (Micha Berger) writes:
> >> Consider the following in C, however:
> >
> >>	. . . { T (x); . . . }
> > 
> >> How should the construct be parsed?
> >
> >It would have to be an expression, since anything within brackets ain't a
> >decleration. This isn't Pascal kiddees, all functions are global.
> 
> Anything within brackets is a declaration until the end of
> the declarations, and then statements (expressions) until the closing
> bracket.  This isn't Pascal kiddee; intuition and aesthetics are not
> welcome here.

Well the sataement cannot be a decleration for x, since x is in
parenthasis. It cannot be a decleration for T (using ANSI's new format),
since it has no type.
	My point was about something else. I felt people were trying to force
the statement to read along the same lines as:

	main() {
		int T(x)
		int x;
		{
		};
		.
		.
		.
	}

and that's what I meant about being allowed in Pascal and not in C.

(BTW I've been using C as my primary language since 1978 not because
C is superior as a language, just that it gives me more power to optimize
my code. If I got a Pascal with a good enough optimizer, I'd switch
back. I like C's speed, not the language itself.)
-- 
					Micha Berger				     
Disclaimer: All opinions expressed here are my own. The spelling, noone's.
email: ...!cmcl2!hombre!dasys1!aj-mberg	       Aspaklaria Publications
  vox: (718) 380-7572			       73-32 173 St, Hillcrest, NY 11366

diamond@diamond.csl.sony.junet (Norman Diamond) (04/20/89)

Some unknown poster once asked:

>>>>	. . . { T (x); . . . }
>>>> How should the construct be parsed?

In article <9382@dasys1.UUCP> aj-mberg@dasys1.UUCP (Micha Berger) writes:

>Well the sataement cannot be a decleration for x, since x is in
>parenthasis. It cannot be a decleration for T (using ANSI's new format),
>since it has no type.

1.  It can be a declaration for x.  If your compiler rejects this:

  typedef int T;
  main ()
  {
      printf ("hello ");
      {
          T (x);
          x = 3;
          printf ("world.\n");
      }
  }

then your compiler is broken.

2.  Before ANSI standardized existing practice, it could have been a
declaration for T, as you hint (though not when T was typedef'ed).

>	My point was about something else. I felt people were trying to force
>the statement to read along the same lines as:
>
>	main() {
>		int T(x)
>		int x;
>		{
>		};
>		...
>	}

Then maybe you should have read the example before answering.
Your first answer was still out to lunch, and you insulted the
wrong language.  Your second answer is still out to lunch too.

Norman Diamond, Sony Computer Science Lab (diamond%csl.sony.jp@relay.cs.net)
  The above opinions are my own.   |  Why are programmers criticized for
  If they're also your opinions,   |  re-inventing the wheel, when car
  you're infringing my copyright.  |  manufacturers are praised for it?