[net.unix-wizards] C compiler implements wrong semantics

NEVILLE%umass-cs.csnet@csnet-relay.arpa (Neville D. Newman) (02/02/86)

This is posted to unix-wizards instead of to net.lang because i believe
that it shows faulty semantics in the Unix C compiler.  i don't know
the proper to get to arbitrary newsgroups, being an Internet person, so
if the moderator would kindly forward it there i would appreciate it.

While digging through the guts of the portable C compiler, i noticed
that it produced exactly the same code for two statements that i think
have different semantics.  According to my C references, the unary
operators have precedence over binary operators (and are evaluated right
to left).  The rule for pre- and post-increment (and -decrement) operators
is, of course, that a++ changes the value of  a  after the value of the
term is used, so that the change is a side-effect.  ++a, on the other
hand changes the value of  a  before the term is used and is therefore
entirely equivalent to  (a += 1) or (a = a+1).

The C compiler on our 4.2bsd system, however, seems to consider the "before"
and "after" to be relative to the higher level *expression* being evaluated
rather than just the individual term.  In the assembly output for the simple
program that follows, the increment or decrement is always placed before or
after the block of statements that implement the C assignment, never in the
midst of that block where it should sometimes appear.

According to my books, if a is 5, then (a++ + a) ought to evaluate to 11.
On 4.2bsd (or any system using the pcc, i imagine) it evaluates to 10.
On VMS with VAX-C v2.1, it evaluates to 11.
On CP/M-68K with the Alcyon compiler, it is 10.  The Alcyon compiler strives
for compatibility with version 7 Unix.

So the questions for the day are:  Is pcc "right" because it is sort of the
defacto standard?  (i have a friend who claims that BNF and such are useless,
the compiler is the only definition of a language that counts)  Is this
discrepancy between Unix C's behaviour and description already widely known
and carefully worked around?  Should i attempt to fix it and possibly break
some code or leave it alone for old time's sake?

This code should check several facets of the pre-/post- increment/decrement
problem.  The pre-increments should be give results of 12 on all systems,
or else there's a *bad* problem.  The post-increments give 10 on Unix, 11
on VMS.  i think 11 is correct, based on the language description.


#include <stdio.h>

main() {
int a;
int b;

/* check post-increments */
a = 5;
b = a + a++;
printf("b = a + a++   yields  %d\n",b);
a = 5;
b = a++ + a;
printf("b = a++ + a   yields  %d\n",b);
a = 5;
b = a + (a++);
printf("b = a + (a++) yields  %d\n",b);
a = 5;
b = (a++) + a;
printf("b = (a++) + a yields  %d\n",b);

/* check pre-increments */
a = 5;
b = a + ++a;
printf("b = a + ++a   yields  %d\n",b);
a = 5;
b = ++a + a;
printf("b = ++a + a   yields  %d\n",b);
a = 5;
b = a + (++a);
printf("b = a + (++a) yields  %d\n",b);
a = 5;
b = (++a) + a;
printf("b = (++a) + a yields  %d\n",b);

}

chris@umcp-cs.UUCP (Chris Torek) (02/02/86)

PCC is neither `right' nor `wrong'; the behaviour of that kind of
code (`a++ + a') is specifically left undefined.  (The ANSI draft
has the notion of `sequence points' after which all side effects
should have taken place.  An addition within a single expression
is not a sequence point.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

MRC%PANDA@sumex-aim.arpa (02/03/86)

     On a DEC-20 running Stanford's KCC compiler, all the
post-increments yield 11 and all the pre-increments yield 12, as
follows: 
	b = a + a++   yields  11
	b = a++ + a   yields  11
	b = a + (a++) yields  11
	b = (a++) + a yields  11
	b = a + ++a   yields  12
	b = ++a + a   yields  12
	b = a + (++a) yields  12
	b = (++a) + a yields  12

     This would seem to correspond to the VMS C compiler and the
formal definition.  I think the discrepancy is that VMS C and KCC
were written with a formal definition in mind, while Unix C was
written as a kind of RatFor for PDP-11 assembly code.

     The basic form of the generated code was
	MOVEI 5,6		; load constant 6 into register 5
	MOVEM 5,-3(17)		; store constant in a
	ADD 5,-3(17)		; add a to a
	SUBI 5,1		; subtract 1 (post-increment only)
	MOVEM 5,-2(17)		; store resulting value into b
 in all cases.  The -n(17) stuff simply refers to variables
allocated on the stack (PDP-10 stacks grow upwards).  The only
difference between the pre-increment and post-increment cases was
that the pre-increment case didn't have the SUBI.

     This leads me to another question.  This generated code does
the job, but certainly isn't up to what an optimizing compiler
can do, much less hand-coded assembly code.  On the PDP-10,
hand-coded assembly code could do the computations in 2
instructions if the value of a is unimportant afterwards (and if
printf can take its argument in a register).  We're talking a 50%
slowdown in generated code, or more if we're in an inner loop and
the compiler can recognize the pattern as a load-once constant.

     Has much been done in the technology of optimizing C
compilations?
-------

john@basser.oz (John Mackin) (02/03/86)

I originally wrote the following as a piece of mail to the
poster of the article, but then I thought someone might
be led to believe some of the hideous misstatements
he made, so I am following-up instead...

In article <2147@brl-tgr.ARPA>
	NEVILLE%umass-cs.csnet@csnet-relay.arpa (Neville D. Newman) writes:

> While digging through the guts of the portable C compiler, i noticed
> that it produced exactly the same code for two statements that i think
> have different semantics.

> According to my C references, the unary
> operators have precedence over binary operators (and are evaluated right
> to left).

First problem.  ``my C references'', ``my books'' (below): what are
these meaningless terms?  WHAT are you using as a reference?  There
is only ONE book that should be referred to in a case like this:
``The C Programming Language,'' by Brian W. Kernighan and Dennis
M. Ritchie, Prentice-Hall, 1981, commonly referred to as ``K&R''.
That is the document that defines the language, at least until the
ANSI C standard is produced; and even then there will be ANSI C and K&R
C, unless I miss my guess.  If you are digging around in the internals
of a C compiler and you haven't read K&R until you more or less know
it by heart, go do so, and you don't need to read any more of this
news item.  If I seem to be belaboring the obvious, the reason will become
clear very soon.

> The rule for pre- and post-increment (and -decrement) operators
> is, of course, that a++ changes the value of  a  after the value of the
> term is used, so that the change is a side-effect.

Correct.  Read what you wrote ... ``IS A SIDE-EFFECT.''  Remember
those words, we'll have cause to refer to them shortly.

> ++a, on the other
> hand changes the value of  a  before the term is used and is therefore
> entirely equivalent to  (a += 1) or (a = a+1).

If by this you are trying to claim that the change to a in this case is
NOT a side-effect, you're wrong.  A side-effect is a change to a
variable ``as a by-product of the evaluation of an expression'':
K&R, page 50.

> According to my books, if a is 5, then (a++ + a) ought to evaluate to 11.

WHAT BOOKS?  Any book which implies or states any such thing is just
plain WRONG!  Read K&R, page 50 (Section 2.12):

	In any expression involving SIDE EFFECTS, there
	can be subtle dependencies on the order in which
	variables taking part in the expression are stored.
	[Emphasis mine.]

I won't quote it at greater length, it'd be too much to type, but they
make it perfectly clear.  What that expression ``ought to evaluate to''
is NOT DEFINED.  The implementer of a given C compiler is free to evaluate
it as they wish.  Even lint knows that; applying it to your test program gives:

xx.c(9): warning: a evaluation order undefined
xx.c(12): warning: a evaluation order undefined
xx.c(15): warning: a evaluation order undefined
xx.c(18): warning: a evaluation order undefined
xx.c(23): warning: a evaluation order undefined
xx.c(26): warning: a evaluation order undefined
xx.c(29): warning: a evaluation order undefined
xx.c(32): warning: a evaluation order undefined

> So the questions for the day are:  Is pcc "right" because it is sort of the
> defacto standard?  (i have a friend who claims that BNF and such are useless,
> the compiler is the only definition of a language that counts)

There are cases, particularly with reference to the cpp, where this
argument is valid.  However, in this case it doesn't enter into
the discussion, because the result is NOT DEFINED.

> Is this
> discrepancy between Unix C's behaviour and description already widely known
> and carefully worked around?

There IS no discrepancy.

> Should i attempt to fix it and possibly break
> some code or leave it alone for old time's sake?

Like the old adage says: ``If it's not broken, DON'T fix it.''

> The pre-increments should be give results of 12 on all systems,
> or else there's a *bad* problem.

INCORRECT!  The order of evaluation of such things IS NOT DEFINED!

I'm sorry if I've been a bit over-vehement about this, but it
does upset me when people don't read the documents in the case...

John Mackin, Basser Department of Computer Science,
	     University of Sydney, Sydney, Australia

{seismo,ukc,mcvax,ubc-vision,prlb2}!munnari!basser.oz!john
john%basser.oz@SEISMO.CSS.GOV
CSNET: john@basser.oz

hans@erisun.UUCP (02/04/86)

This has no doubt been multiply reiterated over the years, but here goes
again:

	The result of the statement 
	b = ( a++ + a );
	is not defined by the semantics of C.
	Evaluation of subexpressions may be performed in any
	order and, specifically, code which depends on
	subexpression evaluation order is erroneous.
	This is one trait C shares with most sequential
	assignment based procedural languages.

As an aside, there is an operator, ',' , which defines
evaluation order and not much else, and there are
the && and || operators, of course, but these destroy arithmetic
values in their course of duty.
The above statement could be written as
	b = ( ( b = a++ ), b += a  )
to produce one particular evaluation order, 
or
	b = ( ( b = a ), b += a++ )
for the other order, but neither appears to have any advantages
compared to their sequential statement forms,

	b  = a;				b  = a;
	b += ++a;			b += a++;

which are semantically indisputable.


-- 
 Two's complement, but three's an int.
Hans Albertsson EIS, USENET/uucp: {decvax,philabs}!mcvax!enea!erix!erisun!hans

rjk@mrstve.UUCP (Richard Kuhns) (02/04/86)

In article <2147@brl-tgr.ARPA> NEVILLE%umass-cs.csnet@csnet-relay.arpa writes:
(I shortened the note...)
>This is posted to unix-wizards instead of to net.lang because i believe
>that it shows faulty semantics in the Unix C compiler.  i don't know
>the proper to get to arbitrary newsgroups, being an Internet person, so
>if the moderator would kindly forward it there i would appreciate it.
>
>According to my books, if a is 5, then (a++ + a) ought to evaluate to 11.
>On 4.2bsd (or any system using the pcc, i imagine) it evaluates to 10.
>On VMS with VAX-C v2.1, it evaluates to 11.
>
>So the questions for the day are:  Is pcc "right" because it is sort of the
>defacto standard?  (i have a friend who claims that BNF and such are useless,
>the compiler is the only definition of a language that counts)  Is this
>discrepancy between Unix C's behaviour and description already widely known
>and carefully worked around?  Should i attempt to fix it and possibly break
>some code or leave it alone for old time's sake?
>

According to everything quote official unquote I've read on the subject,
there is no discrepancy between Unix C's behaviour and description.
Quoting from "A C PROGRAM CHECKER - lint" (dist. with AT&T SYSV.2),

"In order that the efficiency of C language on a particular machine not be
unduly compromised, the C language leaves the order of evaluation of
complicated expressions up to the local compiler. ...
*In particular, if any variable is changed by a side effect and also used
elsewhere in the same expression, the result is explicitly undefined.* "
-- 
Rich Kuhns		{ihnp4, decvax, etc...}!pur-ee!pur-phy!mrstve!rjk