[net.lang.c] C "optimization"

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this comparison, I present a test which selects between two routines,
each of which has the identical arguments passed to it.  Now, since either
one or the other routine is called (i.e. flow control analysis can readily
determine that there is no condition under which neither or both routines
are called), a great optimizer would first push all the arguments onto the
stack, and then test the condition "t", and call the appropriate routine.
Regrettably, even Bliss is not smart enough to do this.  However, it is
smarter than C in that:

	1) It uses two MOVQ instructions to load the procedure arguments
instead of 4 PUSHL instructions (this is a space optimization, but gives
a small speed optimization since there are two less instructions to execute).
If Bliss were assured of the target processor, it could use a MOVO (move
octaword).  However, the 11/730 doesn't support that instruction, so Bliss
avoids it.
	2) C uses a common exit point from the routine.  Since no cleanup
code is used, the "jbr L19" is wasteful.  Bliss instead simply generates
two RET instructions where required.  This is a drawback of the peephole
optimizer that C uses.  It can eliminate jumps to jumps, but it does not
eliminate jumps to returns.  It should, since that optimization is both a
speed and space improvement.

----------------------------------------+-------------------------------------
external routine alpha;			|	extern  int	alpha();
external routine beta;			|	extern  int	beta();
					|
routine test(p,q,r,s,t) : novalue =	|	test(p,q,r,s,t)
begin					|	{
    if .t neq 0				|	    if (t!=0)
	then alpha(.p,.q,.r,.s)		|		alpha(p,q,r,s);
	else beta(.p,.q,.r,.s);		|	    else
end;					|		beta(p,q,r,s);
					|	}
					|
	.TITLE  FOO			|		.data
					|		.text
	.PSECT  $CODE$,NOWRT,2		|	LL0:	.align  1
					|		.globl  _test
TEST:	.WORD	^M<>			|		.set	L14,0x0
	TSTL	20(AP)			|		.data
	BEQL	1$			|		.text
	MOVQ	12(AP), -(SP)		|	_test:  .word	L14
	MOVQ	4(AP), -(SP)		|		tstl	20(ap)
	CALLS	#4, W^ALPHA		|		jeql	L18
	RET				|		pushl	16(ap)
1$:	MOVQ	12(AP), -(SP)		|		pushl	12(ap)
	MOVQ	4(AP), -(SP)		|		pushl	8(ap)
	CALLS	#4, W^BETA		|		pushl	4(ap)
	RET				|		calls	$4,_alpha
					|		jbr	L19
					|	L18:	pushl	16(ap)
					|		pushl	12(ap)
					|		pushl	8(ap)
					|		pushl	4(ap)
					|		calls	$4,_beta
					|	L19:	ret

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this example I show what the compilers do when given two blocks of code
that are nearly identical (i.e. they are "tail identical").  Only the head
is different (namely whether the first argument pushed to "alpha" is a 0
or a 1.  Both compilers behave well in that they share the tail identical code
between the two conditions.  The only complaint I have against C is that
it uses 4 PUSHL instructions to pass the remaining arguments, rather than a
more speed/space effifient two MOVQ instructions.  If you were sure you were
not going to run on an 11/730, you could further reduce this to a single MOVO
instruction, but that is an unreasonable request.
----------------------------------------+-------------------------------------
external routine alpha;			|	extern  int	alpha();
					|
routine test(p,q,r,s,t) : NoValue =	|	test(p,q,r,s,t)
	begin				|	{
					|	    if (t!=0)
	if .t neq 0 then		|		alpha(p,q,r,s,0);
		alpha(.p,.q,.r,.s,0)	|	    else
	else				|		alpha(p,q,r,s,1);
		alpha(.p,.q,.r,.s,1);	|	}
	end;				|
					|
					|
	.TITLE  FOO			|		.data
					|		.text
	.EXTRN  ALPHA			|	LL0:	.align  1
					|		.globl  _test
	.PSECT  $CODE$,NOWRT,2		|		.set	L13,0x0
					|		.data
TEST:	.WORD	^M<>			|		.text
	TSTL	20(AP)			|	_test:  .word	L13
	BEQL	1$			|		tstl	20(ap)
	CLRL	-(SP)			|		jeql	L17
	BRB	2$			|		pushl	$0
1$:	PUSHL	#1			|		jbr	L200004
2$:	MOVQ	12(AP), -(SP)		|	L17:	pushl	$1
	MOVQ	4(AP), -(SP)		|	L200004:pushl	16(ap)
	CALLS	#5, W^ALPHA		|		pushl	12(ap)
	RET				|		pushl	8(ap)
					|		pushl	4(ap)
					|		calls	$5,_alpha
					|		ret

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this example, I demonstrate the ability (and lack thereof) of the compilers
to extract loop invariant code from the body of a loop.  This is a technique
we were all taught in Programming-1.  The Bliss compiler knows about this
technique, and does it for you when you forget (or when it is less elegant to
create a temporary variable to hold the invariant value).  What I do here is
loop on "i" from 0 to "(5+a)/2", and do *nothing* in the body of the loop.
The invariant expression is "(5+a)/2".  It doesn't take a genius to see that
that value will never change in the loop, especially since the loop does
nothing at all (let alone reference "a").  This is a very simple example of
loop invariant code.  Bliss can recognize more complex examples than this.
Neither compiler eliminates the loop altogether.  This is a religious issue,
in that "is the loop needed at all if you know you aren't going to do anything
in it".  There is no "right" answer to that question, since it is really very
application dependant (i.e. do you really want to ignore software timing
delays?).  So, on to the comparison:

	1) Bliss recognizes the loop invariant section of the loop, and
evaluates it once (before the loop is executed).  Thereafter, it does not
need to reevaluate the expression.  The C compiler, on the other hand,
evaluates the limit expression before each pass of the loop.  Not only is
this computationally redundant, but speed inefficient.
	2) Bliss uses the AOBLEQ (Add One and Branch if Less or Equal) to
effect the loop.  The C compiler (after having recalculated the limit
expression), uses an "incr" / "cmpl" / "jleq" combination.  This is less
efficient in both speed and space.
	3) In the calculation of the limit expression, both compilers need
a temporary variable to place the result.  The Bliss compiler chooses R1,
while the C compiler allocates a stack location.  This is a poor choice on
the part of C, since stack accesses take far longer than register accesses
and require more bytes of assembly code.  The register "r1" is available
(and does not need to be preserved on routine entry), so C should use it.
	4) The C compiler allocates a single variable on the stack in the
wrong way.  It emits "subl $4,sp" / "clrl -4(fp)" when it could much more
efficiently do "clrl -(sp)".  Thereafter it refers to the variable as "-4(fp)"
when it should use "(sp)".  The latter takes 1 bytes versus 2.  However, as
mentioned in 3) above, using "r1" is better all around.
	5) The Bliss compiler sets the loop index variable to be 1 less than
the starting value it needs, and immediately increments it (i.e. the loop
increment is at the top of the loop).  The loop increment in C is also at the
top of the loop, but C sets the variable to be what it wants to start at,
skips the loop increment the first time, and hits it each time afterward.
For loop increments that are complex (i.e. involve pointer deaccessing),
this is a reasonable approach.  However, for simple increments (like "i++"),
the code is wasteful.
----------------------------------------+-------------------------------------
routine test(a) : novalue =		|	test(a)
begin					|	int	a;
					|	{
    incr i from 0 to (5+.a)/2 do ;	|	    int i;
					|
end;					|	    for (i=0; i<=(5+a)/2; i++) ;
					|	}
					|
	.TITLE  FOO			|		.data
					|		.text
	.PSECT  $CODE$,NOWRT,2		|	LL0:	.align  1
					|		.globl  _test
TEST:	.WORD	^M<>			|		.set	L12,0x0
	ADDL3	#5, 4(AP), R1		|		.data
	DIVL2	#2, R1			|		.text
	MNEGL	#1, R0			|	_test:  .word	L12
1$:	AOBLEQ  R1, R0, 1$		|		subl2	$4,sp
	RET				|		clrl	-4(fp)
					|		jbr	L18
					|	L200001:incl	-4(fp)
					|	L18:	addl3	$5,4(ap),r0
					|		divl2	$2,r0
					|		cmpl	-4(fp),r0
					|		jleq	L200001
					|		ret

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this example I show how the compilers allocate local variables.  The
test code declares two local variables, "a", and "b".  By using flow analysis
it can be seen that the use of "a" and "b" does not overlap.  Now a really
"great" optimizer would not bother moving the constant values into the local
variables, and then pushing them on the stack (since it can also be seen that
the only use of the local variable is a place holder).  However, both compilers
can be forgiven in that it is possible for a debugger to tell you the values
of "a" and "b" if they occupy a location, while it is very difficult to make
a debugger smart enough to look on the stack during a partially executed
routine call.  The comparison:
	1) C is not smart enough to recognize that the use of the variables
overlaps, and instead allocates two locations (on the stack) for them.  Bliss
on the other hand recognizes this fact, and shares a location for both the
variables.  DON'T tell me C does it so the debugger can distinguish between
the two variables.  If the VMS debugger is smart enough to tell the difference,
then so can the Unix debuggers.
	2) C allocates the variables on the stack, while Bliss chooses to use
register 0.  Even if Bliss had used two registers, it still could have used
R1 and R0.  C should also, but instead it plods around playing with the stack.
The C code is larger, and also substantially slower.
----------------------------------------+-------------------------------------
external routine alpha;			|	extern  alpha();
					|
routine test(parm) : NoValue =		|	test()
	begin				|	{
	local a,b;			|	    int a, b;
					|
	a = 2;				|	    a = 2;
	alpha(.a);			|	    alpha(a);
	b = 5;				|	    b = 5;
	alpha(.b);			|	    alpha(b);
	end;				|	}
					|
					|
	.TITLE  FOO			|		.data
					|		.text
	.EXTRN  ALPHA			|	LL0:	.align	    1
					|		.globl  _test
	.PSECT  $CODE$,NOWRT,2		|		.set	L13,0x0
					|		.data
TEST:	.WORD	^M<>			|		.text
	MOVL	#2, R0			|	_test:  .word	  L13
	PUSHL	R0			|		subl2	$8,sp
	CALLS	#1, W^ALPHA		|		movl	$2,-4(fp)
	MOVL	#5, R0			|		pushl	-4(fp)
	PUSHL	R0			|		calls	$1,_alpha
	CALLS	#1, W^ALPHA		|		movl	$5,-8(fp)
	RET				|		pushl	-8(fp)
					|		calls	$1,_alpha
					|		ret

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this example I show what the compilers do when they find unused local
variables.  Simply put, if a local variable is unused, there is no reason
to allocate space for it.  The time spent allocating space is a waste.  The
Bliss compiler knows that both "a" and "b" are not used, so no space is
reserved for them.  The C compiler on the other hand subtracts 8 from the
stack pointer, and then does not touch the space.  Why bother?  (The debugger
can always be told the variable is not allocated).  The removal of unused
locals should be a transparent operation.  Instead, the C compiler needs the
help of "lint" to warn you about it.  Bogus!
----------------------------------------+-------------------------------------
external routine alpha;			|	extern  alpha();
					|
routine test(parm) : NoValue =		|	test(parm)
	begin				|	{
	local a,b;			|	    int a, b;
					|
	alpha(.parm);			|	    alpha(parm);
	end;				|	}
					|
					|
	.TITLE  FOO			|		.data
					|		.text
	.EXTRN  ALPHA			|	LL0:	.align	    1
					|		.globl  _test
	.PSECT  $CODE$,NOWRT,2		|		.set	L13,0x0
					|		.data
TEST:	.WORD	^M<>			|		.text
	PUSHL	4(AP)			|	_test:  .word	  L13
	CALLS	#1, W^ALPHA		|		subl2	$8,sp
	RET				|		pushl	4(ap)
					|		calls	$1,_alpha
					|		ret

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this example, I show how both compilers are smart enough to do compile
time testing of constant expressions.  This is included to be fair to C,
which does a generally "good" job of generating code.  It just ain't "great".

As can be seen, the generated code is effectively equivalent.  However, here
I raise a gripe to the template nature of the C compiler.  Since it assumes
there to be a data and a text portion to every routine/program, it emits a
".data" pseudo op, outputs the data, emits a ".text" pseudo op, and outputs
the text.  What results is two (count 'em) superfluous ".data" / ".text"
pseudo ops.  Now, this doesn't slow down the assembler a whole lot, but on
a loaded system, every little bit helps.  Likewise, where in the hell is
the label "LL0" ever used?  If you don't use it (and certainly, with a name
like "LL0", it is local), don't emit it.  The same thing holds true for
the routine entry mask being stored in "L13".  What ever for?  You only
use it once, so why clutter the symbol table with it?  C takes 10 lines
of assembly code to do what Bliss does in 5 lines.
----------------------------------------+-------------------------------------
literal P1 = 3,				|	#define P1	3
	P2 = 5;				|	#define P2	5
					|
routine test =				|	extern int alpha();
begin					|
    if P1 lss P2			|	test()
	then return 0			|	{
	else return 1			|	    if (P1 < P2)
end;					|		return 0;
					|	    else
					|		return 1;
					|	}
					|
	.TITLE  FOO			|		.data
					|		.text
	.PSECT  $CODE$,NOWRT,2		|	LL0:	.align  1
					|		.globl  _test
TEST:	.WORD	^M<>			|		.set	L13,0x0
	CLRL	R0			|		.data
	RET				|		.text
					|	_test:  .word	L13
					|		clrl	r0
					|		ret

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this example I show what happens when local variables are initialized
and compared.  I have to admit I was disappointed with Bliss here.  A "great"
optimizer should recognize that the local variables are initialized, never
modified, and compared.  This should reduce to a compile time test, and no
intermediate code should be generated.  Alas, both C and Bliss do the tests
at run time.  The same old complaint against C shows up here, though.  It
uses stack locals when it should use "r0" and "r1".  To be fair, other than
that, the code is the same for both compilers.
----------------------------------------+-------------------------------------
literal P1 = 3,				|	#define P1	3
	P2 = 5;				|	#define P2	5
					|
routine test =				|	extern int alpha();
begin					|
    local loc1 : initial(P1),		|	test()
	  loc2 : initial(P2);		|	{
					|	    int loc1 = P1,
    if .loc1 lss .loc2			|		loc2 = P2;
	then return 0			|
	else return 1			|	    if (loc1 < loc2)
end;					|		return 0;
					|	    else
					|		return 1;
					|	}
					|
	.TITLE  FOO			|		.data
					|		.text
	.PSECT  $CODE$,NOWRT,2		|	LL0:	.align  1
					|		.globl  _test
TEST:	.WORD	^M<>			|		.set	L13,0x0
	MOVL	#3, R1			|		.data
	MOVL	#5, R0			|		.text
	CMPL	R1, R0			|	_test:  .word	L13
	BGEQ	1$			|		subl2	$8,sp
	CLRL	R0			|		movl	$3,-4(fp)
	RET				|		movl	$5,-8(fp)
1$:	MOVL	#1, R0			|		cmpl	-4(fp),-8(fp)
	RET				|		jgeq	L17
					|		clrl	r0
					|		ret
					|	L17:	movl	$1,r0
					|		ret

dvk@mi-cec.UUCP (Dan Klein) (02/14/84)

This is a continuation of my diatribe on "C doesn't optimize, it neatens".
In this and other articles, I compare a true optimizing compiler (Bliss-32
running under VMS) to a code neatener (C running under BSD 4.1c).  Any and
all counterexamples are welcome.  However, this is NOT a comparison of the
languages.  Both C and Bliss have their good and bad points.  This is simply
a comparison of the code they generate.  As in all examples, the source
code and uncensored assembly code is presented.  In all examples, the C source
and Bliss source are as nearly identical as language differences permit.  I
have not taken advantage of any "tricks" to get either language to perform
better or worse than the other.  The optimizer was enabled for both languages.

		-Dan Klein, Mellon Institute, Pittsburgh	(412)578-3382
=============================================================================

In this example, I construct a structure that consists of a pointer and
a data part.  The operation that is being performed is that of moving the
data part of one component to the data part of the component immediately
preceeding it in the linked list.  The mechanism is more naturally expressed
in C, but Bliss outperforms it even in light of the "irregular" constraints
placed on the code generator.

Since the Bliss optimizer has some flow control analysis built into it, it
is able to eliminate common sub-expressions from complex statements or groups
of statements.  In this case, the common subexpression (using C syntax) is
the pointer chain "a->p->p->p->".  Bliss is able to recognize this commonality,
and evaluate that chain only once.  Since C, on the other hand, has only a
peephole optimizer to work with, it fails to see this common subexpression,
and evaluates both "a->p->p->p->d" and "a->p->p->p->p->d" completely.  Without
a doubt, the Bliss generated code is both more speed and space efficient than
the C generated code.
----------------------------------------+-------------------------------------
switches structure(ref block[2]);	|	typedef struct dp *DP;
field CP =				|	struct dp {
    set					|	    int d;
    d = [0,0,%BPVAL,0],			|	    DP  p;
    p = [1,0,%BPADDR,0]			|	    }
    tes;				|
					|	test(a)
routine test(a)  : NoValue =		|	DP a;
    begin				|	{
    map a : ref block[2] field(CP);	|	    a->p->p->p->d =
    a[p][p][p][d] = .a[p][p][p][p][d];  |	 	a->p->p->p->p->d;
    end;				|	}
					|
					|
	.TITLE  FOO			|		.data
					|		.text
	.PSECT  $CODE$,NOWRT,2		|	LL0:	.align  1
					|		.globl  _test
TEST:	.WORD	^M<>			|		.lcomm  L16,8
	MOVL	4(AP), R0		|		.set	L12,0x0
	MOVL	4(R0), R0		|		.data
	MOVL	4(R0), R0		|		.text
	MOVL	4(R0), R0		|	_test:  .word	L12
	MOVL	@4(R0), (R0)		|		movl	4(ap),r0
	RET				|		movl	4(r0),r0
					|		movl	4(r0),r0
					|		movl	4(r0),r0
					|		movl	4(ap),r1
					|		movl	4(r1),r1
					|		movl	4(r1),r1
					|		movl	*4(r0),*4(r1)
					|		movab	L16,r1
					|		movq	(r0),(r1)
					|		movab	L16,r0
					|		ret

keesan@bbncca.ARPA (Morris Keesan) (02/17/84)

-----------------------------
    This particular example is fatuous.  You complain about the  C compiler
emitting unnecessary assembler pseudo-ops and labels, because they "slow the
assembler down", and "on a loaded system, every little bit helps."  This is
totally silly.  The extra code and processing in the C compiler to eliminate
extra lines of assembler code are likely to put as much of a load on a system
as the extra processing in the assembler.
-- 
					Morris M. Keesan
					{decvax,linus,wjh12,ima}!bbncca!keesan
					keesan @ BBN-UNIX.ARPA

dvk@mi-cec.UUCP (Dan Klein) (02/19/84)

A response to Morris Keesan:

	Having a compiler discard instructions that are superflous is not a
"fatuous" desire.  If you go ahead and count the number of instructions that
are required to supress a ".data" / ".text" pair (probably on the order of
10 instructions), and the amount of code required to write the things out,
go through system buffers, get written out to the disk, read back again by the
next pass, moved back though the buffers, scanned, checked in lookup tables,
evaluated, (ignored), and then add in the (minimum) 5 or 6 context swaps, you
begin to see that removing the instructions is **FAR** cheaper than writing
them out.  Not only that, but if I/O is blocked, the process is eligible to
be swapped out, which slows you down even more.

	No, the example was not fatuous.  Try running on a system with a load
of 20 (it happens a *lot* on cmucsg), and you tell me which approach wins.

		-Dan Klein, Mellon Institute, Pittsburgh

ka@hou3c.UUCP (Kenneth Almquist) (02/22/84)

This example also shows what is really a bug in the C compiler.  After
executing the assignment, it (unnecessarily) takes the value of *(a->p->p).
The System V compilers for the VAX and the 3B20 do the same thing.
					Kenneth Almquist


P. S.  For those of you who don't have the original article, the C code was:
typedef struct dp *DP;
struct dp {
    int d;
    DP  p;
    }

test(a)
DP a;
{
    a->p->p->p->d =
 	a->p->p->p->p->d;
}

scs@foxvax1.UUCP (S.C. Schwarm ) (02/22/84)

To add my two cents on unnecessary labels and directives.  The problem is
with the speed of DEC Macro Assemblers.  They are and always have been VERY
SLOW, there for a compiler writer never uses the standard one.  She/He normally
must generate object code directly and build an object code displayer to
print out object code.  I really don't know if BLISS-32 does it this way but
FORTRAN and Pascal on VMS do.

Otherwise, I think this work is very interesting.

	Steve Schwarm
	Foxboro Co.
	genrad!wjh12!foxvax1!scs

grahamr@azure.UUCP (Graham Ross) (02/22/84)

Of course it's fatuous.  The C compiler I use is one-pass.  (Try /lib/ccom
as an interactive filter on the VAX.)  The work required to decide whether
or not to emit .text and .data directives is a great deal.  The general fix
implies some sort of two-pass operation or backstashing.

Similarly, putting L13 in the symbol table for a register mask is done
simply because the register mask word must be emitted at the function's
entry point and the compiler doesn't know how many registers it destroys
until the function's end.  The L13 gets moved to a place before the .word
directive by the optimizer.  (Neatener if you want -- \fIoptimum\fR is Latin
for "best" and Klein's examples show many Bliss failings too.)

I'd like to comment that I don't often pass adjacent (in memory) data
objects to subroutines -- that is, the pushl v. movq argument strikes me
as specious.

chris@umcp-cs.UUCP (02/22/84)

	From: ka@hou3c.UUCP (Kenneth Almquist)

	This example also shows what is really a bug in the C compiler.
	After executing the assignment, it (unnecessarily) takes the
	value of *(a->p->p)....  the C code was: ...

	typedef struct dp *DP;
	struct dp { int d; DP  p; }
	test(a) DP a; { ...

Nope.  It's a bug in the C code.  Look closely at the "struct" declaration:
no trailing semicolon.  That makes "test" a structure-valued function.

I missed that the first time too.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris.umcp-cs@CSNet-Relay

guy@rlgvax.UUCP (Guy Harris) (02/23/84)

A few comments:

1) MOVO is not even guaranteed to be available on an 11/750 or 11/780.
It's "real" name is MOVH (moves an H-floating value), and is only provided
with the G/H-floating option.  (I agree, though, catching the two "pushl"s
and converting them to a "movq" would be good.)

2) Dan is right in that the extra code to eliminate redundant ".data" and
".text" operations would probably be extremely minor (checking "pcc/code.c",
there are only a few places that emit them, and wrapping a test of a state
variable and a set/clear of that state variable around it would be all
it would seem to take), but I didn't check it in detail to see if there
are any hidden "gotcha"s.  Morris is right in that it probably doesn't make
enough of a difference to worry about; yes, penny candy is far cheaper than
10 cent candy but unless your allowance is 10 cents/week.  The extra
I/O overhead is equivalent to

	(7*number of extraneous pseudo-ops)/(buffer size)

which means that it takes about 73 extraneous pseudo-ops to require
one extra "write" or "read" system call; percentagewise, this is probably
a drop in the bucket.  The only other overhead is that of the calls
to "printf" which write the extra pseudo-ops (negligible, considering
all the other things compilers do) and the code in the assembler to
process them (probably the most significant overhead of them all, and
I suspect it is still small, percentagewise).  Trimming the extra
.data/.text isn't the place to put your effort if you're trying to
improve the performance of your machine or your compiler.

3) Even though this is a comparison of implementations, not languages,
the point is well taken that perhaps better language implementations
under UNIX should be considered.  There is resistance to UNIX among
people doing scientific computation, because its FORTRAN 77 implementation
produces code which produces correct answers, but doesn't produce them
in anything near the most CPU-efficient manner (and, believe me, CPU
efficiency *counts* in that environment!).  I suspect the relative lack
of development of the code quality of UNIX compilers is due to the
greater interest in portability than in code quality of the developers and
the code quality of those compilers being judged adequate by their users.

	Guy Harris
	{seismo,ihnp4,allegra}!rlgvax!guy

mp@whuxle.UUCP (Mark Plotnick) (02/23/84)

Dan Klein's comparison between C and Bliss really compared pcc
with BLISS-32.  Using such results to point out the inferior aspects of
C is like saying UNIX's user interface is horrible based on one's
experiences with version 6 and 3bsd.

I compiled Dan's test programs on the DEC VMS C compiler, version 1.0-09.
This compiler shares its back end with other highly optimizing
VMS compilers such as PL/1.  The generated code is in some cases better
than BLISS-32's, and in some cases it's worse than pcc's.


Program 1:
extern  int	alpha();	    test:   .entry  test,^m<r2>
extern  int	beta();			    moval   (ap),r2
					    tstl    14(r2)
test(p,q,r,s,t)				    beql    vcg.1
{					    pushl   10(r2)
    if (t!=0)				    pushl   0C(r2)
	alpha(p,q,r,s);			    pushl   08(r2)
    else				    pushl   04(r2)
	beta(p,q,r,s);			    calls   #4,alpha
}					    ret
				    vcg.1:  pushl   10(r2)
					    pushl   0C(r2)
					    pushl   08(r2)
					    pushl   04(r2)
					    calls   #4,beta
					    ret

The code here is about the same as pcc's.  Note that VMS C
likes to keep the address of the base of the argument list in a register,
even though it doesn't really buy anything.

Program 2:

extern  int	alpha();	    test:   .entry  test,^m<r2>
					    moval   (ap),r2
test(p,q,r,s,t)				    tstl    14(r2)
{					    beql    vcg.1
    if (t!=0)				    pushl   #0
	alpha(p,q,r,s,0);		    pushl   10(r2)
    else				    pushl   0C(r2)
	alpha(p,q,r,s,1);		    pushl   08(r2)
}					    pushl   04(r2)
					    calls   #5,alpha
					    ret
				    vcg.1:  pushl   #1
					    pushl   10(r2)
					    pushl   0C(r2)
					    pushl   08(r2)
					    pushl   04(r2)
					    calls   #5,alpha
					    ret


Well, VMS C loses here.  It doesn't share identical code sections.

Program 3:
test(a)				    test:   .entry  test,^m<r2>
int	a;				    moval   (ap),r2
{					    clrl    r1
    int i;				    addl3   #5,04(r2),r0
					    divl2   #2,r0
    for (i=0; i<=(5+a)/2; i++) ;	    cmpl    r1,r0
}					    bgtr    vcg.2
				    vcg.1:  aobleq  r0,r1,vcg.1
				    vcg.2:  ret


Almost as good as BLISS-32's.  It doesn't use BLISS-32's neat trick of setting
the value to be one less than the real starting value and flowing
into the aobleq.


Program 4:
extern  alpha();		    test:   .entry  test,^m<>
					    pushl   #2
test()					    calls   #1,alpha
{					    pushl   #5
    int a, b;				    calls   #1,alpha
					    ret
    a = 2;
    alpha(a);
    b = 5;
    alpha(b);
}


This beats BLISS-32, by pushing the constant values immediately rather than
moving each to a register first.

Program 5:
Identical to BLISS-32's code.


Program 6:
Identical to BLISS-32's code.

Program 7:

#define P1	3		    test:   .entry  test,^m<>
#define P2	5			    cmpl    #3,#5
					    bgeq    vcg.1
extern int alpha();			    clrl    r0
					    ret
test()				    vcg.1:
{					    movl    #1,r0
    int loc1 = P1,			    ret
	loc2 = P2;

    if (loc1 < loc2)
	return 0;
    else
	return 1;
}


This beats BLISS-32 in the same way as it did with Program 4 - it uses constant
values in immediate mode rather than moving them to registers first.
I wonder why it didn't just go further and collapse the program
down to "clrl r0 ; ret".

Program 8:
typedef struct dp *DP;		    test:   .entry  test,^m<>
struct dp {				    movl    08(ap),r0
    int d;				    movl    04(r0),r0
    DP  p;				    movl    04(r0),r0
    }					    movl    04(r0),r0
					    movl    @04(r0),(r0)
test(a)					    ret
DP a;
{
    a->p->p->p->d =
	a->p->p->p->p->d;
}

Identical to BLISS-32's code, except that the arg is at 8(ap) rather
than 4(ap).  This is because the function is structure-valued, and
4(ap) contains the address of some space set aside in the caller
for the return value.

Program 9:

extern  alpha();		    test:   .entry  test,^m<>
					    clrl    ap
test()					    movl    ap,r0
{					    pushl   #0
    register int a, b, c;		    pushl   ap
					    pushl   r0
    a = b = c = 0;			    calls   #3,alpha
    alpha(a, b, c);			    ret
}

This code looks truly bizarre, and has to do with VMS C's methods
of register allocation.  Variable "c" has been optimized out of
the code, "b" resides in ap (since we don't have any parameters,
we can use ap as just another register), and "a" resides in r0.


Before you all rush out and switch to VMS because of its C compiler,
note the following (all based on version 1.0-09; I don't know if
another version has come out that fixes these problems):
	- it uses registers very heavily, allocating them based on a static
	  analysis of the program.  It also IGNORES "register" declarations in
	  your C code.  This may lead to nonoptimal code, as well as slower
	  procedure startup (because of all the registers that may need to
	  be saved).
	- it tries to minimize code space, but not data space.  The
	  prime example of this is that it always uses a branch table
	  for implementing switches, no matter how sparsely the cases
	  are distributed.  A switch with a "case 0" and a "case 32000"
	  generates an awfully big branch table.
	- it sometime dies with a fatal parser error when presented with
	  some legal C constructs.
But, it's reasonably fast, comes with a very good manual, and can
produce compilation listings with storage maps, cross references, etc.


	Mark Plotnick
	WH 1C-244 x6955

jss@sjuvax.UUCP (J. Shapiro) (11/16/85)

Some of you may find this one interesting:

Given the following (fairly trivial) C program:

main()
{
        int a=5, b=4, c=3, d=2, e=1;

        a = b + c;
        e = c + d + b;
}

unoptimized ULTRIX cc and 4.2 BSD gives assembly code which amounts to

	push the constants onto the stack
	add b + c, move result to a
	add c + d, put it in R0
	add e to R0, move result to e
	set up return value
	exit

optimizing causes it to use registers a bit more.

unoptimized VMS cc generates essentially the same code as optimized ULTRIX cc,
though it pushes the constants onto the stack in the opposite order.

Here is the kicker:

Optimized VMS cc generates code which amounts to

	set up return value
	exit

having concluded that all variables in the system are dead at all points.
Inserting a printf of the values a and e after the assignments results in
(Hang on to your seats):

	push 9
	push 7
	call printf
	set up return value
	exit

I am afraid I got off into a tangent relating to finding common
subexpressions, and decided to try the above compilation hoping that
some version of the compilation would spot the obvious optimization of
changing the second line to read, in effect:

	e = a + d

In the process of finding out that no one seems to do this, I have lost
some respect for the UNIX "optimizer".  Now, can anyone mail me a
coherent reasoning for why UNIX cc is so inexcuseably stupid (at the
moment our postings are going out, but our incoming feed is dead)?
Can someone redeem my respect for the optimizer?

Jon Shapiro
Haverford College
-- 
Jonathan S. Shapiro
Haverford College

	"It doesn't compile pseudo code... What do you expect for fifty
		dollars?" - M. Tiemann

pdg@ihdev.UUCP (P. D. Guthrie) (11/18/85)

In article <2539@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes:
>
>In the process of finding out that no one seems to do this, I have lost
>some respect for the UNIX "optimizer".  Now, can anyone mail me a
>coherent reasoning for why UNIX cc is so inexcuseably stupid (at the
>moment our postings are going out, but our incoming feed is dead)?
>Can someone redeem my respect for the optimizer?
>

I enjoyed the comparison between the UNIX and VMS optimizers (left out
for brevity), and I my only reaction is that it is typical of DEC
compilers to produce super-optimized code.  The Tops-20 Fortran Compiler
is another example, and seems to store "synonyms" for the contents of
every register, to leave out unnecessary calculations.  When I was
writing a Fortran compiler on a DEC-20,  it seemed at first that the
best way to see how to generate code was to look at the output of DEC's
Fortran compiler,  but that proved to be fruitless, as it would use very
strange DEC-20 instructions to optimize the code down to nothing.
I would be interested to know if there have been any articles published
on DEC's optimization methods.
					Paul Guthrie
					ihnp4!ihdev!pdg
PS, sorry to talk about Fortrash in this enlightened newsgroup :-}

jtp@lasspvax.UUCP (Jim Pantaleone) (11/19/85)

In article <403@ihdev.UUCP> pdg@ihdev.UUCP (55224-P. D. Guthrie) writes:
>I would be interested to know if there have been any articles published
>on DEC's optimization methods.

I remember reading a volume from Digital Press a few years ago on the
subject of the creation of DEC's first PL/1 compiler. It wasn't bad at
all - a bit of "Soul of a New Machine" mixed in with blood and guts only
a compiler phreak could love. Interesting discussion of how they chose/
developed/cooked up the optimizations that made it into the final
product. [Your local DEC people should have a pub. catalogue somewhere.]

Point is, rev 1 of the C compiler produced code that had well-optimized
code mixed in with plug-ugly. And it looked eerily familia. Familiar. I
think they hacked the PL/1!

The C code still isn't as beauteous as the PL/1 (DO-loops where the 
index variable can get redefined if that works out better) or Fortran 
(machine NOP's interspersed to optimize memory cycling), but the latest 
version (2.1?) does actually produce correct, good, code. As far
as I know.  And the screen debugger is (most of the time) at least as 
much a boon to mankind as the compiler... now, why aren't ALL the system 
symbols available in ".H" file form?

I wish there were a mechanism whereby those that were willing to pay
for it could get really good software for Unix too. Distributed program-
ming in the middle of the night is not really conducive to projects
that require major manpower...  Lord! my Pcc was dumb when I first
got it! Ugly too.  

But it's true DEC would never have let me at the source... but it's
also true I wouldn't have needed to...

Garry Wiegand     (Cornell Engineering // Flying Moose Graphics)
garry%geology@cu-arpa.cs.cornell.edu  (arpa)

(Tho' it may not sound like it, I have no official connection to DEC.)

jsdy@hadron.UUCP (Joseph S. D. Yao) (11/20/85)

One of the original purposes of the C compiler was to compile
a then-obscure operating system on the PDP-7 and PDP-11.  In
device drivers for this OS, some routines read like:
	send(c)
	{
		LPADDR->word = WRITE | GO;
		LPADDR->word = c;
		c = LPADDR->word;
		return(c);
	}
or words to that effect.  I would get quite upset if optimising
this led to:
	send(c)
	{
		return(LPADDR->word = c);
	}
Remember, this was before we had implemented compilers with the
"const" and "var" keywords.  (Oh, you mean it's still before?
Re-calibrate that wayback, would you, Sherman?)

I think my memory on that other keyword is wrong, but I'll post
this anyway for the general content.
-- 

	Joe Yao		hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}

jsdy@hadron.UUCP (Joseph S. D. Yao) (11/20/85)

In article <85@hadron.UUCP> jsdy@hadron.UUCP (me) writes:
>"const" and "var" keywords.  (Oh, you mean it's still before?
>I think my memory on that other keyword is wrong, ...

All flamers and memory potion salesmen, please stop in your tracks.
It comes to me that "volatile" is the C keyword I sought.  ("Var" is
PASCAL and several other languages for something quite different.)
			(*sigh*)
-- 

	Joe Yao		hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}