dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this comparison, I present a test which selects between two routines, each of which has the identical arguments passed to it. Now, since either one or the other routine is called (i.e. flow control analysis can readily determine that there is no condition under which neither or both routines are called), a great optimizer would first push all the arguments onto the stack, and then test the condition "t", and call the appropriate routine. Regrettably, even Bliss is not smart enough to do this. However, it is smarter than C in that: 1) It uses two MOVQ instructions to load the procedure arguments instead of 4 PUSHL instructions (this is a space optimization, but gives a small speed optimization since there are two less instructions to execute). If Bliss were assured of the target processor, it could use a MOVO (move octaword). However, the 11/730 doesn't support that instruction, so Bliss avoids it. 2) C uses a common exit point from the routine. Since no cleanup code is used, the "jbr L19" is wasteful. Bliss instead simply generates two RET instructions where required. This is a drawback of the peephole optimizer that C uses. It can eliminate jumps to jumps, but it does not eliminate jumps to returns. It should, since that optimization is both a speed and space improvement. ----------------------------------------+------------------------------------- external routine alpha; | extern int alpha(); external routine beta; | extern int beta(); | routine test(p,q,r,s,t) : novalue = | test(p,q,r,s,t) begin | { if .t neq 0 | if (t!=0) then alpha(.p,.q,.r,.s) | alpha(p,q,r,s); else beta(.p,.q,.r,.s); | else end; | beta(p,q,r,s); | } | .TITLE FOO | .data | .text .PSECT $CODE$,NOWRT,2 | LL0: .align 1 | .globl _test TEST: .WORD ^M<> | .set L14,0x0 TSTL 20(AP) | .data BEQL 1$ | .text MOVQ 12(AP), -(SP) | _test: .word L14 MOVQ 4(AP), -(SP) | tstl 20(ap) CALLS #4, W^ALPHA | jeql L18 RET | pushl 16(ap) 1$: MOVQ 12(AP), -(SP) | pushl 12(ap) MOVQ 4(AP), -(SP) | pushl 8(ap) CALLS #4, W^BETA | pushl 4(ap) RET | calls $4,_alpha | jbr L19 | L18: pushl 16(ap) | pushl 12(ap) | pushl 8(ap) | pushl 4(ap) | calls $4,_beta | L19: ret
dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this example I show what the compilers do when given two blocks of code that are nearly identical (i.e. they are "tail identical"). Only the head is different (namely whether the first argument pushed to "alpha" is a 0 or a 1. Both compilers behave well in that they share the tail identical code between the two conditions. The only complaint I have against C is that it uses 4 PUSHL instructions to pass the remaining arguments, rather than a more speed/space effifient two MOVQ instructions. If you were sure you were not going to run on an 11/730, you could further reduce this to a single MOVO instruction, but that is an unreasonable request. ----------------------------------------+------------------------------------- external routine alpha; | extern int alpha(); | routine test(p,q,r,s,t) : NoValue = | test(p,q,r,s,t) begin | { | if (t!=0) if .t neq 0 then | alpha(p,q,r,s,0); alpha(.p,.q,.r,.s,0) | else else | alpha(p,q,r,s,1); alpha(.p,.q,.r,.s,1); | } end; | | | .TITLE FOO | .data | .text .EXTRN ALPHA | LL0: .align 1 | .globl _test .PSECT $CODE$,NOWRT,2 | .set L13,0x0 | .data TEST: .WORD ^M<> | .text TSTL 20(AP) | _test: .word L13 BEQL 1$ | tstl 20(ap) CLRL -(SP) | jeql L17 BRB 2$ | pushl $0 1$: PUSHL #1 | jbr L200004 2$: MOVQ 12(AP), -(SP) | L17: pushl $1 MOVQ 4(AP), -(SP) | L200004:pushl 16(ap) CALLS #5, W^ALPHA | pushl 12(ap) RET | pushl 8(ap) | pushl 4(ap) | calls $5,_alpha | ret
dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this example, I demonstrate the ability (and lack thereof) of the compilers to extract loop invariant code from the body of a loop. This is a technique we were all taught in Programming-1. The Bliss compiler knows about this technique, and does it for you when you forget (or when it is less elegant to create a temporary variable to hold the invariant value). What I do here is loop on "i" from 0 to "(5+a)/2", and do *nothing* in the body of the loop. The invariant expression is "(5+a)/2". It doesn't take a genius to see that that value will never change in the loop, especially since the loop does nothing at all (let alone reference "a"). This is a very simple example of loop invariant code. Bliss can recognize more complex examples than this. Neither compiler eliminates the loop altogether. This is a religious issue, in that "is the loop needed at all if you know you aren't going to do anything in it". There is no "right" answer to that question, since it is really very application dependant (i.e. do you really want to ignore software timing delays?). So, on to the comparison: 1) Bliss recognizes the loop invariant section of the loop, and evaluates it once (before the loop is executed). Thereafter, it does not need to reevaluate the expression. The C compiler, on the other hand, evaluates the limit expression before each pass of the loop. Not only is this computationally redundant, but speed inefficient. 2) Bliss uses the AOBLEQ (Add One and Branch if Less or Equal) to effect the loop. The C compiler (after having recalculated the limit expression), uses an "incr" / "cmpl" / "jleq" combination. This is less efficient in both speed and space. 3) In the calculation of the limit expression, both compilers need a temporary variable to place the result. The Bliss compiler chooses R1, while the C compiler allocates a stack location. This is a poor choice on the part of C, since stack accesses take far longer than register accesses and require more bytes of assembly code. The register "r1" is available (and does not need to be preserved on routine entry), so C should use it. 4) The C compiler allocates a single variable on the stack in the wrong way. It emits "subl $4,sp" / "clrl -4(fp)" when it could much more efficiently do "clrl -(sp)". Thereafter it refers to the variable as "-4(fp)" when it should use "(sp)". The latter takes 1 bytes versus 2. However, as mentioned in 3) above, using "r1" is better all around. 5) The Bliss compiler sets the loop index variable to be 1 less than the starting value it needs, and immediately increments it (i.e. the loop increment is at the top of the loop). The loop increment in C is also at the top of the loop, but C sets the variable to be what it wants to start at, skips the loop increment the first time, and hits it each time afterward. For loop increments that are complex (i.e. involve pointer deaccessing), this is a reasonable approach. However, for simple increments (like "i++"), the code is wasteful. ----------------------------------------+------------------------------------- routine test(a) : novalue = | test(a) begin | int a; | { incr i from 0 to (5+.a)/2 do ; | int i; | end; | for (i=0; i<=(5+a)/2; i++) ; | } | .TITLE FOO | .data | .text .PSECT $CODE$,NOWRT,2 | LL0: .align 1 | .globl _test TEST: .WORD ^M<> | .set L12,0x0 ADDL3 #5, 4(AP), R1 | .data DIVL2 #2, R1 | .text MNEGL #1, R0 | _test: .word L12 1$: AOBLEQ R1, R0, 1$ | subl2 $4,sp RET | clrl -4(fp) | jbr L18 | L200001:incl -4(fp) | L18: addl3 $5,4(ap),r0 | divl2 $2,r0 | cmpl -4(fp),r0 | jleq L200001 | ret
dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this example I show how the compilers allocate local variables. The test code declares two local variables, "a", and "b". By using flow analysis it can be seen that the use of "a" and "b" does not overlap. Now a really "great" optimizer would not bother moving the constant values into the local variables, and then pushing them on the stack (since it can also be seen that the only use of the local variable is a place holder). However, both compilers can be forgiven in that it is possible for a debugger to tell you the values of "a" and "b" if they occupy a location, while it is very difficult to make a debugger smart enough to look on the stack during a partially executed routine call. The comparison: 1) C is not smart enough to recognize that the use of the variables overlaps, and instead allocates two locations (on the stack) for them. Bliss on the other hand recognizes this fact, and shares a location for both the variables. DON'T tell me C does it so the debugger can distinguish between the two variables. If the VMS debugger is smart enough to tell the difference, then so can the Unix debuggers. 2) C allocates the variables on the stack, while Bliss chooses to use register 0. Even if Bliss had used two registers, it still could have used R1 and R0. C should also, but instead it plods around playing with the stack. The C code is larger, and also substantially slower. ----------------------------------------+------------------------------------- external routine alpha; | extern alpha(); | routine test(parm) : NoValue = | test() begin | { local a,b; | int a, b; | a = 2; | a = 2; alpha(.a); | alpha(a); b = 5; | b = 5; alpha(.b); | alpha(b); end; | } | | .TITLE FOO | .data | .text .EXTRN ALPHA | LL0: .align 1 | .globl _test .PSECT $CODE$,NOWRT,2 | .set L13,0x0 | .data TEST: .WORD ^M<> | .text MOVL #2, R0 | _test: .word L13 PUSHL R0 | subl2 $8,sp CALLS #1, W^ALPHA | movl $2,-4(fp) MOVL #5, R0 | pushl -4(fp) PUSHL R0 | calls $1,_alpha CALLS #1, W^ALPHA | movl $5,-8(fp) RET | pushl -8(fp) | calls $1,_alpha | ret
dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this example I show what the compilers do when they find unused local variables. Simply put, if a local variable is unused, there is no reason to allocate space for it. The time spent allocating space is a waste. The Bliss compiler knows that both "a" and "b" are not used, so no space is reserved for them. The C compiler on the other hand subtracts 8 from the stack pointer, and then does not touch the space. Why bother? (The debugger can always be told the variable is not allocated). The removal of unused locals should be a transparent operation. Instead, the C compiler needs the help of "lint" to warn you about it. Bogus! ----------------------------------------+------------------------------------- external routine alpha; | extern alpha(); | routine test(parm) : NoValue = | test(parm) begin | { local a,b; | int a, b; | alpha(.parm); | alpha(parm); end; | } | | .TITLE FOO | .data | .text .EXTRN ALPHA | LL0: .align 1 | .globl _test .PSECT $CODE$,NOWRT,2 | .set L13,0x0 | .data TEST: .WORD ^M<> | .text PUSHL 4(AP) | _test: .word L13 CALLS #1, W^ALPHA | subl2 $8,sp RET | pushl 4(ap) | calls $1,_alpha | ret
dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this example, I show how both compilers are smart enough to do compile time testing of constant expressions. This is included to be fair to C, which does a generally "good" job of generating code. It just ain't "great". As can be seen, the generated code is effectively equivalent. However, here I raise a gripe to the template nature of the C compiler. Since it assumes there to be a data and a text portion to every routine/program, it emits a ".data" pseudo op, outputs the data, emits a ".text" pseudo op, and outputs the text. What results is two (count 'em) superfluous ".data" / ".text" pseudo ops. Now, this doesn't slow down the assembler a whole lot, but on a loaded system, every little bit helps. Likewise, where in the hell is the label "LL0" ever used? If you don't use it (and certainly, with a name like "LL0", it is local), don't emit it. The same thing holds true for the routine entry mask being stored in "L13". What ever for? You only use it once, so why clutter the symbol table with it? C takes 10 lines of assembly code to do what Bliss does in 5 lines. ----------------------------------------+------------------------------------- literal P1 = 3, | #define P1 3 P2 = 5; | #define P2 5 | routine test = | extern int alpha(); begin | if P1 lss P2 | test() then return 0 | { else return 1 | if (P1 < P2) end; | return 0; | else | return 1; | } | .TITLE FOO | .data | .text .PSECT $CODE$,NOWRT,2 | LL0: .align 1 | .globl _test TEST: .WORD ^M<> | .set L13,0x0 CLRL R0 | .data RET | .text | _test: .word L13 | clrl r0 | ret
dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this example I show what happens when local variables are initialized and compared. I have to admit I was disappointed with Bliss here. A "great" optimizer should recognize that the local variables are initialized, never modified, and compared. This should reduce to a compile time test, and no intermediate code should be generated. Alas, both C and Bliss do the tests at run time. The same old complaint against C shows up here, though. It uses stack locals when it should use "r0" and "r1". To be fair, other than that, the code is the same for both compilers. ----------------------------------------+------------------------------------- literal P1 = 3, | #define P1 3 P2 = 5; | #define P2 5 | routine test = | extern int alpha(); begin | local loc1 : initial(P1), | test() loc2 : initial(P2); | { | int loc1 = P1, if .loc1 lss .loc2 | loc2 = P2; then return 0 | else return 1 | if (loc1 < loc2) end; | return 0; | else | return 1; | } | .TITLE FOO | .data | .text .PSECT $CODE$,NOWRT,2 | LL0: .align 1 | .globl _test TEST: .WORD ^M<> | .set L13,0x0 MOVL #3, R1 | .data MOVL #5, R0 | .text CMPL R1, R0 | _test: .word L13 BGEQ 1$ | subl2 $8,sp CLRL R0 | movl $3,-4(fp) RET | movl $5,-8(fp) 1$: MOVL #1, R0 | cmpl -4(fp),-8(fp) RET | jgeq L17 | clrl r0 | ret | L17: movl $1,r0 | ret
dvk@mi-cec.UUCP (Dan Klein) (02/14/84)
This is a continuation of my diatribe on "C doesn't optimize, it neatens". In this and other articles, I compare a true optimizing compiler (Bliss-32 running under VMS) to a code neatener (C running under BSD 4.1c). Any and all counterexamples are welcome. However, this is NOT a comparison of the languages. Both C and Bliss have their good and bad points. This is simply a comparison of the code they generate. As in all examples, the source code and uncensored assembly code is presented. In all examples, the C source and Bliss source are as nearly identical as language differences permit. I have not taken advantage of any "tricks" to get either language to perform better or worse than the other. The optimizer was enabled for both languages. -Dan Klein, Mellon Institute, Pittsburgh (412)578-3382 ============================================================================= In this example, I construct a structure that consists of a pointer and a data part. The operation that is being performed is that of moving the data part of one component to the data part of the component immediately preceeding it in the linked list. The mechanism is more naturally expressed in C, but Bliss outperforms it even in light of the "irregular" constraints placed on the code generator. Since the Bliss optimizer has some flow control analysis built into it, it is able to eliminate common sub-expressions from complex statements or groups of statements. In this case, the common subexpression (using C syntax) is the pointer chain "a->p->p->p->". Bliss is able to recognize this commonality, and evaluate that chain only once. Since C, on the other hand, has only a peephole optimizer to work with, it fails to see this common subexpression, and evaluates both "a->p->p->p->d" and "a->p->p->p->p->d" completely. Without a doubt, the Bliss generated code is both more speed and space efficient than the C generated code. ----------------------------------------+------------------------------------- switches structure(ref block[2]); | typedef struct dp *DP; field CP = | struct dp { set | int d; d = [0,0,%BPVAL,0], | DP p; p = [1,0,%BPADDR,0] | } tes; | | test(a) routine test(a) : NoValue = | DP a; begin | { map a : ref block[2] field(CP); | a->p->p->p->d = a[p][p][p][d] = .a[p][p][p][p][d]; | a->p->p->p->p->d; end; | } | | .TITLE FOO | .data | .text .PSECT $CODE$,NOWRT,2 | LL0: .align 1 | .globl _test TEST: .WORD ^M<> | .lcomm L16,8 MOVL 4(AP), R0 | .set L12,0x0 MOVL 4(R0), R0 | .data MOVL 4(R0), R0 | .text MOVL 4(R0), R0 | _test: .word L12 MOVL @4(R0), (R0) | movl 4(ap),r0 RET | movl 4(r0),r0 | movl 4(r0),r0 | movl 4(r0),r0 | movl 4(ap),r1 | movl 4(r1),r1 | movl 4(r1),r1 | movl *4(r0),*4(r1) | movab L16,r1 | movq (r0),(r1) | movab L16,r0 | ret
keesan@bbncca.ARPA (Morris Keesan) (02/17/84)
----------------------------- This particular example is fatuous. You complain about the C compiler emitting unnecessary assembler pseudo-ops and labels, because they "slow the assembler down", and "on a loaded system, every little bit helps." This is totally silly. The extra code and processing in the C compiler to eliminate extra lines of assembler code are likely to put as much of a load on a system as the extra processing in the assembler. -- Morris M. Keesan {decvax,linus,wjh12,ima}!bbncca!keesan keesan @ BBN-UNIX.ARPA
dvk@mi-cec.UUCP (Dan Klein) (02/19/84)
A response to Morris Keesan: Having a compiler discard instructions that are superflous is not a "fatuous" desire. If you go ahead and count the number of instructions that are required to supress a ".data" / ".text" pair (probably on the order of 10 instructions), and the amount of code required to write the things out, go through system buffers, get written out to the disk, read back again by the next pass, moved back though the buffers, scanned, checked in lookup tables, evaluated, (ignored), and then add in the (minimum) 5 or 6 context swaps, you begin to see that removing the instructions is **FAR** cheaper than writing them out. Not only that, but if I/O is blocked, the process is eligible to be swapped out, which slows you down even more. No, the example was not fatuous. Try running on a system with a load of 20 (it happens a *lot* on cmucsg), and you tell me which approach wins. -Dan Klein, Mellon Institute, Pittsburgh
ka@hou3c.UUCP (Kenneth Almquist) (02/22/84)
This example also shows what is really a bug in the C compiler. After executing the assignment, it (unnecessarily) takes the value of *(a->p->p). The System V compilers for the VAX and the 3B20 do the same thing. Kenneth Almquist P. S. For those of you who don't have the original article, the C code was: typedef struct dp *DP; struct dp { int d; DP p; } test(a) DP a; { a->p->p->p->d = a->p->p->p->p->d; }
scs@foxvax1.UUCP (S.C. Schwarm ) (02/22/84)
To add my two cents on unnecessary labels and directives. The problem is with the speed of DEC Macro Assemblers. They are and always have been VERY SLOW, there for a compiler writer never uses the standard one. She/He normally must generate object code directly and build an object code displayer to print out object code. I really don't know if BLISS-32 does it this way but FORTRAN and Pascal on VMS do. Otherwise, I think this work is very interesting. Steve Schwarm Foxboro Co. genrad!wjh12!foxvax1!scs
grahamr@azure.UUCP (Graham Ross) (02/22/84)
Of course it's fatuous. The C compiler I use is one-pass. (Try /lib/ccom as an interactive filter on the VAX.) The work required to decide whether or not to emit .text and .data directives is a great deal. The general fix implies some sort of two-pass operation or backstashing. Similarly, putting L13 in the symbol table for a register mask is done simply because the register mask word must be emitted at the function's entry point and the compiler doesn't know how many registers it destroys until the function's end. The L13 gets moved to a place before the .word directive by the optimizer. (Neatener if you want -- \fIoptimum\fR is Latin for "best" and Klein's examples show many Bliss failings too.) I'd like to comment that I don't often pass adjacent (in memory) data objects to subroutines -- that is, the pushl v. movq argument strikes me as specious.
chris@umcp-cs.UUCP (02/22/84)
From: ka@hou3c.UUCP (Kenneth Almquist) This example also shows what is really a bug in the C compiler. After executing the assignment, it (unnecessarily) takes the value of *(a->p->p).... the C code was: ... typedef struct dp *DP; struct dp { int d; DP p; } test(a) DP a; { ... Nope. It's a bug in the C code. Look closely at the "struct" declaration: no trailing semicolon. That makes "test" a structure-valued function. I missed that the first time too. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris.umcp-cs@CSNet-Relay
guy@rlgvax.UUCP (Guy Harris) (02/23/84)
A few comments: 1) MOVO is not even guaranteed to be available on an 11/750 or 11/780. It's "real" name is MOVH (moves an H-floating value), and is only provided with the G/H-floating option. (I agree, though, catching the two "pushl"s and converting them to a "movq" would be good.) 2) Dan is right in that the extra code to eliminate redundant ".data" and ".text" operations would probably be extremely minor (checking "pcc/code.c", there are only a few places that emit them, and wrapping a test of a state variable and a set/clear of that state variable around it would be all it would seem to take), but I didn't check it in detail to see if there are any hidden "gotcha"s. Morris is right in that it probably doesn't make enough of a difference to worry about; yes, penny candy is far cheaper than 10 cent candy but unless your allowance is 10 cents/week. The extra I/O overhead is equivalent to (7*number of extraneous pseudo-ops)/(buffer size) which means that it takes about 73 extraneous pseudo-ops to require one extra "write" or "read" system call; percentagewise, this is probably a drop in the bucket. The only other overhead is that of the calls to "printf" which write the extra pseudo-ops (negligible, considering all the other things compilers do) and the code in the assembler to process them (probably the most significant overhead of them all, and I suspect it is still small, percentagewise). Trimming the extra .data/.text isn't the place to put your effort if you're trying to improve the performance of your machine or your compiler. 3) Even though this is a comparison of implementations, not languages, the point is well taken that perhaps better language implementations under UNIX should be considered. There is resistance to UNIX among people doing scientific computation, because its FORTRAN 77 implementation produces code which produces correct answers, but doesn't produce them in anything near the most CPU-efficient manner (and, believe me, CPU efficiency *counts* in that environment!). I suspect the relative lack of development of the code quality of UNIX compilers is due to the greater interest in portability than in code quality of the developers and the code quality of those compilers being judged adequate by their users. Guy Harris {seismo,ihnp4,allegra}!rlgvax!guy
mp@whuxle.UUCP (Mark Plotnick) (02/23/84)
Dan Klein's comparison between C and Bliss really compared pcc with BLISS-32. Using such results to point out the inferior aspects of C is like saying UNIX's user interface is horrible based on one's experiences with version 6 and 3bsd. I compiled Dan's test programs on the DEC VMS C compiler, version 1.0-09. This compiler shares its back end with other highly optimizing VMS compilers such as PL/1. The generated code is in some cases better than BLISS-32's, and in some cases it's worse than pcc's. Program 1: extern int alpha(); test: .entry test,^m<r2> extern int beta(); moval (ap),r2 tstl 14(r2) test(p,q,r,s,t) beql vcg.1 { pushl 10(r2) if (t!=0) pushl 0C(r2) alpha(p,q,r,s); pushl 08(r2) else pushl 04(r2) beta(p,q,r,s); calls #4,alpha } ret vcg.1: pushl 10(r2) pushl 0C(r2) pushl 08(r2) pushl 04(r2) calls #4,beta ret The code here is about the same as pcc's. Note that VMS C likes to keep the address of the base of the argument list in a register, even though it doesn't really buy anything. Program 2: extern int alpha(); test: .entry test,^m<r2> moval (ap),r2 test(p,q,r,s,t) tstl 14(r2) { beql vcg.1 if (t!=0) pushl #0 alpha(p,q,r,s,0); pushl 10(r2) else pushl 0C(r2) alpha(p,q,r,s,1); pushl 08(r2) } pushl 04(r2) calls #5,alpha ret vcg.1: pushl #1 pushl 10(r2) pushl 0C(r2) pushl 08(r2) pushl 04(r2) calls #5,alpha ret Well, VMS C loses here. It doesn't share identical code sections. Program 3: test(a) test: .entry test,^m<r2> int a; moval (ap),r2 { clrl r1 int i; addl3 #5,04(r2),r0 divl2 #2,r0 for (i=0; i<=(5+a)/2; i++) ; cmpl r1,r0 } bgtr vcg.2 vcg.1: aobleq r0,r1,vcg.1 vcg.2: ret Almost as good as BLISS-32's. It doesn't use BLISS-32's neat trick of setting the value to be one less than the real starting value and flowing into the aobleq. Program 4: extern alpha(); test: .entry test,^m<> pushl #2 test() calls #1,alpha { pushl #5 int a, b; calls #1,alpha ret a = 2; alpha(a); b = 5; alpha(b); } This beats BLISS-32, by pushing the constant values immediately rather than moving each to a register first. Program 5: Identical to BLISS-32's code. Program 6: Identical to BLISS-32's code. Program 7: #define P1 3 test: .entry test,^m<> #define P2 5 cmpl #3,#5 bgeq vcg.1 extern int alpha(); clrl r0 ret test() vcg.1: { movl #1,r0 int loc1 = P1, ret loc2 = P2; if (loc1 < loc2) return 0; else return 1; } This beats BLISS-32 in the same way as it did with Program 4 - it uses constant values in immediate mode rather than moving them to registers first. I wonder why it didn't just go further and collapse the program down to "clrl r0 ; ret". Program 8: typedef struct dp *DP; test: .entry test,^m<> struct dp { movl 08(ap),r0 int d; movl 04(r0),r0 DP p; movl 04(r0),r0 } movl 04(r0),r0 movl @04(r0),(r0) test(a) ret DP a; { a->p->p->p->d = a->p->p->p->p->d; } Identical to BLISS-32's code, except that the arg is at 8(ap) rather than 4(ap). This is because the function is structure-valued, and 4(ap) contains the address of some space set aside in the caller for the return value. Program 9: extern alpha(); test: .entry test,^m<> clrl ap test() movl ap,r0 { pushl #0 register int a, b, c; pushl ap pushl r0 a = b = c = 0; calls #3,alpha alpha(a, b, c); ret } This code looks truly bizarre, and has to do with VMS C's methods of register allocation. Variable "c" has been optimized out of the code, "b" resides in ap (since we don't have any parameters, we can use ap as just another register), and "a" resides in r0. Before you all rush out and switch to VMS because of its C compiler, note the following (all based on version 1.0-09; I don't know if another version has come out that fixes these problems): - it uses registers very heavily, allocating them based on a static analysis of the program. It also IGNORES "register" declarations in your C code. This may lead to nonoptimal code, as well as slower procedure startup (because of all the registers that may need to be saved). - it tries to minimize code space, but not data space. The prime example of this is that it always uses a branch table for implementing switches, no matter how sparsely the cases are distributed. A switch with a "case 0" and a "case 32000" generates an awfully big branch table. - it sometime dies with a fatal parser error when presented with some legal C constructs. But, it's reasonably fast, comes with a very good manual, and can produce compilation listings with storage maps, cross references, etc. Mark Plotnick WH 1C-244 x6955
jss@sjuvax.UUCP (J. Shapiro) (11/16/85)
Some of you may find this one interesting:
Given the following (fairly trivial) C program:
main()
{
int a=5, b=4, c=3, d=2, e=1;
a = b + c;
e = c + d + b;
}
unoptimized ULTRIX cc and 4.2 BSD gives assembly code which amounts to
push the constants onto the stack
add b + c, move result to a
add c + d, put it in R0
add e to R0, move result to e
set up return value
exit
optimizing causes it to use registers a bit more.
unoptimized VMS cc generates essentially the same code as optimized ULTRIX cc,
though it pushes the constants onto the stack in the opposite order.
Here is the kicker:
Optimized VMS cc generates code which amounts to
set up return value
exit
having concluded that all variables in the system are dead at all points.
Inserting a printf of the values a and e after the assignments results in
(Hang on to your seats):
push 9
push 7
call printf
set up return value
exit
I am afraid I got off into a tangent relating to finding common
subexpressions, and decided to try the above compilation hoping that
some version of the compilation would spot the obvious optimization of
changing the second line to read, in effect:
e = a + d
In the process of finding out that no one seems to do this, I have lost
some respect for the UNIX "optimizer". Now, can anyone mail me a
coherent reasoning for why UNIX cc is so inexcuseably stupid (at the
moment our postings are going out, but our incoming feed is dead)?
Can someone redeem my respect for the optimizer?
Jon Shapiro
Haverford College
--
Jonathan S. Shapiro
Haverford College
"It doesn't compile pseudo code... What do you expect for fifty
dollars?" - M. Tiemann
pdg@ihdev.UUCP (P. D. Guthrie) (11/18/85)
In article <2539@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes: > >In the process of finding out that no one seems to do this, I have lost >some respect for the UNIX "optimizer". Now, can anyone mail me a >coherent reasoning for why UNIX cc is so inexcuseably stupid (at the >moment our postings are going out, but our incoming feed is dead)? >Can someone redeem my respect for the optimizer? > I enjoyed the comparison between the UNIX and VMS optimizers (left out for brevity), and I my only reaction is that it is typical of DEC compilers to produce super-optimized code. The Tops-20 Fortran Compiler is another example, and seems to store "synonyms" for the contents of every register, to leave out unnecessary calculations. When I was writing a Fortran compiler on a DEC-20, it seemed at first that the best way to see how to generate code was to look at the output of DEC's Fortran compiler, but that proved to be fruitless, as it would use very strange DEC-20 instructions to optimize the code down to nothing. I would be interested to know if there have been any articles published on DEC's optimization methods. Paul Guthrie ihnp4!ihdev!pdg PS, sorry to talk about Fortrash in this enlightened newsgroup :-}
jtp@lasspvax.UUCP (Jim Pantaleone) (11/19/85)
In article <403@ihdev.UUCP> pdg@ihdev.UUCP (55224-P. D. Guthrie) writes: >I would be interested to know if there have been any articles published >on DEC's optimization methods. I remember reading a volume from Digital Press a few years ago on the subject of the creation of DEC's first PL/1 compiler. It wasn't bad at all - a bit of "Soul of a New Machine" mixed in with blood and guts only a compiler phreak could love. Interesting discussion of how they chose/ developed/cooked up the optimizations that made it into the final product. [Your local DEC people should have a pub. catalogue somewhere.] Point is, rev 1 of the C compiler produced code that had well-optimized code mixed in with plug-ugly. And it looked eerily familia. Familiar. I think they hacked the PL/1! The C code still isn't as beauteous as the PL/1 (DO-loops where the index variable can get redefined if that works out better) or Fortran (machine NOP's interspersed to optimize memory cycling), but the latest version (2.1?) does actually produce correct, good, code. As far as I know. And the screen debugger is (most of the time) at least as much a boon to mankind as the compiler... now, why aren't ALL the system symbols available in ".H" file form? I wish there were a mechanism whereby those that were willing to pay for it could get really good software for Unix too. Distributed program- ming in the middle of the night is not really conducive to projects that require major manpower... Lord! my Pcc was dumb when I first got it! Ugly too. But it's true DEC would never have let me at the source... but it's also true I wouldn't have needed to... Garry Wiegand (Cornell Engineering // Flying Moose Graphics) garry%geology@cu-arpa.cs.cornell.edu (arpa) (Tho' it may not sound like it, I have no official connection to DEC.)
jsdy@hadron.UUCP (Joseph S. D. Yao) (11/20/85)
One of the original purposes of the C compiler was to compile a then-obscure operating system on the PDP-7 and PDP-11. In device drivers for this OS, some routines read like: send(c) { LPADDR->word = WRITE | GO; LPADDR->word = c; c = LPADDR->word; return(c); } or words to that effect. I would get quite upset if optimising this led to: send(c) { return(LPADDR->word = c); } Remember, this was before we had implemented compilers with the "const" and "var" keywords. (Oh, you mean it's still before? Re-calibrate that wayback, would you, Sherman?) I think my memory on that other keyword is wrong, but I'll post this anyway for the general content. -- Joe Yao hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}
jsdy@hadron.UUCP (Joseph S. D. Yao) (11/20/85)
In article <85@hadron.UUCP> jsdy@hadron.UUCP (me) writes: >"const" and "var" keywords. (Oh, you mean it's still before? >I think my memory on that other keyword is wrong, ... All flamers and memory potion salesmen, please stop in your tracks. It comes to me that "volatile" is the C keyword I sought. ("Var" is PASCAL and several other languages for something quite different.) (*sigh*) -- Joe Yao hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}