[comp.arch] Compiler complexity

jlg@beta.lanl.gov (Jim Giles) (06/18/88)

In article <8717@lll-winken.llnl.gov>, brooks@maddog.llnl.gov (Eugene D. Brooks III) writes:
> [...]
> If main memory, and in particular shared memory in a multiprocessor, is
> 20 to 40 clocks away having eighty or so registers with fully pipelined
> memory access is really a whole lot faster.

Any machine with fully pipelined memory access is faster (registers or not).

This discussion brings up a question about the actual use of CISC
instructions.  It seems that RISC vs. CISC is probably about a draw
for raw compute speed of special coded sequences.  The question is:
how many programs actually get full advantage from the large instruction
sets offered by CISC?

A few years back, I did site maintenance on a CFT compiler (Fortran on
the Cray).  I used the experience as a opportunity to learn about the
present state of compiler technology (CFT is very archaic, but I wanted
to learn the whole business).  At that time, the technology simply wasn't
able to automatically produce a good code generator (compiler 'backend').
This meant that code generators always had to be hand-made routines
with lots of ad hoc ideas involved.  The bottom line was: it was simply
EASIER to write a code generator for a RISC-like machine than for a CISC.
(To be sure, some source-level constructs could be turned directly into
CISC instructions.  But this simplistic approach produces very inefficient
code.  The famous case is the VAX polynomial instruction - a tight loop
of simple instructions was ALWAYS faster than the sequence for the single
instruction.  A simplistic code generator might be easier to write, but
it produces unacceptably bad code.)

In more recent years, has the state of the compiler art improved so that
good code generators for CISC machines can be built easily?  Or is it
still a hit or miss operation?  Do modern compilers for CISC make good
use of the variety in the instruction set?

J. Giles
Los Alamos

brooks@maddog.llnl.gov (Eugene D. Brooks III) (06/18/88)

In article <20338@beta.lanl.gov> jlg@beta.lanl.gov (Jim Giles) writes:
>Any machine with fully pipelined memory access is faster (registers or not).

Me thinks the point has been missed here!

Its very simple.  Consider the code:
	a = b + c;
	e = f + g;
	h = i + j;
which is admittedly a contrivance, but is useful for illustration.
With a three register RISC:
	load r0,b;
	load r1,c;
	add  r2,r0,r1;
	load r0,f;
	load r1,g;
	store r2,a
	add  r2,r0,r1;
	load r0,i;
	load r1,j;
	store r2,e
	add  r2,r0,r1;
	store r2,j
Yes, I could do a little better with the code by allowing the add to destroy
an argument, but the basic point is that the low number of registers introduces
"false dependencies" in the register file which prevents re-ordering to take
advantage of pipelining.  With a lot of registers the code becomes:
	load r0,b;
	load r1,c;
	load r3,f;
	load r4,g;
	load r6,i;
	load r7,j;
	add  r2,r0,r1;
	add  r5,r3,r4;
	add  r8,r6,r7;
	store r2,a
	store r5,e
	store r8,j
which is just what the Cerberus compiler does with it.  This code rips along
with an issue rate of one instruction per clock and can mask quite a bit of
memory and functional unit latency.  One could say that we will just specify
that our CISC instruction set has fully pipelined memory access; and emit
the code
	add a,b,c
	add e,f,g
	add h,i,j
and it would go "just as fast", but just what must be hidden away in the
hardware?  Enough registers to hold all those temporaries in a way that
does not induce resource dependencies, not to mention a execution unit
(in hardware) which looks ahead many instructions to see of it can start
some of the operations in them.  We call this bug riddled complexity.
One might as well just expose those registers, they have to be in there.
One might also argue that we ought to take a WM approach and
fifo everything without registers; how then do we handle arbitrary arrival
order for the memory access, which frequently occurs in a multiprocessor.
ANSWER: Lots of REGISTERS so implement the fifos so stalling is prevented
as much as is possible.
	
Some careful thought about this will reveal why the ETA 10, which uses the
CISC like memory to memory instructions for its vector operations, can't
support these operations directly to the SSD (they call it a shared memory).
They can only support (load/store like) stride 1 vector accesses when copy
in and out from the local memory to the SSD.

Flames?  Sure, I eat fire for lunch.


							Eugene

peterd@june.cs.washington.edu (Peter Damron) (06/18/88)

In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes:
> This discussion brings up a question about the actual use of CISC
> instructions.  It seems that RISC vs. CISC is probably about a draw
> for raw compute speed of special coded sequences.  The question is:
> how many programs actually get full advantage from the large instruction
> sets offered by CISC?
> 
> In more recent years, has the state of the compiler art improved so that
> good code generators for CISC machines can be built easily?  Or is it
> still a hit or miss operation?  Do modern compilers for CISC make good
> use of the variety in the instruction set?
> 
> J. Giles
> Los Alamos

The current state of the art in retargetable code generators is such
that the addressing modes of CISC machines can be easily handled
(with the possible exeption of addressing mode with side-effects).
Complex instructions like the VAX polynomial instruction and string
move instructions are still problematic.

These code generators are based on machine descriptions written
as regular tree grammars and on tree parsing techniques.

I have been working on techniques for handling complex instructions
in a retargetable code generator, but it is a hard problem.

Further problems include the integration of register allocation,
instruction selection, and instruction scheduling.

Any further discussion along these lines should probably be moved
to comp.compilers.

Peter Damron
Department of Computer Science
University of Washington
peterd@june.cs.washington.edu
[My impression is that taking advantage of complicated instructions often
adds ugly irregularities to an otherwise clean register model.  For example,
the IBM 370's BXLE and BXH instructions do just the right thing for a
count-by-N loop, but only if the operands are in adjacent registers.  Ditto
the Vax string move instructions.  A separate problem is that in many cases
(again Vax MOVC5 is an example) the instructions do something so complicated
that in a low-level language like C or Fortran you'd have to write a fairly
large chunk of code that the code generator would have to recognize as
sufficiently close to the instruction.  As the RISC crowd points out, these
complicated instructions often end up being slower than the equivalent
operation synthesized out of simpler ops, particularly if some compile-time
optimization is possible, so the only gain is in instruction density.  -John]
--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

hwe@beta.lanl.gov (Skip Egdorf) (06/18/88)

In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes:
> 
> This discussion brings up a question about the actual use of CISC
> instructions.  It seems that RISC vs. CISC is probably about a draw
> for raw compute speed of special coded sequences.  The question is:
> how many programs actually get full advantage from the large instruction
> sets offered by CISC?
> ...
> In more recent years, has the state of the compiler art improved so that
> good code generators for CISC machines can be built easily?  Or is it
> still a hit or miss operation?  Do modern compilers for CISC make good
> use of the variety in the instruction set?
> 
> J. Giles
> Los Alamos

As I seem to like to dredge up ancient history in this forum, these
thoughts sent me to my dusty closet of 'olde stuff' to fetch volume
1 (I also have V2) of a collection of papers produced by DEC called
"PDP-10 Applications in Science".
I think that I got this sometime in 1970, though most of the papers
seem to be from around 1967-1968.
The third paper in the book is

"Selecting and Evaluating a Medium Scale Computer System"
"A Study Undertaken for Louisiana State University at New Orleans"
by Jon A. Stewart, Director, LSUNO Computer Research Center

The paper is a summary of an in-depth comparison of three major
systems of the time; the SDS Sigma-5, the IBM 360/44, and the
DEC PDP-10. The paper is very fine at bringing back the memories
of the good old days, but of most interest to this discussion is
an appendix containing compiler-generated code for a simple
fortran benchmark.

The upshot was that the PDP-10 generated much shorter code,
that also ran faster. (I can publish a sample if there is
enough interest.)

While this paper drew no architectural conclusions (such as
are discussed in this forum) the main reason (if I
remember my DEC Propaganda of the day) was the very orthogonal
instruction set as opposed to a smaller number of instructions.

This clean and complete instruction set allowed the compiler
to use a large percentage of the instruction set. While I believe
that the clean instruction set came first, and the compiler
that used the large percentage of that instruction set came later,
rather than the current risc idea of profiling code to determine what
will be in the instruction set, it was a haunting hint of where
we would be today.

I recall seeing published statements about how many of the KA-10's
366 instructions were actually used by the FORTRAN compiler.
Does anyone still have these numbers??

I also have some DEC Propaganda from about 1973 bragging about
the optimization done by DEC FORTRAN. This includes examples
of generated code. Again, if anyone is interested...

					Skip Egdorf
					hwe@lanl.gov

samples@dougfir.Berkeley.EDU (A. Dain Samples) (06/21/88)

My $0.02 worth:

In article <1117@ima.ISC.COM> peterd@june.cs.washington.edu (Peter Damron) writes:
>In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes:
>> This discussion brings up a question about the actual use of CISC
>> instructions.  It seems that RISC vs. CISC is probably about a draw
>> for raw compute speed of special coded sequences.  

This isn't clear at all, although there have been several articles
attempting to clarify the relative speeds of the two architecture styles.

>> [more questions]
>> J. Giles
>> Los Alamos
>
> [Peter's response is correct]
>
>Peter Damron
>Department of Computer Science
>University of Washington
>peterd@june.cs.washington.edu
>[My impression is that taking advantage of complicated instructions often
>adds ugly irregularities to an otherwise clean register model.  
> [essentially correct examples]
>As the RISC crowd points out, these
>complicated instructions often end up being slower than the equivalent
>operation synthesized out of simpler ops, 
           
A slight correction needs to be made here: there is only one instance
that I know of where a complex instruction on a CISC architecture turned
out to run more slowly than the same operation coded with simpler
instructions ON THE SAME MACHINE.  I believe the instruction was one
of the CALL instructions in an early version of the microcode for one
of the smaller VAXen, but I'm not real sure.  The net result was that
DEC fixed the microcode so the performance of the instruction was improved.
[As dmr noted in his recent comp.arch note, it was the 780's calling
instructions.]

A more correct statement of what the RISC crowd is pointing out is captured
in the following question:

	What is the cost of the additional hardware/microcode necessary
	to support complex instructions?

It is almost never the case (although it does sometimes happen) that an
instruction can be added to an existing CPU instruction set without
impacting the design of the hardware.  In general, such an enhancement
requires more control lines, more microcode, and rarely it may require a
new function to the ALU.  The question is then: how much more slowly does
the hardware run as a whole due to the addition of this additional
hardware?  Stated another way: If you had spent as much time on making a 
small instruction set run very fast as you spent making a large instruction 
set run acceptably fast, how much faster would the majority of the programs 
run?

>particularly if some compile-time
>optimization is possible, so the only gain is in instruction density.  -John]

In my compiler-oriented opinion, this comment captures the true winnings
from RISCs.  Compile-time optimization means throwing away as much
generality as you can while preserving the semantics of the operation 
IN CONTEXT.  Since simple operations run very fast on RISCs, and since the
majority of instructions emitted by a compiler (any compiler) are simple
instructions anyway, and since streams of simple instructions are easier to
analyze and improve (although at the cost of having more of them to look
at), the net result is faster programs.  

A. Dain Samples, UC Berkeley, samples@renoir.berkeley.edu, 642-9542, 785-5644

--
Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima
Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request

eric@snark.UUCP (Eric S. Raymond) (06/23/88)

In article <1127@ima.isc.com>, samples@dougfir.Berkeley.EDU (A. Dain Samples) writes:
>A slight correction needs to be made here: there is only one instance
>that I know of where a complex instruction on a CISC architecture turned
>out to run more slowly than the same operation coded with simpler
>instructions ON THE SAME MACHINE. 

And the moderator clarifies:
>[As dmr noted in his recent comp.arch note, it was the 780's calling
>instructions.]

Eh? You both forgot the RISC crowd's favorite horror story -- the VAX POLY
instruction for evaluating a polynomial in x in one swell foop, given
x and the coefficients. Turns out this is always slower than a loop using
simpler instructions.
-- 
      Eric S. Raymond                     (the mad mastermind of TMN-Netnews)
      UUCP: {{uunet,rutgers,ihnp4}!cbmvax,rutgers!vu-vlsi,att}!snark!eric
      Post: 22 South Warren Avenue, Malvern, PA 19355   Phone: (215)-296-5718

rwa@auvax.UUCP (Ross Alexander) (06/26/88)

In article <20345@beta.lanl.gov>, hwe@beta.lanl.gov (Skip Egdorf) writes:
> In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes:
> As I seem to like to dredge up ancient history in this forum, these
> thoughts sent me to my dusty closet of 'olde stuff' to fetch volume
> 1 (I also have V2) of a collection of papers produced by DEC called
> "PDP-10 Applications in Science".
> I think that I got this sometime in 1970, though most of the papers
> seem to be from around 1967-1968. The third paper in the book is
> "Selecting and Evaluating a Medium Scale Computer System"
> "A Study Undertaken for Louisiana State University at New Orleans"
> by Jon A. Stewart, Director, LSUNO Computer Research Center
> The paper is a summary of an in-depth comparison of three major
> systems of the time; the SDS Sigma-5, the IBM 360/44, and the
> DEC PDP-10. The paper is very fine at bringing back the memories
> of the good old days [...]
> 					Skip Egdorf
> 					hwe@lanl.gov

Without prejudice to either Skip E.  or Jon Stewart, I remember that
particular paper well enough; I read it in 1974 but the tone and
conclusions stick with me.  I believe the main weakness of the
evaluation is that it rested on such a small (published) sample.  A
small and somewhat artificial fragment of FORTRAN IV (written in a
FORTRAN II style) is pushed through the compilers for the different
machines; then the generated code is examined in a fashion which
nowadays would be considered naive.

I might add that the compilers were as naive as the analysis;
compiler technology was quite immature in 196[789], and the generated
code has a very student-compiler-project look about it :-).  For
instance, two 16-bit IBM rr-class instructions (quite RISCy in type)
are scored as poorer than a single 36 bit PDP instruction
accomplishing the same result by way of an
register-indirect-with-autoincrement construction.  As well, the IBM
compiler makes no attempt to avoid repeated evaluation of common
subexpressions nor does it do any strength reduction.  One tellingly
correct and important point was that the PDP was considerably less
expensive than the IBM for the same power :-) 

I myself am neutral; I've programmed both 360's and PDP-11's (similar
in spirit if not implementation details to 10's) in assembler, doing
moderately-sized pieces of code (~5,000 lines).  The 360
architecture, though ugly, can be used in the style of a RISC machine
if one just ignores all the ss-class instructions (such gewgaws as
the decimal or editing instructions); I try to avoid using IBM's ugly
linkage conventions too :-).  The 11 code is much more esthetic to
look at, but I don't think it's much denser or faster for a given
hardware approach than the equivalent 360 binary.  

To sum up, that paper wasn't very differnet in tone from a lot of
the flaming we see here on the USENET :-) and it's sort of odd to
see it quoted as part of `the literature'.  But its also sort of fun
to see it come up again :-)

--
Ross Alexander, who plays around with computers at Athabasca
University, Athabasca, Alberta   -  alberta!auvax!rwa

"Assembly of Japanese bicycle require great peace of mind."