jlg@beta.lanl.gov (Jim Giles) (06/18/88)
In article <8717@lll-winken.llnl.gov>, brooks@maddog.llnl.gov (Eugene D. Brooks III) writes: > [...] > If main memory, and in particular shared memory in a multiprocessor, is > 20 to 40 clocks away having eighty or so registers with fully pipelined > memory access is really a whole lot faster. Any machine with fully pipelined memory access is faster (registers or not). This discussion brings up a question about the actual use of CISC instructions. It seems that RISC vs. CISC is probably about a draw for raw compute speed of special coded sequences. The question is: how many programs actually get full advantage from the large instruction sets offered by CISC? A few years back, I did site maintenance on a CFT compiler (Fortran on the Cray). I used the experience as a opportunity to learn about the present state of compiler technology (CFT is very archaic, but I wanted to learn the whole business). At that time, the technology simply wasn't able to automatically produce a good code generator (compiler 'backend'). This meant that code generators always had to be hand-made routines with lots of ad hoc ideas involved. The bottom line was: it was simply EASIER to write a code generator for a RISC-like machine than for a CISC. (To be sure, some source-level constructs could be turned directly into CISC instructions. But this simplistic approach produces very inefficient code. The famous case is the VAX polynomial instruction - a tight loop of simple instructions was ALWAYS faster than the sequence for the single instruction. A simplistic code generator might be easier to write, but it produces unacceptably bad code.) In more recent years, has the state of the compiler art improved so that good code generators for CISC machines can be built easily? Or is it still a hit or miss operation? Do modern compilers for CISC make good use of the variety in the instruction set? J. Giles Los Alamos
brooks@maddog.llnl.gov (Eugene D. Brooks III) (06/18/88)
In article <20338@beta.lanl.gov> jlg@beta.lanl.gov (Jim Giles) writes: >Any machine with fully pipelined memory access is faster (registers or not). Me thinks the point has been missed here! Its very simple. Consider the code: a = b + c; e = f + g; h = i + j; which is admittedly a contrivance, but is useful for illustration. With a three register RISC: load r0,b; load r1,c; add r2,r0,r1; load r0,f; load r1,g; store r2,a add r2,r0,r1; load r0,i; load r1,j; store r2,e add r2,r0,r1; store r2,j Yes, I could do a little better with the code by allowing the add to destroy an argument, but the basic point is that the low number of registers introduces "false dependencies" in the register file which prevents re-ordering to take advantage of pipelining. With a lot of registers the code becomes: load r0,b; load r1,c; load r3,f; load r4,g; load r6,i; load r7,j; add r2,r0,r1; add r5,r3,r4; add r8,r6,r7; store r2,a store r5,e store r8,j which is just what the Cerberus compiler does with it. This code rips along with an issue rate of one instruction per clock and can mask quite a bit of memory and functional unit latency. One could say that we will just specify that our CISC instruction set has fully pipelined memory access; and emit the code add a,b,c add e,f,g add h,i,j and it would go "just as fast", but just what must be hidden away in the hardware? Enough registers to hold all those temporaries in a way that does not induce resource dependencies, not to mention a execution unit (in hardware) which looks ahead many instructions to see of it can start some of the operations in them. We call this bug riddled complexity. One might as well just expose those registers, they have to be in there. One might also argue that we ought to take a WM approach and fifo everything without registers; how then do we handle arbitrary arrival order for the memory access, which frequently occurs in a multiprocessor. ANSWER: Lots of REGISTERS so implement the fifos so stalling is prevented as much as is possible. Some careful thought about this will reveal why the ETA 10, which uses the CISC like memory to memory instructions for its vector operations, can't support these operations directly to the SSD (they call it a shared memory). They can only support (load/store like) stride 1 vector accesses when copy in and out from the local memory to the SSD. Flames? Sure, I eat fire for lunch. Eugene
peterd@june.cs.washington.edu (Peter Damron) (06/18/88)
In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes: > This discussion brings up a question about the actual use of CISC > instructions. It seems that RISC vs. CISC is probably about a draw > for raw compute speed of special coded sequences. The question is: > how many programs actually get full advantage from the large instruction > sets offered by CISC? > > In more recent years, has the state of the compiler art improved so that > good code generators for CISC machines can be built easily? Or is it > still a hit or miss operation? Do modern compilers for CISC make good > use of the variety in the instruction set? > > J. Giles > Los Alamos The current state of the art in retargetable code generators is such that the addressing modes of CISC machines can be easily handled (with the possible exeption of addressing mode with side-effects). Complex instructions like the VAX polynomial instruction and string move instructions are still problematic. These code generators are based on machine descriptions written as regular tree grammars and on tree parsing techniques. I have been working on techniques for handling complex instructions in a retargetable code generator, but it is a hard problem. Further problems include the integration of register allocation, instruction selection, and instruction scheduling. Any further discussion along these lines should probably be moved to comp.compilers. Peter Damron Department of Computer Science University of Washington peterd@june.cs.washington.edu [My impression is that taking advantage of complicated instructions often adds ugly irregularities to an otherwise clean register model. For example, the IBM 370's BXLE and BXH instructions do just the right thing for a count-by-N loop, but only if the operands are in adjacent registers. Ditto the Vax string move instructions. A separate problem is that in many cases (again Vax MOVC5 is an example) the instructions do something so complicated that in a low-level language like C or Fortran you'd have to write a fairly large chunk of code that the code generator would have to recognize as sufficiently close to the instruction. As the RISC crowd points out, these complicated instructions often end up being slower than the equivalent operation synthesized out of simpler ops, particularly if some compile-time optimization is possible, so the only gain is in instruction density. -John] -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
hwe@beta.lanl.gov (Skip Egdorf) (06/18/88)
In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes: > > This discussion brings up a question about the actual use of CISC > instructions. It seems that RISC vs. CISC is probably about a draw > for raw compute speed of special coded sequences. The question is: > how many programs actually get full advantage from the large instruction > sets offered by CISC? > ... > In more recent years, has the state of the compiler art improved so that > good code generators for CISC machines can be built easily? Or is it > still a hit or miss operation? Do modern compilers for CISC make good > use of the variety in the instruction set? > > J. Giles > Los Alamos As I seem to like to dredge up ancient history in this forum, these thoughts sent me to my dusty closet of 'olde stuff' to fetch volume 1 (I also have V2) of a collection of papers produced by DEC called "PDP-10 Applications in Science". I think that I got this sometime in 1970, though most of the papers seem to be from around 1967-1968. The third paper in the book is "Selecting and Evaluating a Medium Scale Computer System" "A Study Undertaken for Louisiana State University at New Orleans" by Jon A. Stewart, Director, LSUNO Computer Research Center The paper is a summary of an in-depth comparison of three major systems of the time; the SDS Sigma-5, the IBM 360/44, and the DEC PDP-10. The paper is very fine at bringing back the memories of the good old days, but of most interest to this discussion is an appendix containing compiler-generated code for a simple fortran benchmark. The upshot was that the PDP-10 generated much shorter code, that also ran faster. (I can publish a sample if there is enough interest.) While this paper drew no architectural conclusions (such as are discussed in this forum) the main reason (if I remember my DEC Propaganda of the day) was the very orthogonal instruction set as opposed to a smaller number of instructions. This clean and complete instruction set allowed the compiler to use a large percentage of the instruction set. While I believe that the clean instruction set came first, and the compiler that used the large percentage of that instruction set came later, rather than the current risc idea of profiling code to determine what will be in the instruction set, it was a haunting hint of where we would be today. I recall seeing published statements about how many of the KA-10's 366 instructions were actually used by the FORTRAN compiler. Does anyone still have these numbers?? I also have some DEC Propaganda from about 1973 bragging about the optimization done by DEC FORTRAN. This includes examples of generated code. Again, if anyone is interested... Skip Egdorf hwe@lanl.gov
samples@dougfir.Berkeley.EDU (A. Dain Samples) (06/21/88)
My $0.02 worth: In article <1117@ima.ISC.COM> peterd@june.cs.washington.edu (Peter Damron) writes: >In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes: >> This discussion brings up a question about the actual use of CISC >> instructions. It seems that RISC vs. CISC is probably about a draw >> for raw compute speed of special coded sequences. This isn't clear at all, although there have been several articles attempting to clarify the relative speeds of the two architecture styles. >> [more questions] >> J. Giles >> Los Alamos > > [Peter's response is correct] > >Peter Damron >Department of Computer Science >University of Washington >peterd@june.cs.washington.edu >[My impression is that taking advantage of complicated instructions often >adds ugly irregularities to an otherwise clean register model. > [essentially correct examples] >As the RISC crowd points out, these >complicated instructions often end up being slower than the equivalent >operation synthesized out of simpler ops, A slight correction needs to be made here: there is only one instance that I know of where a complex instruction on a CISC architecture turned out to run more slowly than the same operation coded with simpler instructions ON THE SAME MACHINE. I believe the instruction was one of the CALL instructions in an early version of the microcode for one of the smaller VAXen, but I'm not real sure. The net result was that DEC fixed the microcode so the performance of the instruction was improved. [As dmr noted in his recent comp.arch note, it was the 780's calling instructions.] A more correct statement of what the RISC crowd is pointing out is captured in the following question: What is the cost of the additional hardware/microcode necessary to support complex instructions? It is almost never the case (although it does sometimes happen) that an instruction can be added to an existing CPU instruction set without impacting the design of the hardware. In general, such an enhancement requires more control lines, more microcode, and rarely it may require a new function to the ALU. The question is then: how much more slowly does the hardware run as a whole due to the addition of this additional hardware? Stated another way: If you had spent as much time on making a small instruction set run very fast as you spent making a large instruction set run acceptably fast, how much faster would the majority of the programs run? >particularly if some compile-time >optimization is possible, so the only gain is in instruction density. -John] In my compiler-oriented opinion, this comment captures the true winnings from RISCs. Compile-time optimization means throwing away as much generality as you can while preserving the semantics of the operation IN CONTEXT. Since simple operations run very fast on RISCs, and since the majority of instructions emitted by a compiler (any compiler) are simple instructions anyway, and since streams of simple instructions are easier to analyze and improve (although at the cost of having more of them to look at), the net result is faster programs. A. Dain Samples, UC Berkeley, samples@renoir.berkeley.edu, 642-9542, 785-5644 -- Send compilers articles to ima!compilers or, in a pinch, to Levine@YALE.EDU Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbn}!ima Please send responses to the originator of the message -- I cannot forward mail accidentally sent back to compilers. Meta-mail to ima!compilers-request
eric@snark.UUCP (Eric S. Raymond) (06/23/88)
In article <1127@ima.isc.com>, samples@dougfir.Berkeley.EDU (A. Dain Samples) writes: >A slight correction needs to be made here: there is only one instance >that I know of where a complex instruction on a CISC architecture turned >out to run more slowly than the same operation coded with simpler >instructions ON THE SAME MACHINE. And the moderator clarifies: >[As dmr noted in his recent comp.arch note, it was the 780's calling >instructions.] Eh? You both forgot the RISC crowd's favorite horror story -- the VAX POLY instruction for evaluating a polynomial in x in one swell foop, given x and the coefficients. Turns out this is always slower than a loop using simpler instructions. -- Eric S. Raymond (the mad mastermind of TMN-Netnews) UUCP: {{uunet,rutgers,ihnp4}!cbmvax,rutgers!vu-vlsi,att}!snark!eric Post: 22 South Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718
rwa@auvax.UUCP (Ross Alexander) (06/26/88)
In article <20345@beta.lanl.gov>, hwe@beta.lanl.gov (Skip Egdorf) writes: > In article <20338@beta.lanl.gov>, jlg@beta.lanl.gov (Jim Giles) writes: > As I seem to like to dredge up ancient history in this forum, these > thoughts sent me to my dusty closet of 'olde stuff' to fetch volume > 1 (I also have V2) of a collection of papers produced by DEC called > "PDP-10 Applications in Science". > I think that I got this sometime in 1970, though most of the papers > seem to be from around 1967-1968. The third paper in the book is > "Selecting and Evaluating a Medium Scale Computer System" > "A Study Undertaken for Louisiana State University at New Orleans" > by Jon A. Stewart, Director, LSUNO Computer Research Center > The paper is a summary of an in-depth comparison of three major > systems of the time; the SDS Sigma-5, the IBM 360/44, and the > DEC PDP-10. The paper is very fine at bringing back the memories > of the good old days [...] > Skip Egdorf > hwe@lanl.gov Without prejudice to either Skip E. or Jon Stewart, I remember that particular paper well enough; I read it in 1974 but the tone and conclusions stick with me. I believe the main weakness of the evaluation is that it rested on such a small (published) sample. A small and somewhat artificial fragment of FORTRAN IV (written in a FORTRAN II style) is pushed through the compilers for the different machines; then the generated code is examined in a fashion which nowadays would be considered naive. I might add that the compilers were as naive as the analysis; compiler technology was quite immature in 196[789], and the generated code has a very student-compiler-project look about it :-). For instance, two 16-bit IBM rr-class instructions (quite RISCy in type) are scored as poorer than a single 36 bit PDP instruction accomplishing the same result by way of an register-indirect-with-autoincrement construction. As well, the IBM compiler makes no attempt to avoid repeated evaluation of common subexpressions nor does it do any strength reduction. One tellingly correct and important point was that the PDP was considerably less expensive than the IBM for the same power :-) I myself am neutral; I've programmed both 360's and PDP-11's (similar in spirit if not implementation details to 10's) in assembler, doing moderately-sized pieces of code (~5,000 lines). The 360 architecture, though ugly, can be used in the style of a RISC machine if one just ignores all the ss-class instructions (such gewgaws as the decimal or editing instructions); I try to avoid using IBM's ugly linkage conventions too :-). The 11 code is much more esthetic to look at, but I don't think it's much denser or faster for a given hardware approach than the equivalent 360 binary. To sum up, that paper wasn't very differnet in tone from a lot of the flaming we see here on the USENET :-) and it's sort of odd to see it quoted as part of `the literature'. But its also sort of fun to see it come up again :-) -- Ross Alexander, who plays around with computers at Athabasca University, Athabasca, Alberta - alberta!auvax!rwa "Assembly of Japanese bicycle require great peace of mind."