steve@grinch.umiacs.umd.edu (02/14/89)
There's an interesting paper on the RISC versus CISC controversy in the January 1989 Usenix proceedings. The paper is by Dan Klein of SEI. The basic claim here is that, while CISC architectures have all these fancy instructions that do neat things, compilers (or, at least, C compilers) use these fancy instructions less than 1% of the time. It's the much more fundamental instructions (i.e., 'move what's in this spot in memory into this register') that get used most of the time. The problem here is that it's harder to implement CISC than RISC, and that CISC implementations do the same instructions more slowly than their RISC counterparts. (It takes more silicon and more time to deal with lots of different addressing modes and instructions than it does to deal with a small number of addressing modes and instructions.) The fancy instructions are the ones compilers don't use, and the simple ones are much faster, so the RISC chips are faster overall. (This came as a surprise to Dan Klein, who had set out to prove exactly the opposite.) -Steve Spoken: Steve Miller Domain: steve@mimsy.umd.edu UUCP: uunet!mimsy!steve Phone: +1-301-454-1808 USPS: UMIACS, Univ. of Maryland, College Park, MD 20742
brent%sprite.Berkeley.EDU@ginger.berkeley.edu (Brent Welch) (02/14/89)
The standard RISC argument is that you spend less chip area on complicated instructions (i.e. microcode) so you can make the cycle time shorter, and you can spend chip area on things like register windows so that procedure calls go faster. Register windows are a set of overlapping registers, say 64 registers total divided into 7 or 8 overlapping sets of 16. At any one time only 16 registers are visible. At procedure call you bump the "window pointer" by 8 to make another partially overlapping set of registers visible. This means you can use the overlapping part to pass procedure arguments very fast; what were the callers local registers are the callee's input registers after the window is shifted. If you call very deep the system has to trap and spill windows, or do the converse when you unwind. Ordinarily, however, you can save many main-memory references with this technique. Also, you can make instruction decoding go faster because there are fewer different instructions, and fewer addressing modes. Typically all ALU operations are register-to-register, and memory references are through explicit load/store instructions. You also do pipelining and enforce some contraints so that you get nearly one instruction completed per cycle, even though it takes say 4 cycles in the pipeline for a single instruction. (Compare this with the cycle counts given in a 680x0 manual. 2-10 cycles/simple instruction.) The constraints on instruction sequences include "delayed branches" where the instruction after a branch always gets executed. This delay slot can be filled by clever compilers. Sometimes there are also restrictions on using the result of one instruction as the input to the next instruction because of the pipeline. I think SPARC uses internal forwarding so this restriction doesn't apply. Perhaps a final argument is that because the chip is less complicated you can put it onto newer, faster technology more easily. The first SPARC chips were made on gate-arrays, for example, and a gallium arsinide chip has been promised. Ulitimately, however, I'm not sure there has been a truely fair head-to-head competition between a RISC and a CISC. You'd have to use the same technology, the same processor cache, the same memory bandwidth, the same applications, and the best compilers for both. Brent Welch University of California, Berkeley brent%sprite@ginger.Berkeley.EDU
tim@amd.com (Tim Olson) (02/14/89)
In article <15686@mimsy.UUCP> folta@tove.umd.edu (Wayne Folta) writes: > ...I had assumed that a RISC machine had a much smaller > and simpler instruction set. That is, fewer instructions, each of which > did simpler things than a CISC instruction set. But how can this make a > machine that much faster? Is it because most CISC machines are > microcoded? Partially. > This additional level of instruction execution could add > overhead. Is it because a smaller instruction set requires fewer bits to > encode each instruction? This would make fetches somewhat faster. No -- RISC instructions are typically 32-bits long, and have a more sparse encoding than CISC instructions. > It seems to me that to accomplish the same work, the RISC machine would > just have to execute more instructions than the CISC machine. Yes, that is true (most of the time). > So where have I gone wrong? How is it that--if indeed it > is--RISC beats CISC by large margins? Remember, instructions != cycles. For a RISC machine to be faster than a CISC machine, it simply must take fewer cycles to complete the overall program, even if this means executing more instructions: 1 Performance = 1/sec = cycles/sec * ----------------------------- cycles/inst * [total inst] Thus, we can improve performance by raising the cycles/sec (increasing the clock frequency; basically a processing problem), decreasing the total number of instructions executed (by making them complex: CISC), or decreasing the number of cycles than an instruction requires (by making them simple: RISC). Note that these variables are not independant; it is hard to make very complex instructions run fast, etc. That is the view from the hardware side. However, software (specifically optimizing compilers) play just as important a role in the RISC performance picture. One can make the argument that RISC & CISC look very similar at the "micromachine" level, and that the fetching of a microinstruction from the microcode on a CISC machine is somewhat like a RISC machine fetching an instruction. Now the CISC machine has hard-wired microcode to execute from, while the RISC machine instructions are "custom-tailored" by the compiler for the problem at hand. For example, let's look at a typical loop: for (i=0; i<MAX; ++i) a[i] = 0; A CISC machine may have a single instruction that performs the inner statement, by using an indexed base+offset addressing mode. However, each time through the loop it must fetch the 32-bit base address of the array "a", multiply the index variable i by the size of the elements of a, add the two values together to form an address, then store 0 out to that location. A highly-optimizing compiler can recognize that the base of the array never changes (so it can be computed in a register before the loop begins [loop-invarient code motion]), and we can increment this address by the size of each element, rather than incrementing by 1 and then multiplying (or shifting) [strength-reduction]. Now the loop consists of a few, simple instructions (store, add, compare, branch), which matches nicely with what is provided by the RISC machine (and they are performed quickly, because they are executed directly instead of being interpreted by another level of microcode). -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)