[comp.sys.sun] RISC versus CISC

steve@grinch.umiacs.umd.edu (02/14/89)

There's an interesting paper on the RISC versus CISC controversy in the
January 1989 Usenix proceedings.  The paper is by Dan Klein of SEI.  The
basic claim here is that, while CISC architectures have all these fancy
instructions that do neat things, compilers (or, at least, C compilers)
use these fancy instructions less than 1% of the time.  It's the much more
fundamental instructions (i.e., 'move what's in this spot in memory into
this register') that get used most of the time.

The problem here is that it's harder to implement CISC than RISC, and that
CISC implementations do the same instructions more slowly than their RISC
counterparts.  (It takes more silicon and more time to deal with lots of
different addressing modes and instructions than it does to deal with a
small number of addressing modes and instructions.)  The fancy
instructions are the ones compilers don't use, and the simple ones are
much faster, so the RISC chips are faster overall.  (This came as a
surprise to Dan Klein, who had set out to prove exactly the opposite.)

	-Steve

Spoken: Steve Miller    Domain: steve@mimsy.umd.edu    UUCP: uunet!mimsy!steve
Phone: +1-301-454-1808  USPS: UMIACS, Univ. of Maryland, College Park, MD 20742

brent%sprite.Berkeley.EDU@ginger.berkeley.edu (Brent Welch) (02/14/89)

The standard RISC argument is that you spend less chip area on complicated
instructions (i.e. microcode) so you can make the cycle time shorter, and
you can spend chip area on things like register windows so that procedure
calls go faster.

Register windows are a set of overlapping registers, say 64 registers
total divided into 7 or 8 overlapping sets of 16.  At any one time only 16
registers are visible.  At procedure call you bump the "window pointer" by
8 to make another partially overlapping set of registers visible.  This
means you can use the overlapping part to pass procedure arguments very
fast;  what were the callers local registers are the callee's input
registers after the window is shifted.  If you call very deep the system
has to trap and spill windows, or do the converse when you unwind.
Ordinarily, however,  you can save many main-memory references with this
technique.

Also, you can make instruction decoding go faster because there are fewer
different instructions, and fewer addressing modes.  Typically all ALU
operations are register-to-register, and memory references are through
explicit load/store instructions.  You also do pipelining and enforce some
contraints so that you get nearly one instruction completed per cycle,
even though it takes say 4 cycles in the pipeline for a single
instruction.  (Compare this with the cycle counts given in a 680x0 manual.
2-10 cycles/simple instruction.) The constraints on instruction sequences
include "delayed branches" where the instruction after a branch always
gets executed.  This delay slot can be filled by clever compilers.
Sometimes there are also restrictions on using the result of one
instruction as the input to the next instruction because of the pipeline.
I think SPARC uses internal forwarding so this restriction doesn't apply.

Perhaps a final argument is that because the chip is less complicated you
can put it onto newer, faster technology more easily.  The first SPARC
chips were made on gate-arrays, for example, and a gallium arsinide chip
has been promised.

Ulitimately, however, I'm not sure there has been a truely fair
head-to-head competition between a RISC and a CISC.  You'd have to use the
same technology, the same processor cache, the same memory bandwidth, the
same applications, and the best compilers for both.

	Brent Welch
	University of California, Berkeley
	brent%sprite@ginger.Berkeley.EDU

tim@amd.com (Tim Olson) (02/14/89)

In article <15686@mimsy.UUCP> folta@tove.umd.edu (Wayne Folta) writes:
> ...I had assumed that a RISC machine had a much smaller
> and simpler instruction set.  That is, fewer instructions, each of which
> did simpler things than a CISC instruction set.  But how can this make a
> machine that much faster?  Is it because most CISC machines are
> microcoded?

Partially.

> This additional level of instruction execution could add
> overhead.  Is it because a smaller instruction set requires fewer bits to
> encode each instruction?  This would make fetches somewhat faster.

No -- RISC instructions are typically 32-bits long, and have a more
sparse encoding than CISC instructions.

> It seems to me that to accomplish the same work, the RISC machine would
> just have to execute more instructions than the CISC machine.

Yes, that is true (most of the time).

> So where have I gone wrong?  How is it that--if indeed it
> is--RISC beats CISC by large margins?

Remember, instructions != cycles.  For a RISC machine to be faster than a
CISC machine, it simply must take fewer cycles to complete the overall
program, even if this means executing more instructions:

							1
	Performance = 1/sec = cycles/sec * -----------------------------
					   cycles/inst  *  [total inst]


Thus, we can improve performance by raising the cycles/sec (increasing the
clock frequency; basically a processing problem), decreasing the total
number of instructions executed (by making them complex: CISC), or
decreasing the number of cycles than an instruction requires (by making
them simple: RISC).  Note that these variables are not independant; it is
hard to make very complex instructions run fast, etc. 

That is the view from the hardware side.  However, software (specifically
optimizing compilers) play just as important a role in the RISC
performance picture.  One can make the argument that RISC & CISC look very
similar at the "micromachine" level, and that the fetching of a
microinstruction from the microcode on a CISC machine is somewhat like a
RISC machine fetching an instruction.  Now the CISC machine has hard-wired
microcode to execute from, while the RISC machine instructions are
"custom-tailored" by the compiler for the problem at hand.

For example, let's look at a typical loop:

	for (i=0; i<MAX; ++i)
		a[i] = 0;

A CISC machine may have a single instruction that performs the inner
statement, by using an indexed base+offset addressing mode.  However, each
time through the loop it must fetch the 32-bit base address of the array
"a", multiply the index variable i by the size of the elements of a, add
the two values together to form an address, then store 0 out to that
location.

A highly-optimizing compiler can recognize that the base of the array
never changes (so it can be computed in a register before the loop begins
[loop-invarient code motion]), and we can increment this address by the
size of each element, rather than incrementing by 1 and then multiplying
(or shifting) [strength-reduction].  Now the loop consists of a few,
simple instructions (store, add, compare, branch), which matches nicely
with what is provided by the RISC machine (and they are performed quickly,
because they are executed directly instead of being interpreted by another
level of microcode).

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)