[comp.arch] Why The Move To RISC Architectures?

Will@cup.portal.com (Will E Estes) (03/18/90)

What is the MIPS rating of these microprocessors:

386SX-15
386-20
386-25
386-33

Also, since the 80386 has a more complex instruction set and does
more work in a given instruction than does a typical RISC chip,
does comparing MIPS figures between RISC and non-RISC
architectures really tell you anything of worth?

Finally, why is everyone so excited about RISC?  Why the move to
simplicity in microprocessor instruction sets?  You would think
that the trend would be just the opposite - toward more and more
complex instruction sets - in order to increase the execution
speed of very high-level instructions by putting them in silicon
and in order to make implementation of high-level language
constructs easier.

Thanks,
Will              (sun!portal!cup.portal.com!Will)
 

henry@utzoo.uucp (Henry Spencer) (03/21/90)

In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes:
>What is the MIPS rating of these microprocessors:
>
>386SX-15
>386-20
>386-25
>386-33

With what memory systems, and running what workload?  And which flavor of
"MIPS" are you talking about?

>Also, since the 80386 has a more complex instruction set and does
>more work in a given instruction than does a typical RISC chip,
>does comparing MIPS figures between RISC and non-RISC
>architectures really tell you anything of worth?

Comparing MIPS figures tells you nothing of worth even when those
complications aren't present.  MIPS numbers are marketing nonsense, not
useful performance measures.

>Finally, why is everyone so excited about RISC?  Why the move to
>simplicity in microprocessor instruction sets?  You would think
>that the trend would be just the opposite - toward more and more
>complex instruction sets - in order to increase the execution
>speed of very high-level instructions by putting them in silicon
>and in order to make implementation of high-level language
>constructs easier.

Oh my, a newcomer to the group, I'd say...  RISC is exciting because it
generally leads to computers that run real workloads faster.  That is
the meaningful measure of performance.  The fact is, trying to bundle
zillions of instructions onto the chip usually makes them slower, and
compilers find it very difficult to effectively exploit all the bizarre
silliness that CISC designers throw in.  About a decade ago, it started
to become clear that executing simple instructions very quickly works
much better.
-- 
MSDOS, abbrev:  Maybe SomeDay |     Henry Spencer at U of Toronto Zoology
an Operating System.          | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (03/21/90)

In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes:
>...since the 80386 has a more complex instruction set and does
>more work in a given instruction than does a typical RISC chip,
>does comparing MIPS figures between RISC and non-RISC
>architectures really tell you anything of worth?

Yes, but only if you understand the basic situation.  A MIPS rating
should be treated as give-or-take about a factor of two.  So, if one
machine has twice the MIPS of another (on a given compute-bound
task), the machines could still be about equal (on that task).  This
isn't true within a family, of course: different 386 boxes really can
be compared by their MIPS ratings.  Note, however, that I said "box".
Different boxes containing the same chip, at the same clock rate, can
still have different MIPS ratings.  (This is because of caches and
buses and other significant non-CPU items.) So, to ask for the MIPS
of a CPU chip is mostly to ask for an upper bound.

As for "complex" instructions, it is worth noting that the complexity
may be potential rather than actual.  Sometimes, a given machine runs
faster when the compilers avoid generating the more complex cases.
It was this observation that led us to explore the RISC idea.

>Finally, why is everyone so excited about RISC?  Why the move to
>simplicity in microprocessor instruction sets?  

The excitement is because the better RISC machines have genuinely
high throughput.  The simplicity is only relative - they're actually
quite complex machines.  The important point is that the designs are
carefully tuned, so that complexity is only used where it pays its
way.  So, rather than ask "Why simplicity?", it would be better to
ask about specific aspects, such as subroutine calling.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

shj@ultra.com (Steve Jay) (03/21/90)

henry@utzoo.uucp (Henry Spencer) writes:

>About a decade ago, it started
>to become clear that executing simple instructions very quickly works
>much better.

Apparently this was clear to Seymour Cray a lot earlier than that:
the 6600 in 1964.

Steve Jay
shj@ultra.com  ...ames!ultra!shj
Ultra Network Technologies / 101 Dagget Drive / San Jose, CA 95134 / USA
(408) 922-0100 x130	"Home of the 1 Gigabit/Second network"

seanf@sco.COM (Sean Fagan) (03/21/90)

In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes:
>Finally, why is everyone so excited about RISC? You would think
>that the trend would be just the opposite - toward more and more
>complex instruction sets - in order to increase the execution
>speed of very high-level instructions by putting them in silicon
>and in order to make implementation of high-level language
>constructs easier.

Well, first of all, that should be "I would think," as, obviously, not
everybody thinks like you do.

Second of all, my immediate reaction on reading this was, "And thus the VAX
is born."  I think that says it all 8-).  Doubtless dozens of people will
post and flood your mailbox about this, but, if they don't, I'll be glad to
8-).

-- 

-----------------+
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

seanf@sco.COM (Sean Fagan) (03/21/90)

In article <1990Mar21.004840.6473@ultra.com> shj@ultra.com (Steve Jay) writes:
>Apparently this was clear to Seymour Cray a lot earlier than that:
>the 6600 in 1964.

Yes, but Seymour is God, and it took awhile for IBM to acknowledge that 8-).

-- 

-----------------+
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

seanf@sco.COM (Sean Fagan) (03/21/90)

In article <1990Mar20.175843.2612@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>The fact is, trying to bundle
>zillions of instructions onto the chip usually makes them slower, and
>compilers find it very difficult to effectively exploit all the bizarre
>silliness that CISC designers throw in.  

Time to throw myself into the fray (and mention Seymour and CDC later, too
8-)).  Simple, non-hardware engineer reason why what Henry says is true
(and, for the most part, it *is* true):  you have only a finite amount of
silicon space on a chip.  Given that, you have a couple of options:  you can
make a small amount of instructions *really* fast (through the brute force
method of just throwing silicon at it), or you can make a large amount of
instructions (which might be fast, or might not; since you now have less
silicon, they will probably be slower).  You can, within limits, make any
instruction faster by throwing more silicon at it.  For example, you can do
a 32x32->64 (bit) multiply in 2 cycles if you use enough silicon, maybe even
one cycle.  This will, however, take up *lots* of chip space, so you might
just keep it down to somewhere between 2 and 5 cycles, or get rid of it
entirely (since you can do any multiply with shifts and adds, and a large
amount of multiplies in certain test sets use mostly constants).  If you've
made all of your instructions execute as fast as possible, and have more
space available, you can have, oh, an on-board MMU, on-board FPU, on-board
cache, second processor, etc.  With larger instruction sets, you don't have
that option as much.

After you've done all that, btw, you can throw in pipelining if you don't
already have it, multiple functional units, scoreboarding (either full or
the simpler one most people use), etc.  Meanwhile, the CISC chip is still
trying to make the POLY instruction execute in something less than 100
cycles...

>About a decade ago, it started
>to become clear that executing simple instructions very quickly works
>much better.

Well, I'd say more than that, about 25 years.  Seymour Cray and the CDC
Cyber 6600, a truly wonderful machine with less than 74 instructions, a
load-store architecture, and three-operand instructions.  Just beautiful.

-- 

-----------------+
Sean Eric Fagan  | "Time has little to do with infinity and jelly donuts."
seanf@sco.COM    |    -- Thomas Magnum (Tom Selleck), _Magnum, P.I._
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

hrich@emdeng.Dayton.NCR.COM (George.H.Harry.Rich) (03/21/90)

.In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes:
.>architectures really tell you anything of worth?
...
.>
.>Finally, why is everyone so excited about RISC?  Why the move to
.>simplicity in microprocessor instruction sets?  You would think
.>that the trend would be just the opposite - toward more and more
.>complex instruction sets - in order to increase the execution
.>speed of very high-level instructions by putting them in silicon
.>and in order to make implementation of high-level language
.>constructs easier.
.>
>Thanks,
>Will              (sun!portal!cup.portal.com!Will)
> 

I want to first state that I'm not an expert on RISC architecture, but the
experts seem not to have replied, or given oversimplified explanations, so
I'll make an attempt.  My area is software, not hardware, so I hope those
who are knowledable will be quick to correct me if I'm wrong.

First of all, what you save on a complex instruction versus several simple
ones is the fetch and decode time.  If the processor has good prefetch and
caching what you are generally talking about is decode time.  However,
a really simple instruction set takes less time to decode, so it is conceivable
that you could have a net savings without taking other factors into account.

However, the real point is that you pay the penalty for the complex decode
even for very simple instructions where you don't get a benefit for your
increase in decoding time.  While compilers may take advantage of complex
instructions for such things as stack frame management,  most complex
instructions seem to be designed for the convenience of assembler programmers
rather than compiler code generation, and the bulk of the code generated by
compilers involves relatively simple instructions.  Even where instructions
are designed for compiler code generation, the designers miss fairly often,
and the instruction sits there in silicon, unused by the compiler writers.

It might even be true that a complex instruction set designed ideally for
compiler code generation might beat RISC.  However, ideal designers are
very rare.  There are always some design flaws in a complex system.  A
RISC design has the benefit of targeting toward a simple thing done well,
and removes from the designer the burden of knowing some of the more
arcane details of compiler writing.  It is not really surprising that it
is a good approach in practice.

Regards,

	Harry Rich

colwell@mfci.UUCP (Robert Colwell) (03/22/90)

In article <5303@scolex.sco.COM> seanf@sco.COM (Sean Fagan) writes:
>
>In article <1990Mar20.175843.2612@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>The fact is, trying to bundle
>>zillions of instructions onto the chip usually makes them slower, and
>>compilers find it very difficult to effectively exploit all the bizarre
>>silliness that CISC designers throw in.  
>
>...you have a couple of options:  you can
>make a small amount of instructions *really* fast (through the brute force
>method of just throwing silicon at it), or you can make a large amount of
>instructions (which might be fast, or might not; since you now have less
>silicon, they will probably be slower).

[a moment please while I find my soapbox...oh here it is...]  And once
again I claim that the NUMBER OF INSTRUCTIONS is very nearly meaningless
as a measure of RISCyness.  It isn't the number of instructions that makes
a VAX hard to speed up, it's their architecturally-required semantic content.
Ok, so indirectly you have a point, in that an architecture as complicated
as that probably needs microcode for its implementation.  But that's not
what you said.  Do you want to hear my arguments for why our VLIW with
2**1024 possible instructions is nevertheless a RISC?

>After you've done all that, btw, you can throw in pipelining if you don't
>already have it, multiple functional units, scoreboarding (either full or
>the simpler one most people use), etc.  Meanwhile, the CISC chip is still
>trying to make the POLY instruction execute in something less than 100
>cycles...

And those 100 cycles might well be considerably faster than the equivalent
RISC code sequence (including the icache misses the RISC will incur in
executing it).  But RISC probably still wins.  Why?  Because those RISC
operations can be overlapped with other useful ops, and while the CISC is
running its POLY, nothing else can usefully proceed.  The pipelining,
multiple FUs, and scoreboarding can be done on RISCs or CISCs (and has
been) so it doesn't seem especially relevant here.

>>About a decade ago, it started
>>to become clear that executing simple instructions very quickly works
>>much better.
>
>Well, I'd say more than that, about 25 years.  Seymour Cray and the CDC
>Cyber 6600, a truly wonderful machine with less than 74 instructions, a
>load-store architecture, and three-operand instructions.  Just beautiful.

An amazing machine.  But unless we know more about how it shared 
responsibility for performance with its compiler, I refuse to call
it a RISC.  From what I've read, it was designed so that the hardware
would extract whatever parallelism it could use, and all the compiler
did was convert the high level source into sequential machine ops.
Close, but not much different in principle from what drives the CISC
design philosophy.  (Oh yes, there was one.  The CISC design philosophy
was to "make the compiler writer's job easier"; yes, it probably failed
at that, too, but that's one of the reasons for all those complicated
instruction sets in the first place.)

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

shj@ultra.com (Steve Jay) (03/23/90)

colwell@mfci.UUCP (Robert Colwell) writes:

>>Well, I'd say more than that, about 25 years.  Seymour Cray and the CDC
>>Cyber 6600, a truly wonderful machine with less than 74 instructions, a
>>load-store architecture, and three-operand instructions.  Just beautiful.

>An amazing machine.  But unless we know more about how it shared 
>responsibility for performance with its compiler, I refuse to call
>it a RISC.  From what I've read, it was designed so that the hardware
>would extract whatever parallelism it could use, and all the compiler
>did was convert the high level source into sequential machine ops.

The original compiler for the 6600, called "RUN", made no attempt
to optimize instruction sequences.  By 1970, however, CDC had a new
compiler, FTN, which did rearrange instructions to optimize usage
of the multiple functional units.  The technology of both local and
global optimization in the FTN compiler was continously improved,
and by mid to late 70's, it was difficult to beat the compiler even
with hand tuned assembly language.

I don't think the unavailability of an optimizing compiler when the
6600 first came out in any way detracts from the RISCness of the
machine.  You can read articles written around 1965 which justify
the design decisions for the 6600 in terms almost identical to those
used today to justify RISC over CISC.  I suspect that experiences with
the 6600/7600 were important in teaching later architects how important
compiler technology is.

Steve Jay
shj@ultra.com  ...ames!ultra!shj
Ultra Network Technologies / 101 Dagget Drive / San Jose, CA 95134 / USA
(408) 922-0100 x130	"Home of the 1 Gigabit/Second network"

johnl@esegue.segue.boston.ma.us (John R. Levine) (03/23/90)

In article <289@emdeng.Dayton.NCR.COM> hrich@emdeng.UUCP (George.H.Harry.Rich) writes:
>It might even be true that a complex instruction set designed ideally for
>compiler code generation might beat RISC. ...

I doubt it.  The IBM 801 project included some of the best compiler people
around and they came up with the original RISC machine which was quite
stripped down, and an extremely fancy compiler named PL.8 which generates
fantastic code for it.  There are some slightly exotic instructions, e.g.
shift register N and put the result in register N+1, but nothing very
complicated.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
"Now, we are all jelly doughnuts."

baum@Apple.COM (Allen J. Baum) (03/23/90)

[]
>In article <289@emdeng.Dayton.NCR.COM> hrich@emdeng.UUCP (George.H.Harry.Rich) writes:
>.In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes:
>.>architectures really tell you anything of worth?
>...
>.>
>.>Finally, why is everyone so excited about RISC?  Why the move to
>.>simplicity in microprocessor instruction sets?  You would think
>.>that the trend would be just the opposite - in order to increase
>.> the speed of very high-level instructions by putting them in silicon
>

Actually, the problem with cmoplex stuff is that it isn't used, so why put
it in. The higher the semantic content, the less often it is used. RISC
attempts to put the highest semantic content in that gets used a lot- which
isn't very high, it turns out.

>First of all, what you save on a complex instruction versus several simple
>ones is the fetch and decode time.  If the processor has good prefetch and
>caching what you are generally talking about is decode time.  However,
>a really simple instruction set takes less time to decode,

Yes, but if your critical paths are not decode related, then it just doesn't
matter. Reducing critical paths (both in hardware, where it is generally 
load/store or branch related, and software, which is '# of inst.s to perform
some function'.
CISCs attempt to reduce the second (software) factor. Unfortunately, they
often do this by increasing the first, and they can't do it often enough
to make up for this.
You can make instructions that perform the same actions as a series of simpler
instructions. I can make n^i variations of the latter, and few variations of
the former. Experience has shown that lots of variations get used, especially
after optimization, so that it is impossible to pick a small set of complex
insts. that get used enough to make them worthwhile. Besides, these complex
insts. often get executed as a series of microsteps, and often go no faster
than the series of simple instructions. Finally, it is possible to re-arrange
the order of the simpler ones to avoid interlocks, which can't happen inside
a complex instruction.

On the flip side, complex instructions can run a deeper pipeline. If the
instructions can truly be piped (a very big if, when interlocks are taken
into account), then this is equivalent to a cheap 'superscalar' implementation.

For example, a series of "Add Mem to Reg" instructions, which can be piped at
one per cycle, will run twice as fast as the simpler "Load Mem to Reg", "Add
Reg to Reg" series. The pipeline is more complex, but is simpler than the
full superscalar implementation. The question is, with good register allocation
does it happen enough to make it worthwhile?
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

baum@Apple.COM (Allen J. Baum) (03/23/90)

[]
>In article <1990Mar22.190941.1184@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
>In article <289@emdeng.Dayton.NCR.COM> hrich@emdeng.UUCP (George.H.Harry.Rich) writes:
>>It might even be true that a complex instruction set designed ideally for
>>compiler code generation might beat RISC. ...
>
>I doubt it.  The IBM 801 project included some of the best compiler people
>around and they came up with the original RISC machine which was quite
>stripped down, and an extremely fancy compiler named PL.8 which generates
>fantastic code for it.

I'm afraid I no longer buy arguments of the form "x didn't do it, and x
is omnipotent, so it can't/shouldn't be done." That work was done 10+ years ago
and the state of the art has improved, and will continue to improve. In fact,
the IBM patent suite includes one patent that describes how to optimally 
choose instruction forms including mem->reg insts. For example, if something
was to be added to a register, and nothing else was done with it, then the
"Add mem to Reg" form would be selected, not "Load Mem to Reg",
"Add reg to reg". The latter might be used if the value was going to be used
again shortly, or be modified, etc.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

dik@cwi.nl (Dik T. Winter) (03/23/90)

In article <1990Mar22.184122.7917@ultra.com> shj@ultra.com (Steve Jay) writes:
 >                                     By 1970, however, CDC had a new
 > compiler, FTN, which did rearrange instructions to optimize usage
 > of the multiple functional units.  The technology of both local and
 > global optimization in the FTN compiler was continously improved,
 > and by mid to late 70's, it was difficult to beat the compiler even
 > with hand tuned assembly language.
And then came the problem.  CDC came with newer versions of their machine,
and newer versions of their compiler.  The problem was that different
machines had different requirements with respect to scheduling.  So a
program fully optimized for a 7600 was not optimal for a 170/750.  There
were switches in the compiler to tune for the different models, but at
least on the 170/750 it was possible to take the compiler generated
assembler code, hand tune it by simple peep-hole optimization, and
gain a factor of 2 (but of course not for all programs).

This is in general a problem if the compiler has too much to do.
Newer models of the machine require a different compiler.  And not
only newer models, but if you have a range of models differing only in
price and performance, you may have introduced different scheduling
requirements for the different models.  Although your architecture can
be such that object code compiled for one model is valid for another
model, it may be sub-optimal.  And think next about the hassle to
maintain different versions of the compiler!

 > I don't think the unavailability of an optimizing compiler when the
 > 6600 first came out in any way detracts from the RISCness of the
 > machine.  You can read articles written around 1965 which justify
 > the design decisions for the 6600 in terms almost identical to those
 > used today to justify RISC over CISC.

I agree here.  And do not take me wrong; I like the (60 bit) Cybers and
the Crays.

Although this belongs more to comp.compilers it is also of significance in
this group, because there is a strong interaction between compiler and
machine.
-- 
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

mash@mips.COM (John Mashey) (03/23/90)

In article <8912@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <1990Mar22.184122.7917@ultra.com> shj@ultra.com (Steve Jay) writes:
> >                                     By 1970, however, CDC had a new
> > compiler, FTN, which did rearrange instructions to optimize usage
> > of the multiple functional units.  The technology of both local and
> > global optimization in the FTN compiler was continously improved,
> > and by mid to late 70's, it was difficult to beat the compiler even
> > with hand tuned assembly language.
>And then came the problem.  CDC came with newer versions of their machine,
>and newer versions of their compiler.  The problem was that different
>machines had different requirements with respect to scheduling.  So a
>program fully optimized for a 7600 was not optimal for a 170/750.  There
>were switches in the compiler to tune for the different models, but at...

>This is in general a problem if the compiler has too much to do.
>Newer models of the machine require a different compiler.  And not
>only newer models, but if you have a range of models differing only in
>price and performance, you may have introduced different scheduling
>requirements for the different models.  Although your architecture can
>be such that object code compiled for one model is valid for another
>model, it may be sub-optimal.  And think next about the hassle to
>maintain different versions of the compiler!

This issue, of course, is almost certainly true for every line of
computers that
	a) Has multiple distinct implementations at the same time.
	b) Evolves over time by anything but clock-rate changes to the
	same implementation.

Product families for which optimal code differs among models includes
at least:
	a) IBM S/360 and derivatives.  Even amongst the first round of
	S/360s, optimal code differed.  (Note that IBM compiler folks
	observed that pipeline scheduling was useful on some machines...)
	b) DEC VAXen
	c) Intel 80x86
	d) Motorola 680x0
	e) SPARC (different FPU timings already, for example, and if the
	next generation has multiple different styles of pipelines...)
	f) MIPS Rx000  (R2000s always had 1-cycle writes; R3000s with
	approp. mode bit use 2-cycle write-partial-words; R6000s have
	different FP timings, etc).

Fortunately for the simpler architectures:
	a) Integer instructions are fairly simple, understandable, and maybe
	even the same with regard to timing amongst different implementations.
	b) Floating point operations are much more likely to vary, but they're
	probably less likely to be interchangeable, so you do what you can.
	c) If you're lucky, the pipeline constraints may be such that
	you:
		1) Want to work harder for things with deeper pipelines,
		in terms of spreading operations apart to lessen stalls.
		2) Want to work harder for more aggressive machines that
		have more concurrency.
	Fortunately, at least in some cases, there are optimizations
	for the more aggressive machiensthat help them, but certainly
	don't hurt the less aggressive machines much, if at all.
	For instance, if machine (n+1) has longer-latency loads than (n),
	trying harder to move references to the data later probably won't
	hurt (n).

At least you don't have to fight with issues like:
	-Model A has a (multi-cycle) serial shifter, and every shift position
	costs a cycle, but B has a barrel shifter, where the cost is constant,
	regardless of shift count, and both have multipliers of differing
	speeds, so the optimal sequences to do multiplies by constants
	are completely different, and the cutover from shifts+add/subtract
	to actual multiply is completely different.
	-On Model A, to copy 8 bytes from here to there, use a move-character,
	because it has narrow data paths anyway and microcode, but on
	model B, use load/store, because THOSE are hardwired, and go faster
	than doing move-character, because the startup time dominates....

Anyway, CDC was hardly alone in this...it's a fact of life for everybody
that does multiple implementations.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253, or 408-524-7015
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086