[comp.arch] CISC Silent Spring

greg@sce.carleton.ca (Greg Franks) (02/01/90)

In article <3300098@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
>
>Here's a rhetorical question:
>
>When was the last time someone introduced a new CISC architecture?
>How many years has it been?  New versions of old chips ('486, '040,
>etc) do not count as "new architectures".

The big players in the microprocessor wars are busy souping up their
existing CISC processors all of the time, so why would they bother
concocting new ones.  People sure like having the latest CPU on their
desk to run lotus 123 or MacDraw.  Furthermore, with the lastest CISC
processors reaching into the domain of the RISC processors in terms of
performance (eg, 68040 @ 25 MHz being faster than SPARC @ 25 MHz
according to Byte), who needs most of the RISC processors floating
around these days?  Just imagine, 100MIPs and the ability to run an
anchient version Word Perfect all in one box!

Introducing *any* new architecture, be it RISC or CISC, is likely
going to be exceptionally difficult these days unless that processor
can demonstrate clear superiority over all others in some way or
another (perhaps a cray-on-a-chip??)

Sign me - I want an upgrade :-)
-- 
Greg Franks   (613) 788-5726     Carleton University,             
uunet!mitel!sce!greg (uucp)  	 Ottawa, Ontario, Canada  K1S 5B6.
greg@sce.carleton.ca (bitnet)	 (we're on the internet too. (finally))
Overwhelm them with the small bugs so that they don't see the big ones.

mash@mips.COM (John Mashey) (02/04/90)

In article <771@sce.carleton.ca> greg@sce.UUCP (Greg Franks) writes:
>In article <3300098@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
>>
>>Here's a rhetorical question:
>>
>>When was the last time someone introduced a new CISC architecture?
>>How many years has it been?  New versions of old chips ('486, '040,
>>etc) do not count as "new architectures".
>
>The big players in the microprocessor wars are busy souping up their
>existing CISC processors all of the time, so why would they bother
>concocting new ones.  People sure like having the latest CPU on their
>desk to run lotus 123 or MacDraw.  Furthermore, with the lastest CISC
>processors reaching into the domain of the RISC processors in terms of
>performance (eg, 68040 @ 25 MHz being faster than SPARC @ 25 MHz
>according to Byte), 
My Byte's are in the middle of the input stack.  Could someone please
post the DATA that shows a 68040 @ 25Mhz to be faster than a SPARC @ 25Mhz?
(yes, I've seen the Motorola ads that show the 68040 to be 20 mips versus
a SPARC's 18.... :-)  

On the good side of reality, plaudits to UNIX/World, which has started
publishing SPEC numbers (full charts) in some of its workstation comparisons.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jdudeck@polyslo.CalPoly.EDU (John R. Dudeck) (02/04/90)

In article <35456@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>a 68040 @ 25Mhz to be faster than a SPARC @ 25Mhz?
>(yes, I've seen the Motorola ads that show the 68040 to be 20 mips versus
>a SPARC's 18.... :-) 

In my understanding of RISC vs CISC, you can't directly compare RISC MIPS
against CISC MIPS, because the risc instructions are simple, whereas the
cisc instructions are complex.  It may take several risc instructions to
perform one cisc instruction.  Originally the trick was to get the several
risc instructions to execute in less time than the one complex instruction.
Now the tables are turned, because the cpu designers have figured out
how to get the cisc chips to perform the complex instruction in the same clock
cycle that the risc chip takes to perform the simple instruction...  In
a DEC Professional editorial I saw the expression CRISCO (complex risc)
architecture to refer to this.

-- 
John Dudeck                           "You want to read the code closely..." 
jdudeck@Polyslo.CalPoly.Edu             -- C. Staley, in OS course, teaching 
ESL: 62013975 Tel: 805-545-9549          Tanenbaum's MINIX operating system.

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/04/90)

In article <25cb6b65.702c@polyslo.CalPoly.EDU> jdudeck@polyslo.CalPoly.EDU
	 (John R. Dudeck) writes:
>In my understanding of RISC vs CISC, you can't directly compare RISC MIPS
>against CISC MIPS, because the risc instructions are simple, whereas the
>cisc instructions are complex.  It may take several risc instructions to
>perform one cisc instruction.  Originally the trick was to get the several
>risc instructions to execute in less time than the one complex instruction.
>Now the tables are turned, because the cpu designers have figured out
>how to get the cisc chips to perform the complex instruction in the same clock
>cycle that the risc chip takes to perform the simple instruction...  

I think that this overstates the advantage of CISC. The recent CISC
chips aren't getting a complex instruction to run in one clock. More
correctly, they are getting the most commonly used simple
instructions to run in one clock. They are also getting the most
commonly used complex instructions to run in fewer clocks. 

The whole RISC thing came about because several compiler people found
that they could get better performance out of CISCs by ignoring many
of the complex instructions, thus treating the machines as RISC.  The
hardware people responded by building machines that did only the
simple things. To my surprise, the payoff was fairly big.

RISC reduced the design time - an advantage that a fast CISC doesn't
have. It also reduced the silicon area, but as all the players add
onchip caches and whatnot, that matters little.  Finally, RISC
increased the clock rate, but advanced CISC should come close.

So, is it a wash? More-or-less, yes - if RISC designs stand still.
But they aren't. RISC is moving to ECL and GaAs, where transistors
are scarce. They are also moving to superscalar designs, where the
RISC/CISC difference is between incredible complexity and stupefying
complexity.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

pkr@maddog.sgi.com (Phil Ronzone) (02/04/90)

In article <7826@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>RISC reduced the design time - an advantage that a fast CISC doesn't
>have. It also reduced the silicon area, but as all the players add
>onchip caches and whatnot, that matters little.  Finally, RISC
>increased the clock rate, but advanced CISC should come close.
>
>So, is it a wash? More-or-less, yes - if RISC designs stand still.
>But they aren't. RISC is moving to ECL and GaAs, where transistors
>are scarce. They are also moving to superscalar designs, where the
>RISC/CISC difference is between incredible complexity and stupefying
>complexity.

I see that as the ONLY large advantage that RISC has. It simply has
been able to reduce the design time.

The second argument (gate scarcity) is interesting, but does it not
also have a limit? If gates are "typical" in the 10,000-100,000 range,
yes, but how about when gates are "typical" in the 1,000,000-10,000,000.


------Me and my dyslexic keyboard----------------------------------------------
Phil Ronzone   Manager Secure UNIX           pkr@sgi.COM   {decwrl,sun}!sgi!pkr
Silicon Graphics, Inc.               "I never vote, it only encourages 'em ..."
-----In honor of Minas, no spell checker was run on this posting---------------

preston@titan.rice.edu (Preston Briggs) (02/05/90)

In article <3562@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes:
>In article <7826@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:

>>So, is it a wash? More-or-less, yes - if RISC designs stand still.
>>But they aren't. RISC is moving to ECL and GaAs, where transistors
>>are scarce. They are also moving to superscalar designs, where the
>>RISC/CISC difference is between incredible complexity and stupefying
>>complexity.

>I see that as the ONLY large advantage that RISC has. It simply has
>been able to reduce the design time.

"Simply" is the correct word, but not just applied to design time.
The "incredible complexity" vs. "stupefying complexity" also
aplies to the problem of generating code for the super-scalar
design.  A chip like the 860 is hard enough; if somebody builds a
a similar machine with complex addressing modes, etc... it'll
be really difficult to build a good compiler for.

Lindsay pointed out that RISC machines were a response to compilers that
only used the simple instructions.  I expect (don't know for sure) that the
compilers for 80x86's and 680x0's are still mostly using the simple
instructions.  Speed isn't the only reason for avoiding complex instruction;
you avoid them because they're difficult to generate, because they don't
do what you want in the first place, and because the intermediate
results aren't available for reuse.

For (an old, perhaps overworked) example:
Suppose I want to load a value from
memory and add it to a register.  On most CISC's I can do it in one
instruction.  On most RISC's, I have to use 2 instructions.
On the RISC machine, the value I loaded will still be in a register
where I can reuse it later.  Of course, we could also use 2 instructions
on the CISC.  How often does this case arise?  That depends on
your code and the strength of your optimizer.  The RISC bet (supported
by dynamic code measurements) is that it happens a lot.

So, no matter how fast the CISC people make that "add from memory"
instruction run, it won't matter much because it isn't used much.

Preston Briggs
preston@titan.rice.edu

pkr@maddog.sgi.com (Phil Ronzone) (02/05/90)

In article <4537@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>So, no matter how fast the CISC people make that "add from memory"
>instruction run, it won't matter much because it isn't used much.


I was too brief. Perhaps in a sense, I am saying that RISC is a non-concept,
or a transient "necessary evil".

Clearly we have (at least) two technological thrusts for RISC:

1). Compiler technology
2). Highly automated designed tools that cross the threshold of
    the minimal number gates needed to do something useful.

I am of the belief that as our design tools progress, they will
go far beyong what is needed for RISC. At some stage, we'll get
into the ranges of handling large word width microcoded machines.

Such microcoded machines can certainly execute many simple (i.e.,
RISC type) instructions in one cycle -- I know, I've done one.

BUT - they also can implement many useful instructions that are
not in some (or all) of the current RISC machines. From multiply to
divide to handling the TLB context to entire context switches to
(of course) FP instructions.

Since what counts is the HUMAN design time most, if they are
equivalent, then why not a CRISC? RISC instruction for typical
code generation, and the messy many for the rest of the real world.

Of course, they could be an order of magnitude difference between
RISC and CRISP/CISC, however, how much the die area and hence cost
will matter is, IMHO, not a big factor.

-------------

This came up for me because I microcode an mainly RISC machine
(stack oriented, non-virtual) that had to support unaligned data
transfer. The microcode word kept getting wider and
wider, but, I had barrel shifters in front of the memory and
in front/back of the ALU and hardware assist to keep the 16
top words of the stack in microcode registers.

It was a near joy to do unaligned word transfers -- if the
next sequential instruction was a push/pop from/to memory and
the word was unaligned 4 bytes on a 2 byte boundary, I never
lost a cycle (pre-interrupts went off in the previous instruction
so I could start the shift fetch shift in memory).


------Me and my dyslexic keyboard----------------------------------------------
Phil Ronzone   Manager Secure UNIX           pkr@sgi.COM   {decwrl,sun}!sgi!pkr
Silicon Graphics, Inc.               "I never vote, it only encourages 'em ..."
-----In honor of Minas, no spell checker was run on this posting---------------

henry@utzoo.uucp (Henry Spencer) (02/06/90)

In article <25cb6b65.702c@polyslo.CalPoly.EDU> jdudeck@polyslo.CalPoly.EDU (John R. Dudeck) writes:
>Now the tables are turned, because the cpu designers have figured out
>how to get the cisc chips to perform the complex instruction in the same clock
>cycle that the risc chip takes to perform the simple instruction...

No, they've figured out how to make tomorrow's CISC chips perform the
simpler instructions in the same clock cycle that yesterday's RISC chips
took to perform similarly simple operations.  The complicated instructions
are still slow (and still rarely used), the RISCs still have a built-in
lead due to shorter design times, and the CISCs still have a built-in
handicap due to the mass of instruction/decoding/exception baggage
dragging along behind their RISC-like cores.
-- 
SVR4:  every feature you ever |     Henry Spencer at U of Toronto Zoology
wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

pkr@maddog.sgi.com (Phil Ronzone) (02/07/90)

In article <1990Feb5.211208.15741@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>No, they've figured out how to make tomorrow's CISC chips perform the
>simpler instructions in the same clock cycle that yesterday's RISC chips
>took to perform similarly simple operations.  The complicated instructions
>are still slow (and still rarely used), the RISCs still have a built-in
>lead due to shorter design times, and the CISCs still have a built-in
>handicap due to the mass of instruction/decoding/exception baggage
>dragging along behind their RISC-like cores.


Hmm, like automatic TLB loading, or that even more rarely used set known
as MUL and DIV??? :-)

------Me and my dyslexic keyboard----------------------------------------------
Phil Ronzone   Manager Secure UNIX           pkr@sgi.COM   {decwrl,sun}!sgi!pkr
Silicon Graphics, Inc.               "I never vote, it only encourages 'em ..."
-----In honor of Minas, no spell checker was run on this posting---------------

chasm@attctc.Dallas.TX.US (Charles Marslett) (02/07/90)

In article <4537@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes:
> For (an old, perhaps overworked) example:
> Suppose I want to load a value from
> memory and add it to a register.  On most CISC's I can do it in one
> instruction.  On most RISC's, I have to use 2 instructions.
> On the RISC machine, the value I loaded will still be in a register
> where I can reuse it later.  Of course, we could also use 2 instructions
> on the CISC.  How often does this case arise?  That depends on
> your code and the strength of your optimizer.  The RISC bet (supported
> by dynamic code measurements) is that it happens a lot.

Is this dynamic code measurements of sloppy assembly code or good compiler
generated code or . . .

If the addend is used all that much, some form of code strength reduction
would be warranted.  I suspect the measurements are due to the "cheapness"
of the CISC address arithemetic [using PTR+6, PTR+9, PTR+12, etc., because
it costs nothing in execution time].

Add immediate is used a lot because it is cheap, too.

> So, no matter how fast the CISC people make that "add from memory"
> instruction run, it won't matter much because it isn't used much.

If my memory is still holding up, the probability that you will reuse the
addend is less than 50% with vast numbers of registers (64 or so), and then
only if you spend an immense amount of computational resources doing rather
good data flow analysis.  That says if you have add-from-memory, you're better
off most of the time using it.

The more accurate statement might be that if you design a RISC box, it is
better to design a register-to-register add than a memory-to-register add
or a register-to-memory add because the penalty for the other 50% where it
is not optimal is much less.

> Preston Briggs
> preston@titan.rice.edu

Charles Marslett
chasm@attctc.dallas.tx.us
[I needed some hate mail anyway, ;^)]

ccc_ldo@waikato.ac.nz (02/07/90)

How about the 65816? Wasn't that around 1986?

sgolson@pyrite.East.Sun.COM (Steve Golson) (02/07/90)

In article <3300098@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
> When was the last time someone introduced a new CISC architecture?
> How many years has it been?  New versions of old chips ('486, '040,
> etc) do not count as "new architectures".

Since no one else has mentioned it... what about TRON?

Steve Golson         sgolson@East.sun.com          golson@cup.portal.com
Trilobyte Systems -- 33 Sunset Road -- Carlisle MA 01741 -- 508/369-9669
       (consultant for, but not employed by, Sun Microsystems)
"As the people here grow colder, I turn to my computer..." -- Kate Bush

henry@utzoo.uucp (Henry Spencer) (02/08/90)

In article <3674@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes:
>>... and the CISCs still have a built-in
>>handicap due to the mass of instruction/decoding/exception baggage
>>dragging along behind their RISC-like cores.
>
>Hmm, like automatic TLB loading, or that even more rarely used set known
>as MUL and DIV??? :-)

Automatic TLB loading is not worth the hardware needed to do it, as
Mips (among others) has clearly demonstrated.  And most RISCs do something
about multiplication and division, although sometimes the "something" is
a carefully-considered decision, based on extensive simulations, to leave
it to software.  (Of course, sometimes the same decision is made without
the careful consideration and extensive simulation... :-( )  I haven't
heard many complaints about having to live without TranslateAndTest or 
EvaluatePolynomial instructions. :-)
-- 
SVR4:  every feature you ever |     Henry Spencer at U of Toronto Zoology
wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mash@mips.COM (John Mashey) (02/08/90)

In article <3562@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes:
>In article <7826@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>>RISC reduced the design time - an advantage that a fast CISC doesn't
>>have. It also reduced the silicon area, but as all the players add
>>onchip caches and whatnot, that matters little.  Finally, RISC
>>increased the clock rate, but advanced CISC should come close.
>>
>>So, is it a wash? More-or-less, yes - if RISC designs stand still.
>>But they aren't. RISC is moving to ECL and GaAs, where transistors
>>are scarce. They are also moving to superscalar designs, where the
>>RISC/CISC difference is between incredible complexity and stupefying
>>complexity.
>
>I see that as the ONLY large advantage that RISC has. It simply has
>been able to reduce the design time.
>
>The second argument (gate scarcity) is interesting, but does it not
>also have a limit? If gates are "typical" in the 10,000-100,000 range,
>yes, but how about when gates are "typical" in the 1,000,000-10,000,000.

Gate/transistor count can be misleading.
On anything current, most of the transistors will be in the caches & MMU
and register files.
Certainly, with million-transistor chips, there are not enough to do
everything you like, and even 3-4M, although it gets you bigger caches,
we'll still have compromises and arguments in the hallways.

Some of the issues are:
1) In CMOS, the smaller size is less of advantage for RISC than it used to be.
Nevertheless, at a given technology level, for a few years, it often means you
get a bigger cache, a more parallel FPU, a bigger MMU, or something else on the
same die, or, that the die can be smaller and hence cheaper, or that the RISC
gets an even more aggressive pipeline in the same space.
2) AS everybody gets more aggressive, the pipelines and other critical paths
get more complex, as more aggressive = more things in parallel.
CISCs may well take longer to design (or not), but the key issue is what
happens in the critical paths on the chip.  From past history (i.e., things
like 360/91), you can make any architecture go faster, but if not designed
for smooth pipelining, the complexity can get very high.  In addition,
VLSI has design constraints that forbid some of the solutions used in the
less integrated designs, i.e., lots of big busses and interconnects.
(You certainly can have big busses, but you still only get a few layers
of them, and as soon as your design exceeds what you can get on the
chip, performance drops, whereas the performance cost of incremental
complexity in a less integrated design is not necessarily so much.)
3) Exception-handling is always one of the most trouble-prone areas of
a design, and anything that makes it more complex slows down the design
process.
4) Finally, nothing will make the current CISC micro architectures
have more registers available at once [they might get register sets,
but the instruction encoding makes it pretty hard to increase the number
available to the compiler at once.]

Note: this was not intended to be a RISC commercial, merely to point out that
the transistor-count issue gets over-emphasized.  I give a talk that ends with:

IF RISC IS SO GOOD, WILL CISC DISAPPEAR?
	No.
		68Ks and X86s will be with us forver.
WILL CISCS GET FASTER?
	Yes, using RISC-like techniques, or in fact, the same techniques that
	mainframe and supermini people have been using for 20 years to
	speed existing CISC architectures.
WILL THEY CATCH UP?
	No:
	Intellectual complexity.
	Longer design cycles.
	Less registers than match current global optimizers.
WHERE ARE THEY NOW?
	It is hard to tell, as I've seen no real benchmarks for an
	040 yet [they may exist, I haven't seen them], and I'm hoping to
	see SPEC ratios for 486 fairly soon, which will really help get over
	apples and oranges comparisons.o

I have been collecting some numbers in advance of that, although unfortuantely
limited to *stones & such, and will post some soon.  What I see so far says
that 25 MHz 486s, in 32-bit mode (I think), have floating point that
looks mostly like a MIPS M/500, and integer somewhat less than an M/800,
even with external caches as well as internal.
This is faster than an 8MHz R2000, and slower than a 12.5MHz one....
To be fair, they're somewhat faster under MS/DOS, and I'm not sure if that's
compiler differences, or (more likely) 16-vs-32 bit model differences.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

gillies@p.cs.uiuc.edu (02/09/90)

> Since no one else has mentioned it... what about TRON?

TRON was the only response I received via email that fit the
requirements (a genuinely new CISC instruction set, not an enhancement
to some old product).  How long ago was TRON conceived?  I'm  guessing
it may have been 5 years ago, but I'm not sure.  It's very interesting
that TRON comes from Japan, and the U.S. is still considered to be the
leader in CPU design.  Maybe I should have asked for the most recent
U.S. CISC cpu design.

pkr@maddog.sgi.com (Phil Ronzone) (02/09/90)

In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>WILL THEY [risc] CATCH UP?
>	No:
>	Intellectual complexity.
>	Longer design cycles.
>	Less registers than match current global optimizers.


Hmmm -- maybe we should break RISC into RISC-the-instruction-chip and
RISC-the-el-cheapo-hardware-realization-of-an-instruction-set.

What would we call a chip with say, an R3000 instruction set AND
a microcoded 68XXX  instruction set, with a mode bit to flip
between the two?

RISC? CISC? CRISP? :-) RIDICULOUS?

We're going to be able to do it some day.


------Me and my dyslexic keyboard----------------------------------------------
Phil Ronzone   Manager Secure UNIX           pkr@sgi.COM   {decwrl,sun}!sgi!pkr
Silicon Graphics, Inc.               "I never vote, it only encourages 'em ..."
-----In honor of Minas, no spell checker was run on this posting---------------

slackey@bbn.com (Stan Lackey) (02/09/90)

In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>3) Exception-handling is always one of the most trouble-prone areas of
>a design, and anything that makes it more complex slows down the design
>process.

Microcode is the way this problem is commonly dealt with.  Microcode
turns an intractable hardware control mechanism into a part of the
design that many a computer hardware or software person can
understand, design, debug, etc.  Because exception handling must be
dealt with in hardware in a RISC, one could make the claim that this
makes RISCs more complex (from a certain point of view) than CISCs.

>4) Finally, nothing will make the current CISC micro architectures
>have more registers available at once [they might get register sets,
>but the instruction encoding makes it pretty hard to increase the number
>available to the compiler at once.]

This issue keeps coming up.  You know, I have seen so many compiled
routines that contain inner loops that only use 5 or 6 registers,
inner loops that determine the performance of the program, that I
really wonder how many registers are needed (Oh no not that again!)
Global register allocation and all, who cares if you save and restore
a half dozen registers outside of a loop that is going to execute 100
times.  

Having more registers really does help some of the time, and when
the industry starts making new CISC architectures I'll bet you will
see more, now that program size is not so constraining.

>IF RISC IS SO GOOD, WILL CISC DISAPPEAR?
>	No.
>		68Ks and X86s will be with us forver.
As will lots of other architectures, as long as RISCs are going to neglect
certain functionality.

>WILL THEY CATCH UP?
>	No:
>	Intellectual complexity.
>	Longer design cycles.
>	Less registers than match current global optimizers.
Vector machines always run faster on vector problems than non vector
machines.  Even if the cycle time is a little slower.

The shoe is moving to the other foot, so to speak; in order to match
vector machines, RISCs will need to go to super scalar execution
(assuming they don't add the large register sets or the instructions
to do vectors).  To do this they need to deal with variable length
instructions (variability determined by register dependencies and
stuff in the pipe, not to mention the surrounding instructions),
register and opcode fields in variable places in the instruction word,
complexity handling exceptions, and all the other CISC characteristics 
RISCers love to bash.
-Stan

baum@Apple.COM (Allen J. Baum) (02/09/90)

[]
>In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>Gate/transistor count can be misleading.
>On anything current, most of the transistors will be in the caches & MMU
>and register files.

Actually, I thought that I'd heard that 3/4 of the area on the 88100 is
the FPU (course, that doesn't include caches)

>Some of the issues are:
>2) AS everybody gets more aggressive, the pipelines and other critical paths
>get more complex, as more aggressive = more things in parallel.
>CISCs may well take longer to design (or not), but the key issue is what
>happens in the critical paths on the chip.  From past history (i.e., things
>like 360/91), you can make any architecture go faster, but if not designed
>for smooth pipelining, the complexity can get very high.

Bingo! I believe you've said something I believe strongly, and the
crux is the "designed for smooth piplelining" phrase. I feel that this
is really the major distinguishing feature between "RISC" & "CISC".
This is far more important than the silly RISC/CISC
#of-regs/addressing modes arguments. A "CISC" which is designed for
pipelining should keep up with a "RISC". The tricks used to make
"RISC"s go faster then work for "CISC"s as well.
Most of the otherwise fundamental problems of "CISC"s (exception handling)
go away (to the same extent they go away in "RISC"s anyway)

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

baum@Apple.COM (Allen J. Baum) (02/09/90)

[]
>In article <51951@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>3) Exception-handling is always one of the most trouble-prone areas of
>>a design, and anything that makes it more complex slows down the design
>>process.
>
>Microcode is the way this problem is commonly dealt with.  Microcode
>turns an intractable hardware control mechanism into a part of the
>design that many a computer hardware or software person can
>understand, design, debug, etc. 

Well, you still need to be able to save the necessary info to unwind state,
which may mean keeping around a lot more info than a RISC, and you may have
to be very careful to do it in the right order, etc. Note that exception
handling on some RISCs (I860) is a complete bitch as well.

>>4) Finally, nothing will make the current CISC micro architectures
>>have more registers available at once [they might get register sets,
>>but the instruction encoding makes it pretty hard to increase the number
>>available to the compiler at once.]
>
>This issue keeps coming up.  You know, I have seen so many compiled
>routines that contain inner loops that only use 5 or 6 registers,
>inner loops that determine the performance of the program, that I
>really wonder how many registers are needed (Oh no not that again!)
>Global register allocation and all, who cares if you save and restore
>a half dozen registers outside of a loop that is going to execute 100
>times.  
>
>Having more registers really does help some of the time, and when
>the industry starts making new CISC architectures I'll bet you will
>see more, now that program size is not so constraining.

Yes and no. Mall's paper about link time register alloc. showed a 15%
increase with 52 regs. showed a 10-29% speedup. The high end was with
the Stanford benchmark, not a real workload. These are still relatively
small numbers compared to what waiting one year will bring. On the other
hand, lots of registers lets you perform optimizations (besides avoiding
register spills) which can't be done otherwise, notably loop unrolling.
(But, to counter the counter-argument, superscalar techniques might severly
lessen the advantages of unrolling, since the overhead which is being
saved might be done in parallel).

>The shoe is moving to the other foot, so to speak; in order to match
>vector machines, RISCs will need to go to super scalar execution
>(assuming they don't add the large register sets or the instructions
>to do vectors).  To do this they need to deal with variable length
>instructions (variability determined by register dependencies and
>stuff in the pipe, not to mention the surrounding instructions),
>register and opcode fields in variable places in the instruction word,
>complexity handling exceptions, and all the other CISC characteristics 
>RISCers love to bash.

You betcha. RISCs probably don't need to be quite as aggressive as CISCs
to take advantage of these techniques, but the complexity is going to be
worse than current day CISCs.


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

colwell@mfci.UUCP (Robert Colwell) (02/09/90)

In article <51951@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>3) Exception-handling is always one of the most trouble-prone areas of
>>a design, and anything that makes it more complex slows down the design
>>process.
>
>Microcode is the way this problem is commonly dealt with.  Microcode
>turns an intractable hardware control mechanism into a part of the
>design that many a computer hardware or software person can
>understand, design, debug, etc.  Because exception handling must be
>dealt with in hardware in a RISC, one could make the claim that this
>makes RISCs more complex (from a certain point of view) than CISCs.

One could also dispute that claim.  Exception handling in normal 
machines (meaning those lacking the hard-real-time limit that incoming
missiles pose) don't deserve special hardware attention.  Give the
software as much as it needs to clean up the mess.  Anything more
increases the hardware design time and the likelihood that something
will have to be respun to fix bugs.  Sure, now I've moved that
complexity into software, and there it will still have to be dealt
with.  But I don't know of any machines that were late because the
software implementing their exception handlers weren't ready, and I
can think of lots of examples for complexity-related bugs delaying
hardware.

>Having more registers really does help some of the time, and when
>the industry starts making new CISC architectures I'll bet you will
>see more, now that program size is not so constraining.

You need as many registers as it takes for spilling and restoring 
them to stay off your list of bottlenecks.  This is a fairly 
complicated function of the number of functional units, their
respective latencies, the bandwidth available (and needed) to
and from memory, and the cleverness of the compiler.  Not a
RISC/CISC issue at all (which we first pointed out in 1983).

>>WILL THEY <CISC> CATCH UP?
>>	No:
>>	Intellectual complexity.
>>	Longer design cycles.
>>	Less registers than match current global optimizers.
>Vector machines always run faster on vector problems than non vector
>machines.  Even if the cycle time is a little slower.

"Always" is a tad strong.  If you're talking about 100% vectorizable
code I suppose you're right, but there isn't much of that around.
It certainly doesn't constitute the workloads of the customers
and benchmarks that we routinely run across.  For anything less
I believe vector machines are yesterday's answer to the problem.

>The shoe is moving to the other foot, so to speak; in order to match
>vector machines, RISCs will need to go to super scalar execution
>(assuming they don't add the large register sets or the instructions
>to do vectors).  To do this they need to deal with variable length
>instructions (variability determined by register dependencies and
>stuff in the pipe, not to mention the surrounding instructions),
>register and opcode fields in variable places in the instruction word,
>complexity handling exceptions, and all the other CISC characteristics 
>RISCers love to bash.

We solved this in Multiflow's machines without resorting to any of that.
Number of registers and memory bandwidth scale with the number of
functional units.  Instruction variability is at the packet (32-bit 
instruction word) level; a packet is present or it is not, and the 
cache miss hardware looks at a "mask" word to decide.  This allows us 
to do cache refill at full memory bandwidth without the refill engine 
having to even see any of the packets -- they just get blasted into 
icache directly.  And since they're fully decoded already, we get the 
RISC benefit of simple, fast instruction decode.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

colwell@mfci.UUCP (Robert Colwell) (02/09/90)

In article <38462@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>>In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>Some of the issues are:
>>2) AS everybody gets more aggressive, the pipelines and other critical paths
>>get more complex, as more aggressive = more things in parallel.
>>CISCs may well take longer to design (or not), but the key issue is what
>>happens in the critical paths on the chip.  From past history (i.e., things
>>like 360/91), you can make any architecture go faster, but if not designed
>>for smooth pipelining, the complexity can get very high.
>
>Bingo! I believe you've said something I believe strongly, and the
>crux is the "designed for smooth piplelining" phrase. I feel that this
>is really the major distinguishing feature between "RISC" & "CISC".
>This is far more important than the silly RISC/CISC
>#of-regs/addressing modes arguments. A "CISC" which is designed for
>pipelining should keep up with a "RISC". The tricks used to make
>"RISC"s go faster then work for "CISC"s as well.
>Most of the otherwise fundamental problems of "CISC"s (exception handling)
>go away (to the same extent they go away in "RISC"s anyway)

But this is only true if you are setting out to design a CISC starting
from a clean slate.  It's not clear to me why anyone would do that, 
unless they had goals other than performance uppermost in their minds
(and I believe there are some.)  If your task is to implement the VAX,
say, with "smooth pipelining", and you want to keep up with a machine
designed to be a RISC-like compiler target, then I believe you're doomed 
to failure.  (See Doug Clark's description of his travails in implementing
the VAX-8600 in ASPLOS-II.  I actually felt sorry for him, it reads like
an Edgar Allan Poe horror story.)  There are just too many 
architecturally-required strands of spaghetti for you to end up with
a clean design.

And if you're starting from scratch to design a CISC, and you try
to implement this concept of "smooth pipelining" (which could stand
a more rigorous definition, by the way), and you try to yank exception
handling into software, and you make the frequent ops go fast, and
you minimize the side-effects of ops, then what have you got left
that qualifies your design as a CISC?  Maybe you've left yourself
some really complicated instrs for some special purposes.  If so,
good luck working those into your exceptions-handling and pipelining
schemes.  I don't believe you can do that in anything like the same
amount of time a similar RISC design would need.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

oconnordm@CRD.GE.COM (Dennis M. O'Connor) (02/09/90)

baum@Apple (Allen J. Baum) writes:
] >CISCs may well take longer to design (or not), but the key issue is what
] >happens in the critical paths on the chip.  From past history (i.e., things
] >like 360/91), you can make any architecture go faster, but if not designed
] >for smooth pipelining, the complexity can get very high.
] 
] Bingo! I believe you've said something I believe strongly, and the
] crux is the "designed for smooth piplelining" phrase. I feel that this
] is really the major distinguishing feature between "RISC" & "CISC".

A major illustrative example of this was the MCF architecture, developed by
the military when DEC refused to license the VAX architecture to MIL-SPEC
computer manufacturers. ( MCF was known as Nebula, also )

MCF was very similar to a VAX, but more so. It had recursive addresing
modes, for instance : you could, in a single addres specification,
specify something like ( M[x] = contents of memory location x )

[offset + M[ offset + M[ offset + M[ offset + register ] ] ] ]

I kid you not. And with no limit on the level of nesting. Just
think how easy (!?) this made compilation of high-level code
constructs like 
   rec_array( index_array( frame(2).index ).in_ptr ).rec_field( 2 )
;-)

Worse than than this, the instruction set was byte-quantized and
variable length, and you couldn't tell how to decode a byte until all
the previous bytes had been decoded. ( One method of solving this was
to decode each byte all five possible ways and then select the correct
decoding. ) The(dynamic) average instruction length was five bytes, so to
achieve, say, 10 million instructions per second execution you had to decode
50 million bytes per second, one at a time. Yeesh.

Designing a pipelined architecture for this beast was tough ( for
example, the pipeline had a loop in the middle of it to handle
the recursive addresing modes. ) A few changes to the architecture
would have allowed it to run much more quickly.

Apparently, this is what happens when a machine architecture is
designed by ONLY the compiler people ( I guess ) with no input
from the hardware people. The two must work together, IMHO :-)
--
  Dennis O'Connor      OCONNORDM@CRD.GE.COM      UUNET!CRD.GE.COM!OCONNOR
  Science and Religion have this in common : you must take care to
  distinguish both from the people who claim to represent each of them.

bcase@cup.portal.com (Brian bcase Case) (02/10/90)

>Vector machines always run faster on vector problems than non vector
>machines.  Even if the cycle time is a little slower.


I don't believe this.  See the WM-machine architecture proposed by
W. Wulf.  This is a general-purpose architecture that can achieve
vector rates without actual vector hardware (well, the memory system
has to be done right, but there are no vector instructions).

andrew@frip.WV.TEK.COM (Andrew Klossner) (02/10/90)

> When was the last time someone introduced a new CISC architecture?

How about the i960?  Object-oriented instructions with tags, and
"silicon operating system" features, though you'd never know it from
the externally released documentation.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

aglew@oberon.csg.uiuc.edu (Andy Glew) (02/13/90)

>>Vector machines always run faster on vector problems than non vector
>>machines.  Even if the cycle time is a little slower.
>
>I don't believe this. 

How about a bit of hand-waving?:

If instruction dispatch is your bottleneck, vector machines are faster
because they dispatch multiple operations with one instruction.  Usually
the operations are simple, with trivial dependencies.  Really RISCy
vector machines do not require hardware to resolve the possible dependencies.

CISCs may dispatch multiple operations per instruction, but the
dependencies are typically more complicated and the operations dispatched
are less regular.

Superscalar RISCs (or CISCs) dispatch multiple operations per
"instruction decode cycle", but the operations dispatched are less
regular and the dependencies are more general.


Of course, if instruction dispatch is not your bottleneck,
and you are limited by things like data memory access and dependency
depth, then you use the most powerful multiple operation dispatch
technique you can get away with.
--
Andy Glew, aglew@uiuc.edu

danh@halley.UUCP (Dan Hendrickson) (02/13/90)

In article <26765@cup.portal.com] bcase@cup.portal.com (Brian bcase Case) writes:
]]Vector machines always run faster on vector problems than non vector
]]machines.  Even if the cycle time is a little slower.
]
]
]I don't believe this.  See the WM-machine architecture proposed by
]W. Wulf.  This is a general-purpose architecture that can achieve
]vector rates without actual vector hardware (well, the memory system
]has to be done right, but there are no vector instructions).

Has anyone ever designed the hardware for this "proposed" architecture? Or
are you comparing real iron to paper?  There are a lot of very interesting
architectures which have been proposed but never saw silicon because they
could not effectively be implemented, or if they were the penalties on the
cycle time for the hardware made them slower than the "inferior" architectures.

slackey@bbn.com (Stan Lackey) (02/13/90)

In article <26765@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>Vector machines always run faster on vector problems than non vector
>>machines.  Even if the cycle time is a little slower.

>I don't believe this.  See the WM-machine architecture proposed by
>W. Wulf.  This is a general-purpose architecture that can achieve
>vector rates without actual vector hardware (well, the memory system
>has to be done right, but there are no vector instructions).

The proposed WM machine assumes a memory access unit that is
programmed to access a linear data structure with a base address, a
stride, and a length.  That's a "vector move" instruction.  The WM
combines this with a Multiflow-style RISC-VLIW instruction set and
huge register file.  I have no doubts this machine can match standard
vector machine performance; in my opinion, it also qualifies as a
vector machine, as it has all the key features.
:-) Stan

baum@Apple.COM (Allen J. Baum) (02/14/90)

[]
>In article <1229@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>>In article <38462@apple.Apple.COM> baum@apple.UUCP ( me! ) writes:
>> ....flames about smooth pipelining & CISC....

>But this is only true if you are setting out to design a CISC starting
>from a clean slate.  It's not clear to me why anyone would do that, 
>unless they had goals other than performance uppermost in their minds

Well, I think my point was that you would start with a clean slate; after
all, that's what RISC did. My furhter point was that some of the CISCy
features that people think of as awful (e.g. reg.-mem instructions) 
a). can be implemented without adversely
b). use of these features can improve performance (cut # of cycles) of programs

Another way to look at a reg-mem architecture is as a limited
superscalar architecture. If you look from a pipelining point of view,
you'll notice that the address calculation and execution (of different
instructions) occur in parallel; parallelism that does not occur in
load/store architectures. Note that you still have to use instruction 
scheduling in order to avoid interlocks, but IBM talked about doing that
to improve performance at the first ASPLOS conference.

Current RISC design is based on some level of compiler technology. I believe
that current compiler technology can go further. In fact, one of the series
of patents that IBM got on the 801 was a method for choosing which address
mode to use, i.e. when to load into a register and re-use the value, and
when to use a reg-memory instruction because it won't get re-used.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

sbw@naucse.UUCP (Steve Wampler) (02/22/90)

> This new "bat" machine is quite CISCy, and sounds pretty interesting,
> at least based on the tidbits from EE Times.  A 64 bit MPU with 
> instructions to support C-like things such as telling if one byte in
> a machine word is '\0' or '\n', or another which basically implements
> an 8-case switch statement.  Of course, like most new things, there
> are incredible MIPS claims for it but nothing like SpecMark or even
> Dhrystone out yet.
> 
> Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests"

Well, I haven't seen the EE Times, but I know a little about the BAT.
The first model, the 6420, is, in my opinion, not a great implementation.
The 6430, however, is well implemented.  The BAT series does many of the
things that RISCS have been doing (large caches, lots of registers, etc.)
in a CISC environment (18 different addressing modes!)  Most of the
'common' instructions (i.e. RISC) are single cycle and typically one or
two bytes (so you can load 8 or 4 in one 64-bit bus access).  C-style
string functions and memory functions are encoded as single instructions
that operate 8-bytes at a time.  Function calls are 3(?) cycles, and
interupts are equally fast.  Also nice are the 256(?) data channels
for nice fast I/O.

The MIPS claim is misleading - that is for character processing
(where they should be *fast*, given the 8-byte parallelism), PEAK
performance.  The biggest win (aside from character processing)
is the small impact heavy I/O has on overall performance - it just
doesn't slow down much under heavy I/O.

Given the slow clocks on the 6420 and 6430, they do a pretty impressive
job.  I'm anxious to see what the faster clock versions do.

I'm supposed to be getting one next week - though the OS may follow
after that by another week or so - to play with.  If there's anything
really interesting in the real machine (I don't expect that much from
the 6420) I'll be happy to post it.

Oh, yes, it's a 48-bit address space.  One last interesting feature
is the ability to play with part of a register without affect the
other parts.  So you can put a 16-bit tag on a 48-bit address and
not have to unpack them into separate registers.  The machine *should*
be able to run Icon and Lisp like, well, a bat out of ....
-- 
	Steve Wampler
	{....!arizona!naucse!sbw}