[comp.arch] More RISC vs. CISC wars

slackey@bbn.com (Stan Lackey) (07/11/89)

In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes:

>But, RISC machines are easier to pipeline

I don't see how this can be true, other than possibly in the scoreboarding
logic.  If it has it.  

>easier to speed up the clock for

Unless your cycle time is limited by memory chips (for the cache), in
which case it doesn't matter.

>easier to provide staged functional units for, etc..  

I don't get this one

>I don't
>know of any CISC machines with 'hardwired' instruction sets.  Micro-
>coding slows the machine down.

This is an interesting statement.  As I recall hearing, Cray started 
this perception back in the 70's.  I thought it had been proven wrong.
For example, the Alliant executes the instruction:

	add.d  (an)+, fp0

in one cycle (yes, that's double precision memory-to-register add,
auto increment), and it's microcoded.  Are you saying that it would be
done in zero cycles if we got rid of the microcode?  Gee, and after
spending so much real estate on those microcode RAM's...
:-) Stan's opinions only.

petolino@joe.Sun.COM (Joe Petolino) (07/12/89)

>>>But, RISC machines are easier to pipeline

>I don't see how this can be true, other than possibly in the scoreboarding
>logic.  If it has it.  

One of the most fundamental principles of the RISC philosophy (if there is one)
is to *not* include anything in the architecture specification if it will
make a pipelined implementation difficult.  Examples of this are numerous:
delayed branches, lack of interlocks in various cases, etc.  What these have
in common is that they relax the requirement for the processor to follow a
strictly sequential model of execution in cases where it would slow down
and/or complicate a pipelined implementation.

>>>I don't
>>>know of any CISC machines with 'hardwired' instruction sets.  Micro-
>>>coding slows the machine down.

>>I can think of a couple of old ones.  The first pdp11, the 11/20, was
>>hardwired.  (This accounts for some of the little irregularities in the
>>11 instruction set, in fact, like the way INC isn't quite the same as
>>ADD #1...)  And I seem to recall that the 360/75 was mostly hardwired,
>>for speed.

No need to go back that far.  None of the Amdahl 470-series machines (V6, V7,
V8) had microcode per se.  They couldn't find RAMs fast enough (yes,
mainframes often use RAM, not ROM, for microcode, since it's faster). 
Instead, something that looked very much like microcode was transformed into
PLA equations and implemented in gate arrays.

-Joe

jlg@lanl.gov (Jim Giles) (07/12/89)

From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey):
< [...]
<>I don't
<>know of any CISC machines with 'hardwired' instruction sets.  Micro-
<>coding slows the machine down.
<
< This is an interesting statement.  As I recall hearing, Cray started 
< this perception back in the 70's.  I thought it had been proven wrong.
< For example, the Alliant executes the instruction:
<
< 	add.d  (an)+, fp0
< 
< in one cycle (yes, that's double precision memory-to-register add,
< auto increment), and it's microcoded.  Are you saying that it would be
< done in zero cycles if we got rid of the microcode?  Gee, and after
< spending so much real estate on those microcode RAM's...

And, how many microcycles does 'one cycle' on the Alliant correspond
to?  You don't suppose that a smaller instruction set would allow
instructions to run closer to the gate delay times rather than be
multiple microcycles long?  Seems to me that a RISC machine might
have _cycle_ times equal to the _microcycle_ of your CISC machine.
The real estate for the your microcode rom could better be used as
a high speed instruction buffer.  With the instruction set hardwired,
the individual instructions would operate at gate delay speeds.  This
could all be done for machine with _fewer_ instructions.  And, as
everone seems to agree, compilers for CISCs don't use all those extra
instructions anyway.  Seems like a good idea to get rid of them and
speed up the machine!

Alliant is obviously fairly slow, since it can do something to an
arbitrary memory location in one cycle.  The cycle time is aparently
longer than the memory delay time.

dik@cwi.nl (Dik T. Winter) (07/12/89)

In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
 > From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey):
 > < For example, the Alliant executes the instruction:
 > <
 > < 	add.d  (an)+, fp0
Nonsense.  It is either
	addd	(an)+,d0
or
	faddd	(an)+,fp0
 > < 
 > < in one cycle (yes, that's double precision memory-to-register add,
 > < auto increment), and it's microcoded.
 > 
 > And, how many microcycles does 'one cycle' on the Alliant correspond
 > to?

I do not know about microcycles, but seeing that the cycle time on the
Alliant is 170 nsec. it is clear that one cycle execution is required
to get any performance.  And, yup, the Alliant will outperform the
SPARCstation 1 by a factor of up to 20 on some benchmarks, but if you
try to compile something the Alliant is clearly inferior.
(I have just been doing some benchmarking, a SPARC: 1.5 Megaflop single
precision; Alliant: up to 30 (4 processor FX4).  Compilation: SPARC at
least 2.5 times as fast.)

Moral: use the tool you have at hand to do the task you have at hand.
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

hascall@atanasoff.cs.iastate.edu (John Hascall) (07/12/89)

In article <8263@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
> > From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey):
> > < For example, the Alliant executes the instruction:
  
>(I have just been doing some benchmarking, a SPARC: 1.5 Megaflop single
>precision; Alliant: up to 30 (4 processor FX4).  Compilation: SPARC at
>least 2.5 times as fast.)
  
>Moral: use the tool you have at hand to do the task you have at hand.
Moral2: if all you have is a hammer, everything looks like a nail.
 
	(use the *correct* tool for the job!!)
 
 
John Hascall
ISU Comp Center
Ames IA

slackey@bbn.com (Stan Lackey) (07/12/89)

In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey):
>< [...]
><>I don't
><>know of any CISC machines with 'hardwired' instruction sets.  Micro-
><>coding slows the machine down.
><
>< This is an interesting statement.  As I recall hearing, Cray started 
>< this perception back in the 70's.  I thought it had been proven wrong.
>
>And, how many microcycles does 'one cycle' on the Alliant correspond
>to?  

One.  The reason many, even memory-to-register, operations take one
microcycle is because it has a scalar pipeline.  Even though pipelines
"can't-be-done" on CISC's.

The cycle time is fairly long, 170ns, but that was typical for when
the machine was designed, 1983.  Cycle time was set by
cache/memory/bus tradeoffs, and by the register read-modify-write time
you could get with CMOS gate arrays of that era.  Had nothing to do
with instruction decode, which is done in parallel with other
operations in the first and second pipeline stages.  Microcode access
time is done in parallel with normal address calculation time.

Note that CMOS has gotton like 3 times faster since then.

>compilers for CISCs don't use all those extra
>instructions anyway.  Seems like a good idea to get rid of them and
>speed up the machine!

The Alliant compiler really really does use the memory-to-register
operations, auto-inc/dec addressing modes, vector instructions, and
concurrency instructions.  All to advantage.

>Alliant is obviously fairly slow, since it can do something to an
>arbitrary memory location in one cycle.  The cycle time is aparently
>longer than the memory delay time.

As I hope I clarified above, the pipeline allows a very long sequence
of operations, including a memory access, to consume effectively one
cycle of execution time.  Specifically, memory-to-register floating
point takes six cycles from front to back, but with the pipeline
really consumes only one cycle.
:-) Stan

jlg@lanl.gov (Jim Giles) (07/13/89)

From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey):
> In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>>And, how many microcycles does 'one cycle' on the Alliant correspond
>>to?  
> 
> One.  The reason many, even memory-to-register, operations take one
> microcycle is because it has a scalar pipeline.  Even though pipelines
> "can't-be-done" on CISC's.

You are either using pipelines (in which case the instruction _issues_
in one clock, but the result is not delivered for several more), or
you aren't (in which case, I don't believe your claim that the instruction
has no microcycles).  RISCs can also be pipelined (easier than CISCs),
and the several simple instructions may execute as fast or faster than
the one big one.  And (back to the original subject), it is easier to 
_compile_ for a RISC machine.

Now that you've said that the Alliant is pipelined, you have to tell
be what the _real_ instruction timing for the given example is.  What
is the minimum number of clocks between issuing the given instruction
and issuing the next instruction which uses one of the results of the
one given?  Bet it ain't 1.

jlg@lanl.gov (Jim Giles) (07/13/89)

From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey):
> [...]
> As I hope I clarified above, the pipeline allows a very long sequence
> of operations, including a memory access, to consume effectively one
> cycle of execution time.  Specifically, memory-to-register floating
> point takes six cycles from front to back, but with the pipeline
> really consumes only one cycle.

Or it really consumes six!!  Depends upon whether there is anything
independent to do while this instruction runs.  If the next instruction
depends on the result of this one, the next gets delayed six clocks. Period.

With a RISC instruction set, you can move the individual components of
this complex "instruction" around to get maximum overlap from your pipeline.
Splitting the functionality of the instruction requires more instruction
issues, but it also allows better flexibility in instruction scheduling
optimizations.  It would require a _very_ smart compiler to tell which
way to go.  This is exactly one of the points I made originally about
CISCs being harder to compile for.

hascall@atanasoff.cs.iastate.edu (John Hascall) (07/13/89)

In article <13985@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
 
>From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey):
>> [...]
>>                        Specifically, memory-to-register floating
>> point takes six cycles from front to back, but with the pipeline
>> really consumes only one cycle.
 
>Or it really consumes six!!  Depends upon whether there is anything
>independent to do while this instruction runs.  If the next instruction
>depends on the result of this one, the next gets delayed six clocks. Period.
 
   I doubt it is delayed all six, surely the first part of the next
   instruction can be done (at least the fetch and decode).


   John Hascall
   ISU Comp Center

sbf10@uts.amdahl.com (Samuel Fuller) (07/13/89)

In article <13985@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey):
>> [...]
>> As I hope I clarified above, the pipeline allows a very long sequence
>> of operations, including a memory access, to consume effectively one
>> cycle of execution time.  Specifically, memory-to-register floating
>> point takes six cycles from front to back, but with the pipeline
>> really consumes only one cycle.
>
>Or it really consumes six!!  Depends upon whether there is anything
>independent to do while this instruction runs.  If the next instruction
>depends on the result of this one, the next gets delayed six clocks. Period.

If a RISC has data dependencies then its stuck too, right?

>
>With a RISC instruction set, you can move the individual components of
>this complex "instruction" around to get maximum overlap from your pipeline.

I hardly consider a memory-to-register multiply a complex instruction.

For an example of a complex instruction see the TRT instruction in the
IBM 370 POO.  These are the instructions that RISC rightfully throws out.

>Splitting the functionality of the instruction requires more instruction
>issues, but it also allows better flexibility in instruction scheduling
>optimizations.  It would require a _very_ smart compiler to tell which
>way to go.  This is exactly one of the points I made originally about
>CISCs being harder to compile for.

Look at it this way.  To perform a floating point multiply on two
operands which exist in memory this machine will take two slots down the
pipe to perform the operation.

Prev Inst           DATBXW
LOAD OP1 to REG1     DATBXWload can be bypassed back into X for Mul
Mult REG1 by OP(mem)  DATBXW    Multiply is finished after the X
Next Inst              DATBXW

All RISC machines that I know about are Load/Store machines. So given
the same pipeline they would take at least three slots to perform the
operation.

Prev Inst           DATBXW
LOAD OP1 to REG1     DATBXW
LOAD OP2 to REG2      DATBXW
Mult REG1 by REG2      DATBXW    Multiply is finished after the X
Next Inst               DATBXW

A pipeline is a pipeline.  The pipelines on our 370 machines have a shorter
cycle time than any RISC processor on the market.  370 is definitely
not RISC.  RISC is wonderful stuff.  But it is not necessary to make a
fast computer.  RISC just allows you to make a fast computer quickly (read
design time) and cheaply (read single chip CPU).  Our machines are fast
but they take forever to design and cost a fortune. But people buy them :).

Sam Fuller / Amdahl System Performance Architecture

slackey@bbn.com (Stan Lackey) (07/13/89)

The discussion continues between jlg@lanl.gov (Jim Giles) and me.  If
you are bored with it, "Type 'n' now"

In article <13984@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey):
>> In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>>>And, how many microcycles does 'one cycle' on the Alliant correspond
>>>to?  
>> 
>> One.  The reason many, even memory-to-register, operations take one
>> microcycle is because it has a scalar pipeline.  Even though pipelines
>> "can't-be-done" on CISC's.

I aplogize for the sarcasm.  I have seen too many "can't be done in a
CISC" or "is too hard to do in a CISC" statements, referring to things
I have done in a CISC.

>You are either using pipelines (in which case the instruction _issues_
>in one clock, but the result is not delivered for several more), or
>you aren't (in which case, I don't believe your claim that the instruction
>has no microcycles).

The basic clock to the Alliant CE is 170ns.  One new microword is
accessed every 170ns cycle.  Many instructions consume one 170ns
cycle.  Some consume more.  FADD.D (ay)+, fp0 consumes one.  FDIV.D
<ea>, fp0 consumes more than one, like 3 or 4.

>Now that you've said that the Alliant is pipelined, you have to tell
>be what the _real_ instruction timing for the given example is.  What
>is the minimum number of clocks between issuing the given instruction
>and issuing the next instruction which uses one of the results of the
>one given?  Bet it ain't 1.

Bet it is, for lots of cases.

The CE has a fixed six-stage pipeline.  The stages are:

1. Instruction cache access and instruction decode

2. Address calculation and microcode access

3. Address translation and passing the address through the crossbar

4. Cache access and returning the data through the crossbar (on
   a read, send data on a write)

5. Integer execution or pass operands to floating point unit

6. Floating point execution and writing of results

So, the full execution time of a FP instruction is 6*170.  A new 
instruction can be started every 170.  

Dependencies cause dead cycles to be inserted.  These dependencies
include an integer operation being used as an address in the next
instruction, but do not include integer or floating point
dependencies; we used the 50ns BIT (Bipolar Integrated Technology)
functional units, and wired the data paths efficiently so that
dependent operands could be routed fast enough.

In the implementation, only one microword is accessed for the entire
instruction.  It is a very wide microinstruction, and fields of it
that are destined to control operations later in the instruction are
delayed by "pipeline registers".  The technique was called "data
stationary control" in the textbook we got it out of.  Lore has it
that IBM has used this style in their mainframes, and calls it
"delayed microinstructions" or something similar.

Note: Because condition codes are not available to a branch instruction
following a compare, branch prediction is employed.

Also note: the above strategy seems to work real well.  Compare its
Whets with the 1989 50ns RISC machines.

My opinions, and not necessarily those of Alliant Computer Systems,
Internation Business Machines, BBN, the publisher of the textbook we
got "data stationary" out of, and anybody else, living or dead, whom
I may have mentioned.
 :-) Stan

seanf@sco.COM (Sean Fagan) (07/14/89)

In article <42550@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>>But, RISC machines are easier to pipeline
>I don't see how this can be true, other than possibly in the scoreboarding
>logic.  If it has it.  

"Most" "CISC" machines have those wonderful complex addressing modes we all
know and love, which oftimes means a variable length instruction (best
example is, of course, the VAX).  This is rather difficult to pipeline,
although not impossible.  One of the features most "RISC" chips have in
common is a limited amount of instruction formats.  For example, the Cyber
170 machines (knew I'd put this in somewhere, didn't you? 8-)) had two
formats, 15 and 30 bits wide.  If I remember correctly, the i860 has 5 or so
(but, sizewise, it only works out to 2 or 3), and the 88k looks like it's
similar.

>>Micro-
>>coding slows the machine down.

>This is an interesting statement.  As I recall hearing, Cray started 
>this perception back in the 70's.  I thought it had been proven wrong.

Bad thing to say.  As far as I knew, it had been proven that microcoding
*did* slow the system down, since you could always, worst case, make your
microcode be your instruction set, and use a cache to execute the "real"
code normally, and execute the ucode directly when you wanted more speed.

>For example, the Alliant executes the instruction:
>	add.d  (an)+, fp0
>in one cycle (yes, that's double precision memory-to-register add,
>auto increment), and it's microcoded.  Are you saying that it would be
>done in zero cycles if we got rid of the microcode?  Gee, and after
>spending so much real estate on those microcode RAM's...

As somebody else said, in regards to Elxsi:  if you most complex
instructions take as much time as your simplest ones, then your cycle time
is too long (or something like that).  Usually, the reason cycle times are
chosen to be longer than they need to be is a) the hardware isn't up to
snuff (e.g., ETA), or b) you need a longer cycle to get more work done.
RISC advocates (and I find myself in that group this time) claim that, if b)
is chosen, then you should simply make your cycle time shorter and go with a
different instruction set (or algorithm in the hardware).

-- 
Sean Eric Fagan  |    "Uhm, excuse me..."
seanf@sco.UUCP   |      -- James T. Kirk (William Shatner), ST V: TFF
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

jlg@lanl.gov (Jim Giles) (07/14/89)

> The discussion continues between slackey@bbn.com (Stan Lackey) and me.
> If you are bored with it, "Type 'n' now"
> 
>>[...]                                                            What
>>is the minimum number of clocks between issuing the given instruction
>>and issuing the next instruction which uses one of the results of the
>>one given?  Bet it ain't 1.
> 
> Bet it is, for lots of cases.

Obviously, it _never_ is.  I want the time from instruction issue
to writing of results.  Even by _your_ calculation, this is _always_
six clocks (for the instruction at issue).

> The CE has a fixed six-stage pipeline.  The stages are:
> [...]
> 6. Floating point execution and writing of results
> 
> So, the full execution time of a FP instruction is 6*170.  A new 
> instruction can be started every 170.  

This is just like any other pipelined machine (CRAY, for example).
I would _never_ claim that the divide approximate on the Cray was
one clock (even though that's its issue time).  The instruction
time is number of clocks from issue to completion - nothing else.
When someone says a machine is pipelined, I _assume_ that issue
time is shorter than the whole instruction time (for most instructions).

> Dependencies cause dead cycles to be inserted.  [...]

Finally, this discussion gets back to _my_ point about compiler
construction.  CISC machines usually have a superset of the instructions
found in a RISC machine.  The compiler must determine whether to
use the simpler instructions (in which case, you pay more for instruction
issue - but might find a improved scheduling) or whether to use the
complex instruction (which pays less for instruction issue, but might
cause more delays to be generated).  RISC doesn't have the choice,
so it's _OBVIOUSLY_ easier to compile for!

This still leaves the question of whether RISC or CISC is faster.
This question is independent of the compiler discussion I am talking
about.  My bet would be that RISC could be made faster if the compiler
for the CISC is assumed _not_ to do the optimizations I am talking about.
For this reason I claim that CISC compilers _are_ more complex than
RISC compilers (at least if they generate competitive code).