[comp.arch] really CISC

hansen@mips.UUCP (Craig Hansen) (05/14/87)

In article <16658@amdcad.AMD.COM>, phil@amdcad.AMD.COM (Phil Ngai) writes:
> Let's not forget about a branch target cache. With memories capable of
> supplying high bandwidth (nibble mode, page mode, video memories, etc)
> all you really need is to handle the start up time. By caching only
> about four instructions at the branch target, you can run loops at
> full speed no matter how long they are. By caching several branch
> targets you can handle nested loops, etc. 

Don't forget that you must interleave load and store references into that
memory system too.  A "real man's" memory system with ECC, of a serious
size, and bandwidth for I/O transactions doesn't respond to a memory request
in 80 nanoseconds.  (This assumes two cycles in on/off chip delays and
branch target selection).  This means that you'd better cache more than four
instructions in your branch target cache, and the average basic block size
is in the range of six to eight instructions (This is for MIPS instructions,
AMD instructions would be more because of their weaker load/store/branch
instructions).  The branches that terminate these basic blocks are taken
more than 50% of the time, so I don't see how a branch target cache performs
comparably to an instruction cache, in a real system environment.

I guess the AMD folks are excited about their new child, but when reality
sets in, and you try to build a real UNIX-based computer system out of this
fine controller part, you'll be sorely disappointed if you expected 17 MIPS
at 25 MHz with a nibble-mode DRAM as your primary memory system.  Based on
their benchmarks (puzzle, dhrystone, and a tiny piece of linpack; sorry, but
that's all they've got to compare with...) with an external cache, external
cache control hardware, and a fast main memory system, a 29000 at 25 MHz that
you can build next year doesn't perform any faster than a MIPS R2000 system
at 16.7 MHz that you can build now.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

bcase@apple.UUCP (Brian Case) (05/15/87)

In article <388@dumbo.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>instructions).  The branches that terminate these basic blocks are taken
>more than 50% of the time, so I don't see how a branch target cache performs
>comparably to an instruction cache, in a real system environment.

it can perform comparably because, logically, part of the instruction cache
is in the external rams.  the branch target cache is only there to cover the
latency of the initial access in that external ram (how many times do we
have to say this?).  so, in essence, the Am29000 branch target cache can
cache 32 loops of *any* size.  this is most certainly not true of traditional
instruction caches.  on the other hand, this is not necessarily a very
important attribute; the point is that a branch target cache can perform,
and in fact does perform, as well as an instruction cache.  i feel that it
should be i (or someone defending the am29000) who should be asking *you*
to defend a traditional instruction cache relative to the branch target
cache (always assuming an external burst mode memory for the branch target
cache, of course.  the burst mode memory is a critical part of the concept.).

>I guess the AMD folks are excited about their new child, but when reality
>sets in, and you try to build a real UNIX-based computer system out of this
>fine controller part, you'll be sorely disappointed if you expected 17 MIPS
>at 25 MHz with a nibble-mode DRAM as your primary memory system.  Based on

*not* nibble-mode, video dram; there is a big difference.

>their benchmarks (puzzle, dhrystone, and a tiny piece of linpack; sorry, but
>that's all they've got to compare with...) with an external cache, external
>cache control hardware, and a fast main memory system, a 29000 at 25 MHz that
>you can build next year doesn't perform any faster than a MIPS R2000 system
>at 16.7 MHz that you can build now.

i will be one of the first to say that mipsco has a good part.  you guys have
done a great job in the sense that you have gotten great performance at lower
clock rates.  but what happens to your bus at 25, 30, 35 mhz?  i'm not saying
that you are guys are going to fail to fix things, but let's not be casting
stones!  how do you know that amd will be sorely dissapointed in its
expectations of 17 mips at 25 mhz?  there are accurate simulations to support
just this data!  i am not trying to say you are wrong, but don't just make
the statement, back it up with solid data or at least some observations you
have made about problems with the architecture/implementation!  please,
follow the lead of your co-worker john mashey.

sorry, i know the note i am responding to is old, so if this issue has already
been put to bed, forgive me.  i have been off the net for a while because of
a job change.

    bcase

howard@cpocd2.UUCP (Howard A. Landman) (05/20/87)

In article <3460001@hpsrla.HP.COM> brucek@hpsrla.HP.COM (Bruce Kleinman) writes:
>CISC CPUs tend to be, uh,
>rather complex.  And, therefore, the CPU usually dominates the chip.
>
>RISC CPUs tend to be rather simple. And,
>therefore, the CPU is usually a small portion of the chip.

This is approximately true, depending on what you mean by "CPU".  However,
I find that it is usually far more enlightening to consider the fraction
of the chip dedicated to computation/storage versus the fraction dedicated
to control.  Last time I looked, every CISC microprocessor ever made spent
50% to 75% of the chip area on control, with an average of about 2/3.
Usually a big chunk of this is the ucode ROMs.  This leaves only 25% to 50%
of the chip to do the useful work.  It's easy to see this in most chip photos.
Look for the datapath and register file, squeezed into one end of the chip.
Everything else is control.

I don't have numbers for all the RISC processors, but the RISC I spent 6%
of its area on control; roughly one-tenth the CISC percentage.  And I'm
pretty sure MIPS is well under half.  Numbers, Craig?  Anyone?
-- 
	Howard A. Landman
	...!intelca!mipos3!cpocd2!howard
	howard%cpocd2%sc.intel.com@RELAY.CS.NET
	"You just ask them?"

brucek@hpsrla.HP.COM (Bruce Kleinman) (05/22/87)

+---------
| Last time I looked, every CISC microprocessor ever made spent
| 50% to 75% of the chip area on control, with an average of about 2/3.
| [....]
| This leaves only 25% to 50% of the chip to do useful work.
+---------

My point exactly.  Consider the newest Transputer, the T8 I believe.
A RISCy chip which happens to be microcoded, curiously enough.  Very
tight instruction set, and a stack-like programming model.  Fixed-width
instructions, a scant 8 bits wide.  Honest.  Sound simple?  It is.

The instruction decode and microcode occupy about 10% of the die.  Hmmm,
what to do the rest of that real estate.  The Inmos folks decided on
4K bytes of RAM, a floating point unit, a DRAM controller, and four
high-speed serial links.  They have experimented with a version that
replaces the floating point unit with a Winchester controller.

My point here is not to glorify the Transputer (although I am exceedingly
impressed by the chip).  I use it to illustrate the possibilities opened
up by "simple instruction set computers" as a whole.  MIPS II has a
4K byte cache.  The AMD 29000 has a 192 registers, a branch target cache,
and an MMU.  The Fairchild Clipper has a floating point unit (the 8K bytes
of cache and the pair of MMUs are on two separate dies).  The possibilities
seem nearly endless.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                              Bruce Kleinman
              Hewlett Packard -- Network Measurements Division
                          Santa Rosa, California

                         ....hplabs!hpsrla!brucek
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~