[comp.arch] RISC vs CISC

alfred@dutesta.UUCP (Herman Adriani & Alfred Kayser) (10/27/88)

I only see discussions about established firms and there hardware. Am I
the only one who ever heard about the Acorn Risc chip and there computer
'Archimedes', or do all of you ignore this machine just because it is a
small (british) company which manufactures strange computers (BBC).
I do not own an Archimedes, but I like the machine -> it is cheap, fast,
Hi-res lots of memory.
What I like to know: How does this newsgroup think about this machine (chip).

					Herman Adriani.
P.S.
I don't want to start a new war about RISC or CISC, I like both architectures
and they should excist both!!!. They just have there won territory in which
they outperform the other one.

Don't flame me about starting something, I just wanna know (everything).
-- 
 _____________________________________________________________________________
/                                                                             \
| Herman Adriani & Alfred Kayser: Computer fans especially from 24 pm to 7 am |
\_____________________________________________________________________________/

bcase@cup.portal.com (Brian bcase Case) (10/29/88)

>Am I the only one who ever heard about the Acorn Risc chip and there computer
>'Archimedes', or do all of you ignore this machine just because it is a
>small (british) company which manufactures strange computers (BBC).
>I do not own an Archimedes, but I like the machine -> it is cheap, fast,
>Hi-res lots of memory.
>What I like to know: How does this newsgroup think about this machine (chip).

I can only give my opinion; my views do not necessarily represent those of
this group.  No warantee, either expressed or implied, ....

The ARM is a very interesting, and in some ways clever, architecture.  The
ability to do an ALU and SHIFT op in one cycle is surprisingly useful in
some kinds of code, especially the embedded-control, bit-twiddling, graphics-
handling kinds of applications.  The conditional execution (each instruction
hase a 4-bit condition field that must be satisfied for the instruction to
execute) facility is also quite nice, sorta like a skip.  It does have a
three-address architecture.

Its biggest problems are implementation and support.  The address bus is
only 26 bits, faulting handing and memory management are a totally
weird, very fast implementations do not exist, no one is providing
comprehensive support (maybe VLSI tech is doing a better job these days,
but where are the compilers?  UNIX implementations?  etc.), etc.  If I
remember correctly, it also has only 25, or some weird number of, registers,
and only some are available to user code.  One of them is the PC, which
is not good for really high-performance implementations.  The ALU and SHIFT
instructions take 2 cycles.  It has only an address bus and a combined
data/instruction bus.  For best performance, you need more bandwidth.  At
high clock rates, the bus protocols will not work well.

I have seen small graphics kernels on which the ARM does better than any
vanilla RISC in terms of cycle count.  But vanilla RISCs win in total
time (shorter cycle).  Most of the implementation problems could be solved,
but who is solving them?

aglew@urbsdc.Urbana.Gould.COM (10/30/88)

>/* ---------- "RISC vs CISC" ---------- */
>I only see discussions about established firms and there hardware. Am I
>the only one who ever heard about the Acorn Risc chip and there computer
>'Archimedes', or do all of you ignore this machine just because it is a
>small (british) company which manufactures strange computers (BBC).
>I do not own an Archimedes, but I like the machine -> it is cheap, fast,
>Hi-res lots of memory.
>What I like to know: How does this newsgroup think about this machine (chip).
>
>					Herman Adriani.
>P.S.
>I don't want to start a new war about RISC or CISC, I like both architectures
>and they should excist both!!!. They just have there won territory in which
>they outperform the other one.
>
>Don't flame me about starting something, I just wanna know (everything).
>-- 
> _____________________________________________________________________________
>/                                                                             \
>| Herman Adriani & Alfred Kayser: Computer fans especially from 24 pm to 7 am |
>\_____________________________________________________________________________/

Most of us simply don't know enough about the ARM to say much
useful. I've looked at it's instruction set, and it appears to
be fairly clean, although with the typical British penchant for
gimmicks (don't fmae me, I'm (almost) (dual citizen) a Brit too).
I am interested in the ARM because of its commodity market nature
- I think it's VLSI Technology that's selling it as an embedded
controller(?) - which means that it has significantly different
tradeoffs than most of the workstation chip wars we hear about
most often.  But apart from quoting the marketing literature
(7 MIPS, listing the IS) I don't have any meaningful data.

If anyone connected with the ARM wishes to start a meaningful 
discussion, I would probably join in, with questions like:
did you intend to sell it into the low market, or did you design
the chip and have it just happen that way? did you make the right
tradeoffs? etc.?  Is there anyone out there that can lead this?

In general, I start discussions on things that I am interested in.
I provide information on things that I know about, when that information
is already in the public domain.  And I'll comment about things
other folks say.

So, Herman, what can you tell us about the ARM and Archimedes?


Andy "Krazy" Glew. 
at: Motorola Microcomputer Division, Champaign-Urbana Development Center
    (formerly Gould CSD Urbana Software Development Center).
mail: 1101 E. University, Urbana, Illinois 61801, USA.
email: (Gould addresses will persist for a while)
    aglew@gould.com     	    - preferred, if you have MX records
    aglew@fang.gould.com     	    - if you don't
    ...!uunet!uiucuxc!mcdurb!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

PS. I promise to shorten this .signature soon.

kers@otter.hple.hp.com (Christopher Dollin) (10/31/88)

Some remarks about the ARM, following Brian Case's response to the basenote. I
speak solely as an owner of an Acorn Archimedes with some experience in
compiler writing; my employers won't buy me one for my desk [are you
surprised?]

| Its biggest problems are implementation and support.  The address bus is
| only 26 bits, faulting handing and memory management are a totally

Memory management is not part of the chip, but has to be provided by an
external memory manager. The current version of this (MEMC) does seem to be
a little odd, but then, I'm not really familiar with MMU devices.

| weird, very fast implementations do not exist, no one is providing
| comprehensive support (maybe VLSI tech is doing a better job these days,
| but where are the compilers?  UNIX implementations?  etc.), etc.  If I

Acorn a *supposed* to be working on a Unix implementation. Last I heard it was
die out toward the end of this year. Silence is golden ....

| remember correctly, it also has only 25, or some weird number of, registers,
| and only some are available to user code.

25 on-chip registers. Each operating mode sees 16; in the three non-user
modes, some of the user registers are shadowed out; in SVC and MI (maskable
interupt) mode, the R15 (PC) and R14 (return link) are shadowed, and in NMI
(non-MI) mode, R11-R15 are shadowed. This is to allow fast context switching,
especially in NMI code (where the NMI owner can set up the NMI registers and
return to user code; NMIs can then operate with *no* register save-restore).
Acorn operating systems are *heavily* interrupt-driven.

| One of them is the PC, which is not good for really high-performance
| implementations.  The ALU and SH instructions take 2 cycles.

No, one cycle. An ALU instuction with one operand shifted *by an amount held
in a register* takes an additional cycle. Incidentally, one should be careful
and distinguish *sequential* cycles from *non-sequential* cycles, as the
instruction fetch is in burst mode (I think that's the right term).

| It has only an address bus and a combined data/instruction bus.  For best
| performance, you need more bandwidth. At high clock rates, the bus protocols
| will not work. I have seen small graphics kernels on which the ARM does better
| than any vanilla RISC in terms of cycle count.  But vanilla RISCs win in total
| time (shorter cycle).  Most of the implementation problems could be solved,
| but who is solving them?

Well, presumably Acorn. If not, I can imagine I'll be very upset in a few
years time ....

Regards, Kers.
"If anything anyone lacks, they'll find it all ready in stacks."

hjb@otter.hple.hp.com (Harry Barman) (10/31/88)

A friend of mine saw 4.3 running on an Archimedes w/ X windows.  If Acorn's
famed marketing/shipping depts. can get their act together it may be possible
to buy it!

The ARM was started as a project in mid-83.  Initially, the main reasons for
the project were (don't laugh) a successor to the 6502 processor that was used
in Acorn's previous machines.  I believe this background influenced the design
of the MMU, which meant huge page sizes and so wasn't very suitable for Unix
implementations.

Acorn is in the business of building small PC board personal (cheap) computers,
and in that context I think the ARM fits in reasonably well.

Harry

cik@l.cc.purdue.edu (Herman Rubin) (06/25/89)

In article <1989Jun24.230056.27774@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes:
> >This, in my opinion is one of the major faults of RISC processors. They
> >do not provide basic arithmetic instructions. 
> 
> When the list of "basic" arithmetic instructions is pages long, one starts
> to wonder how many of them are really "basic".  The instruction you ask
> for -- divide double length by single length yielding single-length result --
> is not exactly frequently needed.  Just how much silicon is it worth to make
> it run faster than an implementation as a subroutine?

The list of arithmetic instructions is pages long only if the documentation
is stupid enough not to put lots of instructions on one page.  What is used
heavily in his work, reported in sci.math., is division of a double by a
single getting simultaneously quotient and remainder.  This takes about
6 or 7 instructions inline if some provisions are made.  If they are not,
it may very well take 20 instructions.

One might reduce the number of division instructions by only having
double/single yielding quotient and remainder for integer division,
signed and unsigned, and floating point division similarly, signed only.
This would give three division instructions with different argument
sizes, and the only "waste" would be in the moving of unwanted results
to an unused register.  Even this could be avoided if there were read-only
registers, which many machines have.  Instead, many machines have separate
quotient and remainder operations.

Now Bob Silverman is one of the mathematicians who knows how to use the
machine instructions efficiently, and absent the instruction, will modify
the algorithm considerably to get around it.  This is one of the things
which does not show up on benchmarks.  The presence or absence of a few
"minor" instructions can make an algorithm efficient or horribly inefficient.

Also, the cost of the entire ALU is typically dwarfed by the costs of 
memory, IO, etc.  Silicon is cheap.

Mathematicians have been rightly accused of not sufficient use of computers.
It is not the case that fixed-length floating-point operations are what is
needed.  Multiple precision arithmetic requires good integer arithmetic.

There are operation counts and there are operation counts.  Is there essentiallyone integer sum, or are there several?  I have advocated an integer quotient
and remainder with the choice of truncation depending on the signs of the
arguments.  Is this one instruction or 2^n, where n is between 8 and 16?
I consider it one; the tag field can be decoded by the arithmetic unit, not
by the control unit.

We can get much more at little extra cost, and flexibility is cheap.  The
same holds for languages.  But once the instruction is omitted, it can be
expensive to do anything about it.  .  
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

mash@mips.COM (John Mashey) (11/10/89)

In article <28942@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes:
...
>continuing speed advantages).  It seems to me that the
>real issue is what would the extra die space be used for.
>With a deeper pipeline, one could use additional gates
>without slowing the clock down.  With a CISCy enough
>CISC, one might be able to keep the pipeline full.  So,
>if we were to double the die size tomorrow, what would
>go on the chip?  Just to throw sand in our eyes, why not
>put 2 RISCS on the chip?  Big research area - should the
I don't think we're anywhere near this yet, and this can be seen by
analyzing the layout and nature of million-transistor chips [like i860s].
If you look at the i860 die, you find that:
	a) Most of the transistors are in the caches.
	b) Most of the space is the FPU, registers, integer datapath, etc.
	Some of this stuff is wires, and it doesn't shrink as well as
	transistors do.
	c) At the top speed claimed for it, eventually [50Mhz], 12KB
	of cache is NOWHERE near big enough for efficiency, by itself.
		1) As the CPU gets faster, the cache miss cost goes up,
		and the cache miss ratio must go down enough to maintain
		a constant amountof memory-system degradation.
		2) Although 8-16K of cache on a million-transistor chip
		is certainly useful, serious cache simulations say that
		it just isn't enough for a well-balanced machine at the higher
		clock rates [50Mhz or so] that one would naturally use with
		the kind of technology that gives you a million transistors.
		3) Thus, you still end up with secondary cache being needed
		in many configurations and application environments.
So that says, that when you get up to 4M transistors, maybe you get close
to having big enough caches on the chip to balance the CPU+FPU that are
there....except that now you'll want to boost the clock rate some more,
which means the caches are not as improved as you'd think [although getting
close].  Well, maybe if somebody wants to build 100MHz parts, with about
8M transistors, 128K caches, that's a sort-of balanced thing.
Maybe at the 16M-transistor point, if you still can't think of anything else to
do with more silicon [and note that the current million transistor chips
on the market or coming soon have not run out of intersting things to do
with more silicon], you put 2 CPUs on one chip, if you can figure out
a sensible cache hierarchy, and a package with few enough pins that
people can use, because, as usual, the issue is not so much in making
the CPUs run fast, it's getting the data in and out, and packaging technology
will be "interesting".

Of course, some of the numbers change if you built chips with different
mixtures.  Specifically, if you didn't care about FP, you could omit the
FPU, which is inherently a big space hog.  If you didn't need an MMU, 
that would save space also.  However, I think this only moves the potential
switch point from 1 CPU to 2 around a little.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

frazier@oahu.cs.ucla.edu (Greg Frazier) (11/10/89)

In article <31097@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <28942@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes:
>...
[me proposing multiple CPU chips]
>I don't think we're anywhere near this yet, and this can be seen by
>analyzing the layout and nature of million-transistor chips [like i860s].
>If you look at the i860 die, you find that:
>	a) Most of the transistors are in the caches.
>	b) Most of the space is the FPU, registers, integer datapath, etc.
>	Some of this stuff is wires, and it doesn't shrink as well as
>	transistors do.
>	c) At the top speed claimed for it, eventually [50Mhz], 12KB
>	of cache is NOWHERE near big enough for efficiency, by itself.
[ further discussion of how the $ is still too small ]

Yeah, I've been wondering why people are bothering with on-chip
caches.  Admittedly, a 2-chip CPU would probably be significantly
more expensive than a single-chip CPU, but I think the product
would be more flexible, and achieve better performance, if, instead
of putting the $ on the CPU chip, the $ was provided on a companion
chip.  This should only increase the latency to the cache by a
single clock cycle, while significantly boosting the hit ratio.
Particularly if this were a data $ (leave the instruction $ on
chip - it belongs there).  With a separate cache chip, one could
a) put some intelligence on it and/or b) make it expandable, such
that a user could choose to provide two or three $ chips.  Of
course, this is not a terribly new idea - but I don't know why it
isn't being used in the newer chips.  Realtime people would love
it - they would put NO $ chips on, and simply populate the board
with fast local memory (almost the same thing, but more predictable
behavior).  This scheme would provide more space on chip for... two cpu's,
or multiple FPU's, or single-cycle FPU's (that's an attractive one!),
or any other of a host of performance boosting schemes.  A disadvantage
is that the CPU chip would have to have multiple ports - a chip
with inst. $ and data $ on chip can have both $'s share a single
port to memory, since (presumably) neither is using it very often.
However, unless one wants 128 bit paths to memory (which one might),
I don't think having 2 ports is a big deal.

Quick performance analysis hack:
on-chip $, assume 85% hit ratio, 1 cycle delay on hit, 14 cycle delay
	on miss
off-chip $, assume 95% hit ratio, 2 cycle delay on hit, 14 cycle delay
	on miss (a smart $ will not incur extra miss delay)

on-chip memory speed: .85*1 + .15*14 = 2.95 cycles/reference
off-chip memory speed: .95*2 + .05*14 = 2.60 cycle/reference - a win!

With the hit ratios I have assumed, the break-even point is a memory
delay of 10 cycles - below that, the on-chip cache becomes a win.
Of course, change the hit ratios, and one changes the break-even
point, so your mileage will vary.  As a final note, if the $ chip
is closely married to the CPU chip, there is no reason why the
2 cycle dealay can't be achieved, I think.

Greg Frazier

"They thought to use and shame me but I win out by nature, because a true
freak cannot be made.  A true freak must be born." - Geek Love

Greg Frazier	frazier@CS.UCLA.EDU	!{ucbvax,rutgers}!ucla-cs!frazier

gerry@zds-ux.UUCP (Gerry Gleason) (11/10/89)

In article <31027@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>In article <1579@crdos1.crd.ge.COM> davidsen@crdos1.UUCP (bill davidsen) writes:
>>  I don't think any major chips are being designed in 6 months, or 18.
>>At least not CPUs. I believe Intel said that the 486 design cycle was
>>started about five years ago. Feel free to correct this if you have a
>>better source than _Info World_.

>Fujitsu SPARC . . .
>Or how about the Cypress full-custom SPARC?  Both of these were designed
>and taped-out in well under 18 months.
> . . . R2000 . . .  in 9 months.

I assume you are only stating the design cycles for these RISC processors,
and not disputing his guess about the 486.  Even if it is more like three
years, that's still a big win in design cycle time for RISCs, and what about
man years invested in these processors.  These wins in design cycle are
probably more significant than the performance issue (not insignificant itself).

Gerry Gleason

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/10/89)

In article <31027@obiwan.mips.COM>, mark@mips.COM (Mark G. Johnson) writes:

|  Just to name a couple from our esteemed colleagues in the SPARC camp,
|  how about the original Fujitsu SPARC (the one in the 4/260)?  Its
|  design schedule & history was published in _High_Performance_Systems.
|  Or how about the Cypress full-custom SPARC?  Both of these were designed
|  and taped-out in well under 18 months.

  I think we're talking different things as design time here, I was
talking about the time from "let's build a CPU" to a working part.
Taking a part with known word size, register layout and instruction set
is a subset of that. A SPARC port looks to me like "how do we do it"
with out the "what do we do" phase.

  Since you mentioned the R2000, can you determine the elapsed time for
the whole process?
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

baum@Apple.COM (Allen J. Baum) (11/10/89)

[]
>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>ARGUMENT 1: RISC is better because it's smaller, for new technologies
>When die size is a limit, RISC is better, because you can do it,
   Agreed. However, current trends appear to indicate that new technologies
   mature reasonably quickly. This argument works for just a few (<10?) years.
>ARGUMENT 2: RISC is better for cost reasons, because it's smaller
   When you get to 1M transistors on a chip, the space for some extra decoding
   logic is neglible. This argument works only during the initial (e.g.
   'new technology' phase.
>ARGUMENT 3: RISC is better, because even if there is enough space on a die
>	to put a whole CPU plus other things, the RISC can afford more space
>	for caches and other good things, and so it will be faster.
    See argument above. What is the performance difference between a 32k cache
    and a 31k cache?
>ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster
>	time to design and test the chips.
>	COMMENT: maybe, maybe not.
   Agreed. Notice as we get more and more transistors, we start to do more
   complex (superscalar, hi-perf FP, graphics) things, not just add more
   cache.  Superscalar may double your performance. It would literally
   impossible to add enough cache to do that, so simple, regular hardware is
   probably not where the transistors will go. (for disbelievers, if your
   cache miss penalty * miss ratio <1, then even if every access hits, you'll
   save less than a cycle, so CPI approaches 1. Superscalar does better.)

>ARGUMENT 4: But when you can get zillions** of transistors  on a die, it
>	it doesn't matter.
>	... More transistors will help everything, however, it may be
>		that the "limiting factor" will be not die space,
>		but COMPLEXITY in critical paths and exception-handling.
>
>This last point is illustrated by the recent i486 bugs, and also, by the
>errata list carried for a long time by the i386

Ah, yes. I think its safe to say that our simple RISCs are going to start to
get fairly complex, as we start to play all the little hardware tricks we've
known from the supercomputer world, and some that we'll invent. Note that a
lot of errors come from exception handling kinds of problems, and while CISC
machines have them, superscalar, out-of-order execution RISC machines will
have them in spades.

I'm beginning to believe Nick Treddinick when he says that RISCs aren't better;
just newer. In a few years, I think we may find that CISCs will be back, to
some extent. The difference is that these 'CISC's will be a bit more carefully
tuned to allow the hardware techniques now being pioneered by RISCs to work
on them. Certainly, the compiler technology will permit them to be used
efficiently; as noted in an earlier posting this week, RISC vs. CISC compilers
seem to just trade off the problems of instruction scheduling for instruction
selection. Both are do-able.

To give a flavor of what I mean, I'll summarize the RT/2 papers from IBM.
Note that one paper referenced the other as having the title "An IBM post-RISC
Processor Architecture", although that wasn't the title it was given.

Superscalar w/ 1 fix, 1 float, 1 branch, & 1 cond. code op simultaneously

Branch instructions:
   branch on any bit of CC reg
   each has bit that enables storing of PC+1 into -->dedicated<-- link reg.
   -->no<-- delayed branching
   taken conditional branch: 0 to 3 cycles (depends when CC is set)
   -->dedicated<-- counter reg, for decr&branch if 0 ops, which can be
    combined with test of any bit in CC reg.

Cond Code. ops:
   Any boolean operation on any two of 32 bits in CC reg. Useful for generating
   compound Boolean expressions. Frequently used booleans can be kept in
   CC reg.

Fixed ops:
   Multiply & divide included, with dedicated MQ reg.
   Support for min, max, & abs
   Support for arbitrarily aligned byte string compare & move, both length
     specified & null terminated. Hardware dedicated byte count & comparision
     register included in state. String instructions defined to permit max. 
     theoretical bux bandwidth to be use, w/ very low overhead for short
     strings.
   Auto-incr & decr address modes
   Hardware handling for load/store of misaligned data (as long as its within
     a cache line). Optional fault if it crosses cache lines.

Floating Point ops:
   Multiply & Add w/ only one round, takes same time as either add or mult.
   Reg. renaming

Overall:
   all interrupts/traps are precise
   Icache: 8K byte, 64 byte line,
           32 entry 2 way set assoc. TLB
   Dcache: 64K byte, 4 way set assoc, 128 byte line,
          128 entry 2 way set assoc. TLB
   -->hardware<-- table walking
   Dcache has load & store buffers (store buffers so load can be performed
      before cache writeback, load buffer so loads can proceed during filling).
   Mem system has ECC & bit steering (allows spare bit to be substituted for
      failing bit). 4-bit DRAMs scattered across ECC groups so a chip failure 
      is detectable.
   4 deep 'pending store queue' permit address translation & checking even if
   data is still being calculated.

Memory addressing:
   52 bit virtual, 32 bit physical
   upper 4 bits of 32 bit address selects one of 16 24-bit seg. regs.(24+28=52)
   Seg. regs. have an I/O bit & lock enable bits
   Lock enable turns on the hardware lock & transaction ID hardware (801 & RT
     style.)
   Hardware can use low 20 bits of virtual address for translation lookup. 
     Software must ensure that aliasing is avoided
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

marc@oahu.cs.ucla.edu (Marc Tremblay) (11/10/89)

In article <28985@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes:
> (Some stuff on why an on-chip cache may not be a good idea)
>Quick performance analysis hack:
>on-chip $, assume 85% hit ratio, 1 cycle delay on hit, 14 cycle delay
>	on miss
>off-chip $, assume 95% hit ratio, 2 cycle delay on hit, 14 cycle delay
>	on miss (a smart $ will not incur extra miss delay)
>
>on-chip memory speed: .85*1 + .15*14 = 2.95 cycles/reference
>off-chip memory speed: .95*2 + .05*14 = 2.60 cycle/reference - a win!
>
>With the hit ratios I have assumed, the break-even point is a memory
>delay of 10 cycles - below that, the on-chip cache becomes a win.

Regarding data caches, most implementation with a 2-cycle load delay
(on a hit) allow overlapping of instruction so that a decent compiler 
can schedule an instruction in the delay slot.
If we assume that the load-delay can be filled let's say 50% of the time,
we obtain a break-even point of around 6 for the miss delay,
which makes the on-chip cache even more questionnable if *only*
this factor is considered.

To really evaluate the impact of an on-chip cache though we have 
to look at other factors such as:

	1) With an on-chip cache it is a lot easier to implement 
	   a wide datapath (64 or 128 bits) between the cache and 
	   the register file than it is with an off-chip cache,
	   which requires lots of pins and lots of routing.
	   A wide datapath allows the use of instructions that
	   take advantage of the extra bandwidth to 
	   i) save/restore the register file quicker (for example 
	   on calls/returns) and
	   ii) load and store double precision operands in one cycle
	   (two double precision operands can be loaded with 128 bits).

	2) What applications is the processor-cache target for?
	   For example if the chip is used mostly for applications 
	   showing lots of relatively small loops with heavy
	   floating-point computations then an on-chip instruction
	   cache makes a lot of sense since the hit ratio will be high.

	3) Cost of flushing the cache on a context switch.
	   Cost of maintaining cache coherency in a multiprocessor
	   environment, etc...

					Marc Tremblay
					marc@CS.UCLA.EDU

mash@mips.COM (John Mashey) (11/10/89)

In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>ARGUMENT 1: RISC is better because it's smaller, for new technologies
>>When die size is a limit, RISC is better, because you can do it,
>   Agreed. However, current trends appear to indicate that new technologies
>   mature reasonably quickly. This argument works for just a few (<10?) years.
Agreed, although 10 years is almost an eternity, epsecially given the
structure and trends in the computer business, i.e., it's getting harder and
harder to introduce new architectures successfully....

>>ARGUMENT 2: RISC is better for cost reasons, because it's smaller
>   When you get to 1M transistors on a chip, the space for some extra decoding
>   logic is neglible. This argument works only during the initial (e.g.
>   'new technology' phase.
Sorry, I should have been more specific: I was assuming architectures
upward-compatible with any of the current prevalent CISCS, i.e., that
would typically use large microcode ROMS for some instructions, even
if the most frequent ones were hardwired as in 486s, etc.

>>ARGUMENT 3: RISC is better, because even if there is enough space on a die
>>	to put a whole CPU plus other things, the RISC can afford more space
>>	for caches and other good things, and so it will be faster.
>    See argument above. What is the performance difference between a 32k cache
>    and a 31k cache?
Of course, if there is that little difference, it doesn't make very
much difference, EXCEPT if to fit the size of chip you can actualy make,
it makes the difference between 16K and 32K, which can happen quite easily.

>>ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster
>>	time to design and test the chips.
>>	COMMENT: maybe, maybe not.
>   Agreed. Notice as we get more and more transistors, we start to do more
>   complex (superscalar, hi-perf FP, graphics) things, not just add more
>   cache.  Superscalar may double your performance. It would literally
>   impossible to add enough cache to do that, so simple, regular hardware is
>   probably not where the transistors will go. (for disbelievers, if your
>   cache miss penalty * miss ratio <1, then even if every access hits, you'll
>   save less than a cycle, so CPI approaches 1. Superscalar does better.)
>
>>ARGUMENT 4: But when you can get zillions** of transistors  on a die, it
>>	it doesn't matter.
>>	... More transistors will help everything, however, it may be
>>		that the "limiting factor" will be not die space,
>>		but COMPLEXITY in critical paths and exception-handling.
>>
>>This last point is illustrated by the recent i486 bugs, and also, by the
>>errata list carried for a long time by the i386
>
>Ah, yes. I think its safe to say that our simple RISCs are going to start to
>get fairly complex, as we start to play all the little hardware tricks we've
>known from the supercomputer world, and some that we'll invent. Note that a
>lot of errors come from exception handling kinds of problems, and while CISC
>machines have them, superscalar, out-of-order execution RISC machines will
>have them in spades.
Yes, but superscalar, out-of-order CISCs would have them in spades & hearts :-)

The new IBM stuff looks interesting....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

kolding@cs.washington.edu (Eric Koldinger) (11/10/89)

In article <28985@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes:
>Yeah, I've been wondering why people are bothering with on-chip
>caches.  Admittedly, a 2-chip CPU would probably be significantly
>more expensive than a single-chip CPU, but I think the product
>would be more flexible, and achieve better performance, if, instead
>of putting the $ on the CPU chip, the $ was provided on a companion
>chip.  This should only increase the latency to the cache by a
>single clock cycle, while significantly boosting the hit ratio.
>Particularly if this were a data $ (leave the instruction $ on
>chip - it belongs there).  With a separate cache chip, one could

    [Greg goes on to explain his logic.]

    Why not provide both on-chip and off-chip cache ($).  In the chip of the
    future, a 50-100Mhz part, keeping the processor fed from an off chip cache
    will be quite a bear.  The off-chip communication speeds are probably going
    to be substantially slower than what you can achieve on-chip, so feeding an
    instruction per cycle into the chip might improve almost impossible,
    especially if you put two CPUs on the chip (as Greg proposes somewhere in
    the article).  Instead, why don't you use a multi-level caching scheme.

    Put a "small" cache on-chip, keep it simple, possibly I-cache only, and
    have a larger backing cache off-chip.  This would allow you to build the
    off-chip cache as large as you like, and still keep the single cycle cache
    access most of the time, with a low (2-4 cycle) delay on most cache misses,
    and a larger delay (multiple bus cycles) when the second level cache
    misses.  You can also place cache-coherency mechanisms in the second level
    cache so that snooping will not cause the CPU to ever be locked out of the
    cache, except in the exceptional event of a cache-coherency action, in
    which case the second level cache can "interupt" the first level cache and
    invalidate any necesary entries.  The second level cache could easily
    service two processors, allowing more versatility than the 2 CPU chip that
    Greg proposes.

-- 
        _   /|                          Eric Koldinger
        \`o_O'                          University of Washington
          ( )     "Gag Ack Barf"        Department of Computer Science
           U                            kolding@cs.washington.edu

baum@Apple.COM (Allen J. Baum) (11/11/89)

[]
>In article <31149@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
> In article <blah blah> Allen Baum says:
>>Ah, yes. I think its safe to say that our simple RISCs are going to start to
>>get fairly complex, as we start to play all the little hardware tricks we've
>>known from the supercomputer world, and some that we'll invent. Note that a
>>lot of errors come from exception handling kinds of problems, and while CISC
>>machines have them, superscalar, out-of-order execution RISC machines will
>>have them in spades.

>Yes, but superscalar, out-of-order CISCs would have them in spades&hearts :-)
>
>The new IBM stuff looks interesting....

I guess what I was trying to say is that I think its possible to have an 
architecute that we might today consider "CISC" (for any number of reasons:
Reg-Mem instructions, 2 word instructions, fancier address modes, etc.) that
would be architected (I hate verbing nouns) to permit superscalar and out-of-
order execution, unlike current CISCs that gave no thought to the issues at
all (but I love run-on sentences).

This being the case, I believe that compiler technology can take advantage
of CISCy features, and that we'll see a resurgence of CISCs. But, they won't
be compatible with the ones we are familiar with.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

schow@bcarh61.bnr.ca (Stanley T.H. Chow) (11/11/89)

In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>
>The die area issue has been widely misinterpreted.  Let me summarize some
>of the various arguments for RISC vs CISC and die size:
> [...]
>ARGUMENT 4: But when you can get zillions** of transistors  on a die, it
>	it doesn't matter.
>	COMMENT 1: We haven't gotten enough transistors to make anybody
>		happy yet ("happy" means VLSI designers wandering around
>		saying "I have so much space I don't know what to do with it")
>		and we're not likely to get enough any time real soon.
>	COMMENT 2: Of course, this all remains to be seen, but:
>	COMMENT 3: More transistors will help everything, however, it may be
>		that the "limiting factor" will be not die space,
>		but COMPLEXITY in critical paths and exception-handling.
		    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Please note that "complexity" is essentially independent of the RSIC/CSIC
debate. I defy anyone to call the M88k as simple :-)

It is certainly true that i86 (or any other micro) has had its share of
problems. However, I would score the problems to the "Venerable Architecture"
instead. It is by no means clear that CISC architecture must always be
nasty in this respect.

The "Zillion transistor" question is essentially another design trade-off:
is it better to use the transistors for bigger cache? more pipeline stages?
fatter pipelines? more execution units? ....

Of course, once the architecture is chosen, many of the trade-offs become
obviouse. IMHO, CSIC leaves many of the choices open, allowing more paths
to be determined later. [Structured Software people think delaying choices
is a good stratagy. You may or may not agree. :-)].

>
>Do weird and occasional effects matter?  Maybe, maybe not.  Observe that
>a 6-month difference is a long time these days, and if it takes 6 months
>more to get a design really right enough that reputable companies will
>ship it, that's a noticable difference.
>

Certainly 6 months make a difference. It is also true that even 2 years
make little difference.

It all depends on the market and the perspective. How many RISC chips
beat the 486 to market? What difference did it make? These days, software
is KING. In the commercial market place, Compatibility is also very
important.

Even in the new designs that don't care about old software, I would think
designers (at least the design managers) should look at the long term
evolution of the architecture before settling on a particular CPU chip.

The only occasions when 6 months *really* matter is a race to open up a
new area. It's been a long time since those kinds of excitment in the 
CPU & memory market. Examples are the *first* one-chip CPU (the 4004/8008), 
the first DRAM, the first EEPROM, etc. When I can also buy the same
functionality, 6 months is only cost-optimization.



Stanley Chow        BitNet:  schow@BNR.CA
BNR		    UUCP:    ..!psuvax1!BNR.CA.bitnet!schow
(613) 763-2831		     ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh61
Me? Represent other people? Don't make them laugh so hard.

gerry@zds-ux.UUCP (Gerry Gleason) (11/11/89)

In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster
>>	time to design and test the chips.
>>	COMMENT: maybe, maybe not.
>   Agreed. Notice as we get more and more transistors, we start to do more
>   complex (superscalar, hi-perf FP, graphics) things, not just add more
>   cache.  Superscalar may double your performance. It would literally
>   impossible to add enough cache to do that, so simple, regular hardware is
>   probably not where the transistors will go. (for disbelievers, if your
>   cache miss penalty * miss ratio <1, then even if every access hits, you'll
>   save less than a cycle, so CPI approaches 1. Superscalar does better.)

I think part of the problem is that not everyone means the same thing by
RISC.  As I said in an earlier posting, it's not RISC vs CISC, but simply
apply the KISS principle to every aspect of design, or maybe more to the
point, before you put in a feature you had better be sure it is a big win.
Small wins are not worth the headaches that the complexity brings.  In fact,
if you can throw out a large chunk of complexity (from the instruction set
or addressing modes for example), you will have capacity for more hot-rod
features (i.e. pipelines, and some of the things you mentioned).

Another aspect of RISC is that it is a move to more modular, less ad-hoc
architectures.  The architects that are pressing the RISC debate are
applying the same ideas that help software engineers deal with million
line programs to help hardware engineers deal with million transistor
chips.  The issues are the same.

Gerry Gleason

My cohorts here at Zenith think I need a disclaimer to say this is my
opinion and not Zenith's.  IMHO, anyone who assumes a posting to the
net represents the organization they work for is nuts.  I know it happens,
but how often is the posters opinion even represented.  So I will just
waist net bandwith this once with a disclaimer:

	"No complany will let me speak for them anyway, and
	 I reserve the right to change my opinion without notice"

mash@mips.COM (John Mashey) (11/11/89)

In article <9769@june.cs.washington.edu> kolding@june.cs.washington.edu.cs.washington.edu (Eric Koldinger) writes:

>    Why not provide both on-chip and off-chip cache ($).  In the chip of the
>    future, a 50-100Mhz part, keeping the processor fed from an off chip cache
>    will be quite a bear.  The off-chip communication speeds are probably going
>    to be substantially slower than what you can achieve on-chip, so feeding an
>    instruction per cycle into the chip might improve almost impossible,
....
Some relevant recent papers include:

Wen-Hann Wang, Jean-Loup Baer, Henry M. Levy, "Organization and Performance
of a Two-Level Virtual-Real Cache Hierarchy", 16th Ann. Int. Symposium
on Computer Architecture, May-June 1989, Jerusalem, Israel.
ACM SIGARCH 17, 3 (June 1989), 140-148.  [Univ. of Washington people,
using virtual first-level and real second-level caches: get fast cycle
from first level, high hit-rate from second.  Grossly similar scheme to
MIPS MC6280, for same reasons.

Steven Pryzbylski, Mark Horowitz, John Hennessy, "Characteristics of
Performance-Optimal Multi-Level Cache Hierarachies", (same as above), 114-121.
A good quote from the abstract:
	"The increasing speed of new generation processors will exacerbate
the already large difference between CPU cycle times and main memory
access times.  As this difference grows, itwill be increasingly difficult
to build single-level caches that are both fast enough to match these fast
cycle times and large enough to effectively hide the main memory
access times....  This change in relative importance of cycle time
and miss rate makes associativity more attractive and increases the
optimal cache size for second-level caches over what they would be
for an equivalent single-level cache system."

Note, of course, that many 68020 systems used external caches along
with the internal ones, and various sperminis and mainframes have
used such things for some time.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (11/12/89)

	1) With an on-chip cache it is a lot easier to implement 
	   a wide datapath (64 or 128 bits) between the cache and 
	   the register file than it is with an off-chip cache,
	   which requires lots of pins and lots of routing.
	   A wide datapath allows the use of instructions that
	   take advantage of the extra bandwidth to 
	   i) save/restore the register file quicker (for example 
	   on calls/returns) and
	   ii) load and store double precision operands in one cycle
	   (two double precision operands can be loaded with 128 bits).

I'm a wide datapath proponent myself, but some of my contacts have
responded that so many signals simulataneously changing state at the
same time => huge instantaneous power demand.
    
How much of a problem is this *really*?

Can techniques such as slightly delaying groups of lines help?
Eg. provide the LSBs earlier so that caries can propagate while
waiting for the MSBs to arrive on different lines?

--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

cs4g6ag@maccs.dcss.mcmaster.ca (Stephen M. Dunn) (11/13/89)

In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
$>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
$>ARGUMENT 2: RISC is better for cost reasons, because it's smaller
$   When you get to 1M transistors on a chip, the space for some extra decoding
$   logic is neglible. This argument works only during the initial (e.g.
$   'new technology' phase.

$>ARGUMENT 3: RISC is better, because even if there is enough space on a die
$>	to put a whole CPU plus other things, the RISC can afford more space
$>	for caches and other good things, and so it will be faster.
$    See argument above. What is the performance difference between a 32k cache
$    and a 31k cache?

   I have beside me an article stating that on the Berkeley RISC I and II chips,
between 6 and 10 percent of the chip area was used for decoding and control
logic.  Compare this with a figure of 50-60% on a 68000 or Z8000 ... now, I
must admit that I've never designed an on-chip cache for a RISC chip, but
somehow I think it would be quite easy to get more than a 1k cache out of
40-54% of the chip size.  After all, if each 1k of cache size takes even 40%
of the chip's area, how would you get a 32k (or, for that matter, 31k) cache
on a chip in the first place?
-- 
Stephen M. Dunn                               cs4g6ag@maccs.dcss.mcmaster.ca
          <std_disclaimer.h> = "\nI'm only an undergraduate!!!\n";
****************************************************************************
They say the best in life is free // but if you don't pay then you don't eat

baum@Apple.COM (Allen J. Baum) (11/14/89)

[]
>In article <255E15BB.12770@maccs.dcss.mcmaster.ca> cs4g6ag@maccs.dcss.mcmaster.ca (Stephen M. Dunn) writes:
>In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>$ What is the performance difference between a 32k cache  and a 31k cache?
>
>I have beside me an article stating that on the Berkeley RISC I and II chips,
>between 6 and 10 percent of the chip area was used for decoding and control
>logic.  Compare this with a figure of 50-60% on a 68000 or Z8000 ... now, I
>must admit that I've never designed an on-chip cache for a RISC chip, but
>somehow I think it would be quite easy to get more than a 1k cache out of
>40-54% of the chip size.  After all, if each 1k of cache size takes even 40%
>of the chip's area, how would you get a 32k (or, for that matter, 31k) cache
>on a chip in the first place?

It is not clear that the percentage of transistors used by control logic
is constant as the total number of transistors increases. That is, 50-60% of
the 68,000 transistors in a 68000 may have been control, but its a lot less
(percentage-wise) of the 250k (?) transistors of a 68030. By the time you 
start throwing large caches on board, the control percentage may be neglible,
even for a CISC. It was my contention that when you reach this point, the
cost of CISC control hardware is neglible, and the supposed superiority of
RISC (in this "area" {heh, heh} only!! :-)  is in serious question.


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

toms@omews44.intel.com (Tom Shott) (11/16/89)

To throw some more gas on this fire.

As we top 50 MHz for chip speed the biggest problem becomes getting data on
and off chip. You start needing delay cycles between reads and writes on a
IO pin to turn the bus around or a chip w/ lots of pins to get data in and
out faster. Putting a cache on the die has been beat into the ground.

One solution that has not been discussed is flip chip technology. In flip
chip technology many die are mounted directly on a ceramic carrier. (This
is used in IBM mainframes). The result is lower interconnect capacitance
(smaller feature size for lines, no pins) and the ability to match an exact
SRAM to an exact CPU. IOs (not pins) are cheap w/ flip chip technology. You
get higher bandwidth to the cache (wider and faster lines) and the ability
to optomize your process for digital on one die and RAM cells on the other.

Other ways to deal w/ the interconnect speed problem are architectural.
Delayed loads have been spoken about. It just takes more smarts in the
compiler to use those slots. A longer pipeline w/ the load at the start
will also hide off chip delays.

A novel architecture from the Computer Systems Group at UIUC published by
Dave Archer, et el used multiple task running on one CPU to hide delays.
For example w/ a 4 stage pipeline, the CPU chip would run four tasks at
once. I don't remember the details but it worked out that each task
executed at 1/4 of full speed. (I think dummy pipeline stages were used
between the stages). But during that delay time memory fetch latency was
hidden. (Also data dependices). Realistically I might expect this technique
only to be used for large systems aimed at multiuser applications. You need
four tasks always ready to run.


--
-----------------------------------------------------------------------------
Tom Shott    INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520
	     toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com
	INTeL.. Designers of the 960 Superscalar uP and other uP's

lee@iris.davis.EDU (Peng Lee) (11/17/89)

In article <TOMS.89Nov15114003@omews44.intel.com>,
toms@omews44.intel.com (Tom Shott) writes:
> 
> A novel architecture from the Computer Systems Group at UIUC published by
> Dave Archer, et el used multiple task running on one CPU to hide delays.
> For example w/ a 4 stage pipeline, the CPU chip would run four tasks at
> once. I don't remember the details but it worked out that each task
> executed at 1/4 of full speed. (I think dummy pipeline stages were used
> between the stages). But during that delay time memory fetch latency was
> hidden. (Also data dependices). Realistically I might expect this technique
> only to be used for large systems aimed at multiuser applications. You need
> four tasks always ready to run.
> 

      I have been looking at the implementation of this kind of design. 
By implement a task controller unit that would swap to a new register
set when there is a data dependency or branch or a load (delay), you
don't need 4 tasks fully utilize the excution pipe.   At most two tasks
should be enough to fill the pipe.   And if standard RISC optimizing
technique (delay branch, etc) is applied to these tasks, the single task
exexution time shouldn't be slower then the regular RISC when there is
only one task.   This design have a advantage guarantee full utilization
of execution pipling when there is two tasks.   One very interesting
aspect I am currently looking at is possibility of implementing at
semaphore in these register sets.

> 
> --
> -----------------------------------------------------------------------------
> Tom Shott    INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520
> 	     toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com
> 	INTeL.. Designers of the 960 Superscalar uP and other uP's

-Peng (lee@iris.ucdavis.edu)

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (11/17/89)

In article <15126@haddock.ima.isc.com>, news@haddock.ima.isc.com (overhead) writes:
> In article <503@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
> >It also seems to me that the current practice of throwing away computers
> >every 3 years or so cannot continue forever; it is evidence of an
> >immature technology that experiences quantum performance leaps every 3
> >years  (that's what makes it fun :^).  Eventually, performance improvements
> >may taper off a bit, so the advantage of being able to design a chip in
> >6 months instead of 18 will not be as great as it currently is.
> 
> I used to think these things.  I even worked for a company that
> had a 25 year old CDC 6600 in production.  The problem is this:
> the maintenance on the old machines is higher than the cost of
> new machines.  It is still true, only less so.  Why keep the
> current machine on contract when you could just let the machine
> break in six months (with some risk), and use the contract money
> to buy a newer (and maybe) faster machine?
> 

I *know* that it is cheaper to throw away old technology and bring in
the latest and greatest.  I have a three year old VAX that I would
*love* to get rid of, if I could find anyone foolish enough to buy
it.  And you're right, the maintenance costs are a major portion of the
justification.  So is increased productivity and functionality on the
part of the users.

> We keep hearing how one physical limit or another will soon be
> reached.  I make this prediction: computers will continue getting
> faster by the same expontial growth for the next ten years.  Some
> of the speed will come from raw hardware.  Much of it will come
> from more parallelism.  RAM density will continue to increase.
> Disk speeds and densities will get better.  Backup media will get

My point exactly, exponential growth for another 10 years.  Maybe 15.
At some point, *all* curves flatten.  Besides the economic reasons
for getting rid of old technology, there are economic penalties for
constantly throwing away working systems.  Aside from the obvious
hardware waste (I've heard of working mainframes dismantled with
hammers and thrown in garbage trucks) there are very substantial
software porting costs.  My point is that there will come a time
when economics will dictate that machines not be thrown out every
3 years.  Whether this will be in 5, 10 or 20 years I'm not sure.
The technology curve will then remain flat until a quantum break-
through takes place (analagous to the vacuum tube -> semiconductor
revolution).  I'm placing my bets on optical gates.  When that
breakthrough takes place, the whole thing will take off again and
the added power will allow previously undreamable applications to
be written.

-- 
  Terry Ingoldsby                       ctycal!ingoldsb@calgary.UUCP
  Land Information Systems                           or
  The City of Calgary         ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

ingoldsb@ctycal.UUCP (Terry Ingoldsby) (11/17/89)

In article <2@zds-ux.UUCP>, gerry@zds-ux.UUCP (Gerry Gleason) writes:

> But really, I'm beginning to think that the simplicity and speed of design
> and testing is the really big win with RISC.  Also, product quality goes up
> because there are fewer ways for things to go wrong, fewer tests to write and
> run, and fewer places to make errors in design, layout, etc.  Note that Intel

Exactly.  As long as getting to market 6 months faster than the other guy is
the critical factor in marketing a design, then RISC will win.  Once the curve
flattens, then timing will be far less important.  Consider the early automobile
industry.  In the beginning radical changes were made to engines, suspensions
drivetrains every few years.  Compare a 1960 car to a 1980 car an the improve-
ments have become much more evolutionary than revolutionary.  Does a car
manufacturer really have a significant advantage over his competitor because
he introduces a new feature a year before the competition?  Probably not.  Why?
Because the performance gain is a few percent, not orders of magnitude. 

Anyway, I'm drifting further from comp.arch with each posting.  Also, I'm
not sure I believe my own arguments, if for no other reason then I can't
picture a stagnated computer industry.


-- 
  Terry Ingoldsby                       ctycal!ingoldsb@calgary.UUCP
  Land Information Systems                           or
  The City of Calgary         ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

baum@Apple.COM (Allen J. Baum) (11/17/89)

[]
>In article <5952@ucdavis.ucdavis.edu> lee@iris.davis.EDU (Peng Lee) writes:
>In article <TOMS.89Nov15114003@omews44.intel.com>,
>toms@omews44.intel.com (Tom Shott) writes:
>> 
>> A novel architecture from the Computer Systems Group at UIUC published by
>> Dave Archer, et el used multiple task running on one CPU to hide delays.
>> For example w/ a 4 stage pipeline, the CPU chip would run four tasks at
>> once.
>One very interesting

>aspect I am currently looking at is possibility of implementing at
>semaphore in these register sets.

Not so novel; this scheme was used in the Denelcor HEP, and by Stellar in
their new machine. The Stellar had some kind of semphore feature so that the
parallel running tasks could communicate with each other. The HEP had a 
semophore bit/memory location!


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

mac@ra.cs.Virginia.EDU (Alex Colvin) (11/17/89)

In article <36564@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
> Not so novel; this scheme was used in the Denelcor HEP, and by Stellar in
> their new machine. The Stellar had some kind of semphore feature so that the
> parallel running tasks could communicate with each other. The HEP had a 
> semophore bit/memory location!

wasn't this called a barrel processor?  I believe this was discussed here a
year or two ago.

casey@gauss.llnl.gov (Casey Leedom) (11/21/89)

| From: ingoldsb@ctycal.UUCP (Terry Ingoldsby)
| 
| Aside from the obvious hardware waste (I've heard of working mainframes
| dismantled with hammers and thrown in garbage trucks) there are very
| substantial software porting costs.

  Nothing against the rest of your argument, but I think you over-
estimate the cost of porting software in the contemporary world.
Companies are tending to write less and less of their software in a
machine dependent manner (excepting companies intent on destroying
themselves.)  There is machine dependent code being written, but it's
being isolated from the main line code in well defined abstractions.

Casey

rcd@ico.isc.com (Dick Dunn) (11/21/89)

ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
> gerry@zds-ux.UUCP (Gerry Gleason) writes:
> > But really, I'm beginning to think that the simplicity and speed of design
> > and testing is the really big win with RISC...
[other advantages to simplicity...]
> Exactly.  As long as getting to market 6 months faster than the other guy is
> the critical factor in marketing a design, then RISC will win.  Once the curve
> flattens, then timing will be far less important...

Depends on (a) how and (b) whether the curve flattens.  Even if we get off
the current roughly exponential curve (e.g., the MIPS "double per year"
goal) and drop way back to a linear growth, will that really let CISC come
close enough to say it's "caught up"?  Another way to look at it is to ask
whether, once RISC has held the lead for a while, there's any reason to go
back to CISC?  What does CISC have to offer, if performance somehow manages
to become a secondary consideration (which I have a hard time imagining)?
Will it become, as one might extrapolate from Terry's analogy to cars, a
competition over tailfins and chrome?  (This is less sarcastic and more
serious than it might seem at first.)
-- 
Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
   ...`Just say no' to mindless dogma.

peter@ficc.uu.net (Peter da Silva) (11/22/89)

In article <508@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes:
> Does a car
> manufacturer really have a significant advantage over his competitor because
> he introduces a new feature a year before the competition?

Depends on the feature and how well they can sell it. More efficient engines,
no. More powerful engines, no. Fuel injection, turbochargers, etc..? No.

But the early anti-lock braking systems did rather well, as did Oldsmobile's
marketing of what's really a fairly ordinary engine: the Quad-4.

And then there's the Ford Taurus effect.

-- 
`-_-' Peter da Silva <peter@ficc.uu.net> <peter@sugar.lonestar.org>.
 'U`  --------------  +1 713 274 5180.
"I agree 0bNNNNN would have been nice, however, and I sure wish X3J11 had taken
 time off from rabbinical hairsplitting to add it." -- Tom Neff <tneff@bfmny0>

lm@slovax.Eng.Sun.COM (Larry McVoy) (04/08/91)

In article <1991Apr7.064855.25469@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>Or for a *really* straight comparison, due to John Mashey I think, compare
>the i860 to the i486:  same tools, same process, same chip size, roughly
>the same release time... and the RISC machine is faster, much faster, in
>every way.

I respect Henry, despite his annoying signatures, but I can't agree
with this comparison.  The i860 is, to the best of my knowledge, a
clean start.  The i486 is carrying baggage from all the way back to
8080's.  (I personally think the i486 is a cool chip, if you look
closely, it is quite RISC like in the most common instruction uses.  
First the 386, then the 486.  Hmm.  If Intel keeps this up, they might
make a decent CPU one day :-)

Anyway, it is not a fair comparison.  Not by a long stretch.  Let's see
how the Nth generation SPARC, MIPS, and 88K's do (assuming they last)
compared to some new design from scratch.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

mash@mips.com (John Mashey) (04/18/91)

WARNING: you may want to print this one to read it...

In article <537@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
>In article <1991Apr7.064855.25469@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>>Or for a *really* straight comparison, due to John Mashey I think, compare
>>the i860 to the i486:  same tools, same process, same chip size, roughly
>>the same release time... and the RISC machine is faster, much faster, in
>>every way.
>
>I respect Henry, despite his annoying signatures, but I can't agree
>with this comparison.  The i860 is, to the best of my knowledge, a
>clean start.  The i486 is carrying baggage from all the way back to
>8080's.  (I personally think the i486 is a cool chip, if you look
>closely, it is quite RISC like in the most common instruction uses.  
>First the 386, then the 486.  Hmm.  If Intel keeps this up, they might
>make a decent CPU one day :-)
>
>Anyway, it is not a fair comparison.  Not by a long stretch.  Let's see
>how the Nth generation SPARC, MIPS, and 88K's do (assuming they last)
>compared to some new design from scratch.

Well, there is baggage and there is BAGGAGE.
One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION:
	a) Architectures persist longer than implementations, especially
	user-level Instruction-Set Architecture.
	b) The first member of an architecture family is usually designed
	with the current implementation constraints in mind, and if you're
	lucky, software people had some input.
	c) If you're really lucky, you anticipate 5-10 years of technology
	trends, and that modifies your idea of the ISA you commit to.
	d) It's pretty hard to delete anything from an ISA, except where:
		1) You can find that NO ONE uses a feature
			(the 68020->68030 deletions mentioned by someone
			else).
		2) You believe that you can trap and emulate the feature
		"fast enough".
			i.e., microVAX support for decimal ops,
			68040 support for transcendentals.
Now, one might claim that the i486 and 68040 are RISC implementations
of CISC architectures .... and I think there is some truth to this,
but I also think that it can confuse things badly:
Anyone who has studied the history of computer design knows that
high-performance designs have used many of the same techniques for years,
for all of the natural reasons, that is:
	a) they use as much pipelining as they can, in soem cases, if this
	means a high gate-count, then so be it.
	b) They use caches (separate I & D if convenient).
	c) They use hardware rather than micro-code for the simpler
	operations.
(For instance, look at the evolution of the S/360 products.
Recall that the 360/85 used caches, back around 1969, and within a few
years, so did any mainframe or supermini.)

So, what difference is there among machines if similar implementation
ideas are used?
A: there is a very specific set of characteristics shared by most
machines labeled RISCs, most of which are not shared by most CISCs.
The RISC characteristics:
	a) Are aimed at more performance from current compiler technology
	(i.e., enough registers).
OR
	b) Are aimed at fast pipelining
		in a virtual-memory environment
		with the ability to still survive exceptions
		without inextricably increasing the number of gate delays
			(notice that I say gate delays, NOT just how many
			gates).
Even though various RISCs have made various decisions, most of them have
been very careful to omit those things that CPU designers have found
difficult and/or expensive to implement, and especially, things that
are painful, for relatively little gain.

I would claim, that even as RISCs evolve, they may have certain baggage
that they'd wish weren't there .... but not very much.
In particular, there are a bunch of objective characteristics shared by
RISC ARCHITECTURES that clearly distinguish them from CISC architectures.

I'll give a few examples, followed by the detailed analysis from
an upcoming talk:

MOST RISCs:
	3a) Have 1 size of instruction in an instruction stream
	3b) And that size is 4 bytes
	3c) Have a handful (1-4) addressing modes) (* it is VERY
	hard to count these things; will discuss later).
	3d) Have NO indirect addressing in any form (i.e., where you need
	one memory access to get the address of another operand in memory)
	4a) Have NO operations that combine load/store with arithmetic,
	i.e., like add from memory, or add to memory.
	4b) Have no more than 1 memory-addressed operand per instruction
	5a) Do NOT support arbitrary alignment of data for loads/stores
	5b) Use an MMU for a data address no more than once per instruction
	6a) Have >=5 bits per integer register specifier
	6b) Have >= 4 bits per FP register specifier
These rules provide a rather distinct dividing line among architectures,
and I think there are rather strong technical reasons for this, such
that there is one more interesting attribute: almost every architecture
whose first instance appeared on the market from 1986 onward obeys the
rules above .....
	Note that I didn't say anything about counting the number of
	instructions....
So, here's a table:
C: number of years since first implementation sold in this family
(or first thing of which this is binary compatible with)
3a: # instruction sizes
3b: maximum instruction size in bytes
3c: number of distinct addressing modes for accessing data (not jumps)>
I didn't count register or
literal, but only ones that referenced memory, and I counted different
formats with different offset sizes separately.  This was hard work...
Also, even when a machine had different modes for register-relative and
PC_relative addressing, I counted them only once.
3d: indirect addressing: 0: no, 1: yes
4a: load/store combined with arithmetic: 0: no, 1:yes
4b: maximum number of memory operands
5a: unaligned addressing of memory references allowed in load/store,
	without specific instructions
	0: no never (MIPS, SPARC, etc)
	1: sometimes (as in RS/6000)
	2: just about any time
5b: maximum number of MMU uses for data operands in an instruction
6a: number of bits for integer register specifier
6b: number of bits for 64-bit or more FP register specifier,
	distinct from integer registers

Note that all of this are ARCHIECTURE issues, and it is usually quite
difficult to either delete a feature (3a-5b) or increase the number
of real registers (6a-6b) given an initial isntruction set design.
(yes, register renaming can help, but...)

Now: items 3a, 3b, and 3c are an indication of the decode complexity
	3d-5b hint at the ease or difficulty of pipelining, especially
	in the presence of virtual-memory requirements, and need to go
	fast while still taking exceptions sanely
	items 6a and 6b are more related to ability to take good advantage
	of current compilers.
	There are some other attributes that can be useful, but I couldn't
	imagine how to create metrics for them without being very subjective;
	for example "degree of sequential decode", "number of writebacks
	that you might want to do in the middle of an instruction, but can't,
	because you have to wait to make sure you see all of the instruction
	before committing any state, because the last part might cause a
	page fault,"  or "irregularity/assymetricness of register use",
	or "irregularity/complexity of instruction formats".  I'd love to
	use those, but just don't know how to measure them.
	Also, I'd be happy to hear corrections for some of these.

So, here's a table of 12 implementations of various architectures,
one per architecture, with the attributes above.  Just for fun, I'm
going to leave the architectures coded at first, although I'll identify
them later.  I'm going to draw a line between H1 and L4 (obviously,
the RISC-CISC Line), and also, at the head of each column, I'm going
to put a rule, which, in that column, most of the RISCs obey.
Any RISC that does not obey it is marked with a +; any CISC that DOES
obey it is marked with a *.  So...

CPU	Age	3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE	<6	=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
A1	4	 1  4  1  0	 0  1  0  1	 8  3+	1
B1	5	 1  4  1  0	 0  1  0  1	 5  4	-
C1	2	 1  4  2  0	 0  1  0  1	 5  4	-
D1	2	 1  4  3  0	 0  1  0  1	 5  0+	1
E1	5	 1  4 10+ 0	 0  1  0  1	 5  4	1
F1	5	 2+ 4  1  0	 0  1  0  1	 4+ 3+	3
G1	1	 1  4  4  0	 0  1  1  1	 5  5   -
H1	2	 1  4  4  0	 0  1  0  1	 5  4	-	RISC
---------------------------------------------------------------
L4	26	 4  8  2* 0*	 1  2  2  4	 4  2	2	CISC
M2	12	12 12 15  0*	 1  2  2  4	 3  3	1
N1	10	21 21 23  1	 1  2  2  4	 3  3	-
O3	11	11 22 44  1	 1  2  2  8	 4  3	-
P3	13	56 56 22  1	 1  6  2 24	 4  0	-

An interesting exercise is to analyze the ODD cases.
First, observe that of 12 architectures, in only 2 cases does an
architecture have an attribute that puts it on the wrong side of the line.
of the RISCs:
-A1 is slightly unusual in having more integer registers, and less FP
than usual.
-D1 is unusual in sharing integer and FP registers (that's what the D1:6b == 0).
-E1 seems odd in having a large number of address modes.  I think most of this
is an artifact of the way that I counted, as this architecture really only
has a fundamentally small number of ways to create addresses, but has several
different-sized offsets and combinations, but all within 1 4-byte instruction;
I believe that it's addressing mechanisms are fundamentally MUCH simpler
than, for example, M2, or especially N1, O3, or P3, but the specific number
doesn't capture it very well.
-F1 .... is not sold any more.
-H1 one might argue that this process has 2 sizes of instructions,
but I'd observe that at any point in the instruction stream, the instructions
are either 4-bytes long, or 8-bytes long, with the setting done by a mode bit,
i.e., not dynamically encoded in every instruction.

Of the processors called CISCs:
-L4 happens to be one in which you can tell the length of the instruction
from the first few bits, has a fairly regular instruction decode,
has relatively few addressing modes, no indirect addressing.
In fact, a big subset of its instructions are actually fairly RISC-like,
although another subset is very CISCy.
-M2 has a myriad of instruction formats, but fortunately avoided
indirect addressing, and actually, MOST of instructions only have 1
address, except for a small set of string operations with 2.
I.e., in this case, the decode complexity may be high, but most instructions
cannot turn into multiple-memory-address-with-side-effects things.
-N1,O3, and P3 are actually fairly clean, orthogonal architectures, in
which most operations can consistently have operands in either memory or
registers, and there are relatively few weirdnesses of special-cased uses
of registers.  Unfortunately, they also have indirect addressing,
instruction formats whose very orthogonality almost guarantees sequential
decoding, where it's hard to even know how long an instruction is unitl
you parse each piece, and that may have side-effects where you'd like to
do a register write-back early, but either:
	must wait until you see all of the instruction until you commit state
or
	must have "undo" shadow-registers
or
	must use instruction-continuation with fairly tricky exception
	handling to restore the state of the machine
It is also interesting to note that the original member of the family to
which O3 belongs was rather simpler in some of the critical areas,
with only 5 instruction sizes, of maximum size 10 bytes, and no indirect
addressing, and requiring alignment (i.e., it was a much more RISC-like
design, and it would be a fascinating speculation to know if that
extra complexity was useful in practice).
Now, here's the table again, with the labels:

CPU	Age	3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE	<6	=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
A1	4	 1  4  1  0	 0  1  0  1	 8  3+	1	AMD 29K
B1	5	 1  4  1  0	 0  1  0  1	 5  4	-	R2000
C1	2	 1  4  2  0	 0  1  0  1	 5  4	-	SPARC
D1	2	 1  4  3  0	 0  1  0  1	 5  0+	1	MC88000
E1	5	 1  4 10+ 0	 0  1  0  1	 5  4	1	HP PA
F1	5	 2+ 4  1  0	 0  1  0  1	 4+ 3+	3	IBM RT/PC
G1	1	 1  4  4  0	 0  1  1  1	 5  5   -	IBM RS/6000
H1	2	 1  4  4  0	 0  1  0  1	 5  4	-	Intel i860
---------------------------------------------------------------
L4	26	 4  8  2* 0*	 1  2  2  4	 4  2	2	IBM 3090
M2	12	12 12 15  0*	 1  2  2  4	 3  3	1	Intel i486
N1	10	21 21 23  1	 1  2  2  4	 3  3	-	NSC 32016
O3	11	11 22 44  1	 1  2  2  8	 4  3	-	MC 68040
P3	13	56 56 22  1	 1  6  2 24	 4  0	-	VAX

General comment: this may sound weird, but in the long term, it might
be easier to deal with a really complicated bunch of instruction
formats, than with a complex set of addressing modes, because at least
the former is more amenable to pre-decoding into a cache of
decoded instructions that can be pipelined reasonably, whereas the pipeline
on the latter can get very tricky (examples to follow).  This can lead to
the funny effect that a relatively "clean", orthogonal archiecture may actually
be harder to make run fast than one that is less clean.  Obviously, every
weirdness has it's penalties....  But consider the fundamental difficulty
of pipelining something like (on a VAX):
	ADDL	@(R1)+,@(R1)+,@(R2)+

(I.e., something that, might theoretically arise from:
	int **r1, **r2;
	**r2++ = **r1++ + **r1++;
and which a RISC machine would do (most straightforwardly) as:
	lw	r3,0(r1)	*r1
	add	r1,4		r1++
	lw	r4,0(r1)	*r1 again
	add	r1,4		r1++
	lw	r5,0(r2)	r5 = *r2
	add	r6,r3,r4	sum in r6
	sw	r6,0(r5)	**r2 = sum
	add	r5,4		r2++
(Now, some RISCs might use auto-increment to get rid of, for example,
the last add; in any case, samrt compilers are quite likely to generate
something more like:
	lw	r3,0(r1)	*r1
	lw	r4,4(r1)	*r1 again
	add	r1,8		r1++
	lw	r5,0(r2)	r5 = *r2
	add	r6,r3,r4	sum in r6
	sw	r6,0(r5)	**r2 = sum
	add	r5,4		r2++
which has no stalls anywhere on most RISCs.)
Now, consider what the VAX has to do:
1) Decode the opcode (ADD)
2) Fetch first operand specifier from I-stream and work on it.
	a) Compute the memory address from (r1)
		If aligned
			run through MMU
				if MMU miss, fixup
			access cache
				if cache miss, do write-back/refill
		Elseif unaligned
			run through MMU for first part of data
				if MMU miss, fixup
			access cache for that part of data
				if cache miss, do write-back/refill
			run through MMU for second part of data
				if MMU miss, fixup
			access cache for second part of data
				if cache miss, do write-back/refill
		Now, in either case, we now have a longword that has the
		address of the actual data.
	b) Increment r1  [well, this is where you'd LIKE to do it, or
	in parallel with step 2a).]  However, see later why not...
	c) Now, fetch the actual data from memory, using the address just
	obtained, doing everything in step 2a) again, yielding the
	actual data, which we needto stick in a temporary buffer, since it
	doesn't actually go in a register.
3) Now, decode the second operand specifier, which goes thru everything
that we did in step 2, only again, and leaves the results in a second
temporary buffer. Note that we'd like to be starting this before we get
done with all of 2 (and I THINK the VAX9000 probably does that??) but
you have to be careful to bypass/interlock on potential side-effects to
registers .... actually, you may well have to keep shadow copies of
every register that might get written in the instruction, since every
operand can use auto-increment/decrement. You'd probably want badly to
try to compute the address of the second argument and do the MMU
access interleaved with the memory access of the first, although the
ability of any operand to need 2-4 MMU accesses probably makes this
tricky.  [Recall that any MMU access may well cause a page fault....]

4) Now, do the add. [could cause exception]

5) Now, do the third specifier .... only, it might be a little different,
depending on the nature of the cache, that is, you cannot modify cache or
memory, unless you know it will complete.  (Why? well, suppose that
the location you are storing into overlaps with one of the indirect-addressing
words pointed to by r1 or 4(r1), and suppose that the store was unaligned,
and suppose that the last byte of the store crossed a page boundary and
caused a page fault, and that you'd already written the first 3 bytes.
If you did this straightforwardly, and then tried to restart the
instruction, it wouldn't do the same thing the second time.

6) When you're sure all is well, and the store is on its way, then you
can safely update the two registers, but you'd better wait until the end,
or else, keep copies of any modified registers until you're sure it's safe.
(I think both have been done ??)

7) You may say that this code is unlikely, but it is legal, so the CPU must
do it.  This style has the following effects:
	a) You have to worry about unlikely cases.
	b) You'd like to do the work, with predictable uses of functional
	units, but instead, they can make unpredictable demands.
	c) You'd like to minimize the amount of buffering and state,
	but it costs you in both to go fast.
	d) Simple pipelining is very, very tough: for example, it is
	pretty hard to do much about the next instruction following the
	ADDL, (except some early decode, perhaps), without a lot of gates
	for special-casing.
	(I've always been amazed that CVAX chips are fast as they are,
	and VAX 9000s are REALLY impressive...)
	e) EVERY memory operand can potentially cause 4 MMU uses,
	and hence 4 MMU faults that might actually be page faults...
8) Consider how "lazy" RISC designers can be, with the RISC sequence shown:
	a) Every load/store uses exactly 1 MMU access.
	b) The compilers are often free to re-arrange the order, even across
	what would have been the next instruction on a CISC.
	This gets rid of some stalls that the CISC may be stuck with
	(especially memory accesses).
	c) The alignment requirement avoids especially the problem with
	sending the first part of a store on the way before you're SURE
	that the second part of it is safe to do.

Finally, to be fair, let me add the two cases that I knew of that were more
on the borderline: i960 and Clipper:
CPU	Age	3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE	<6	=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
J1	5	 4+ 8+ 9+ 0      0  1  0  2      4+ 3+	5	Clipper
K1	3	 2+ 8+ 9+ 0	 0  1  2+ -      5  3+	5	Intel 960KB

SUMMARY:
	1) RISCs share certain architectural characteristics, although there
	are differences, and some of those differences matter a lot.
	2) However, the RISCs, as a group, are much more alike than the
	CISCs as a group.
	3) At least some of these architectural characteristics have fairly
	serious consequences on the pipelinability of the ISA, especially
	in a virtual-memory, cached environment.
	4) Counting instructions turns out to be fairly irrelevant:
		a) It's HARD to actually count instructions in a meaningful
		way... (if you disagree, I'll claim that the VAX is RISCier
		than any RISC, at least for part of its instruction set :-)
		b) More instructions aren't what REALLY hurts you, anywhere
		near as much features that are hard to pipeline:
		c) RISCs can perfectly well have string-support, or decimal
		arithmetic support, or graphics transforms ... or lots of
		strange register-register transforms, and it won't cause
		problems .....  but compare that with the consequence of
		adding a single instruction that has 2-3 memory operands,
		each of which can go indirect, with auto-increments,
		and unaligned data...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

chased@rbbb.Eng.Sun.COM (David Chase) (04/19/91)

Hey, you left out the Acorn Risc Machine.  Not a big name in the
workstation market, but I understand that they sell a good number of
them, and the instruction set is "interesting" (a bit-twiddler/
superoptimizer's wet dream, if you ask me, but probably "risc" by many
definitions of the term).  VTI sells this chip in the US, they should
be able to give you a spec if you want it.

David Chase
Sun

henry@zoo.toronto.edu (Henry Spencer) (04/19/91)

In article <2419@spim.mips.COM> mash@mips.com (John Mashey) writes:
>-A1 [29k] is slightly unusual in having more integer registers, and less FP
>than usual. ...
>-D1 [88k] is unusual in sharing integer and FP registers

This is slightly out of date.  AMD appears to have (wisely) decided to ditch
the peculiar 29027 FPU architecture.  The 29050's on-chip floating point uses
the integer register bank (disregarding one or two odd instructions), subject
to a constraint that double-precision arithmetic use even-odd pairs.
-- 
And the bean-counter replied,           | Henry Spencer @ U of Toronto Zoology
"beans are more important".             |  henry@zoo.toronto.edu  utzoo!henry

mash@mips.com (John Mashey) (04/19/91)

In article <11810@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes:
>Hey, you left out the Acorn Risc Machine.  Not a big name in the
>workstation market, but I understand that they sell a good number of
>them, and the instruction set is "interesting" (a bit-twiddler/
>superoptimizer's wet dream, if you ask me, but probably "risc" by many
>definitions of the term).  VTI sells this chip in the US, they should
>be able to give you a spec if you want it.

Sorry, it could have been included, but I just ran out of time and
space, and I thought I had enough data to make the point, which was that
there were noticable differences between RISC and CISC architectures,
regardless of the implementation.
The ARM would certainly get classified as a RISC, with
	32-bit instructions, 1 size
	a handful of memory address modes
	no indirect addressing
	only loads/sotres access memory
	no more than 1 memory address/instruction
	alignment (I think)
	1 use of memory control/TLB per instruction for data
	4 bits available for integer register specifiers
The manual I have shows it with no FP at all, but then it's 2 years' old.

Although I didn't post them, the more complete tables that I was working from
contained multiple implementations of some of the architecture families,
from which one may find several trends:
	a) Only occasionally does anyone (RISC or CISC) subtract instructions.
	b) Both RISC and CISC often add instructions as time goes on.
	c) Sometimes CISCs got CISCier in their ISAs, i.e. ,by adding
	addressing modes the way the 68020 did, or by deleting alignment
	characteristics the way 360->370 did.
	d) Sometimes CISC implementations were done in more RISC-like
	fashion (i.e., trap certain opcodes and emulate)>
	e) I could find no architecture that clearly started as a RISC,
	and then seriously evolved into a CISC, or took on any of the
	attributes that I used to distinguish CISCs and RISCs.
	(So much for this idea that CRISP = Complex RIS Processor
	(not AT&T CRISP) is a merger of RISC and CISC, and that current
	RISC and CISC architectures are evolving towards each other.
	Nonsense.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

torbenm@diku.dk (Torben [gidius Mogensen) (04/19/91)

mash@mips.com (John Mashey) writes:

>Age: number of years since first implementation sold in this family
>(or first thing of which this is binary compatible with)
>3a: # instruction sizes
>3b: maximum instruction size in bytes
>3c: number of distinct addressing modes for accessing data (not jumps)>
>3d: indirect addressing: 0: no, 1: yes
>4a: load/store combined with arithmetic: 0: no, 1:yes
>4b: maximum number of memory operands
>5a: unaligned addressing of memory references allowed in load/store,
>	without specific instructions
>	0: no never (MIPS, SPARC, etc)
>	1: sometimes (as in RS/6000)
>	2: just about any time
>5b: maximum number of MMU uses for data operands in an instruction
>6a: number of bits for integer register specifier
>6b: number of bits for 64-bit or more FP register specifier,
>	distinct from integer registers

You might add ARM to the list

CPU	Age	3a 3b 3c 3d	4a 4b 5a 5b	6a 6b	# ODD
RULE	<6	=1 =4 <5 =0	=0 =1 <2 =1	>4 >3
-------------------------------------------------------------------------
	6+ (7?)	 1  4 13+ 0      0  1? 0  1? 	 4+ 0+  4	ARM

Notes:

There are actually 4 bits that specify addressing mode, but
four of the 16 modes have the same effect. This is due to a very
orthogonal specification.

The (?) in 4b and 5b are due to the load/store multiple register
instructions, which use one memory access per register.

There are no FP unit on the chip, thus no specific FP registers. There
has been an announcement of an FPU to appear this year, but I don't
know anything about it.

I'm not certain about the age, but it was, I think, the first
commercially available RISC processor.

	Torben Mogensen (torbenm@diku.dk)

torbenm@diku.dk (Torben [gidius Mogensen) (04/19/91)

I made an error in the number of addressing modes. The stated 13 modes
(16 with 4 identical) should be 10, since in 6 (8) of the modes, one
of the specifying bits has no effect in user mode. You could also say
that there are 20 modes, as each mode can transfer either 8 or 32 bits. 

	Torben Mogensen (torbenm@diku.dk)

richard@aiai.ed.ac.uk (Richard Tobin) (04/22/91)

In article <1991Apr19.133634.15241@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes:

>You might add ARM to the list

>I'm not certain about the age, but it was, I think, the first
>commercially available RISC processor.

I'm not certain either (surely there's someone from Acorn out there?) but
it was certainly being discussed at Acorn in the summer of 1983.

-- Richard
-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

johng@OCE.ORST.EDU (John A. Gregor) (06/11/91)

In article <1991Jun9.214548.23661@syacus.acus.oz.au> william@syacus.acus.oz.au (William Mason) writes:
>
>	If RISC is *it*  ... How come the guy with shor legs
>	can't winn the olympic foot races ???

I don't know, but he'll still beat the guy trying to run on all fours,
wearing an army boot on one foot, a running shoe on the other, a boxing
glove on his left hand, and a football helment. :-)

Sorry world to contribute noise to the RISC vs CISC noisefest, but I
couldn't resist. 

JohnG
-- 
John A. Gregor
College of Oceanography			 E-mail: johng@oce.orst.edu
Oregon State University			Voice #: +1 503 737-3022
Oceanography Admin Bldg. #104		  Fax #: +1 503 737-2064
Corvallis, OR  97331-5503

wkk@cbnewsl.att.com (wesley.k.kaplow) (06/18/91)

Well, this was not just another code size measurement.  I thought the
original poster request static code information.  Well, I wanted to stay
away from any performance data but here goes anyway:

Note: See the previous posting by me to see what the benchmarks are.

DYNAMIC Instruction Count:

		Benchmark	MIPS Instructions/CRISP Instructions
		----------------------------------------------------
		   BSC				1.56
		Dhrystone			1.24
		   RP				1.50

Static MIPS Instruction Distribution ('cat' + library)

		Instr/Opereration		% of Distribution
		-------------------------------------------------
		 load/store			26%
		branch+Funcall			19%
		    nop				13%
		   arith			12%
		 move reg/load immed		18%
		     misc			12%


I hope that gives you some more information.   It was clear to us, and to 
MIPS, that you can sacrifice some characteristics and gain in cycles/instr
efficiency.


with disclamer;
use disclamer;

Wesley Kaplow
AT&T Bell Labs
Whippany, NJ
201-386-4634