[net.arch] RISC processors

haapanen@watdcsu.UUCP (Tom Haapanen [DCS]) (11/11/84)

< Nami nami nami nami ... >

The latest BYTE (should I be ashamed that I still subscribe?  :-) has
an article by John Markoff on RISC (Reduced Instruction Set Computer)
chips.  In particular, the article concentrated on the Berkeley RISC I
and RISC II designs.

Even though the instruction set looks horribly insufficient, I suppose
it could be lived with, especially with the 138 registers RISC I has.
According to the limited benchmarks in the article, the 1.5 MHz RISC I
was able to beat up on a 8 MHz MC68000, and the 5 MHz RISC II runs
'integer C programs' faster than a 10 MHz NS32016 and a 12 MHz MC68000.
The article does not mention how many registers the RISC II has, but
it does say that a 12 MHz RISC II has been fabricated, though.

What I'm wondering about, though, is whether it is feasible to build a
RISC chip in the VAX 11/780 class, i.e. on par with the MC68020 (note:
this is only to imply the the 68020 is in the same *class* as a 780,
not necessarily equal performance).  Apparently the RISC II contains
44,500 transistors, as opposed to the 68020's 200,000, so at least
there is a lot of room to cram more stuff in.  However, will this
improve performance significantly?

The article also vaguely refers to the Pyramid 90x having register
windows.  Does this mean the Pyramid is a RISC design, or does it just
have large register banks?  Are there any other RISC designs and/or
chips commercially available, or will there be in the near future?

Comments would be appreciated, especially from people involved in the
Berkeley RISC or SOAR projects, or the Stanford MIPS project.


Tom Haapanen		University of Waterloo		(519) 744-2468

allegra \
clyde \  \
decvax ---- watmath --- watdcsu --- haapanen
ihnp4 /  /
linus  /		The opinions herein are not those of my employers,
			of the University of Waterloo, and probably not of
			anybody else either.

henry@utzoo.UUCP (Henry Spencer) (11/13/84)

> Even though the instruction set looks horribly insufficient, I suppose
> it could be lived with...

The whole point of a RISC machine is that you don't live with it; the
compiler does that for you.  So long as it runs his programs quickly,
the nature of the instruction set is not really the customer's concern.
The simplicity makes life lots easier for the compiler and the chip
designer.

> The article does not mention how many registers the RISC II has ...

I think RISC II has roughly twice the register count of RISC I.  Note
that your programs don't get access to all of them simultaneously, so
this is an implementation/performance issue mostly.

> What I'm wondering about, though, is whether it is feasible to build a
> RISC chip in the VAX 11/780 class...

Remember that the current RISC chips are using mediocre MOS processing
and easy-and-simple design rules.  Last I heard, running at the original
target clock speed (which probably hasn't been reached yet), the RISC
design was tentatively benchmarked (by simulation) as substantially faster
than a 780.  This was with, I think, 400-ns memory access times, and an
effective instruction cache was assumed.

> ...  Apparently the RISC II contains
> 44,500 transistors, as opposed to the 68020's 200,000, so at least
> there is a lot of room to cram more stuff in.  However, will this
> improve performance significantly?

If the RISC people got 200k transistors to play with, probably tops on
the agenda would be an instruction cache.  Since the RISC designs execute
a lot of instructions relative to a conventional design, they benefit a
lot from faster instruction fetches.

> The article also vaguely refers to the Pyramid 90x having register
> windows.  Does this mean the Pyramid is a RISC design, or does it just
> have large register banks?

The Pyramid is *claimed* to be more-or-less a RISC design, although from
the sounds of the glossies, they've succumbed to the temptation/need (hard
to tell which) to add tailfins and "features".  Note that it's not a VLSI
RISC, it's a RISC design implemented in ordinary logic.

> Are there any other RISC designs and/or
> chips commercially available, or will there be in the near future?

Don't think there are any other commercial RISC designs just yet, although
there may be half a dozen startup companies about to prove me wrong.  As
far as I know, there are *no* commercial RISC chips at the moment.  The
idea has generated enough enthusiasm that all kinds of people are likely
to jump on the bandwagon in the near future.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

kiessig@idi.UUCP (Rick Kiessig) (11/14/84)

RISC machines may not be as good as they look at first
glance.  The Berkeley implementations, in particular, gain
nearly all of their performance by using register windows.
The simplicity of the instruction set may in fact slow these
chips down, although it certainly makes them easier to
implement.

-- 
Rick Kiessig
{decvax, ucbvax}!sun!idi!kiessig
{akgua, allegra, amd, burl, cbosgd, decwrl, dual, ihnp4}!idi!kiessig
Phone: 408-996-2399

henry@utzoo.UUCP (Henry Spencer) (11/16/84)

> RISC machines may not be as good as they look at first
> glance.  The Berkeley implementations, in particular, gain
> nearly all of their performance by using register windows.
> The simplicity of the instruction set may in fact slow these
> chips down, although it certainly makes them easier to
> implement.

Don't knock implementation ease.  No way could the Berkeley people have
built a chip with all those registers unless the control section was
very simple.  There is also the telling observation that many existing
CISC machines are notorious for being faster if you "hand code" the
complex sequences rather than relying on the all-singing-all-dancing
fancy instructions.  For example, C function calls are faster on an
11/70 than on a VAX, even though it's one instruction on the VAX and
about a dozen on the 70.

Another way to look at it is that the RISC machine is a machine with
unusually "clean" microcode that is executed directly out of main
memory, and generated directly by compilers.  Assuming that the
cleanliness and the memory fetches aren't a major problem, clearly
it is faster to write your own microcode for a specific job than to
rely on the CPU designer's ROM microcode.

"If the big performance win is register windows, then the rest of the
CPU should be made as simple as possible, i.e. a RISC."
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

jlg@lanl.ARPA (11/20/84)

It is obviously possible to build a RISC machine that is in the same class
as a VAX.  But why would you want to when some RISC-like machines have been
running for years MUCH FASTER than a VAX?  These are the CDC machines, the 
CRAY machines, and the more recent vector processor machines 'from the east.'

For example, the CRAY machine is VERY RISC-like.  There are two data addressing
modes corresponding to the VAX 'literal mode' and the VAX 'displacement mode'.
There are two branch addressing modes corresponding to the VAX 'literal mode'
and the VAX 'register mode'.  No instructions other than loads, stores, and
branches address the memory.  All the other instructions use 'register mode'
for their operands, mostly three address code.  Contrary to the remarks of
previous submitters, there is no difficulty achieving very high speed
floating point arithmetic on a RISC-like machine.  In fact the floating 
point units on the CRAY-1s machine are just one clock slower than their 
integer counterparts.

There are several differences between the CRAY machines and the RISC machines
proposed by Peterson and others.  The most important being the lack of 
orthogonality in the instruction set (although the CRAY-2 promises to fix
this deficiency to some extent) and the lack of a high speed context switching
mechanism.  This last point is offset somewhat by the ability to 'block load'
or 'block store' certain register sets (unfortunately, the present compilers
don't make particularly good use of this feature).  Another major difference
between the two types of machines is the presence in the CRAY of several 
different functional units each with different timing characteristics.  
This requires extra logic to reserve registers until the operation is 
completed.  

So far I have described only the scaler part of the CRAY machine, and 
for good reason.  Even without vector operations, the CRAY is MUCH faster
than a VAX.  I suspect that a VLSI version of the CRAY scaler instruction
set would be able to outperform a VAX built with the same technology.
The advantages of the reduced instruction set combined with the simpler
memory interface (only two addressing modes with NO virtual memory support)
would allow the 'micro CRAY' to be clocked at much higher rates.  Of 
course, I doubt that the CRAY archetecture could be put on a single chip
with todays technology, but it could probably be done with a small set
of chips for each functional unit.

Programming a RISC machine is simple as compared to CISC machine - far
from being 'woefully inadequate' the RISC type of machine seems just right.
In a CISC machine there are usually about half a dozen different ways of 
performing any given function, the most obvious is usually NOT the fastest,
or even close.  On a RISC machine, the most obvious code sequence is almost
always the fastest - it may be the ONLY obvious code sequence.  After 17 
years of assembly coding I came to the conclusion the the CRAY instruction
set was the easiest to use of any machine I have seen.  And after two years
of compiler maintenance on the CRAY I concluded that the instruction set
was the easiest to write a compiler for as well (the CRAY compiler is such
a poorly written thing that it would probably never have even worked on 
another machine).  The only really difficult part is scheduling vector
operations, which became much easier on the new X/MP machines.

A word needs to be said about the lack of addressing modes and virtual memory.
At the speeds at which RISC machines will run (not the demo units made from
MOS but the real production chips that (I hope) will come out) memory will
be the slowest component of the system.  On the CRAY, only the reciprocal
approximate is slower than a memory fetch, all other operations are at least
twice as fast (integer add is 7 times as fast, logical operations are 14 
times as fast).  Staged memory is a help (several fetches or stores going
simultaneously), but all the other functional units are staged as well. 
It makes sense to limit memory traffic to just loads and stores so that
other functional units don't end up waiting for memory references.  It also
makes sense to limit the number of addressing modes so that memory traffic
doesn't get even slower due to the extra checking and circuitry in the
memory interface.  If memory traffic is slow, then traffic to the secondary
storage (disk or whatever) is REALLY SLOW.  The data transfer rate for 
the standard CRAY drive (CDC DD-29) is 38.7x10^6 bits/sec,  and the sector
size is 512 words (64 bits/word); less than a millisecond per word - or
about 68,000 cpu cycles!!  This doesn't even count seek time, latency, or
scheduling the traffic with the channel.  Obviously, the operating system
would have to suspend your task until the page had been loaded, and it 
is also clear that no ammount of 'lookahead' in the paging scheme could
significantly improve the performance of the paging scheme.  The solution
is not to page, but to provide a very large amount of central memory.
With large central memory, there is always enough room for code (it's small)
but data may still need to be kept on secondary storage.  Fortunately, it's
usually possible to write code which anticipates its data needs and issues
reads and writes (asynchronous of course) long in advance of the use of that
data.  Short of that, reads and writes don't do that much worse than paging
would have done anyway.

I'm looking forward to the first commercial RISC chips (or chip sets).  I
expect that to be competitive thay will have several functional units (each
staged), only one or two addressing modes, a large central memory requirement,
and no virtual addressing capability.  With this combination, I think RISC
could outrun any other small computer available.

jlg@lanl.ARPA (11/20/84)

In the preceeding note I claimed that the CRAY disk transfer rate was less than
a millisecond per word.  Obviously it is - it's less than a millisecond per
sector, which is 512 words.  This 512 word block still corresponds to about
68,000 cpu cycles though, a lot of time any way you slice it!

bcase@uiucdcs.UUCP (11/22/84)

[bug lunch]

The Ridge-32 is a RISCy machine.  It does not have lots of registers,
but rather has a very simple instruction set.  It can do 8 MIPS if
everything is in registers.

The RISC I has 78 registers, the RISC II has 138.

The Pyramid machine is not a RISC machine even though it does have
the register windows.  It is microcoded and does pay a penalty,
but in the interest of getting the machine out fast, it was a big
win for Pyramid to use microcode.

There are some RISC designs in the pipelines of some companies:  HP
has a project called SPECTRUM which is to be the basis of "all"
future big HP computers (I may have that phrased a little incorrectly).
It "should" already be out, but you know big companies.  Inmos has
been spouting off about the Transputer, a single-chip RISC machine,
but we haven't seen too many of those either.  DEC is experimenting
with RISC.  SUN is experimenting with RISC.  Weitek (spelling?)
(the makers of floating point chip sets) is experimenting with RISC.
In short, anyone who has any sense and does not fear the idea of
being incompatible with past designs (at the instruction set level),
is toying with RISC designs.  There are some start-ups whose sole
goal in life is to exploit the benefits of the RISC concepts (not
all of which are to be found in papers and especially not in BYTE),
and there are some companies trying to form with the same goal in
mind.

What are the main RISC concepts and why are they so important?
Main concepts:  efficient storage hierarchy and minimal instruction
interpretation overhead.  The importance of an efficient storage
hierarchy should be obvious, but not so obvious is the fact that the
register set of the machine is really just the fastest part of the
hierarchy.  The storage hierarchy is MUCH more important than the
number of instructions executed to implement a given function (within
reason of course).  The importance of minimal instruction interpretation
overhead has far reaching implications, including not wasting time,
not wasting silicon, making it easier for a compiler to decide
which instructions, or sequence of instructions, to use, making it
easier to pipeline the machine, etc.

Another important property of RISC machines (here machine means the
hardware together with the compiler) is that they tend (given good
compilers) to REUSE results and data rather than RECOMPUTE or REFETCH
results and data.  As an example, take some VAX instruction which
computes the address of some stack-local variable:

    OPCODE    4(sp),x,y

If this instruction is in a loop, then the computation of 4+sp will
be done on EACH iteration of the loop!  This is clearly wasted effort.
A good RISC compiler will factor out this computation and place the
address in a register (thus, we begin to see the need for lots of
registers).  Now you say, "but the VAX compiler could do this also!"
Sure, but then what becomes of the sp+offset addressing mode?  It is
surely not nearly as important now, and all the control needed to
implement it is less utilized, or perhaps little utilized (there will
still be some other places where register+offset will be useful),
and if it were eliminated, the whole machine may become faster by
virtue of a shorter cycle time.  But now, we need to get instructions
to the CPU at shorter intervals, thus again reinforcing the need for
an efficient storage hierarchy.  If enough iterations of this
measuring/simplifying algorithm are carried out, a RISC design is the
result.

The issue is certainly more compilcated than my discussion would lead
one to believe (one problem is handling interrupts and other pipeline
inconsistencies), but I think I have shed some light on the subject.
There has been considerable work done in this area, and even though
much more needs to be done, the ideas are mature enough that some are
venturing into the commercial market.  RISC machines promise to give
us high performance designs in less time and space (and maybe money)
than conventional machines.  Plus, by using the RISC philosophies,
machines can be designed for special purposes in cases where an attempt
to do so wouldn't have been made before.