[net.arch] RISC cache vs CISC u-code

dan@pyramid.UUCP (Danial Carl Sobotta) (03/04/86)

> This doesn't seem right.  Does 'practical' in this sentence mean less
> bus contention?
> 
> Since a RISC machine doesn't have the fancy microcoded instructions of
> a CISC machine, it takes more instructions to do the same job.  Even
> though a RISC instruction typically requires fewer bits than a CISC
> instruction, a program for a RISC machine is generally said to be
> larger than the equivalent program for a CISC machine.  With today's
> low memory prices, this is not a terrible thing.
> 
> I was always taught that 80%-95% of the bus usage of a processor was
> for instruction fetches.  Therefore if a RISC machine takes more bytes
> of instructions to run a program than a CISC machine would, the RISC
> processor will eat up MORE bus cycles, leaving fewer for displays, DMA
> , and co-processors.

knudsen@ihwpt.UUCP (mike knudsen) replys:

>I agree with you.  Modern CISC processors are microcoded
>(nanocoded?) and fetch one CISC instruction from system RAM,
>then proceed to fetch many nano-instrs from internal ROM
>to perform it.  Meanwhile, the bus is free.
>RISC machines essentially run "nano code" out of YOUR main
>RAM over YOUR bus.  So yes, you seem right to me.
>Or are we both missing something?

Yup, you probably are.  The space that is freed up on a RISC chip by having
little if any u-code ROM can be used for more cache.  The (*rare*) cases
of needing a multiple RISC instruction routine to 'emulate' a CISC
instruction can be handled by having that routine in cache.  This is
automatically done by caching algorithms in hardware (simple!).
So, having the cache effectively reduces Bus traffic not only to a CISC
level (because of above explanation) but also probably LESS traffic because
the cache can be used for ALL instructions (and data).
Or am I missing something?



-- 


  'Out of the inkwell comes Bozo the Clown ...'
 
DISCLAIMER:  These opinions are neither mine nor my C-compiler's
       sun!pyramid!dan

billw@Navajo.ARPA (William E. Westfield) (03/06/86)

Hey, look everybody - no one is claiming that computers with simple
instruction sets are a new idea - nearly all processors invented a
sufficeintly long time ago have nice simple instruction schemes, so as
to make hardware implementation easier (remember, there wasn't always
microcode!).  The original PDP8 (a 12 bit computer), PDP11s (that
everyone knows and loves), and the PDP6 (DECs original 36 bit machine,
with essentially the same instuction set as a DEC10/20, was built of
approximately 3000 gates) all were RISCy in there own ways.  So are
8008's, 8080's, 1802's, 6800's, 6502's, and the rest of your "first
generation" microprocessors.

  What the RISC people did is essentially say "look guys, its all very
nice that transistors are smaller now, and you have microcode and
nanocode and all that and thik that you can come close to implementing
a high level language right on the chip.  Unfortunately, almost isn't
good enough, and trying to write a compiler that takes advantage of a
chip that "almost" implements something is much worse than writing a
compiler for something that doesn't do so at all.  In fact, we aren't
sure we know enough about code generation to take advantage of even the
simple addressing modes on something like a PDP11.  We'de really be
happy if the machine could just do this, that, and the other thing,
only do it real fast.  then our compiler could be simple, and easier to
move to faster processors, and this study here shows that it is likely
to come out working faster anyway..." 

In short, "RISC" is a reaction to the fact that hardware is advancing
faster than software technology, and hardware designers who thought
they understood software really didin't.

With respect to there being less availble DMA cycles in a RISC machine,
this is true.  On the other hand, they say, allocate a bunch of memory
dedictaed to device communications, or just use faster memory.
Mass memory is about 10 times faster than it was 10 years ago,
and 10^n times cheaper.  Software has improved in that time too,
but not anywhere near as much.

glacier!navajo!billw

[DEC should put the 20 on a chip, call it a risc machine, and sell
 systems for 15K.  They'd be very successful.]

berger@imag.UUCP (Gilles BERGER SABBATEL) (03/07/86)

In article <136@pyramid.UUCP> dan@.UUCP (Dan Sobottka) writes:
>
>...  The space that is freed up on a RISC chip by having
>little if any u-code ROM can be used for more cache.  ...
>... So, having the cache effectively reduces Bus traffic not only to a CISC
>level (because of above explanation) but also probably LESS traffic because
>the cache can be used for ALL instructions (and data).
>Or am I missing something?
>
OK, but what when the system is multiprogrammed? Frequent swaps between
processes aren't likely to break the cache efficiency?   This could be
the cause of important degradation of RISC performance in multiuser
environment (Cf previous discussions about the Ridge).

... Or am I missing something?....

-- 
Gilles BERGER SABBATEL - IMAG-TIM3/INPG, GRENOBLE - FRANCE
berger@archi@imag.UUCP

rose@think.ARPA (John Rose) (03/07/86)

In article <136@pyramid.UUCP> dan@.UUCP (Dan Sobottka) writes:
>
>...  The space that is freed up on a RISC chip by having
>little if any u-code ROM can be used for more cache.  ...
>
In article <570@imag.UUCP> berger@imag.UUCP (Gilles BERGER SABBATEL) replies:
>OK, but what when the system is multiprogrammed? Frequent swaps between
>processes aren't likely to break the cache efficiency?

Interesting:  To maintain the functional correspondence between
CISC ucode ROM and RISC fast memory, we'd need something like shared
libraries for low-level routines.  The Unix practice of copying
library code into every executable image would cause gratuitous
flushing of cached "milli-code", only to bring in nearly-identical
code for the next process.  That hard-to-change ROM or WCS has
its good points :-).  What's needed for a RISC is a slowly-changing
set of libraries at well-known addresses, and--more broadly--
software engineering practices which discourage re-invention!

The CISC's have context-switch problems too--look at how the
stack-frame formats have grown in the 68k family (although they
seem also to be paying for early design bugs).  (Co-)processor
registers are a cached data memory which must be flushed (saved)
and reloaded (restored).

-- 
----------------------------------------------------------
John R. Rose, Thinking Machines Corporation, Cambridge, MA
245 First St., Cambridge, MA  02142  (617) 876-1111 X270
rose@think.arpa				ihnp4!think!rose

reiter@harvard.UUCP (Ehud Reiter) (03/08/86)

The numerous articles on RISC machines have all assumed that such machines have
caches.  However, the only commercial RISC machine that I'm familiar with, the
IBM PC/RT, does NOT have a cache, and seems to suffer a factor of 3 performance
degradation because of this (2 MIPS instead of 6 MIPS).  To quote from IBM RT
PERSONAL COMPUTER TECHNOLOGY  (probably available from your friendly
neighborhood IBM salesman), pg 48 - "The 801 minicomputer ... had exceptionally
high performance.  However, much of its performance depended on its two caches,
which can deliver an instruction word and a data word on each CPU cycle.  SINCE
SUCH CACHES WERE PROHIBITIVELY COSTLY FOR SMALL SYSTEMS ..." (emphasis mine).

The point is that you can't assume that RISC machines have caches, because some
don't.  And, as near as I can tell, an RT has much less performance than a
SUN 3 (lousy floating point and I/O as well as no cache), but costs twice as
much ($15k vs $8k for diskless systems (??) ).  So, if RISC machines need
caches to perform well, then CISC machines win out, at least at the bottom end
of the market.

Incidentally, the RT has an "overlapped load" feature, where instructions that
don't reference the loaded data can be executed concurrently with a LOAD
instruction.  The only problem is, this feature is disabled in virtual memory
mode (presumably because of the difficulty of saving the state of the machine
when a page fault occurs).  A case of the "cruel real world" destroying a
cute RISC idea?

One last point - the RT does NOT have a fancy subroutine call mechanism (like
Berkeley's RISC).  So, even if it were true that an RT could execute a MULTIPLY
routine out of memory as fast as a CISC machine could execute it out of
microcode, the RISC multiply is much more expensive because of the subroutine
call overhead.

I highly recommend reading IBM RT PERSONAL COMPUTER TECHNOLOGY - its very well
written, and it shows you what the pros and cons of a real machine.

						Ehud Reiter
						harvard!reiter.UUCP
						reiter@harvard.ARPA

ark@alice.UucP (Andrew Koenig) (03/09/86)

> OK, but what when the system is multiprogrammed? Frequent swaps between
> processes aren't likely to break the cache efficiency?   This could be
> the cause of important degradation of RISC performance in multiuser
> environment (Cf previous discussions about the Ridge).
>
> ... Or am I missing something?....

The IBM 360/91 was one of the fastest machines of its time.
Although it took 360 nanoseconds to read from memory and
720 nanoseconds to write, the machine was nevertheless capable
of executing one instruction every 60-nanosecond cycle (when
running a well-tuned program) because of heavy memory interleaving
and pipelining.

This raised the problem of how to synchronize the CPU with the
clock, which was apparently stored in memory and was updated
every 60th of a second.  They did it by quiescing the entire
machine for each clock tick.  The loss of performance was trivial
compared with the expense of handling the clock some other way.

The same may be true of context switching on RISC machines with
caches.  If no more than 100 or so context switches occur per
second, and the machine executes tens of thousands of instructions
between context switches, it doesn't really matter that the cache
is flushed each time.

aglew@ccvaxa.UUCP (03/09/86)

Responding to billw at navajo.ARPA, who was responding to...

I agree with your basic point, but there's another aspect to RISCs:
there is a big difference at the moment between hardware, where it is 
easy to do things in parallel, and software, where it isn't. Microcode
is just software used to implement sequential operations. One of the 
things we can do to increase speed is to make sequential operations 
parallel, which usually comes down to implementing serial operations
combinatorically. Whenever you have a serial operation that cannot be
made parallel, there are usually enough special cases that can be detected
at compile time to make a standard library function suboptimal - and this
is just as true for microcode as it is for a matrix mathematical library.
(Just how many different forms of matrix multiplication are there:
block, upper triangular, band, sparse...).

Somebody else was talking about caches. Here're some random musings:
registers are just caches explicitly controlled by software. Register windows
are specially structured stack caches. We should have a special cache for
each frequently used data type, with a fetch/replacement strategy optimized 
for that data type.
	Instructions and data are different, so they need different caches.
We have both transparent and explicitly controlled (registers) data caches;
instruction caches are usually transparent, not explicitly controlled. Could
explicitly controlled instruction caches be useful? (Ask MU5). The likely bit 
on branches is a start. Overlays are an explicitly controlled instruction 
cache mechanism. An instruction cache should have automatic linear prefetch,
and should probably try to prefetch the heads of procedures. It should try to
keep return points in the cache. Heads of loops should be left in the cache
once fetched; backward branches can be used as a clue to finding heads of
loops, but are no good if the loop is long - which is exactly when you want to
keep the loop head in the cache. What we need is a special mark for heads of
loops - perhaps an explicit instruction, perhaps just a bit in an instruction,
perhaps branch tables as in MU5. Perhaps this could be used to minimize loop
overhead for while test at the top rather than until test at the bottom loops:
the branch back to the test at the top could automatically fire off the 
head of loop instruction, so it might be possible to execute them both in one
cycle.

johnl@ima.UUCP (John R. Levine) (03/10/86)

In article <765@harvard.UUCP> reiter@harvard.UUCP (Ehud reiter) writes:
>One last point - the RT does NOT have a fancy subroutine call mechanism (like
>Berkeley's RISC).  So, even if it were true that an RT could execute a MULTIPLY
>routine out of memory as fast as a CISC machine could execute it out of
>microcode, the RISC multiply is much more expensive because of the subroutine
>call overhead.

Huh?  Not true at all (and I should know, I wrote the multiply routine for the
AIX C compiler.)  There are two points.  One is that the multiply routine is
what has been called "millicode" which does not go through a full function
linkage.  In the AIX compiler, you just put the two numbers to be multiplied
in two registers and jump; the routine doesn't save anything so there is
negligible linkage overhead.

The other point doesn't apply directly to the RT but is generally important,
and it is that with decent compilers you don't need the sliding register
window hack.  Neither the MIPS chip nor the 801 had multiple register sets,
in both case because the compiler technology made them unnecessary.  Even in
the AIX compiler, which is a straightforward version of PCC, careful choice of
argument and scratch registers keeps the number of registers saved per call
to a minimum.  I saw some of the code generated by the PL.8 compiler, and it
was pretty spectacular.  Register windows would have bought practically
nothing.
-- 

John Levine, Javelin Software, Cambridge MA 617-494-1400
{ decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA

The opinions above are solely those of a 12 year old hacker who has broken
into my account, and not those of my employer or any other organization.

johnl@ima.UUCP (John R. Levine) (03/10/86)

In article <765@harvard.UUCP> reiter@harvard.UUCP (Ehud reiter) writes:
>The numerous articles on RISC machines have all assumed that such machines have
>caches.  However, the only commercial RISC machine that I'm familiar with, the
>IBM PC/RT, does NOT have a cache, and seems to suffer a factor of 3 performance
>degradation because of this (2 MIPS instead of 6 MIPS).  To quote from IBM RT
>PERSONAL COMPUTER TECHNOLOGY  (probably available from your friendly
>neighborhood IBM salesman), pg 48 - "The 801 minicomputer ... had exceptionally
>high performance.  However, much of its performance depended on its two caches,
>which can deliver an instruction word and a data word on each CPU cycle.  SINCE
>SUCH CACHES WERE PROHIBITIVELY COSTLY FOR SMALL SYSTEMS ..." (emphasis mine).

Hmmn.  If you continued reading a few pages past that quote, you'd find that 
the ROMP has other architectural aspects that mitigate the effects of having 
no cache.  For one thing, the ROMP does have four words of instruction 
prefetch buffer which gives the chip some latitude in when it fetches its 
instructions.  It also has an extremely fast bus, the ROMP Storage Channel, 
which can handle a transfer every cycle and allows several transfers to be 
outstanding, since each request has a five-bit tag which the slave device 
passes back for matching up by the master.  Memory can be interleaved many 
ways to allow lots of cycles to be going at once.  The technology book on p.  
58 says that the chip only uses 60% - 70% of the bus bandwidth, which 
suggests that adding a cache wouldn't help as much as you'd think.  

Software can also be of some help here -- for example there are instructions
for unpacking the bytes in a register and I gather that the PL.8 compiler
tries to fetch fullwords and unpack them rather than fetching several adjacent
bytes separately.
-- 
John Levine, Javelin Software, Cambridge MA +1 617 494 1400
{ decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA

The opinions above are solely those of a 12 year old hacker who has broken
into my account, and not those of my employer or any other organization.

jlg@lanl.UUCP (03/10/86)

In article <765@harvard.UUCP> reiter@harvard.UUCP (Ehud reiter) writes:
>One last point - the RT does NOT have a fancy subroutine call mechanism (like
>Berkeley's RISC).  So, even if it were true that an RT could execute a MULTIPLY
>routine out of memory as fast as a CISC machine could execute it out of
>microcode, the RISC multiply is much more expensive because of the subroutine
>call overhead.

The 'RISC' machine I use doesn't call any subroutine to do multiplies.  The
CRAY implements ADD, SUBTRACT, and MULTIPLY for both integers and floats
and DIVIDE for floats only in the hardware with hardwired logic.  It's
expensive in hardware (lots of chips), but not nearly as expensive as CISC
microcoding would make it because of slow operation.  The whole idea of
RISC is to pick those instructions which are important for the application
for which the machine is used, and make them FAST!  So, don't assume that
certain instructions will not be found in RISC machines - depends on the
target market for the device.

J. Giles
Los Alamos

reiter@harvard.UUCP (Ehud Reiter) (03/11/86)

It's true that the IBM PC/RT, through a combination of instruction prefetch
and interleaved memory, does not wait for instruction fetches when executing
a sequential instruction stream.  However, other memory references are quite
expensive.  The "bottom line" is (quoting from page 49 of the RT book)

"Although most ROMP [RT] instructions execute in only one cycle, additional
cycles are taken when it is necessary to wait for data to be returned from
memory for Loads and Branches.  As a result, the ROMP takes about three cycles
on the average for each instruction"

In short, a cacheless RISC machine does not come anywhere close to the one
instruction per cycle goal.

						Ehud Reiter
						harvard!reiter.UUCP
						reiter@harvard.ARPA

henry@utzoo.UUCP (Henry Spencer) (03/12/86)

> The point is that you can't assume that RISC machines have caches...

Only the well-designed ones do.

> ...as near as I can tell, an RT has much less performance than a
> SUN 3 (lousy floating point and I/O as well as no cache), but costs twice as
> much...

Nobody said the RT was particularly well designed.

> ...So, if RISC machines need caches to perform well, then CISC machines
> win out, at least at the bottom end of the market.

At the very bottom, maybe.  Most CISC machines have caches nowadays.  The
68020 inside the Sun 3 does, for example.

> I highly recommend reading IBM RT PERSONAL COMPUTER TECHNOLOGY - its very well
> written, and it shows you what the pros and cons of a real machine.

So long as you don't take it as saying much about the general merits of
RISCs...
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

jer@peora.UUCP (J. Eric Roskos) (03/12/86)

> I agree with your basic point, but there's another aspect to RISCs:
> there is a big difference at the moment between hardware, where it is
> easy to do things in parallel, and software, where it isn't. Microcode
> is just software used to implement sequential operations. One of the
> things we can do to increase speed is to make sequential operations
> parallel, which usually comes down to implementing serial operations
> combinatorically. Whenever you have a serial operation that cannot be
> made parallel, there are usually enough special cases that can be detected
> at compile time to make a standard library function suboptimal - and this
> is just as true for microcode as it is for a matrix mathematical library.

While I must admit that I have some difficulty understanding how some of
the statements above follow from one another, there's one basic idea here
that sort of bothers me; the idea that "microcode" is just like RISC
instructions.

Someone in another recent posting stated it very well -- the RISC
instructions are similar to *vertical* microcode.  In the case of vertical
microcode, it's true that you have a limited amount of parallelism.  But,
in machines with a more "horizontal" microcode, you can achieve a great
deal of parallelism.

Furthermore, not all microcode looks anything at all like a conventional
program; as Mead & Conway point out, possibly the only reason so many
microprogrammed machines have microprograms that look like a conventional
"assembly language" program is that people are more used to writing that
kind of program than they are at writing microprograms in state machines.

Awhile back I said that I think that really the RISC and CISC goals,
beneath all the politics, are fairly similar.  Let me explain one reason
why.

One of the ongoing areas of research in microprogramming involves "vertical
migration" -- analyzing sequences of code to determine things that can
be migrated into the microcode, essentially to produce new instructions.

From the RISC end you'd just go the other way; it's been argued that the
cache does that "automatically," but it's hard to believe that in the long
run, when the RISC approach has come to be seen as mundane, that someone
doesn't start doing statistical analyses on RISC instruction sequences, and
discovers that some sequences commonly occur, and makes new instructions
out of those.  But that's essentially identical to the vertical migration
strategy.  You probably come up with cleaner sequences than you would from
some arbitrary CISC instruction set [although I suspect in the long run
they would be self-refining anyhow], but I think the underlying approach
is more or less the same.
-- 
      e:                  (             )                     jer@j
 omp                    (      (   )      )                         ter Co
     d,                   (             )                     o, FL

"It's a long way from Brooklyn to LA..."

aglew@ccvaxa.UUCP (03/13/86)

RISCs => caches => cache flushes => inefficiencies in multiprogramming.

In addition to being able to neglect cache flushes if the interval between 
context switches is long enough (in terms of instructions processed), context
switches will hopefully become less important on multi-microprocessor
machines, where you can schedule more processes per second for the entire
machine (an important number for real-time systems) but each processor will
be handling fewer context switches. 

reiter@harvard.UUCP (Ehud Reiter) (03/15/86)

I thought it would be nice to try to get some numbers to quantify
RISC and CISC performance.  The numbers below are hopefully not too full
of bugs - I would be glad to send interested people references.

	1) Cache, RISC, and MIPS - some figures for a VAX-11/780

		    cycles/inst	 MIPS

cache disabled		25	.2		VAX's were not designed for this
normal cache		10	.5
perfect cache		 7	.7
MIPS rating			1		What DEC marketing claims
"RISC" mode, perf. cache 2	2.5		All reg-to-reg inst

Note the effect a cache has, that the typical VAX instruction is a complex
7 cycle one and NOT a simple 2-3 cycle one (as some have claimed), and that
a VAX pretending its a RISC machine clocks in at 2 MIPS or so.

	2) Complex instruction execution - the following figures compare a
"VLSI VAX" (presumable a microVAX), an M68020 (16 Mhz, no wait states), and
an IBM PC/RT, all executing the operation  R3=4(R2)+(R1).  Cache-RT is a guess
for what an RT with cache would do (assumes 2 cycles for LOAD/STORE).  All
caches are assumed to be perfect.

			uVAX	68020	real-RT	cache-RT
time (us)		1.2	.94	1.83	.83
cycles			6	15	11	5
bytes			5	6	6	6
cycle time (ns)		200	60	170	170
instructions		1	2	3	3
scratch registers	0	0	1	1


	3) Simple instruction execution - for R3=R1+R2

			uVAX	68020	RT

time(us)		.4	.25	.17
cycles			2	4	1
bytes			3	2	2
instructions		1	1	1

microVAX's seem much more efficient (compared to an RT) at executing complex
instructions than at executing simple instructions - but that's OK, since
the data in (1) indicates that most VAX instructions are indeed complex.
     This seems due to pipelining, incidentally - a microVAX does little
inter-instruction overlap (only instruction fetch/decode), which hurts the
small instructions but doesn't effect the complex ones as much (so much for the
claim that complex VAX instructions are less efficient than simple ones).
An RT, on the other hand, pipelines its instructions.

	4) Note on MIPS.  IBM has been very cautious to only say that an RT
is a "2 RISC MIPS" machine, but one can imagine a more "enthusiastic"
company (I am NOT accusing anyone, but merely pointing out the possibility)
claiming that an RT-type machine with cache was a "6-MIPS" machine, although
the above data indicates that such a machine would only be 50% faster than
a "1 MIPS" microVAX, and a third the speed of a "4.5 MIPS" VAX 8600.

					Ehud Reiter
					harvard!reiter.UUCP
					reiter@harvard.ARPA

chris@umcp-cs.UUCP (Chris Torek) (03/23/86)

In article <1128@unc.unc.UUCP> rentsch@unc.UUCP (Tim Rentsch) writes:

>>If no more than 100 or so context switches occur per
>>second, and the machine executes tens of thousands of instructions
>>between context switches, it doesn't really matter that the cache
>>is flushed each time.  [Andrew Koenig]

>It *can* matter, depending on how big the cache is and on how full it
>must be to achieve a good hit ratio.

[followed by some numbers to demonstrate this]

True enough, or at least from my software perspective (I know little
about cache design).  However, one argument on the RISC side is
that if the processors are simple and cheap enough, you need never
context switch.  Just assign one processor-plus-cache per process.

This sounds like a parallel computation engine, but it need not
be.  If it is easier to design the rest of the system as a
single-CPU-at-a-time machine, and each CPU costs, say, $5, you
can easily stick 100 CPUs into a $40K machine.  This cuts the
context switch rate by a factor of 100.  Of course, there are
cache contention problems if you have shared data.  Just a crazy
idea, no doubt...?
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu