[net.arch] risc, cisc, and microprogramming

dwc@hou2b.UUCP (D.CHEN) (06/16/85)

> A reasonable approach to producing CISC machines, then, would be to have the
> compiler writers design the instruction set; and to have the instruction set
> be changeable for different languages.

funny how things come full circle.  one way to implement the above is to
have dynamically loaded microprogrammed instruction sets.  but as dave
paterson mentions in his paper (i think), the difficulty of writing
and supporting microprograms along with the time penalty in using
microprogrammed logic (versus hardwired logic) led to the RISC idea.

deel@sdcsvax.UUCP (Don Deel) (06/17/85)

>> A reasonable approach to producing CISC machines, then, would be to have the
>> compiler writers design the instruction set; and to have the instruction set
>> be changeable for different languages.
> 
> funny how things come full circle.  one way to implement the above is to
> have dynamically loaded microprogrammed instruction sets.  but as dave
> paterson mentions in his paper (i think), the difficulty of writing
> and supporting microprograms along with the time penalty in using
> microprogrammed logic (versus hardwired logic) led to the RISC idea.

Actually, there's a well-known large company that's already been around
this dynamically-changing-instruction-set bush....  Anyone from Burroughs
care to comment on how the B1700 worked out after all these years?  For
the uninitiated, the B1700 made (makes?) extensive use of its ability to
dynamically change its mind about which instruction set it was executing.
The operating system used one instruction set, the COBOL machine emulation
used another, etc.  The idea was that different languages have different
needs as far as instructions sets are concerned.

jer@peora.UUCP (J. Eric Roskos) (06/17/85)

> funny how things come full circle.  one way to implement the above is to
> have dynamically loaded microprogrammed instruction sets.  but as dave
> paterson mentions in his paper (i think), the difficulty of writing
> and supporting microprograms along with the time penalty in using
> microprogrammed logic (versus hardwired logic) led to the RISC idea.

At this point I think we have more or less exhausted the basic tenets of
the two viewpoints here.  You are certainly correct that this is the RISC
argument; yet I had this in mind when I made the statement you are commenting
on.

[Now, I am naturally a little biased in favor of dynamically loaded micro-
programmed instruction sets; my Master's thesis research was on the
foundations of the tools necessary to produce exactly that on a PE 3220.
(See Micro-14). (It was subsequently implemented by Len B.  Reed for his
Master's thesis, providing a microprogram image that was swappable on a
per-user basis; and this was then built upon by E.  Mike Carter for part of
his PhD dissertation to implement dynamic type-oriented vertical migration.
All this was done under the direction of R.  I.  Winner.)]

A weak point of the RISC argument is that a lot of it is based on the
notion of what is "easy" and what is "difficult".  It's "difficult" to write
and maintain microcode; it is "easy" to build RISC machines.  Furthermore
some CISC machines are poorly-designed, giving some vague support to the
benefits of the RISC machines.  And of course, it has been shown that the
instruction set required for a particular program may be quite different
from that required for another.

Thus, we come down in some sense to the very old argument of the merits or
limitations of generating subroutines inline.  I think that the two issues
have fundamental similarities.  If you have instruction sequences that
are recurrent throughout a program, is it better to generate them again each
time, or generate a subroutine call to one copy of them?  Is it possible to
produce a general-purpose subroutine that can be used by many programs, or
will such a routine be unusable for one reason or another?  What about the
possibility of discovering a new, better way to do the code sequence; in
one case, you can change the subroutine; in the other, you must generate the
code all over again.  On the other hand, with compilers it's not that hard
to regenerate it.

I think the real merits of the RISC architecture are not in the "RISC vs.
CISC" debate.  It appears that people have a tendency to become conceptually
entrenched in the design of architectures of various types.  Every once
in awhile, someone comes along and does it all in a simple way, and people
marvel at how well it works, and how simple it is.  Operating systems were
like this; Unix was an example of a demonstration of what you can and can't
do with a much simpler approach than what had been the focus of research.
-- 
Full-Name:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

	   "Gnyx gb gur fhayvtug, pnyyre..."

mat@amdahl.UUCP (Mike Taylor) (06/18/85)

The implementation for Amdahl's large S/370 compatible processors
might be of interest to this discussion.  While Amdahl has no choice
about the instruction-set architecture (defined by IBM), there is
considerable latitude in implementation.  The primary emphasis in the
machine is fast execution of the common instructions (load, store,
branch, etc.). These can be executed at a sustained rate of one per
cycle (no interlocks or misses) and are effectively implemented
in hardware. More complex instructions are implemented by microcode
sequences, using a so-called nanostore which maps short microwords
into long control words. The most complex instructions are implemented
in software, called macrocode. Any operation can be defined (including
one already supported in the machine) as being implemented in macrocode.
Loadable tables in the CPU control the effective architecture.
Hardware support for macrocode includes a second register file, and
private memory. When a macrocode-supported operation is detected,
the second register file is set up with the decoded instruction and
operands, and a vectored branch to the software routine is stuffed into
the pipe. This costs very few cycles. The software routine uses the second
register file, avoiding save/restore, and has special support to set up
the return conditions.  This allows new instructions to be readily
added with good performance.  The software routines to support new
instructions can be written using special tools to generate essentially
interlock-free code, which will keep the functional units as busy as
microcode could.  The only real additional cost is cache misses caused
by the software routine which would not happen from microstore.
This ability to redefine the instruction set architecture dynamically
is used to support the simultaneous operation of SCP's requiring different
architectures, for example MVS/370 and MVS/XA (or UTS!). Other facilities
(addressing, I/O message and interrupt routing, timers, etc.) combined
with the above provide a kind of hardware virtual machine support
called the "Multiple Domain Facility (MDF)."  Switches between domains
simply cause the architectural controls to be reloaded - the cost of this
is only slightly higher than any other state switch.
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

freeman@spar.UUCP (Jay Freeman) (06/20/85)

[There's no such thing as the line-ea<*GULP*>]

In article <1691@amdahl.UUCP> mat@amdahl.UUCP (Mike Taylor) writes:

>> The implementation for Amdahl's large S/370 compatible processors
>> might be of interest to this discussion.  [concise description follows
>> -- JF]

>>               More complex instructions are implemented by microcode
>> sequences, using a so-called nanostore which maps short microwords
>> into long control words. The most complex instructions are implemented
>> in software, called macrocode.

It sounds like the macrocode is user-definable, how 'bout the microcode?

>> Hardware support for macrocode includes a second register file

Interesting.  Some of these concepts crop up even in the smallest of
microprocessors, too:  The Z-80 has its "alternate" registers, and the
recently notorious 8<censored>> 86 has 256 different software interrupts,
which many people use as hooks to commonly-used library routines.

>>                                                 Switches between domains
>> simply cause the architectural controls to be reloaded - the cost of this
>> is only slightly higher than any other state switch.

Does microcode reload (or switch) as well?

The cited architecture sounds neat:  It looks to me as if, in principle,
users have access to the machine at four levels (assuming the microcode is
rewritable):

        (ordinary) code        eg, call a library routine
        macrocode              eg, execute "machine instruction" FOO,
                                 where FOO traps to a macrocode routine
        microcode/hardware     eg, execute "machine instruction" BAR,
                                 which actually IS a machine instruction
        hardware               eg, change some microcode

Have I got that right?

Clever programming software ought to be able to look at an algorithm to
be implemented, and perform an optimization that included moving code
(should we call that "metacode") across all of these level boundaries.
-- 
Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)

larry@mips.UUCP (Larry Weber) (06/21/85)

> 
> It is also "difficult" to write system software for that subclass of RISC
> machine that simplifies hardware to the point of requiring the software to
> allow for the resolution of pipeline data-dependencies.  It would be an
> Interesting Task to do a code-generator for such a beast.  
> ... 
> Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)

This is simply not true.  The simplest way to resolve pipeline dependencies
is to put into the instruction stream exactly the number of nop instructions
that the hardware would do in a machine which has interlocks.  This does 
not make the program run any slower because the hardware would have delayed
the program anyways.  

The real gain comes when your modify the order of instructions to eliminate
delay cycles.  The complexity of the machine determines the complexity
of the algorithm.  On simpler machines you can actually resolve delays 
in 200 lines.  A number of years ago I modified IBM's compiler for 
their systems language for the 370 to resolve an interlock between 
computation and address calculation.  It resulted in a 6% improvement in 
performance; this modest improvement is respectable when you consider that
the compiler was already highly tuned to produce very good code.  On RISC
machines it is possible to achieve a much larger improvement because you
can schedule a wider class of delays - branch delays, load delays, coprocessor
delays etc.

I believe that pipeline scheduling is an excellent "peephole" optimization
whether the hardware has interlocks or not.

The problem of code generation needn't be any harder than in other machine,
in theory you would like to generate instructions that are good candidates
to resolve delays just produced or about to be produced.  In practice, all
you have to is to not "over-work" a single register which would result
in all instructions being interdependent.
-- 
-Larry B Weber
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!larry
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

freeman@spar.UUCP (Jay Freeman) (06/21/85)

[beware of the line-eat<*CHOMP*>]

In article <145@mips.UUCP> larry@mips.UUCP (Larry Weber) writes:
>> 
>> It is also "difficult" to write system software for that subclass of RISC
>> machine that simplifies hardware to the point of requiring the software to
>> allow for the resolution of pipeline data-dependencies.  It would be an
>> Interesting Task to do a code-generator for such a beast.  
>> ... 
>> [from me -- JF]
>
>This is simply not true.  [...]  The problem of code generation needn't be
>any harder than in other machine [...]

You are certainly correct that putting in NOOPs provides a quick and dirty
resolution of these dependencies.  I erred in not stating precisely, that
I meant "doing it right".   I believe it is important to do it right; because
adding substantial numbers of NOOPs to code -- possibly as many or more than
the code has "real" instructions -- would appear likely to slow up execution
sufficiently that a RISC no longer has a speed advantage over a CISC.
(I have no data -- is that correct?)

I will also accept that a peephole optimizer for removing pipeline
dependencies need not be longer than a few hundred lines.  However, at least
some global data-flow analysis appears desirable, to ensure that cycles
are not wasted in unnecessary resolution after jumps and calls:

(It is well known that many programs spend most of their time in
 relatively short loops -- adding NOOPs needlessly at the start of
 each loop could hurt.  (Again, no data -- key issues would appear to be
 how short the "typical" loop is, how many NOOPs the particular RISC
 requires to flush its pipelines, and what actual advantage the RISC with
 fully-optimized code has over its competitors, anyway.))

(A similar issue may exist with respect to call overhead.)

Notwithstanding, I think I still see a problem with respect to getting good
system software for these beasties.  I reason as follows:  Much of system
software requires well-optimized code.  Therefore we either write it in a
high-level language for which we have a compiler that generates good code,
or else we go in and hand-code the major bottlenecks in assembler.

I submit that optimizing pipeline dependencies by hand, in assembler, is
likely to provide a rich source of subtle, persistant and fascinating bugs.
Great fun [Insert :-) here, I think], but not necessarily conducive to lots
of good code.

I also suggest that decent optimizing compilers are not really
across-the-board state-of-the-art for the microcomputer industry yet.  I
have a sense that although they exist, they are often slow to come out, slow
to be updated and made bug-free, and relatively scarce.  If the Acme RISC
Machine Co. introduces a new wonder chip, how long will it be before such
tools are available in sufficient variety to suit all the eager developers?

To the extent that these arguments are convincing, I am induced to speculate
that the success or failure of computer systems based on RISC-machine
hardware, may be embarrassingly tightly linked to the state of the
system-software industry.

I an not sure that my arguments are correct, and I am not confident of my
perception of the software industry, either.  What do other people think?
-- 
Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)

cdshaw@watmum.UUCP (Chris Shaw) (06/23/85)

Does not the average RISC type machine do a purge-pipe action upon
receipt of a branch instruction, or is there something so incredibly obvious
going on here that I am missing?  The reason I ask is that in any but the most
nontrivial loops, throwing in enough NOPs to fill the pipe so that the next 
instruction is the right one strikes me as grossly inefficient.

The alternative with a simple "purge-pipe" is to repeat enough of the beginning
of the loop such that you end up doing the right instructions if you branched
to continue the loop. Otherwise you purge, and continue after the repeated 
instructions without actually executing any of them.

Thus we have a bogus program with a 2-element pipe:

start:	load  a
	store b
start+2:add   c
	store d
	decr  e
	brnz  start+2:
	load  a		\  do these 2 if branch taken, else throw them away
	store b		/   ... (purge) and do the xor
	xor   a     ---> continue here because "load" & "store" were purged
			 ... when the branch was not taken

(Forgive me if this was utterly stupid)

Chris Shaw    watmath!watmum!cdshaw  or  cdshaw@watmath
University of Waterloo
In doubt?  Eat hot high-speed death -- the experts' choice in gastric vileness !

mat@amdahl.UUCP (Mike Taylor) (06/24/85)

> 
> >>               More complex instructions are implemented by microcode
> >> sequences, using a so-called nanostore which maps short microwords
> >> into long control words. The most complex instructions are implemented
> >> in software, called macrocode.
> 
> It sounds like the macrocode is user-definable, how 'bout the microcode?
> 
In principle, yes.  As a matter of policy, customers are discouraged
from changing either simply because of the maintenance and diagnostic
implications.  However, both macrocode and microcode are stored on
a hard disk which is part of the console (service processor) subsystem,
and loaded during the system initialization sequence.

> Does microcode reload (or switch) as well?
No. All microcode is loaded all the time. If a particular instruction
is not part of the architecture for a given domain, it is logically
disabled, but the microcode is still there.

> 
> The cited architecture sounds neat:  It looks to me as if, in principle,
> users have access to the machine at four levels (assuming the microcode is
> rewritable):
> 
>         (ordinary) code        eg, call a library routine
>         macrocode              eg, execute "machine instruction" FOO,
>                                  where FOO traps to a macrocode routine
>         microcode/hardware     eg, execute "machine instruction" BAR,
>                                  which actually IS a machine instruction
>         hardware               eg, change some microcode
> 
> Have I got that right?
Yes.
> 
> Clever programming software ought to be able to look at an algorithm to
> be implemented, and perform an optimization that included moving code
> (should we call that "metacode") across all of these level boundaries.
> -- 
> Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)
Probably.  However, in practice, the biggest hits come from optimizing
pipeline performance by removing interlocks (i.e. reordering instructions
so that they don't interfere) and optimizing cache performance.

The 580 has a 5-stage pipeline, so that execute/generate interlocks such
as:
               l    5,4(3,4)
               a    6,0(5)
where the result of the first instruction is needed to calculate the EA
of the second, cost three cycles and therefore fixing this common
sequence by doing something else helps a lot.  Also the 580 has dual
caches (Instruction and Operand).  The instruction cache is read-only,
so that stores to lines in the I-cache are very expensive. Also good
code will know how to use the background prefetching properly to
reduce cache misses dramatically.  Good macrocode software will mostly
achieve the theoretical maximum rate of execution (41 MIPS on RX-type
instructions) whereas something like 12 MIPS would be more normal when
the MVS operating system is running.
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

mash@mips.UUCP (John Mashey) (06/30/85)

> Chris Shaw    watmath!watmum!cdshaw  writes:
> Does not the average RISC type machine do a purge-pipe action upon
> receipt of a branch instruction, or is there something so incredibly obvious
> going on here that I am missing? ....
> Thus we have a bogus program with a 2-element pipe:
> start:	load  a
> 	store b
> start+2:add   c
> 	store d
> 	decr  e
> 	brnz  start+2:
> 	load  a		\  do these 2 if branch taken, else throw them away
> 	store b		/   ... (purge) and do the xor
> 	xor   a     ---> continue here because "load" & "store" were purged
> 			 ... when the branch was not taken

The idea in RISCs with branch-delay slots is to NEVER do purge-pipes,
except in interrupt-processing.  A typical RISC design like the above
would ALWAYS execute the 2 instructions following the branch,
so the pipe would never be purged.  A more likely code sequence would be:
  start:load  a
  start+1:store b
  start+2:add   c
  	decr  e
  	brnz  start+1:
  	store d
  	load  a	
  	xor   a 
[Note: the program is bogus, so it's hard to tell exactly what is going on,
but the 2nd example is much more in the spirit of the way RISCs work.
Worth reading are 2 excellent articles: Dave Patterson's Jan 85 CACM
article or John Hennessy's Dec 84 IEE Trans on Computers one.

-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

alan@drivax.UUCP (Alan Fargusson) (06/30/85)

> Does not the average RISC type machine do a purge-pipe action upon
> receipt of a branch instruction, or is there something so incredibly obvious
> going on here that I am missing?  The reason I ask is that in any but the most
> nontrivial loops, throwing in enough NOPs to fill the pipe so that the next 
> instruction is the right one strikes me as grossly inefficient.

> start:	load  a
> 	store b
> start+2:add   c
> 	store d
> 	decr  e
> 	brnz  start+2:
> 	load  a		\  do these 2 if branch taken, else throw them away
> 	store b		/   ... (purge) and do the xor
> 	xor   a     ---> continue here because "load" & "store" were purged
> 			 ... when the branch was not taken
> 

The Bearkly type RISC (and most others I think) delay the action of
the branch for one cycle. Thus for your example program a noop would
have to be placed after the brnz to start+2. In many cases the noop
can be optimized out by a smart compiler.
-- 

Alan Fargusson.

{ ihnp4, amdahl, mot }!drivax!alan