[net.arch] RISCs, caches, and vertical migration

crp@stcvax.UUCP (03/15/86)

> J. Eric Roskos @ Concurrent Computer Corp in Orlando says:

> One of the ongoing areas of research in microprogramming involves "vertical
> migration" -- analyzing sequences of code to determine things that can
> be migrated into the microcode, essentially to produce new instructions.
> From the RISC end you'd just go the other way; it's been argued that the
> cache does that "automatically," but it's hard to believe that
> in the long run, when the RISC approach has come to be seen as mundane,
> that someone doesn't start doing statistical analyses on RISC instruction
> sequences, and discovers that some sequences commonly occur,
> and makes new instructions out of those.
> But that's essentially identical to the vertical migration strategy.
> ... [a bit deleted] ...
> but I think the underlying approach is more or less the same.

One reason that vertical migration of frequent instruction sequences
"down" into a lower "interpretation layer" differs between microcoded machines
and RISC machines (as they are spoken about -- oops, I mean thought about)
today is that the RISC interpretation layer is hardware (today).
If you really need hardware to implement the instruction, then
even though an instruction sequence is common, the price/performance
tradeoff may still be the instruction sequence rather than
"interpret" a single instruction with hardware.

For example, HPs new "Spectrum" machines don't have a multiply instruction.
The reason, they say, is that it made a WHOLE lot of difference in the
amount of ALU hardware needed and the most common case,
a multiplier that is a small constant, can be effected cleverly
and quickly with the remainder of their hardware and instruction set.
Apparently a general multiply subroutine is "fast enough" for the
remaining cases.
An "Electronics" (March 3, 1986) article about Spectrum says that
"multiplications by both variables and constants are done
in an average of three or four cycles - faster than the 20
or so cycles many multiplications consume in other machines."

The following is motherhood, but it seems to bear repeated articulation.

A system design, ANY system design, is done for some particular
- problem (program) domain and
- customer (pocketbook) domain.
The "RISC ballet" (or is it a demolition derby?)
is an optimization problem with variables that include
performance range required, cost range, hardware technologies
(CPU stuff and memories in particular), compiler technology,
programs from the application domain(s) supported, and other
things I either don't know or have forgotten at the moment.

I've noticed that people seem to want to talk about the instruction
set without considering the programs to be run.
You can't.
A decision to leave out integer multiply, for instance, can't be made
only on the basis of what it does to the chip area or complexity.
HP measured a "whole lot" of "real" programs from people that use
their existing machines (the mag says "hundreds of customer jobs
for over 100 customers").  They designed a machine to support
THIS APPLICATION MIX with some particular cost/performance
requirements and not *necessarily* any other mix.
They simulated the effect of their design decisions on programs from people
who gave them money once and might be induced to do so sometime again.
Apparently they were able to decide that integer multiply wasn't
important enough to enough people to hurt their business.
For them, for now, leaving out multiply is a Good Thing.

HP's Spectrum series has support for coprocessors -- like an FPU.
My own speculation is that they found that a reasonable multiplier
for the core instruction set didn't satisfy the high-performance
applications so they had to use a coprocessor (like the FPU)
for that part of the problem domain anyway.

So are RISC designs going to be "general purpose enough"?
I think most designers have taken a broad cross section of problem
domains for their design base and the fundamental operations are similar
enough for most programs that this will hold true.
(Don't all programs spend their time branching and looping?).

John Mashey (mips!mash) comments that though MIPS hadn't targeted the
business program domain specifically that early experience is telling
him that their machine does a reasonable job with this kind of program.
I'm not surprised since the business programs don't do
anything remarkably different than text editors or compilers.
On the other hand, I suspect that the MIPS machine isn't
as good (i.e. effective and cost-effective) as a CRAY XMP
(at the same clock speed) at predicting tomorrow's weather for India.
Every computer is not the right solution for every problem.

As technology changes, useful hardware/software solutions will change.
Architectures aren't forever -- especially today when [vu]lsi is
moving so fast.
One trick to survival for computer makers will be to very carefully
choose an architecture so that it makes sense with today's technology
and still makes sense with the technology available in the medium future.

charlie


-- 
Charlie Price   {hao ihnp4 decvax}!stcvax!crp   (303) 673-5698
USnail:	Storage Technology Corp  -  MD 3T / Louisville, CO / 80028

jer@peora.UUCP (J. Eric Roskos) (03/18/86)

Charlie Price at Storage Technology* writes, in response to my suggesting
that there are analogies between vertical migration and the "optimizing
CISC" description of RISC cache:

> One reason that vertical migration of frequent instruction sequences
> "down" into a lower "interpretation layer" differs between microcoded
> machines and RISC machines (as they are spoken about -- oops, I mean
> thought about) today is that the RISC interpretation layer is hardware
> (today).

While I do agree with the remainder of the posting, the above suggests I
wasn't clear on what I meant in making the analogy.

The goal of vertical migration is to identify frequently-occurring
sequences of operations, and move them into the microcode.  The reason for
moving them into the microcode is the assumption that the microcode is in
a small, fast, but costly memory, and thus can be executed faster.  In
general, there's also the assumption that you can do more in microcode in
parallel, since horizontal microprograms are really commands to control
points in the machine, rather than conventional, familiar, sequential
instructions; but how applicable this is depends on how "horizontal" your
microinstruction is.

The former assumption is the one used recently in here to claim that a
RISC is really an "optimizing CISC".  There is some moderate truth to that,
too, I think, although I have certain doubts at present.

A CISC machine that has used the vertical migration techniques to give a
good CISC instruction set for a particular program (and note that indeed
there may be different instruction sets for different *programs*; several
years ago when I was involved briefly in research in this area under R.
I. Winner we were working on implementing dynamic switching of control
store under Unix in order to support this approach) consists essentially
of subroutine calls to RISC-like subroutines (actually subroutines in
the microprogram control store); the subroutine calls are made out of a
slower memory external to the CPU -- generally the CPU treats this memory
like a user-level machine would treat a disk, and has to explicitly
address and fetch data out of the memory and possibly wait for it to
arrive -- and these calls are the CISC instructions.

Now, in order for a RISC machine to work this way, i.e. in order for the
instructions in the cache to be somehow analogous to CISC instructions in
control store, there has to be both a spatial locality and a locality
of reference for the frequently-used instruction sequence.  At the time I
made the original posting, I debated somewhat over whether to raise this
point, and decided not to, because it would be somewhat difficult to
determine exactly *how* similar this makes the two kinds of machine.

But the basic problem is that, if you say that the RISC machine is an
"optimizing CISC" because it keeps frequently executed instruction
sequences in cache, you have to have a compiler that recognizes these
frequently executed sequences, and puts them in one place (say, a
subroutine that is repeatedly invoked).  For example, having an identical
sequence randomly distributed through memory wouldn't work, because the
cache works spatially; it doesn't know anything about the instruction
sequences per se, so for example you might be fetching in the same sequence
multiple times just because it was spread through memory and thus not in
the cache when you next needed it.** On the other hand, if you put the
sequence in a subroutine, and aligned the subroutine such that it all
fitted into cache and you could keep it in there while fetching
instructions from somewhere else that invoked that subrotine, than each
time the subroutine was called it would already be in cache, and the
operation would be analogous to a CISC with vertical migration of the
sequence that's in the subroutine into microcode.  I'm not sure it's safe
to say that makes the RISC with cache an "optimizing CISC", though, since
it requires as much information about the original program as would be
required to do the vertical migration in the first place.

Which was why I said originally that I thought the two areas of research
probably had a lot in common and probably would tend to converge.  However,
the above is a fairly simplified description, since there are a lot of
questions in my own mind on how comparable the two approaches actually are
that would be hard to address productively in this sort of discussion.

--------
* A company that makes amazing tape drives!  We have one of them here.

** Note that I am thinking here of the more general case in which you just
have a repeatedly occurring sequence you want to put in a faster place;
this might happen, for example, in a large program which has a large loop
that is repeated many times, such that the loop is too big to fit into
the cache at once, but contains sequences of operations (maybe interspersed
with various other differing operations of varied lengths) that do recur
repeatedly.  I didn't mention the obvious case in which you have a loop
that can fit entirely in cache; and also I wasn't considering the
benefits of cache in a machine with a very wide bus, where you fetch 4 or
8 words into the cache all at the same time because it costs the same as
fetching 1 word.  Those are all special cases, and have been covered a lot
in here already, and it's been shown that they do provide significant
benefits themselves.
-- 
E. Roskos