[comp.arch] Superscalar ISAs

sporer@stellar.COM (Michael Sporer @stellar) (08/23/89)

Hi, this is my first posting so please excuse any awkwardness or incorrect
usage. I am interested in Superscalar architectures.

We at Stellar designed our own ISA which has lots of RISC characteristics:
32 integer registers, 8 DP FP registers, single-cycle execution, delayed
branch with conditional squash (nullification). We did extend the RISC ideas
by incorporating memory operates, complex addressing (effective addressing
in the operates as well as load/store), and the incorporation of vector
operations and concurrency operations.

The vector unit operates in a load/store mode with the architecture defining
6 vector registers each with 32 DP elements.

After all this was done, we found
that not all our functional units were busy all the time so we added a
feature called packetization - which the industry now calls superscalar.
However, after much work, we found that we only get a 10 percent increase
in overall performance. We can packetize 2 integer, or 2 fp (a + and a *),
or 1 fp and 1 integer. The biggest restriction is that only one instruction
can be memory reference.

Now, I am beginning to hear about RISC chips (or chip sets) that will have
multiple functional units and an ability to examine 2,3 or even 4 instructions
for dispatch. However, there seems to be some limit (like 1.3x increase) to
the performance increase over a single functional unit.

First off, is this true. Are there really limits to our current instruction
set descriptions of programs that limit this type of parallelism?

Secondly, if this is true, is there really any work being done on true
superscalar ISAs? I am aware of the work being done by Bill Wulf at University
of Virginia (The WM Computer Architecture, Definition and Rational - Computer
Science Report No. TR-88-22, Oct 21, 1988). It seems that he has defined an
elegant method of specifying multiple memory references at once.

Anyway, I've gone on long enough. Thanks for your interest.

-- 
--
Michael Sporer        uunet!stellar!sporer
Stellar Computer Inc  sporer@stellar.com

slackey@bbn.com (Stan Lackey) (08/24/89)

In article <1989Aug22.212330.8119@stellar.COM> sporer@stellar.com writes:
>We at Stellar designed our own ISA which has lots of RISC characteristics:
>We did extend the RISC ideas
>by incorporating memory operates, complex addressing (effective addressing
>in the operates as well as load/store), and the incorporation of vector
>operations and concurrency operations.

>that not all our functional units were busy all the time so we added a
>feature called packetization - which the industry now calls superscalar.
>However, after much work, we found that we only get a 10 percent increase
>in overall performance. We can packetize 2 integer, or 2 fp (a + and a *),
>or 1 fp and 1 integer.

You already have 90%? of what 'superscalar' gets in your instruction
set: mem-to-reg operations, complex addressing, and vectors.  Now, a
true RISC doesn't have those features, and so they will see very
significant improvements.  Go ahead and flame me, but that's why I say
that superscalar and CISC get the same thing, but in slightly
different ways.  Sure, superscalar is more general and according to
what I get from the posting 10% better than CISC.  But it can be more
expensive, depending upon issues like backwards compatibility.

-Stan

mcg@mipon2.intel.com (08/25/89)

In article <1989Aug22.212330.8119@stellar.COM> sporer@stellar.com writes:
>
>We at Stellar designed our own ISA which has lots of RISC characteristics:
>32 integer registers, 8 DP FP registers, single-cycle execution, delayed
>branch with conditional squash (nullification). We did extend the RISC ideas
>by incorporating memory operates, complex addressing (effective addressing
>in the operates as well as load/store), and the incorporation of vector
>operations and concurrency operations.

Characterizing this ISA as having "lots of RISC characteristics" begs the
perennial question of what RISC means, but let's put that aside for the moment.

>The vector unit ...
>
>After all this was done, we found that not all our functional units were
>busy all the time so we added a feature called packetization - which the
>industry now calls superscalar ...

Here is a stab at a definition of Superscalar.  (note that [Jouppi]
is an excellent paper defining Superscalar and Superpipelined
machines, and explored the isomorphism of them and vector machines.
See bibliography at the end of this article.)

	Superscalar: an implementation technique for an instruction
		set architecture that decodes, dispatches, executes,
		and returns results from more than one independent
		instruction from an otherwise linear instruction stream
		during each of the processor's basic clock cycle.

It is equally important to say what superscalar is NOT.  Most importantly
it is not a VLIW (Very Long Instruction Word) processor [Nicolau].  I say
"most importantly" because this is superscalar's most immediate heritage.
The difference between VLIW and superscalar is that VLIW processors
fail the test of executing "from an otherwise linear instruction stream".
Processors which resort to a "dual-instruction mode" or other artifice also
fail this test.

Another important point is that superscalar is an *implementation technique*,
not a fundamental ISA design principle like RISC.  With sufficient
resources, one could probably build a superscalar VAX implementation.
However, the effectiveness (degree of speed-up vs. implementation cost)
of a superscalar implementation does depend on many aspects of the ISA,
notably how easy or hard it is to detect inter-instruction dependencies
and potential parallelism between functional units.

> ... [and] found that we only get a 10 percent increase in overall
>performance.

In general, there are two recent papers and many older ones (see attached
partial bibliography) that characterize the potential speedups one might
hope to see from fine-grained parallel architectures.  The results are
mixed.  In all but one case, the research has focused on adapting another
architecture (e.g. Cray, IBM 360, MIPS, DECWRL Titan) to superscalar
implementation.  Most of the papers agree that the speedup begins to level
off at somewhere between 2x and 4x.  There is disagreement on what the
limiting factor is.  While everyone agrees that a major problem is
distance between conditional branches, others go on to argue that
the fundemental parallelism is limited.

All, however, seem to point out that 2x improvement is achievable.  This
can also be seen in our implementation of a superscalar microprocessor,
mentioned in [Hinton].  Your 10% improvement might be due to a hiding
of potential parallelism by the architecture, or poor instruction
scheduling by the compiler - there is no way to tell without a much more
thorough examination of the architecture.

> We can packetize 2 integer, or 2 fp (a + and a *),
>or 1 fp and 1 integer. The biggest restriction is that only one instruction
>can be memory reference.

Executing FP add and multiply in parallel (if you can truly do that,
and complete them in one cycle), will give you excellent speedups on many
scientific algorithms, provided that they are hand-coded or that you
have a good compiler.  In particular, matrix multiply can be improved
by up to 2x by this.  So if you are doing this and still only getting
10%, your memory system or some other aspect of your implementation is
out of balance.

>Now, I am beginning to hear about RISC chips (or chip sets) that will have
>multiple functional units and an ability to examine 2,3 or even 4 instructions
>for dispatch. However, there seems to be some limit (like 1.3x increase) to
>the performance increase over a single functional unit.

AMD, MIPS, National, Motorola, and Intel are all known to be working on
multi-function-unit processors.  I have commented in the past on my
beliefs about the difficulties some of these architectures will place
on superscalar implementations.

>First off, is this true. Are there really limits to our current instruction
>set descriptions of programs that limit this type of parallelism?

As described above, yes and no.  Current research and practice shows that
2-3x is achievable, given appropriate balance of memory bandwidth
(especially instruction fetch) and funtional units.  VLIW people would
have you believe that 1000x speedup is possible with VLIW, and anything
VLIW can do, Superscalar can do also, possibly with a little more difficulty.
The difference is: a) what benchmarks are important to you: are they
loop-dominated or straigh-code dominated?  what is their fundamental
parallelism?; and b) how much effort do you put into the compiler to
do loop unrolling, software pipelining, inter-basic-block code scheduling.
Does your architecture support branch prediction, out-of-order instruction
execution, etc?  Can you hope to effectively implement them?

>Secondly, if this is true, is there really any work being done on true
>superscalar ISAs? I am aware of the work being done by Bill Wulf at University
>of Virginia (The WM Computer Architecture, Definition and Rational - Computer
>Science Report No. TR-88-22, Oct 21, 1988). It seems that he has defined an
>elegant method of specifying multiple memory references at once.

I don't have a copy of the WM stuff handy, but I remember that it struck
me as closer to the VLIW end of the spectrum than superscalar.

As I mentioned, Intel has published a paper, and I have spoken several times
about research we are doing.  I'm speaking on this subject at Michael
Slater's Microprocessor Forum, the Embedded Systems Conference, and
Wescon on this subject, so maybe you can attend those conferences.
We have announced that a product will be available during 1989.  Other
than that, you'll have to wait.

In article <44677@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>You already have 90%? of what 'superscalar' gets in your instruction
>set: mem-to-reg operations, complex addressing, and vectors.  Now, a
>true RISC doesn't have those features, and so they will see very
>significant improvements.  Go ahead and flame me, but that's why I say
>that superscalar and CISC get the same thing, but in slightly
>different ways.  Sure, superscalar is more general and according to
>what I get from the posting 10% better than CISC.  But it can be more
>expensive, depending upon issues like backwards compatibility.

Here is the flame you asked for: I do not believe that superscalar and CISC
have anything to do with one another.  It is true (and I have said many
times) that certain instruction set features expand the amount of easy-to-see
parallelism in the architecture.  Among these are slightly more complex
addressing modes, include scaled-index and scaled-index-plus-base register,
but we believe that it is most effective to base a superscalar implementation
on a quite simple load/store architecture.  Keeping loads and stores
distinct from register-register operations is, in my opinion, quite
important to the ease of superscalar implementation.

And, as mentioned before, superscalar techniques now in existence
coupled with known techniques for compilation and code scheduling can get
2x to 3x speedups *over RISC processors*.  Also, I don't know the
meaning of the statement "depending on issues like backwards compatibility".
Superscalar implementations, done correctly, are more likely to be
binary backwards compatible that RISC architectures, because RISC people
often exposed the pipeline - a pipeline that they will have to change for
future implementations.

The next 2 years is going to see a lot of talk about superscalar, you
may want to read the attached papers.

S. McGeady
Intel Corp.

Bibliography
------------

R. Acosta, J. Kjelstrup, H. Torng, "An Instruction Approach to Enhancement
Performance in Multiple Functional Unit Processors", IEEE Transactions on
Computers, Vol c-35, #9, Sept. 1986

W. Hwu & Y. Patt, "Checkpoint Repair for Out-of-Order Execution Machines",
1987, ACM 0084-7495

N. Jouppi & D. Wall, "Available Instruction-Level Parallelism for Superscalar
and Superpipelined Machines", 1989, ASPLOS-III Proceedings,
ACM Sigplan Notices Vol 24, May 1989

G. Hinton, "80960 - Next Generation", Proceedings of 34th COMPCON,
IEEE, March 1989

D. Kuck, Y. Muroka & S-C Chen, "On the Number of Operations Simulataneously
Executable in Fortran-Like Programs and Their Resulting Speedup",
IEEE Transactions on Computers, Vol C-21, #12, Dec 1972

A. Nicolau & J. Fisher, "Measuring the Parallelism Available for Very Long
Instruction Word Architectures", IEEE Transactions on Computers,
Vol c-33, #11, Nov 1984

E. Riseman & C. Foster, "The Inhibition of Potential Parallelism by
Conditional Jumps", IEEE  Transactions on Computers, Dec 1972

M. Smith, M. Johnson, & M. Horowitz, "Limits on Multiple Instruction Issue",
1989, ASPLOS-III Proceedings, ACM Sigplan Notices Vol 24, May 1989

J. Thorton, "Parallel Operation in the Control Data 6600",
AFIPS Proceedings FJCC, vol 26, 1964

R. Tomosulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic
Units", IBM Journal, vol 11, Jan 1967

G. Tjaden & M. Flynn, "Detection and Parallel Execution of Independent
Instructions", IEEE Transactions on Computers, Vol C-19, #10, Oct 1970

S. Weiss & J. Smith, "Instruction Issue ogic in Pipelined Supercomputers",
IEEE Transactions on Computers, Vol c-33, #11, Nov. 1984