[comp.arch] EXACTLY what is Superscalar?

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/23/89)

In article <16080@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>   Note that if you have a superscalar architecture, and can do two inst.
>>in parallel (see the Intel 80960CA paper in newest Compcon proceedings), you
>
>Exactly my point about superscalar.  But note that for the expense of the

For quite a while, I have heard superscalar used, and I think the term was
defined in a paper in IEEE Computer a while back, but I am still a little
fuzzy on it.  Is "superscalar" an exact concept, or is it a buzzword like 
"RISC"?  Is a Multiflw machine a superscalar machine, or the i860, or
the Weitek XL-8064?

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

cliff@ficc.uu.net (cliff click) (03/23/89)

[This is my first posting, so please go easy on the flames! :-)]

> For quite a while, I have heard superscalar used, and I think the term was
> defined in a paper in IEEE Computer a while back, but I am still a little
> fuzzy on it.  Is "superscalar" an exact concept, or is it a buzzword like 
> "RISC"?  Is a Multiflw machine a superscalar machine, or the i860, or
> the Weitek XL-8064?

The NOVIX 4000 chip has 2 stacks in external memory; all operations 
implicitly used the top-of-stack (or next-of-stack) with no real 
registers.  Calls/branchs take 1 cycle, subroutine returns take ZERO 
cycles, memory loads and stores take 2 cycles (1 for instruction, 1 
for mem ref), arithmetic took 1 cycle.  The arithmetic instructions 
were explicitly bit-coded - bits in the instruction ran directly into 
the ALU/stacks - so a single instruction could choose one of 
(add/sub/neg/xor/or/and/...) and (shift left/shift right/rotate...) 
and (push results/pop results/...).  The chip had no pipeline and no 
cache.  Interrupts had a 2 cycle latency (recognize interrupt, push 
PC).  The chip was running at 10Mhz with nearly 10MIPS (no flames 
please) throughput.  Oh yeah, both stacks and main memory all had 
SEPERATE address and data lines, and it was a 16bit chip (2 x 8 stack 
address, 16 main address, 3 x 16 data lines = 80 address+data lines).

Is this (basically) register-less chip RISC?

Is this (basically) instructions-as-horizontal-microcode "Superscaler"?

Why isn't this approach more popular?  With no pipeline and no cache
context switches should be cheap (stacks swapped by using MMU).  The
builder didn't use any fancy technology for the part - the MUCH smaller
processes used by the "big boys" (Motorola, Intel, HP...) should be
able to double or triple the clock rate on the part.

(All of the NOVIX stuff is from my head and is a couple of years old, I
may have forgotten some of it!  I know that a 32bit part is in the works.)


Cliff Click, Xenix Support, Ferranti International Controls Corporation.
uunet.uu.net!ficc!cliff, cliff@ficc.uu.net, +1 713 274 5368.
Disclaimer:  What's a disclaimer?

bcase@cup.portal.com (Brian bcase Case) (03/24/89)

>For quite a while, I have heard superscalar used, and I think the term was
>defined in a paper in IEEE Computer a while back, but I am still a little
>fuzzy on it.  Is "superscalar" an exact concept, or is it a buzzword like 
>"RISC"?  Is a Multiflw machine a superscalar machine, or the i860, or
>the Weitek XL-8064?

Well, this is a good question.  Since I have been using the "buzz word"
superscalar, maybe I should give the definition I use.  To me, superscalar
is simply an implementation that executes multiple instructions per
cycle (at least for a RISC architecture) when dependencies permit.  It
accomplishes this multiple-instruction-per-cycle rate *without any help
from the instruction stream itself.*  That is, take an instruction stream
that executes just fine on the 29000; if the same instruction stream were
presented to the S-29000, the superscalar 29000, more than one instruction
would be executed per cycle when dependencies permit.  To accomplish this
to any reasonable degree, two or more (nearly?) identical pipelines must
be present (I think).

Note that this is significantly different from VLIW or the i860.  For
these implementations, multiple operations can execute in one cycle, but
that is because the instruction says, in an explicit way, to do so.  Said
another way, these machines will not execute multiple operations per cycle
*unless* the instruction stream says to do so.  A superscalar machine needs
no such help.  *However*, to squeeze the most from a superscalar design,
one would like to have the compiler arrange things so that dependencies are
minimized.  *But note*, even the compiler-arranged instruction stream will
still execute just fine on a non-superscalar implementation of the
archticture.

mbutts@mntgfx.mentor.com (Mike Butts) (03/28/89)

From article <22975@ames.arc.nasa.gov>, by lamaster@ames.arc.nasa.gov (Hugh LaMaster):
> For quite a while, I have heard superscalar used, and I think the term was
> defined in a paper in IEEE Computer a while back, but I am still a little
> fuzzy on it.  Is "superscalar" an exact concept, or is it a buzzword like 
> "RISC"?  Is a Multiflw machine a superscalar machine, or the i860, or
> the Weitek XL-8064?

In "Superscalar vs. Superpipelined Machines" (Comp. Arch. News, ACM SIGARCH, 
v.16, #3, June 1988, p. 71-80), Norman P. Jouppi of DEC West in Palo Alto 
offers this definition: "A superscalar machine of degree n can issue n 
instructions per cycle."  VLIW machines are similar.

It's a very interesting paper, which I recommend to anyone interested in 
this subject.  He discusses superpipelined machines, which "can issue only one 
instruction per cycle, but have cycle times shorter than the time required 
for any operation", compares the alternatives, and discusses limits to 
instruction-level parallelism.

In particular, let me quote from his concluding comments: "The most important 
point to emphasize is that significant improvements in uniprocessor 
performance via internally parallel processors will only occur for 
applications with large amounts of instruction-level parallelism.  These are 
applications that are often of more importance to physical scientists than to 
computer scientists.  Many applications of importance to computer scientists 
and computer engineers, such as compilers, operating systems, and programs
involving manipulation of linked data structures, will not benefit from
highly parallel uniprocessors."

-- 
Mike Butts, Research Engineer         KC7IT           503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005
...!{sequent,tessi,apollo}!mntgfx!mbutts OR  mbutts@pdx.MENTOR.COM
These are my opinions, & not necessarily those of Mentor Graphics.