[comp.arch] RISC vs. CISC -- SPECmarks

sef@kithrup.COM (Sean Eric Fagan) (04/26/91)

(Here we go again.  *sigh*  Are people really this ignorant?)

In article <1991Apr25.025800.4377@mp.cs.niu.edu> bennett@mp.cs.niu.edu (Scott Bennett) writes:
[in response to my assertion that making a superscalar CISC machine {e.g.,
68050} is much harder than with most of the current RISC machiens.]
>     If you disallow pipelining in the CISC machine, then it is most
>likely to be impossible to have so-called superscalar operation.  

Who said I was disallowing pipelining.  Pipeline your bloody CISC to death
for all I care.  For the most part, it won't make that much difference:
most CISC chips, such as the 68k and iAPX*86 series, tend to do too many
memory references in each instruction to make superscalar feasible.  Or
don't you realize you can only access one memory location at a time?  (Well,
not completely true, but true enough.)

>However,
>most CISC machines now are not only pipelined, they are *multiply* pipe-
>lined.  

Oooh.  They have more than one stage of pipelining.  Like all of the current
RISC chips, which had them before the CISC chips.

>Since a superscalar RISC can only be that way by pipelining,
>let's at least compare only pipelined architectures.  FWIW, the MC68040
>supposedly averages about 1.3 clock cycles per instruction because of
>the pipelining used.  That obviously doesn't reach "superscalar", but
>it isn't terribly far off, either.

Bullshit.  It is *very* far off.  Note the word "supposedly" in your
statement.  Please look at John Mashey's figures; I think they indicate
slightly higher (1.4 or 1.5 CPI?) for the '40; on the other hand, the R3000
got what, 1.2 or 1.3.

>     In any case, what really matters is how much work gets done per
>clock cycle, not how many instructions get done per cycle.  

No, it doesn't.  What matters is *how quickly you can get your job done*.  I
don't care if you can do a POLY instruction in 3 cycles; if you still take 2
cycles to do an add, most current RISC chips will blow you away (unless your
application consists of POLY instructions).

|One example
|is the case of moving blocks of data from one memory location to another.
|A typical RISC must 1) initialize a loop (one or more instruction fetch/
|decodes) and in the body of the loop must 2) load a word into a register
|(one fetch/decode), 3) store from the register into the new location (one
|fetch/decode), 4) increment both addresses (probably two fetches/decodes),
|5) loop back to repeat until finished (at least one fetch/decode).  Some
|CISCs have something like a "repeat" instruction that will execute 
|another instruction (e.g. a storage-to-storage move) a given number of
|times while incrementing addresses in that instruction, so the whole
|operation may require as few as two fetches/decodes.  Other CISCs have
|single instructions capable of doing block moves, so they only need one
|fetch decode.  That means more of the cycles required get spent doing the
|actual work that needs to be done than would be the case with a RISC.  A
|CISC operating in such a way would be at the *opposite* end of the spectrum
|from "superscalar", but would get its work done more quickly anyway.

This was so precious I decided to keep all of it intact.

Note how, because "some" CISCs have a "repeat" instruction, which doesn't
necessarily buy you anything (talk to Henry Spencer), all CISCs are better.

Never mind that fact that for most RISCS and CISCs the code is almost
identical, with only optimizations for the specific processor.

Listen *very* carefully:  as of right now, the most popular chip that has a
repeat instruction is the '386 and '486.  For the '486, MOVS instruction, no
prefix, takes 7 clock cycles.  A "REP MOVSB" takes 5, *if you are moving 0
bytes*, 13, *if you are moving 1 byte*, and 12+3*(number of bytes).  The
overhead is essentially the same for setting up the rep instruction as it is
otherwise (unless you have other uses for the registers MOVS and REP want,
in which case you have to spill them, and reload, which is going to add even
more time).

The '40 doesn't have a repeat instruction; it's block-memory move loop looks
very much like the RISC version, except that the RISC versions can generally
take advantage of overlapping memory loads/stores.  I.e.

	lb	$temp, ($base + $inc)
	addu	$inc, 1, $inc
	sb	$temp, ($src + $inc)

can take a total of three cycles (well, four:  the sw needs to complete).  A
68k is likely to use something like

	mov.b	[a0+d1], [a1+d1]
	add.l	d1, $1

or somesuch (sorry I'm not completely up to date on my 68k assembly; it's
been a while).  Note that the mov instruction has two memory references;
this is BAD.  (Even if I'm wrong, and there's only one memory reference per
instruction, making it look like the RISC version, the '40 still doesn't
have overlapping loads, I believe.)

Go *learn* before you start stating that people who know a lot more than you
(not me; I mostly just nod my head and agree with people like mashey and
patterson) are complete fools for not doing things properly.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

sef@kithrup.COM (Sean Eric Fagan) (04/26/91)

In article <1991Apr24.181932.17810@cs.cornell.edu> wayner@CS.Cornell.EDU (Peter Wayner) writes:
>In another sense, going "superscalar" is much easier with CISC
>machines.  I think the Intel 486 does a PUSH instruction in one cycle.

It executes only one instruction each cycle.  How, pray tell, is that
superscalar?  Yes, the PUSH instruction can execute in one cycle (provided
you are pushing either a "general purpose" register or an immediate; most
code I've seen for the 80186 and later likes to push lots of memory
locations, in which case it takes 4 cycles, not counting memory latency).
However, I seem to recall that there are lots of "gotcha's" in that 1 cycle.
As I do not remember them, I shall defer trying to discuss them.

>In RISC land, this is a decrement and a load. The CISC designer just
>needs to use enough silicon to pipeline the important instructions.

In RISC land, you do a store followed by a decrement.  This will execute in
two cycles on a MIPS R3000, I believe (since the decrement executes while
the store is waiting).

On the other hand, the equivalent POP takes 4 cycles on the '486; on the
R3000, the equivalent load/increment takes (tada) 2 cycles.  Oooh.

>There is no need for complex logic to handle all the possible cases of
>two instructions coming down the pipe. The RISC designer needs to
>worry about generality.

And the CISC designer needs to try to think which sequences of
"instructions" are going to be commonly executed, and make them work fast as
a single instruction (such as PUSH).

Tell me: would it be preferable to have a limited PUSH instruction execute
in one cycle (limited because it can only store into a hardwired location),
or to have a more general store/arithmetic sequence execute in two cycles?

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

peter@ficc.ferranti.com (peter da silva) (04/27/91)

In article <1991Apr26.073829.4625@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
> Or don't you realize you can only access one memory location at a time?
> (Well, not completely true, but true enough.)

Ah, the old Von Neumann botleneck. Time to apply RISC design techniques to
memory subsystems. How about multiported memory, or even banked memory?
Multiple data and address busses? Who knows, but memory subsystems are the
current bottleneck and something's gotta give.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

preston@ariel.rice.edu (Preston Briggs) (04/27/91)

sef@kithrup.COM (Sean Eric Fagan) writes:
>> Or don't you realize you can only access one memory location at a time?
>> (Well, not completely true, but true enough.)

and
peter@ficc.ferranti.com (peter da silva) writes:

>Ah, the old Von Neumann botleneck. Time to apply RISC design techniques to
>memory subsystems. How about multiported memory, or even banked memory?
>Multiple data and address busses? Who knows, but memory subsystems are the
>current bottleneck and something's gotta give.

Well, you _could_ rediscover the Connection Machine.

Preston Briggs

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/27/91)

In article <TH_A6-F@xds13.ferranti.com>, peter@ficc.ferranti.com (peter da silva) writes:
> In article <1991Apr26.073829.4625@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
> > Or don't you realize you can only access one memory location at a time?
> > (Well, not completely true, but true enough.)
> 
> Ah, the old Von Neumann botleneck. Time to apply RISC design techniques to
> memory subsystems. How about multiported memory, or even banked memory?
> Multiple data and address busses? Who knows, but memory subsystems are the
> current bottleneck and something's gotta give.

The CYBER 205/ETA 10 is a vector pipeline machine, with the most versatile
vector architecture I know of.  With slight modification, it would be able
to handle in a single instruction vectors of arbitrary length, and one does
not have to worry about alignment of vectors in vector registers.  At any
rate, each pipe manages 2 inputs and one output per cycle.  (Pipes are 
splitting the vector, so there is no memory conflict.)  There are also other
instruction in which 2 words at a time are used, and this is normal.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

peter@ficc.ferranti.com (peter da silva) (04/27/91)

In article <11412@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> In article <TH_A6-F@xds13.ferranti.com>, peter@ficc.ferranti.com (peter da silva) writes:
> > Ah, the old Von Neumann botleneck. Time to apply RISC design techniques to
> > memory subsystems. How about multiported memory, or even banked memory?
> > Multiple data and address busses? Who knows, but memory subsystems are the
> > current bottleneck and something's gotta give.

> The CYBER 205/ETA 10 is a vector pipeline machine, with the most versatile
> vector architecture I know of.

Fine, but how does it pull those vectors into the CPU? The CPU can have a
zillion registers a zillion words long, but it still has to read them from
memory. So it does a lot of work in its vector units per instruction. But
how quickly can it fill them? I'm envisioning a system where main memory
is multi-way interleaved, where the CPU can read arbitrary words from each
memory bank in a given cycle.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

shair@ux1.cso.uiuc.edu (Bob Shair) (04/27/91)

peter@ficc.ferranti.com (peter da silva) writes:

>In article <1991Apr26.073829.4625@kithrup.COM>, sef@kithrup.COM (Sean Eric Fagan) writes:
>> Or don't you realize you can only access one memory location at a time?
>> (Well, not completely true, but true enough.)

>Ah, the old Von Neumann botleneck. Time to apply RISC design techniques to
>memory subsystems. How about multiported memory, or even banked memory?
>Multiple data and address busses? Who knows, but memory subsystems are the
>current bottleneck and something's gotta give.
>-- 
>Peter da Silva.  `-_-'  peter@ferranti.com

The IBM RISC 6000 (models 530 and above) has two-way memory 
interleaving, allowing two 64-bit words to be loaded or stored
concurrently (but only to adjacent locations, I belive).

Is this a first step in this direction?                        

-- 

Bob Shair                          shair@chgvmic1.iinus1.ibm.com
Scientific Computing Specialist    SHAIR@UIUCVMD (bitnet)
IBM Champaign

guy@auspex.auspex.com (Guy Harris) (04/28/91)

>Yes, the PUSH instruction can execute in one cycle

Which also raises the question "how important are PUSHes"?  They may be
important on machines with few registers, but on RISCs with lots of
registers, they're not used much for passing arguments to procedures, as
arguments are generally passed in registers.

(In addition, multiple PUSHes in a row, if you end up doing that, can be
done with a bunch of stores and *one* register modification on a RISC
machine.)

mash@mips.com (John Mashey) (04/28/91)

In article <1991Apr26.073829.4625@kithrup.COM> sef@kithrup.COM (Sean Eric Fagan) writes:
>In article <1991Apr25.025800.4377@mp.cs.niu.edu> bennett@mp.cs.niu.edu (Scott Bennett) writes:
(I either missed this, or it hasn't ogtten here yet.)

>>Since a superscalar RISC can only be that way by pipelining,
>>let's at least compare only pipelined architectures.  FWIW, the MC68040
>>supposedly averages about 1.3 clock cycles per instruction because of
>>the pipelining used.  That obviously doesn't reach "superscalar", but
>>it isn't terribly far off, either.
>
>Bullshit.  It is *very* far off.  Note the word "supposedly" in your
>statement.  Please look at John Mashey's figures; I think they indicate
>slightly higher (1.4 or 1.5 CPI?) for the '40; on the other hand, the R3000
>got what, 1.2 or 1.3.
>
>>     In any case, what really matters is how much work gets done per
>>clock cycle, not how many instructions get done per cycle.  
>
>No, it doesn't.  What matters is *how quickly you can get your job done*.  I
>don't care if you can do a POLY instruction in 3 cycles; if you still take 2
>cycles to do an add, most current RISC chips will blow you away (unless your
>application consists of POLY instructions).
I'm sure the net doesn't want to again hear of the reasons why CPI is
really only known by CPU architects.  I'll repeat the observation, that
at 25MHz, the best CISC micros get about 12-13 on SPECinteger, and either
somewhat less on FP (68040), or a bunch less (i486), with 128KB external caches.
At the same clock rate, the more efficient RISCs get 19-21 on integer,
and either close to that, or higher on FP. (R3000, RS/6000, HP PA, etc.)
(Many caveats here that I don't have time to type,
about different configurations,
who's running clocks at which rate, costs, etc, etc, etc.)

Put another way, since the numbers in above post must refer to integer,
CISCs need about 2 MHz/SPECint, efficient RISCs about 1.25.
If you take MHZ/SPECint as a measureable approximation to cycles per
EQUIVALENT instruction (as opposed to cycles/native instruction, which is 
fairly useless as a between-architecture metric by itself) you at least
get a comparison that makes sense to argue about.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650

oasis@gary.watson.ibm.com (GA.Hoffman) (04/29/91)

Currently marketed machines have the memory data directly attached to
the processor's cache chips... interleaving was necessary for us to
get acceptable transfer rates for cache lines.  If we could not get
a transfer per cycle we would have to use unacceptably expensive DRAM
for memory.  Our caches are multi-ported... since we build our own
custom VLSI it is cost-effective for us to obtain multi-ported memories
in this fashion. 
-- 
g

gary a hoffman
RISC Systems, Watson Research

peter@ficc.ferranti.com (peter da silva) (04/29/91)

In article <1991Apr27.162202.18043@ux1.cso.uiuc.edu>, shair@ux1.cso.uiuc.edu (Bob Shair) writes:
> The IBM RISC 6000 (models 530 and above) has two-way memory 
> interleaving, allowing two 64-bit words to be loaded or stored
> concurrently (but only to adjacent locations, I belive).

> Is this a first step in this direction?                        

Yes, but a very small one. This sounds like a special case of burst mode
writes. How do they do it, put data on the address bus, or do they have 128
bits of data bus coming into the chip?

To make this really effective you need to have multiple address and
data busses.

(I've been informed that the ETA-10 in fact does this very thing, but
it's pretty much unknown in microprocessors)
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

dik@cwi.nl (Dik T. Winter) (04/30/91)

In article <11412@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
 > The CYBER 205/ETA 10 is a vector pipeline machine, with the most versatile
 > vector architecture I know of.  With slight modification, it would be able
 > to handle in a single instruction vectors of arbitrary length, and one does
 > not have to worry about alignment of vectors in vector registers.
And now consider what occurs on a page fault in the middle of an instruction!
A single vector instruction can create over 45 page faults.  Still worse if
you only run out of the (16) associative registers used to cache page table
entries (half page faults?).
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

rmbult01@ulkyvx.bitnet (Robert M. Bultman) (04/30/91)

> From: mccalpin@perelandra.cms.udel.edu (John D. McCalpin)
> 
> 	1. multiple functional units	(separate FP add and multiply)
> 	2. pipelined functional units	(independent stages for FP ops)
> 	3. multi-word xfr from cache/(vector registers) to FPU
> 	4. multi-word xfr from main memory to cache/(vector registers)/(fpu)
> 

Sounds like a Multiflow Trace.  4 simultaneous 64-bit FP reads
from memory over 4 data busses to 4 modules, each containing 2
FPUs (separate add and multiply), 2 IUs, 64 sp/32 dp FP
registers, 64 32-bit integer registers, instruction cache only. 
Memory consisted of up to 8 memory controllers with up to 8 banks
of memory per controller, with a 7 beat latency in returning
data.

Rob Bultman
Speed Scientific School
University of Louisville

hrubin@pop.stat.purdue.edu (Herman Rubin) (04/30/91)

In article <3423@charon.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes:
> In article <11412@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>  > The CYBER 205/ETA 10 is a vector pipeline machine, with the most versatile
>  > vector architecture I know of.  With slight modification, it would be able
>  > to handle in a single instruction vectors of arbitrary length, and one does
>  > not have to worry about alignment of vectors in vector registers.
> And now consider what occurs on a page fault in the middle of an instruction!
> A single vector instruction can create over 45 page faults.  Still worse if
> you only run out of the (16) associative registers used to cache page table
> entries (half page faults?).

Do you think that the hardware designers were not aware of these problems?
Unless the user can manage page boundaries, page faults are likely to occur
with any kind of vector instruction.  It may be possible for the hardware to
anticipate page faults and decrease the losses associated with them, and doing
the same vector instructions in any other way will produce as many page faults.

I agree that on vector register machines, one can often rearrange the code to
reduce page faults.  But the alignment problems are quite massive, and the
bookkeeping is not inconsiderable, especially if there is a shortage of scalar
registers.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (04/30/91)

> Comments on Cyber 205, page faults, etc.

I suffered with a Cyber 205 for 6 or 7 years, and I know the machine
from a scientific user's viewpoint.  For just about all fluids 
codes, we found virtual memory to be essentially worthless.  For
out of core codes, we had to manage I/O by hand, and about all
virtual memory did for us was to enable us to create blocks of
virtual storage that we could move things in and out of "by hand"

As for it's architecture, the Cyber is a good case of why architectural
classifications aren't a good way to classify computers as "super"
vs. otherwise.  What made the Cyber a pain to program, and killed it
vs. the Cray, was that (1) it had a long pipeline, hence a big hit
for short vectors, which meant that codes had to be rewritten to
use very long vectors, (2) it couldn't execute vector instructions
unless data was contiguous in memory, so for non-unit strides, you
had to do scatter-gather.

The bottom line is that lots of codes never ran very efficiently
on our machine, because it was too hard to recode.  The IMSL
general eigenvalue finder ran about half the speed on the Cyber
205 as it does on my IBM RS/6000 model 730 (using tuned code).
The cache machines, like IBM, also have problems with non-unit
strides, but I'm finding that using strip-mining, it is relatively
easy to get around them, provided your algorithm has re-use of
data.

While we're on the subject of vector architectures, anybody
remember the Texas Instruments ASC?  It had its hardware problems,
but it had the nice feature that it had vector instructions for
multiply-dimensioned arrays, and the compiler (when it worked at
all, which wasn't often) used them intelligently.  This 
machine was ahead of its time, and went beyond what the real
technology could support, but I think its designers deserve to
get some credit.  Our old TI ASC at Princeton was junked and
went to Texas to be recycled, sharing a truck with some crates of
N.J. lettuce.  It's too bad the architecture was junked as well,
as it had a lot going for it.
.

prener@watson.ibm.com (Dan Prener) (05/01/91)

In article <8W+AYTD@xds13.ferranti.com>, peter@ficc.ferranti.com (peter da silva) writes:
|> In article <1991Apr27.162202.18043@ux1.cso.uiuc.edu>, shair@ux1.cso.uiuc.edu (Bob Shair) writes:
|> > The IBM RISC 6000 (models 530 and above) has two-way memory 
|> > interleaving, allowing two 64-bit words to be loaded or stored
|> > concurrently (but only to adjacent locations, I belive).
|> 
|> > Is this a first step in this direction?                        
|> 
|> Yes, but a very small one. This sounds like a special case of burst mode
|> writes. How do they do it, put data on the address bus, or do they have 128
|> bits of data bus coming into the chip?

There are 128 bits of data bus coming from memory to the cache.
-- 
                                   Dan Prener (prener @ watson.ibm.com)

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (05/02/91)

In article <3423@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <11412@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> > The CYBER 205/ETA 10 is a vector pipeline machine, with the most versatile

>And now consider what occurs on a page fault in the middle of an instruction!
>A single vector instruction can create over 45 page faults.  Still worse if
>you only run out of the (16) associative registers used to cache page table
>entries (half page faults?).

>dik t. winter, cwi, amsterdam, nederland
>dik@cwi.nl

While the machine had its faults :-) the above is not one of them.
Vector instructions were continuable with no problems (all the necessary
state was saved, and in a reasonable length of time.)  The fact that one
instruction could riffle through a lot of pages is in no way
different than the fact that a scalar machine can riffle through
a lot of pages in a loop.

The fact that 16 AR's are not enough, is an amusing criticism.  True, but
what superscalar machine of today can map 16*64KW*8Bytes/Word  = 8 MegaBytes
in its TLB?  The ETA-10 had even bigger large pages...

PROPHECY:  One of these days, a single-chip microprocessor will have vector
instructions, and then the advantages and disadvantages of various
architectural decisions will be discovered all over again.

  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117                            #include <std.disclaimer>

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (05/02/91)

>On 2 May 91 01:54:10 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said:

Hugh> PROPHECY: One of these days, a single-chip microprocessor will
Hugh> have vector instructions, and then the advantages and
Hugh> disadvantages of various architectural decisions will be
Hugh> discovered all over again.

I don't see much benefit to explicit vector instructions compared to
tight loops with zero cycle branches (like the RS/6000).  They sure
can eat up a lot of silicon space, though....

The big problem is that the memory bandwidth required for vector FP is
expensive and is not likely to contribute substantially to the non-FP
performance.  Without adequate memory bandwidth, there is not really
any need for vector instructions, since the cpu is idle (waiting for
cache refills) for plenty of time to do loop control....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (05/02/91)

In article <1991Apr30.163153.18568@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>> Comments on Cyber 205, page faults, etc.
>
>I suffered with a Cyber 205 for 6 or 7 years, and I know the machine
>from a scientific user's viewpoint.  For just about all fluids 
>codes, we found virtual memory to be essentially worthless.  For

Cray was quoted once as saying something like (paraphrase):

"If you don't have it (real memory) you can't fake it."

If you look at "Virtual Memory" as an attempt to fake real memory, then,
it won't work.  On the other hand, from the Operating System point
of view, "Memory Mapping" (another name for Virtual Memory) has 
tremendous benefits, particularly if you want even a modest interactive
load.  Otherwise, you either buy twice as much memory as you really need, or,
throw away half you CPU swapping big jobs against each other.  
To some extent, the benefit of virtual memory is invisible to the user.  
If the system has virtual memory, you can run interactively and still 
get very high utilization.  On the other hand, your particular job 
hasn't speeded up.  But, the user benefits due to more throughput/faster 
turnaround/etc.

>out of core codes, we had to manage I/O by hand, and about all
>virtual memory did for us was to enable us to create blocks of
>virtual storage that we could move things in and out of "by hand"

Well, I saw 205 jobs which used the memory mapping I/O functions to achieve
full channel utilization on 8 high speed channels.  I would say 
that I saw virtual memory create great benefits to individual jobs
through this technique (memory mapped files, combined with the "advise"
functions to tell the system what pages you expected to use).  You could
program really fast I/O on that system, and also get an immediate benefit
as soon as more memory was available to the job.

>vs. otherwise.  What made the Cyber a pain to program, and killed it
>vs. the Cray, was that (1) it had a long pipeline, hence a big hit
>for short vectors, which meant that codes had to be rewritten to
>use very long vectors, 

True, this was a big problem.  The benchmarks that ran on the ETA-10 showed
that the long vector startup was not necessary. The ETA-10 had better
instruction overlap with the same basic architecture.

(2) it couldn't execute vector instructions
>unless data was contiguous in memory, so for non-unit strides, you
>had to do scatter-gather.

Yes, this really was the Achilles Heel for this particular Architecture.

>The cache machines, like IBM, also have problems with non-unit
>strides, but I'm finding that using strip-mining, it is relatively
>easy to get around them, provided your algorithm has re-use of
>data.

I have only limited experience with the new, fast-only-in-cache, machines,
but I have to say that the code you need to get optimum performance is
even more non-intuitive than that for the older vector architecture machines.
Even worse, code which was previously optimal for vector machines, and which
was OK on a wide variety of other machines, is now pessimal for these machines.
Sigh...

  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117                            #include <std.disclaimer>

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (05/03/91)

In article <MCCALPIN.91May2095930@pereland.cms.udel.edu>, mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
|> >On 2 May 91 01:54:10 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said:
|> 
|> Hugh> PROPHECY: One of these days, a single-chip microprocessor will
|> Hugh> have vector instructions, and then the advantages and
:
:
|> I don't see much benefit to explicit vector instructions compared to
|> tight loops with zero cycle branches (like the RS/6000).  They sure
|> can eat up a lot of silicon space, though....

In principle, a superscalar implementation with a really smart vectorizing
compiler can do as well as a vector machine.  In order to do so, however,
it would need to be able to issue two uncached loads and an uncached
store every cycle, as well as a multiply and an add (this latter has 
been now on several machines).  Given the latency of non-cached memory,
which is usually greater than 4 cycles, and may even be up to 32 cycles,
this would require the system to keep a large number of pending loads
and stores, as well as a large number of registers (how many depends on
what the latency of main memory is).  I believe that vector instructions
would actually prove to be *much* easier to implement than a CPU with 20+
pending loads and stores, issuing five new instructions per CPU cycle...

 
|> The big problem is that the memory bandwidth required for vector FP is
|> expensive and is not likely to contribute substantially to the non-FP
|> performance.  Without adequate memory bandwidth, there is not really
|> any need for vector instructions, since the cpu is idle (waiting for
|> cache refills) for plenty of time to do loop control....

I agree that the memory subsystem is a major problem, but I am not sure
that it is as bad as assumed above.  I agree that a vector ISA is useless
without the memory bandwidth to back it up.  

But, what I envision is this:
vector loads/stores can very conveniently avoid modifying cache (unless
the location is already cached), and, the latencies on some cached systems
are already fairly long, with fairly high bandwidths during cache refills.
Bandwidth by itself is not all that expensive; what IS expensive is low
latency high bandwidth memory systems.  Given some of the cache refill
strategies on current machines, feeding a vector load/store unit would
not be that big a deal.  The difference here is that ideally you would want
three or four load store units, and need to maintain cache coherence at
the same time.


-- 
  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117                            #include <std.disclaimer>

sef@kithrup.COM (Sean Eric Fagan) (05/03/91)

In article <1991May2.162909.9165@news.arc.nasa.gov> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>Cray was quoted once as saying something like (paraphrase):
>"If you don't have it (real memory) you can't fake it."

I believe it was something like, "Memory is like sex:  it's better when it's
real."

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

johnl@iecc.cambridge.ma.us (John R. Levine) (05/03/91)

In article <1991May2.015410.1470@news.arc.nasa.gov> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>PROPHECY:  One of these days, a single-chip microprocessor will have vector
>instructions, and then the advantages and disadvantages of various
>architectural decisions will be discovered all over again.

Perhaps, but I expect we'll be seeing more of the kind of stuff that the
Intel 860 has -- instructions that expose the pipeline so that either a
sufficiently smart compiler or more likely some suitable intrinsics can
synthesize vector ops.  So long as it keep all the execution units busy,
it hardly matters whether it's one instruction or many.  Then again, the
860 certainly has provoked a lot of discussion about its architectural
decisions.

-- 
John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 492 3869
johnl@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl
Cheap oil is an oxymoron.

dik@cwi.nl (Dik T. Winter) (05/03/91)

In article <1991Apr30.163153.18568@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
 > > Comments on Cyber 205, page faults, etc.
 > 
 >                                      For just about all fluids 
 > codes, we found virtual memory to be essentially worthless.  For
 > out of core codes, we had to manage I/O by hand, and about all
 > virtual memory did for us was to enable us to create blocks of
 > virtual storage that we could move things in and out of "by hand"

True enough.  VM is great for the instruction stream and for small pieces of
data.  It is a bother when you are using large datasets.  (The same holds for
cache.)  But this is not so much an architecture problem as well as an OS/
compiler problem.  You ought to be able to specify what parts of the data
should not be managed through VM (with probably some refinement with respect
to cache management).  Whatever the replacement algorithm is you will always
see some codes (and they are not so uncommon) that trample all over memory in
a way that was not predicted.  As an example: solving a large linear system
that does not fit in memory.  The standard algorithms only lead to page
thrashing.  A colleague of mine once tried a 1000x1000 system on a 1 Mword
205 using standard algorithms.  Result: turn around time approx. 9 hours.
CPU time 3 minutes? (probably less.)  There are algorithms that can do
better, but they need to know the page size and the replacement algorithm.
When applying such an algorithm we got it down to a turn around time of
3 minutes.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

colwell@pdx023.pdx023 (Robert Colwell) (05/03/91)

In article <1991May2.171755.18612@riacs.edu> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

   In principle, a superscalar implementation with a really smart vectorizing
   compiler can do as well as a vector machine.  In order to do so, however,
   it would need to be able to issue two uncached loads and an uncached
   store every cycle, as well as a multiply and an add (this latter has 
   been now on several machines).  Given the latency of non-cached memory,
   which is usually greater than 4 cycles, and may even be up to 32 cycles,
   this would require the system to keep a large number of pending loads
   and stores, as well as a large number of registers (how many depends on
   what the latency of main memory is).  I believe that vector instructions
   would actually prove to be *much* easier to implement than a CPU with 20+
   pending loads and stores, issuing five new instructions per CPU cycle...

You're talking micros here, right?  The machines we designed and built at
Multiflow (and occasionally SOLD) and Cydrome's machines did all this routinely.
Designing the CPU per se has a whole lot of different tradeoffs than more
conventional machines, but then so do vector machines.  I'd much rather design
another VLIW than a vector machine; much of the hard stuff gets shoved off to
the compiler (in the RISC style don't you know.)

Maybe you're assuming one couldn't or wouldn't want to build such a machine onto
a single chip (or small set of chips; comp.arch doesn't seem to uniformly
distinguish between these two possibilities).  For object code compatibility
reasons alone you might not want to make the SW/HW tradeoffs in exactly the same
way as we did at MFCI.  But on the other hand, in a world where folks have DOS
emulators running on RISCs, the rules are not always what they seem.

   |> The big problem is that the memory bandwidth required for vector FP is
   |> expensive and is not likely to contribute substantially to the non-FP
   |> performance.

   I agree that the memory subsystem is a major problem, but I am not sure
   that it is as bad as assumed above.

We all agree this is a major problem, then.  Supercomputers look sick on SPEC
benchmarks mostly because of this, as far as I can tell.  Those expensive
boatloads of fast RAMs don't help much once you have enough to feed the
benchmarks, but they still burn power and make the machine super-pricey.

   The difference here is that ideally you would want
   three or four load store units, and need to maintain cache coherence at
   the same time.

This is a big part of the problem.  You could consider not maintaining cache
coherence from all of the ports, but that is untried SW/HW territory, and the SW
folks will have be resuscitated once you suggest it to them.  (It would sure
help the HW guys though.)  Hey, while we're at it, why don't we snoop a few
buses too, and build a multiprocessor out of this.  Somebody will want to,
that's a given.  

Good luck.  This is all possible, but the bad news is the obvious news: it'll
cost you design time, product cost, and performance.  Quite possibly more than
it's worth.

Bob Colwell  colwell@ichips.intel.com  503-696-4550
Intel Corp.  JF1-19
5200 NE Elam Young Parkway
Hillsboro, Oregon 97124

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (05/03/91)

>On 2 May 91 16:29:09 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said:

Hugh> I have only limited experience with the new, fast-only-in-cache,
Hugh> machines, but I have to say that the code you need to get
Hugh> optimum performance is even more non-intuitive than that for the
Hugh> older vector architecture machines.

That is absolutely true, and a serious problem for those of us
concerned about performance.  I found it very easy to optimize code
for the Cyber 205 and ETA-10 (if you use long vectors, then just write
simple DO loops, otherwise migrate to another platform!).  It is a
little bit harder on the Cray X&Y (mostly because there are so many
more codes that it is *possible* to optimize), and it is pretty
difficult on Cray 1, Cray 2, and cached killer micros.  Techniques
like "unroll outer loop and jam resulting inner loops" are great in
theory, but all too often I find that the various machine's
optimizers require *slightly* different code --- there is no one piece
of code (even a nice block-mode version) that optimizes well on a
broad range of scalar platforms....  Matrix multiply is a good example
of this: the code that is required to run at peak speeds on an RS/6000
does not run particularly well on a MIPS.  A slightly different style
of source optimization is required there....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/03/91)

In article <MCCALPIN.91May2095930@pereland.cms.udel.edu>, mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
> >On 2 May 91 01:54:10 GMT, lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) said:
> 
> Hugh> PROPHECY: One of these days, a single-chip microprocessor will
> Hugh> have vector instructions, and then the advantages and
> Hugh> disadvantages of various architectural decisions will be
> Hugh> discovered all over again.
> 
> I don't see much benefit to explicit vector instructions compared to
> tight loops with zero cycle branches (like the RS/6000).  They sure
> can eat up a lot of silicon space, though....
> 
> The big problem is that the memory bandwidth required for vector FP is
> expensive and is not likely to contribute substantially to the non-FP
> performance.  Without adequate memory bandwidth, there is not really
> any need for vector instructions, since the cpu is idle (waiting for
> cache refills) for plenty of time to do loop control....

It seems there are more operations than in your philosophy, John.

Several years ago, I did a large editing job on a file of physical random
numbers (the source file had some undesirable fixed zeros) on the CYBER 205.
Now I doubt that the manufacturers had this type of operation in mind.  The
process itself was mainly done in 7 sets of vector instructions, segmented
only because of length, each doing up to 2 reads and one write per pipe per
half cycle.  Thus the time, almost all vector time, was roughly 3.5 cycles,
divided by the number of pipes, per word output.

Now not all of the loads/stores would have been needed on a machine like
the RS/6000.  Assuming that there were at least 6, and preferably 7, large
pages of cache available, the process could be done with 5 reads, 2 writes,
6 operations, and a nasty storage problem, which would have added about 2
operations per item.  Of course there would have to be added the loop control
operations.  Even though the vector machine did roughly 12 loads and 7 stores
per item, the time was certainly much less than for comparable speed hardware
on a scalar machine.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (05/03/91)

In article <1991May2.171755.18612@riacs.edu> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

|                                       I believe that vector instructions
| would actually prove to be *much* easier to implement than a CPU with 20+
| pending loads and stores, issuing five new instructions per CPU cycle...

  Consider for a moment a smart cache which does prefetch under
condition {X}. Perhaps as simple as prefetching the next row whenever
the last word (defined as size of the current datafetch) is fetched from
a row and the previous word has also been accessed. This requires an
accessed flag as well as the usual dirty flag, but is not inherently
something hard to do in the cache control. This works nicely for
instruction fetches, too. This could be the next to last word if latency
requires. Obviously there's a tradeoff between slowing the CPU and using
memory bandwidth to fetch data which are not used.

  A "preload cache" instruction or bit to change the cache state to the
above behavior are other possibilities.

  A vector unit is an interesting coprocessor for a system, and could be
implemented to allow multiple units to be detected and used by the CPU.
That would make parallel vector processing flexibly extensible.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

mac@kpc.com (Mike McNamara) (05/05/91)

In article <1991May2.015410.1470@news.arc.nasa.gov> lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:

>The fact that 16 AR's are not enough, is an amusing criticism.  True, but
>what superscalar machine of today can map 16*64KW*8Bytes/Word  = 8 MegaBytes
>in its TLB?  The ETA-10 had even bigger large pages...

	The Ardent (now Stardent) Titan P3 maps 256 MBytes in it vector tlb.
>
>PROPHECY:  One of these days, a single-chip microprocessor will have vector
>instructions, and then the advantages and disadvantages of various
>architectural decisions will be discovered all over again.

	The tradeoff to be decided is when the 1 million gates of a vector
unit aren't better spent on more D cache.  I think that once you get 32 K
or perhaps 64 K D cache on chip, then it's worth it to spend your next million
gates on a vector unit.  I think you will see such a chip in 1994. (But then, 
what do I know?)

>
>  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
>  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
>  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
>  Phone:  415/604-6117                            #include <std.disclaimer> 


-- 
+-----------+-----------------------------------------------------------------+
|mac@kpc.com| Increasing Software complexity lets us sell Mainframes as       |
|           | personal computers. Carry on, X windows/Postscript/emacs/CASE!! |
+-----------+-----------------------------------------------------------------+

martelli@cadlab.sublink.ORG (Alex Martelli) (05/05/91)

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
	...
:I have only limited experience with the new, fast-only-in-cache, machines,
:but I have to say that the code you need to get optimum performance is
:even more non-intuitive than that for the older vector architecture machines.
:Even worse, code which was previously optimal for vector machines, and which
:was OK on a wide variety of other machines, is now pessimal for these machines.

Not really so new - I was optimizing codes for the cache in '87 for an IBM
3090 with VF... ok, there ARE problems (the curve of leading dimension of
array versus megaflops bounces up and down wildly and unpredictably for many,
many 'normal' patterns of memory access - FAR from intuitive!), but I
believe there is by now enough experience in the high-performance-Fortran
crowd to have accounted for these effects - a guy from NAG, for example, was
working on matrix-block-oriented implementation of much BLAS stuff, and
already in '87 he was getting essentially the same performance as assembler
handcoded ESSL routines... so I'm prettyy sure the optimization techniques are
by now, under numerical-programming-specialists' belts!  Whether the average
Fortran-using scientist will ever digest them is another matter, but then,
let's count this as just more fuel to the fire for the argument that anybody
who is not using specialist-coded subroutine libraries for as large a part of
his/her code as possible is probably going about it the wrong way!

martelli@cadlab.sublink.ORG (Alex Martelli) (05/05/91)

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
	...
:theory, but all too often I find that the various machine's
:optimizers require *slightly* different code --- there is no one piece
:of code (even a nice block-mode version) that optimizes well on a
:broad range of scalar platforms....  Matrix multiply is a good example

Yes, I do agree with that - which speaks well for Dan Bernstein's idea
of having a language construct to say to the compiler: here are 2/3/N
different implementations of the SAME programming semantics, now please 
choose the one that's fastest on THIS machine!  This way we would still have
to do the hand-tweaking initially, but once ouur code performs well o, say,
half a dozen platforms, we stand a far better chance to be able to just 
compile and run fast on any new platform... and this holds not only for
numerical codes, but for much bread and butter stuff as well, e.g. an
explicit 'strcpy(a,b);' versus 'while(*a++=*b++);' where some machines and
compilers might be able to inline the call, and others might not, just to
give a trivial example.

dhinds@elaine18.Stanford.EDU (David Hinds) (05/07/91)

In article <819@cadlab.sublink.ORG> martelli@cadlab.sublink.ORG (Alex Martelli) writes:
>lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>	...
>:I have only limited experience with the new, fast-only-in-cache, machines,
>:but I have to say that the code you need to get optimum performance is
>:even more non-intuitive than that for the older vector architecture machines.
>:Even worse, code which was previously optimal for vector machines, and which
>:was OK on a wide variety of other machines, is now pessimal for these machines.
>
>Not really so new - I was optimizing codes for the cache in '87 for an IBM
>3090 with VF... ok, there ARE problems (the curve of leading dimension of
>array versus megaflops bounces up and down wildly and unpredictably for many,
>many 'normal' patterns of memory access - FAR from intuitive!)...

You're still a long way off.  My *father* was optimizing Fortran matrix codes
to exploit the cache on the IBM 370/195, in the (guess?) mid-70's.  On that
machine, correct loop ordering and blocking to fit the cache produced like a
20-fold speed improvement on matrix multiply, without any need for assembly
language.  This was quite a fast machine for its time, as I understand.  But
what seemed obvious at the time seems to have been rediscovered with great
fanfare several times since then.  The IBM RS-6000 also has a lot in common
with the 195 architecture, apparently.

 -David Hinds
  dhinds@cb-iris.stanford.edu

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/07/91)

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>	...
>:I have only limited experience with the new, fast-only-in-cache, machines,
>:but I have to say that the code you need to get optimum performance is
>:even more non-intuitive than that for the older vector architecture machines.
>:Even worse, code which was previously optimal for vector machines, and which
>:was OK on a wide variety of other machines, is now pessimal for these machines

I'm a little puzzled by the discussions involving vector vs. risc s-scalar.
Given similar hardware, and an appropriate (vectorizable) algorithm
the vector method should always be much faster. Risc s-scalar machines
are still essentially SISD. Also is a true vector machine using risc
harware likely to emerge soon?
Hope my ignorance isnt too obvious.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (05/07/91)

>> On 7 May 91 06:15:00 GMT, csrdh@marlin.jcu.edu.au (Rowan Hughes) said:

Rowan> I'm a little puzzled by the discussions involving vector vs.
Rowan> risc s-scalar.  Given similar hardware, and an appropriate
Rowan> (vectorizable) algorithm the vector method should always be
Rowan> much faster. 

I am not sure exactly what question you are implying in this
statement.  If you are just saying that a vectorizable algorithm will
run faster if you vectorize it, then I agree.  However, it is often
the case that non-vectorizable algorithms on fast scalar machines can
outperform a vector algorithm for the same problem on a vector machine
of similar technology.  

The primary difference is usually computational complexity --- for
example, Gaussian elimination for tridiagonal matrices requires O(N)
work and is not vectorizable, while Cyclic reduction requires O(NlogN)
work and is vectorizable.  The relative performance of the algorithms
is thus a balance between the extra work required by the vector
algorithm and the extra performance of the vector hardware.

A secondary difference concerns memory bandwidth.  Most of the
machines that we have been discussing have insufficient memory
bandwidth to run long vector operations at full speed.  Thus,
algorithms that avoid excess memory accesses (like the inner product
algorithm for matrix multiplies) will run faster than an algorithm of
the same computational complexity that uses a standard "vector"
approach (like the SAXPY in the inner loop of the outer product
algorithm for matrix multiplies).

Rowan> Risc s-scalar machines are still essentially SISD.
Rowan> Also is a true vector machine using risc harware likely to
Rowan> emerge soon?  Hope my ignorance isnt too obvious.

Vector instructions are also essentially SISD at the hardware level.
When you execute a vector instruction on a vector machine, it is doing
(almost) exactly the same thing as a Killer Micro running a tight loop
feeding data into a pipelined FPU.  

To extend a Killer Micro to the functionality of a Cray Y/MP will
still require quite a bit of work.  The ability of the Cray to handle
2 vector loads, one vector store, one vector add, and one vector
multiply simultaneously does not easily fit into the RISC paradigm,
since it depends on the existence of multi-cycle instructions.
Superscalar does not seem exactly the way to go, unless the load-store
units are made independent of the integer and float units.  To
reproduce the functionality of the Cray Y/MP seems closer to VLIW than
most of the RISC approaches in use now....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/07/91)

In article <820@cadlab.sublink.ORG>, martelli@cadlab.sublink.ORG (Alex Martelli) writes:
> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
 	...
> :theory, but all too often I find that the various machine's
> :optimizers require *slightly* different code --- there is no one piece
> :of code (even a nice block-mode version) that optimizes well on a
> :broad range of scalar platforms....  Matrix multiply is a good example
 
> Yes, I do agree with that - which speaks well for Dan Bernstein's idea
> of having a language construct to say to the compiler: here are 2/3/N
> different implementations of the SAME programming semantics, now please 
> choose the one that's fastest on THIS machine!  This way we would still have
> to do the hand-tweaking initially, but once ouur code performs well o, say,
> half a dozen platforms, we stand a far better chance to be able to just 
> compile and run fast on any new platform... and this holds not only for
> numerical codes, but for much bread and butter stuff as well, e.g. an
> explicit 'strcpy(a,b);' versus 'while(*a++=*b++);' where some machines and
> compilers might be able to inline the call, and others might not, just to
> give a trivial example.

Now how do we get the idea across to the computer people?  One does not
need something as complicated as matrix multiply, vector add is enough.

The idea that there is a good language, and all the information the user
needs to supply to the compiler is made available by machine-independent
programming in that language, is what is debilitating.

The user who understands the capabilities of the machine instructions is
also likely to think of these things.  This includes the addition of 
"primitive" operations to the language.  It is true that any operation
on any computer can be emulated in any of the present languages, but not
necessarily well.

Even more than inlining is making intrinsic.  This is effectively doing
a macro expansion of the instruction into machine primitives, with the
full capabilities of register assignment, etc., so that full optimization
can be carried out.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/07/91)

Rowan Hughes writes:
>I'm a little puzzled by the discussions involving vector vs. risc s-scalar.

Vector machines are not really SIMD;  they have parallelization through
the pipeline, but the bottom line is that the do not process a whole
vector "at once".  The pipeline only means that each pipe spits out
a float result each cycle (or two, in the case of a mul-add).  Super
scalar machines can also produce a result per cycle.

What I am confused about is why superscalar machines aren't seen as
clearly superceding vector architectures.  Like vector architetures,
they use instruction overlap to produce a result (or two) per cycle.
The difference is that the compiler or the programmer must
arrange things so that the proper overlap is possible, whereas with
the vector machines you just issue a single vector instruction, i.e.
the particular kind of instruction overlap is hard-wired into the
silicon (or GaAs).  That would seem to make vector architectures
clearly less versatile than superscalar.  Perhaps the restrictions
make the hardware easier to build; I imagine they certainly make
the compilers easier to write, but progress seems to have been
pretty good on RISC compilers. A notable exception is none of 
them I have ever seen will automatically do things like strip
mining and unroll-and-jam for you.

As far as I can see, insofar as current vector computers have
some advantages over superscalar, the performance differences have
more to do with memory bandwidth than processor architecture.  I'd
be happy to hear other comments on this, though.
.

preston@ariel.rice.edu (Preston Briggs) (05/07/91)

csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:

>I'm a little puzzled by the discussions involving vector vs. risc s-scalar.
>Given similar hardware, and an appropriate (vectorizable) algorithm
>the vector method should always be much faster.

I don't see why.  Currently it's true, but current vector machines
involve a lot more hardware $$ than current superscalars.
Superscalars can run vector code fast, plus they can run
lots of non-vector code fast.  I'd say they superscede vector
machines (or will soon).

Marty Hopkins made similar comments at ASPLOS recently.
So I guess I don't expect to see new vector machines with RISC techniques.
Instead, I expect to see superscalars with better compilers.

Preston Briggs

gary@chpc.utexas.edu (Gary Smith) (05/08/91)

In article <1991May7.150724.18806@midway.uchicago.edu>, rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
|> Rowan Hughes writes:
|> >I'm a little puzzled by the discussions involving vector vs. risc s-scalar.
|> 
|> Vector machines are not really SIMD;  they have parallelization through
|> the pipeline, but the bottom line is that the do not process a whole
|> vector "at once".  The pipeline only means that each pipe spits out
|> a float result each cycle (or two, in the case of a mul-add).  Super
|> scalar machines can also produce a result per cycle.
|> 
|> What I am confused about is why superscalar machines aren't seen as
|> clearly superceding vector architectures.  Like vector architetures,
|> they use instruction overlap to produce a result (or two) per cycle.
|> The difference is that the compiler or the programmer must
|> arrange things so that the proper overlap is possible, whereas with
|> the vector machines you just issue a single vector instruction, i.e.
|> the particular kind of instruction overlap is hard-wired into the
|> silicon (or GaAs).  That would seem to make vector architectures
|> clearly less versatile than superscalar.  Perhaps the restrictions
|> make the hardware easier to build; I imagine they certainly make
|> the compilers easier to write, but progress seems to have been
|> pretty good on RISC compilers. A notable exception is none of 
|> them I have ever seen will automatically do things like strip
|> mining and unroll-and-jam for you.
|> 
|> As far as I can see, insofar as current vector computers have
|> some advantages over superscalar, the performance differences have
|> more to do with memory bandwidth than processor architecture.  I'd
|> be happy to hear other comments on this, though.
|> .
I believe raymond answers his own question (What I am confused
about...) posed in paragraph 2 in the following paragraph (...the
performance differences have to do with memory bandwidth...).

What is the most fundamental reason that organizations are willing
to spend $20+ million for a Y-MP?  I propose the answer is simply
sustainable memory bandwidth of 42.7 gigabytes per second.  Where
are the RISC machines (other than the CRAY) that can provide this?
What would it cost for their designers to provide it?
-- 
Randolph Gary Smith                       Internet: gary@chpc.utexas.edu
Systems Group                             Phonenet: (512) 471-2411
Center for High Performance Computing     Snailnet: 10100 Burnet Road
The University of Texas System                      Austin, Texas 78758-4497

maverick@cork.Berkeley.EDU (Vance Maverick) (05/08/91)

In article <11996@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
|> In article <820@cadlab.sublink.ORG>, martelli@cadlab.sublink.ORG (Alex Martelli) writes:
|> > Yes, I do agree with that - which speaks well for Dan Bernstein's idea
|> > of having a language construct to say to the compiler: here are 2/3/N
|> > different implementations of the SAME programming semantics, now please 
|> > choose the one that's fastest on THIS machine!
|> 
|> Now how do we get the idea across to the computer people?  One does not
|> need something as complicated as matrix multiply, vector add is enough.

Dain Samples, of the CS Division here, soon of the University of Cincinnati, did just this work for his thesis.  I can hunt up a reference for you if you like, or you might try asking Dain (samples@cs.berkeley.edu).

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (05/08/91)

In article <1991May7.052417.10606@leland.Stanford.EDU>, dhinds@elaine18.Stanford.EDU (David Hinds) writes:
|> In article <819@cadlab.sublink.ORG> martelli@cadlab.sublink.ORG (Alex Martelli) writes:
|> >lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
|> >	...
:
:
etc etc:   Even worse, code which was previously optimal for vector machines, and which
|> >:was OK on a wide variety of other machines, is now pessimal for these machines.



|> >Not really so new - I was optimizing codes for the cache in '87 for an IBM
|> >3090 with VF... ok, there ARE problems (the curve of leading dimension of
|> >array versus megaflops

 
|> You're still a long way off.  My *father* was optimizing Fortran matrix codes
|> to exploit the cache on the IBM 370/195, in the (guess?) mid-70's.  On that


Both posters have essentially the same point, and this point is well taken.
Machines with cache (and other locality-friendly) devices have been around a *long*
time.  Even the 360/67 got a boost from code rearrangement, due to the DAT box
(Dynamic Address Translation == "TLB", sort of) overhead if you accessed arrays
the wrong way.  On the new RISCs, the effect is extremely strong.  Combined with
some of the vector-ish features of these machines, optimal codes can look like a hybrid
of the cache and vector techniques, which makes them rather non-intuitive.

I agree that this is nothing new.  The major problem of all computer architects from
the beginning is where to put the bandwidth.  The new RISC-with-fast-cache machines
have properties somewhat like a minicomputer with an attached array processor.  If
your problem is well suited to this, you can get phenomenal speedups very cheaply.  If
your problem does not have such locality, but is still vectorizable, a 
"vector supercomputer" architecture may be a better approach.

What I am really arguing in favor of is a machine which combines both.  There is no
reason why you can't have a machine with both a superscalar CPU driven mainly off
cache, and a vector load-store architecture that can access secondary memory 
directly.  Then, you get the best of both worlds.  The question is when will this
be done on a microprocessor?

In answer to the sometimes heard statement that "superscalar makes vector
obsolete", the answer is that it *could*, just as a very fast Turing machine
could also.  In order to actually *do it*, however, the load/store architecture
will have to be expanded considerably.  No one has yet succeeded in getting
that much concurrency going in a superscalar machine.  But, I wouldn't argue
that it couldn't be done.  In fact, I would like to see it.

In answer to the other criticism, that VLIW machines make vector obsolete,
I agree.  The Multiflow architecture could have potentially made "vector"
machines obsolete.  In fact, it is really too bad that they went out of
business.  Someone ought to be working on a single chip VLIW, if they aren't
already.  But, I haven't heard of anyone.  In many ways, VLIW seems to be 
a simpler and more general form of "vectorization".



|>  -David Hinds
|>   dhinds@cb-iris.stanford.edu

-- 
  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-6117                            #include <std.disclaimer>

jbuck@janus.Berkeley.EDU (Joe Buck) (05/08/91)

In article <1991May7.061500.7485@marlin.jcu.edu.au> csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:
>I'm a little puzzled by the discussions involving vector vs. risc s-scalar.
>Given similar hardware, and an appropriate (vectorizable) algorithm
>the vector method should always be much faster. Risc s-scalar machines
>are still essentially SISD. Also is a true vector machine using risc
>harware likely to emerge soon?
>Hope my ignorance isnt too obvious.

The reason that your analysis fails is because of pipelining.  In a
pipelined processor (read: any modern CPU whether RISC or CISC), instruction
fetch of the next instruction overlaps execution of the previous instruction.
As a result, for many RISC machines, if the data are entirely in cache,
the time to do, say, an FP dot product, consists only of the time required
by the multiplier and adder (plus a startup cost).  The time to fetch the
instructions and to fetch the data can be completely overlapped with the
floating point computation.  In addition, there is ususally a separate
instruction cache and data cache, so that the fetching of instructions
happens in parallel with the fetching of data.

On a machine like the RS-6000 there is a separate branch processor, so
even the cost of the branches goes away.  The result of all this is
that there are very few problems where vector machines present any advantages
any more; once your inner loop is in the instruction cache, the cost of
fetching the instruction stream disappears.

--
Joe Buck
jbuck@janus.berkeley.edu	 {uunet,ucbvax}!janus.berkeley.edu!jbuck

dik@cwi.nl (Dik T. Winter) (05/08/91)

In article <1991May7.150724.18806@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
 >                    The pipeline only means that each pipe spits out
 > a float result each cycle (or two, in the case of a mul-add).  Super
 > scalar machines can also produce a result per cycle.
See below.
 > 
 > What I am confused about is why superscalar machines aren't seen as
 > clearly superceding vector architectures.  Like vector architetures,
 > they use instruction overlap to produce a result (or two) per cycle.
As noted, vector hardware is simpler to design than super-scalar. That is
why we see super-vector.  The 205 could have up to 4 pipes, producing
4 resuls (8 floating ops) each cycle.  The same for the NEC SX-3.  It will
take some time before super-scalar can do that.  (And consider that the
clock of the SX-3 runs at 2.6 nsec = 385 MHz; and that he can be configured
multi-processor in the future.)

 > The difference is that the compiler or the programmer must
 > arrange things so that the proper overlap is possible, whereas with
 > the vector machines you just issue a single vector instruction, i.e.
 > the particular kind of instruction overlap is hard-wired into the
 > silicon (or GaAs).
Still silicon, and NEC thinks they are able to push silicon still further;
at least, as far as I know, they are not yet thinking about GaAs.

 >                     That would seem to make vector architectures
 > clearly less versatile than superscalar.
True enough.  But there is no reason that a super-vector machine could not
be super-scalar too; which would give them a good boost in performance for
the (still important) scalar part.  (I have seen only one program where
after full vectorization still >80% of the time was spent in vector operations.)

 >                                A notable exception is none of 
 > them I have ever seen will automatically do things like strip
 > mining and unroll-and-jam for you.
The Alliant compiler for the i860 does.
 > 
 > As far as I can see, insofar as current vector computers have
 > some advantages over superscalar, the performance differences have
 > more to do with memory bandwidth than processor architecture.  I'd
 > be happy to hear other comments on this, though.
As noted, memory bandwidth is not the only factor.  But it is of course
an important factor.  I do not know the bus width from memory to CPU
on the SX-3, but on the 205 it is lots of bits (the basic piece of information
going from memory to the CPU was a 'super-word' of 1024 bits).  Other
important factors are memory size (256 64-bit Mword or more) and disk I/O.
And of course: no cache, thank you.

What bothers me is that the super-scalar machines I know (i860 and RS6000)
go away from some basic (RISCy) principles to get their f-p performance.
The i860 operations are sufficiently strange that you need to know
everything about memory access times etc. to get good performance.  If
you do not know that, your performance will be mediocre at best.  I tried
it, I got reasonable performance, but now that I have more specific
knowledge about memory on the machine in question, I know that I ought to
have coded completely different.  The RS6000 is simpler in some ways (no
visible pipelines as on the i860), but on the other hand more difficult:
you have to know exact timing information for the instructions to get the
pipeline going.  And I do not think this information will remain the same
with future models.

But the biggest problems with super-scalar machines to get vector performance
is the limited number of registers.  32 fp registers on both i860 and RS6000.
You need to issue your loads in advance and you need to issue your stores
delayed to get performance.  On the i860 it is extremely difficult to
allocate your registers such that you would have no interference.  On the
RS6000 register renaming helps a bit (39 rather than 32 registers), but
also there full speed loops require extremely careful allocation (and I have
that suspicion that register renaming only makes it less visible).  Compare
that to the Cray (8 vector registers of 64 elements) and the SX-3 (for the
SX-2 it was 32 vector registers of 256 elements, I expect about the same on
the SX-3).  I feel already a bit cramped on the Cray.

But rest assured.  The results will be more correct on your garden variety
micro.  F-p precision on those supers is nothing to write home about.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/08/91)

In <1991May7.150724.18806@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
...
>What I am confused about is why superscalar machines aren't seen as
>clearly superceding vector architectures.

I was confused as to why risc is pitted against vector. Surely each
is suited to very different algorithms. I don't see why there're
needs to be some sort comparison.  It would be a better idea
to draw from each others strengths, hence the question about a risc-
vector machine. For current vector machines the bottlenecks are the
scalar unit, and the load/store pipes. Risc machines clearly don't
do so well at vectors (they're "cache busters"). A machine with a
risc style scalar unit and vector arithmetic/memory pipes would have
the best of both worlds. This would give the programmer much greater
flexibility; ie good optimization can be had for both recurrsive
and vector code. It would also take the "mother of all compilers"
to bring it all together.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

sef@kithrup.COM (Sean Eric Fagan) (05/08/91)

In article <1991May7.061500.7485@marlin.jcu.edu.au> csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:
>I'm a little puzzled by the discussions involving vector vs. risc s-scalar.
>Given similar hardware, and an appropriate (vectorizable) algorithm
>the vector method should always be much faster. Risc s-scalar machines
>are still essentially SISD.

So are the majority of vector machines.  They operate memory-to-memory, and
don't have a seperate functional-unit for each potential element in the
vector.

Also, note that even the Cray is still SISD:  the multiply, recip., and
square root all have to go through the limited number of FU's one at a time
for up to 64 elements in the vector.

As to why the RISC's might be faster:  more registers (does help), better
memory systems (those caches can help, you know), and a quicker design time.

The vector machines, and especially Cray, have an advantage *now*.  Will
they keep it?  Pretty unlikely, I'm afraid...

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

peter@ficc.ferranti.com (peter da silva) (05/08/91)

In article <11996@mentor.cc.purdue.edu>, hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> The idea that there is a good language, and all the information the user
> needs to supply to the compiler is made available by machine-independent
> programming in that language, is what is debilitating.

The idea that such an idea exists is an illusion.

> Even more than inlining is making intrinsic.  This is effectively doing
> a macro expansion of the instruction into machine primitives, with the
> full capabilities of register assignment, etc., so that full optimization
> can be carried out.

You mean like in G++? I admit that C++ isn't the best of all possible
languages, but the combination of inlining, operator definitions, and
GCCs inline assembler (which appears to allow register reassignment)
is pretty close to what you want.

Just don't copy Stroustrup's coding style.
-- 
Peter da Silva.  `-_-'  peter@ferranti.com
+1 713 274 5180.  'U`  "Have you hugged your wolf today?"

krolnik@convex.com (Adam Krolnik) (05/09/91)

There are problems that no-one seems to be discussing about implementing 
super-scalar techniques. They all cost a lot of hardware.  Unless you intend 
from the beginning for the architecture have a vliw view, the processor has 
to perform dynamic instruction selection, and issue.  This initial part of
a superscalar processor is very costly in gates and time.  True there are
superscalar boxes out there (HP-PA and RS6000) that are fast.  But the
superscalar techniques made them large. They are not single chip implementations.
And if they were, the gates they consume could be used to increase performance
by allocating them to other functions.  Superscalar gains are a factor of 2-3,
maybe a little larger, but not close to a magnitude. Magnitude gains are achieved by
circuit technology (time) and memory bandwidth.


      Adam Krolnik
      Design Verification  (214)-497-4578
      Convex Computer Corp. Richardson Tx, 75080
	Disclaimer:   What?! Lawyers don't read this stuff do they?

eachus@largo.mitre.org (Robert I. Eachus) (05/09/91)

In article <11996@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:

   The idea that there is a good language, and all the information the user
   needs to supply to the compiler is made available by machine-independent
   programming in that language, is what is debilitating.

-- No, the problem is with programmers who say: "We have always had to
-- twist the bits, and we will always need to be able to twist the
-- bits." A decade ago, we made it possible to write portable Ada code
-- which runs efficiently on many different classes of high-
-- performance architectures.  This requires that programmers be
-- willing to write X, Y, Z: Matrix(1..300,1..300);...X := Y*Z; and
-- allow the compiler (or library author) to deal with the correct
-- implementation for this architecture.  But we still get programmers
-- writing all over the place for I in 1..300 loop... which can't be
-- optimized in most cases.  (In Ada you have to be worried about
-- exception semantics. In X := Y*Z; you get the "right" answer or you
-- get an exception with X untouched...Easy to do fast on different
-- architectures, although you may have to switch pointers or do an
-- "extra" copy of the final result.  Most compilers can and do
-- recognize this case easily (and do the pointer move) if the type
-- Matrix is declared correctly.  But for the nested loops, when an
-- exception occurs, X must have the "right" junk in it. Yeech!)

   The user who understands the capabilities of the machine instructions is
   also likely to think of these things.  This includes the addition of 
   "primitive" operations to the language.  It is true that any operation
   on any computer can be emulated in any of the present languages, but not
   necessarily well.

-- I've always wanted to have a next set bit operation as a primitive.
-- Where the hardware has the necessary instructions, and given an
-- extensible language, this is not a problem.  (Although finding the
-- correct instruction is often the hardest part of the job.  On
-- Multics, and some other architectures you do this with a translate
-- and test instruction intended for fast character string operations.

   Even more than inlining is making intrinsic.  This is effectively doing
   a macro expansion of the instruction into machine primitives, with the
   full capabilities of register assignment, etc., so that full optimization
   can be carried out.

-- This is elevating a slight difference of implementation to an
-- architectural level.  Good optimizers will optimize inlined code,
-- and good languages will allow writing low level code inserts in a
-- register independent manner.  For example, the DDC-I Ada compiler
-- used to (and may still) allow code inserts in a stack-oriented
-- intermediate language which was the input to the optimizer.  These
-- when inlined were no different from intrinsics, except they could
-- be written by users.

--

					Robert I. Eachus

with STANDARD_DISCLAIMER;
use  STANDARD_DISCLAIMER;
function MESSAGE (TEXT: in CLEVER_IDEAS) return BETTER_IDEAS is...

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (05/12/91)

In article <1991May7.195913.27363@riacs.edu> 
	lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) writes:
>Someone ought to be working on a single chip VLIW, if they aren't
>already.  But, I haven't heard of anyone.  In many ways, VLIW seems to be 
>a simpler and more general form of "vectorization".

As of a year ago, Philips/Signetics had a VLIW chip. 
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

peter@ficc.ferranti.com (peter da silva) (05/13/91)

In article <1991Apr27.012655.6508@rice.edu>, preston@ariel.rice.edu (Preston Briggs) writes:
> peter@ficc.ferranti.com (peter da silva) writes:
> >Ah, the old Von Neumann botleneck. Time to apply RISC design techniques to
> >memory subsystems.

> Well, you _could_ rediscover the Connection Machine.

Is it reasonable to think about scaling the CM down to a desktop workstation?
-- 
Peter da Silva; Ferranti International Controls Corporation; +1 713 274 5180;
Sugar Land, TX  77487-5012;         `-_-' "Have you hugged your wolf, today?"

preston@ariel.rice.edu (Preston Briggs) (05/13/91)

peter@ficc.ferranti.com (peter da silva) writes:
>>>Ah, the old Von Neumann botleneck. Time to apply RISC design techniques to
>>>memory subsystems.

and I (obscurely) wrote
>> Well, you _could_ rediscover the Connection Machine.

and he replies
>Is it reasonable to think about scaling the CM down to a desktop workstation?

Probably not.  However, ...
Very early in his thesis, Hillis makes some interesting points.
He notices that the number of transistors in large computers is becoming
huge.  However, only those in the CPU are heavily employed.
The bulk of the memory sits idle most of the time.  He also notes
that most of our efforts have been devoted to keeping the
CPU "wonderfully busy."  This leads him to consider a machine with a
better balance between CPU and memory: the CM.

I usually cuss the bottleneck because I'm having trouble keeping
the CPU 100% busy.  Hillis' point makes more economic sense.

Preston Briggs

neray@Alliant.COM (Phil Neray) (05/14/91)

1. Yes, superscalar RISC is essentially SISD (even when it has pipelining
like the i860), but you're missing the point - the real leap forward is
putting lots of these RISC/superscalar/pipelined microprocessors in a
a large parallel system with either shared memory (eg., Alliant FX/800 and
FX/2800)) or distributed memory (eg., Intel iPSC/860), and then you're
talking about MIMD. Since the i860 has 1 million transistors on a single
chip, we can put four processors on a single 12"x13" processor module.

MIMD architectures like these handle all of the vector-style algorithms PLUS 
the ones that are not suited to hardware vector architectures (such as loops 
with dependencies, or loops with subroutines).

And, you can still implement vector-style algorithms at the individual
processor level by using pipelining to keep the pipelines full, using the 
cache as "vector registers". In addition, superscalar architectures promote 
optimal pipelined performance by allowing load instructions to be issued 
simultaneously with floating-point instructions.

For example, an FX/2800 with 28 i860-based processors achieves 1018 MFLOPS
on a complex double-precision BLAS3 matrix multiply, and over 2 GFLOPS on 
single-precision convolutions.

2. "RISC + vectors"?

Ardent tried this in their first-generation system with limited success.

The new Convex systems are claimed to be RISC, but this is clearly a
spin-doctor, PR strategy to deal with a creeping realization by Convex 
that parallel RISC is becoming invevitable. The new systems are binary
compatible with the current C2 Series and the first-generation C1 that
was designed in the early 80s. This 128-element vector CISC architecture is 
clearly not RISC (it doesn't have a simple instruction set, and the 
instructions do not execute typically in a single clock cycle - a 128-
element vector instruction will require over 128 clock cycles). 

Convex is claiming that the new systems are RISC because they use fewer 
VLSI chips than their predecessors ... and the WSJ cheerfully quoted
"RISC" as meaning "Reduced Instruction Set Chips" ... oh well, maybe
the "Post" got it right.
-- 
Phil Neray			Domain:	neray@alliant.com
Alliant Computer Systems	UUCP:	{mit-eddie|linus}!alliant!neray
Littleton, MA 01460		Phone:	(508) 486-1429

carroll@ssc-vax (Jeff Carroll) (05/18/91)

In article <_R9B1S2@xds13.ferranti.com> peter@ficc.ferranti.com (peter da silva) writes:
>> Well, you _could_ rediscover the Connection Machine.
>
>Is it reasonable to think about scaling the CM down to a desktop workstation?


	Well, I don't think you'll see TMC do this (they really conceive of
themselves as being in the *super*computer business), but there are people
around who might build "mini-SIMDs".

	In particular, AMT has retargeted the DAP SIMD for the embedded
systems market (and, as TMC did, they glommed on some floating-point chips
in the process). Rumor has it that the large defense contractor E-Systems
has chosen the DAP for implementation in GaAs.


-- 
Jeff Carroll
carroll@ssc-vax.boeing.com

"Like sands through the hourglass, so are the days of our lives." - Socrates

carroll@ssc-vax (Jeff Carroll) (05/18/91)

In article <3997@ssc-bee.ssc-vax.UUCP> carroll@ssc-vax.boeing.com (Jeff Carroll) writes:
>In article <_R9B1S2@xds13.ferranti.com> peter@ficc.ferranti.com (peter da silva) writes:
>>> Well, you _could_ rediscover the Connection Machine.
>>
>>Is it reasonable to think about scaling the CM down to a desktop workstation?

(Sorry about the double post - Emacs is arguing with the Novell gateway again,
and I barely managed to escape the last edit session without a reboot.)

Actually TMC thinks that they *have* scaled down the CM - down to *one*
of those big black cubes.

From the literature I've seen, MasPars are not as large, and there's at least
one SIMD (WaveTracer) that's about the size of a Firefox workstation.

In short, I think there'll be desktop SIMDs some day soon. Whether they'll be
built by TMC, well... where's barmar?

-- 
Jeff Carroll
carroll@ssc-vax.boeing.com

"Like sands through the hourglass, so are the days of our lives." - Socrates