[comp.arch] massive parallelism, was CDC 6600 and TI ASC

lamson@el1.crd.ge.com (scott h lamson) (03/08/91)

In article <1991Mar7.215545.430@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

>   From: henry@zoo.toronto.edu (Henry Spencer)
>   In article <45252@ut-emx.uucp> guru@ut-emx.uucp (chen  liehgong) writes:
>   >Why were the CDC 6600 and the TI ASC (Advanced Scientific Computer)
>   >failures? I would like to have your opinions on this.

>   Assuming that you have confused the 6600 (successful) with the Star 100
>   (failure), the most basic reason why the Star and the ASC failed was that
>   they focussed on long-vector performance and ignored scalar performance,
>   on the theory that everything would vectorize extremely well.  The Cray 1
>   succeeded where they failed because it was a blazing-fast *scalar* machine
>   first and an even-faster vector machine second.
>   -- 

Given this line of reasoning, how do you look at massive parallel ala
the connection machine?  Should you think of the CM as a slow scalar
machine with a super fast very long vector processor?  If so, will it
fail for the same reasons you give?  One difference is vector units
take only 1 clock cycle for each additional operation, whereas CM's
take no extra clock cycles for each additional operation (until you
run out of processors).  but this has nothing to do with the scalar
speed argument anyway.

or is this maybe the wrong way to look at the machine to start with.

--
        Scott|  ARPA:      lamson@crd.ge.com
       Lamson|  UUCP:      uunet!crd.ge.com!lamson
(518)387-5795|  UUCP:      uunet!sierra.crd.ge.com!lamson
General Electric Corporate Research and Development

hrubin@pop.stat.purdue.edu (Herman Rubin) (03/08/91)

In article <LAMSON.91Mar7181743@el1.crd.ge.com>, lamson@el1.crd.ge.com (scott h lamson) writes:

			......................

> Given this line of reasoning, how do you look at massive parallel ala
> the connection machine?  Should you think of the CM as a slow scalar
> machine with a super fast very long vector processor?  If so, will it
> fail for the same reasons you give?  One difference is vector units
> take only 1 clock cycle for each additional operation, whereas CM's
> take no extra clock cycles for each additional operation (until you
> run out of processors).  but this has nothing to do with the scalar
> speed argument anyway.
> 
> or is this maybe the wrong way to look at the machine to start with.

This is the wrong way to look at it, and there are huge problems with
massively parallel processors handling long vectors.  Here is a simple
example which will be relatively poor on SIMD machines: there is a 
function to be computed on all arguments of a vector.  There are different
efficient algorithms to be used in different parts of the domain, but 
there is no common even moderately efficient algorithm.

Now vector machines perform differently on this problem.  The CRAY 1
has a considerable amount of initial overhead, which the X-MP can 
reduce, and then each algorithm can be used only on the relevant
arguments, and a merge then carried out, with overhead slightly worse.
The IBM 3090 reduces the overhead somewhat, and the CYBER 205 has this
overhead quite small.  If extremely fast moves of data between units
could be achieved by additional instructions, SIMD machines could achieve
this also, but these would not be SIMD type instructions; the data blocks
would have to be individualized.
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/08/91)

>> On 8 Mar 91 13:47:05 GMT, hrubin@pop.stat.purdue.edu (Herman Rubin) said:

Herman> In article <LAMSON.91Mar7181743@el1.crd.ge.com>, lamson@el1.crd.ge.com (scott h lamson) writes:

> Given this line of reasoning, how do you look at massive parallel ala
> the connection machine?  Should you think of the CM as a slow scalar
> machine with a super fast very long vector processor? 
> or is this maybe the wrong way to look at the machine to start with.

Herman> This is the wrong way to look at it, and there are huge
Herman> problems with massively parallel processors handling long
Herman> vectors.  Here is a simple example which will be relatively
Herman> poor on SIMD machines: there is a function to be computed on
Herman> all arguments of a vector.  There are different efficient
Herman> algorithms to be used in different parts of the domain, but
Herman> there is no common even moderately efficient algorithm.

That is why Danny Hillis originally wanted to have more than one
instruction stream on the Connection Machine.  I believe that he told
me that the original idea was for 4 instruction streams.  This would
require 2 bits for "instruction stream select" rather than the 1 bit
that is currently used for "masking".  This part of the overhead is
negligible.  The part that caused them to drop the idea was that the
front-end VAXen had enough trouble generating *one* instruction stream
fast enough to keep the machine busy --- it would have failed badly
trying to generate 4 independent instruction streams.

Of course, the other part of the story is that they saw no
overwhelming reason to include this added complexity in the first line
of the machines.  Now that the CM-2 SIMD architecture has proven
itself successful, and now that *much* faster front end machines are
available, Thinking Machines, Inc, might be more receptive to user
requests for this sort of functionality.

The "multiple-instruction-stream" feature may or may not speed up a
particular application.  The best case is when the different
algorithms all require the same amount of time, then one obtains a
speedup equal to the lesser of the number of instruction streams or
the number of algorithms used.  

One may quibble about the choice of 4 instruction streams.  I think
that this would handle most of the applications, though I might want 8
for some stuff I do (one "interior" instruction stream and 6 "boundary
condition" instruction streams for the faces of a three-dimensional
rectangular box).

Once one gets much more complicated than a "few" instruction streams,
the problem would probably map better onto a MIMD architecture.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) (03/09/91)

In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes:
>Given this line of reasoning, how do you look at massive parallel ala
>the connection machine?  Should you think of the CM as a slow scalar
>machine with a super fast very long vector processor?  If so, will it
>fail for the same reasons you give?  One difference is vector units
>take only 1 clock cycle for each additional operation, whereas CM's
>take no extra clock cycles for each additional operation (until you
>run out of processors).  but this has nothing to do with the scalar
>speed argument anyway.

Another important advantage of a CM and a long vector machine is that
the CM has more wires: more communication.  While communication steps
can be very costly, they don't need to pass through the bottleneck of
a Central processing unit as they might when doing a scatter/gather on
a vector machine.

henry@zoo.toronto.edu (Henry Spencer) (03/09/91)

In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes:
>Given this line of reasoning, how do you look at massive parallel ala
>the connection machine? ...

My impression -- beware that it's an area I'm not really up on --
is that CM buyers are more or less aware that they're going to have to
revise their software to exploit the thing.  Buyers of "main line" supers
want their dusty decks to run well, if not optimally, right away.  The
question is how many CM buyers there are, given that constraint.
-- 
"But this *is* the simplified version   | Henry Spencer @ U of Toronto Zoology
for the general public."     -S. Harris |  henry@zoo.toronto.edu  utzoo!henry

brooks@physics.llnl.gov (Eugene D. Brooks III) (03/11/91)

In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes:
>Given this line of reasoning, how do you look at massive parallel ala
>the connection machine?  Should you think of the CM as a slow scalar
>machine with a super fast very long vector processor?
Yes, this is the correct view of the CM-2.
>If so, will it fail for the same reasons you give?
Yes, it will fail for the same historical reasons.
NASA AMES has some impressive computational efficiencies for this
machine on 3-D FFTs, the number I saw on a viewgraph was "~0"
>One difference is vector units
>take only 1 clock cycle for each additional operation, whereas CM's
>take no extra clock cycles for each additional operation (until you
>run out of processors).
This does not matter.
>but this has nothing to do with the scalar
>speed argument anyway.
Right.  What gets you is vector operations under mask.
The fundamental capability which is needed is MIMD functioning of
the processors, so that each processor can go its own way and get
effective work done instead of being disabled for the computation.

brooks@physics.llnl.gov (Eugene D. Brooks III) (03/11/91)

In article <13834@life.ai.mit.edu> lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) writes:
>the CM has more wires: more communication.  While communication steps
>can be very costly, they don't need to pass through the bottleneck of
>a Central processing unit as they might when doing a scatter/gather on
>a vector machine.
Scatter/gather is the equivalent of the communication on the CM.  On
a Cray XMP or YMP the scatter/gather bandwidth matches the cpu speed
just fine and does not represent a bottleneck.  It is certainly not
as hampered as the "router" functionality on the CM-2 which is a severe
mismatch for the PE computational bandwidth.

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/11/91)

In article <781@llnl.LLNL.GOV> 
	brooks@physics.llnl.gov (Eugene D. Brooks III) writes:
>Scatter/gather is the equivalent of the communication on the CM.  On
>a Cray XMP or YMP the scatter/gather bandwidth matches the cpu speed
>just fine and does not represent a bottleneck.  It is certainly not
>as hampered as the "router" functionality on the CM-2 which is a severe
>mismatch for the PE computational bandwidth.

It was mismatched on the CM-1, too. Would anyone care to comment on
why the CM-2 didn't improve in this area? Is there something
intrinsic to their design, or their implementation strategy?

-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

rlk@think.com (Robert Krawitz) (03/12/91)

In article <780@llnl.LLNL.GOV>, brooks@physics (Eugene D. Brooks III) writes:
]In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes:

]Yes, it will fail for the same historical reasons.
]NASA AMES has some impressive computational efficiencies for this
]machine on 3-D FFTs, the number I saw on a viewgraph was "~0"

This timing was performed in 64-bit precision on a machine with 32 bit
floating point hardware (effectively, without floating point hardware).
We do offer 64 bit floating point hardware.  The timing was also for
considerably more than just a 3D FFT.  The performance for the FFT's in
question, using currently-released software and 64 bit FPU hardware, is
between 45 and 140 Mflops, depending upon the options chosen, on an 8K
processor machine (1/8 of a full machine).
-- 
ames >>>>>>>>>  |	Robert Krawitz <rlk@think.com>	245 First St.
bloom-beacon >  |think!rlk	(postmaster)		Cambridge, MA  02142
harvard >>>>>>  .	Thinking Machines Corp.		(617)234-2116

serafini@nas.nasa.gov (David B. Serafini) (03/12/91)

In article <12315@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>In article <781@llnl.LLNL.GOV> 
>	brooks@physics.llnl.gov (Eugene D. Brooks III) writes:
>>Scatter/gather is the equivalent of the communication on the CM.  On
>>a Cray XMP or YMP the scatter/gather bandwidth matches the cpu speed
>>just fine and does not represent a bottleneck.  It is certainly not
>>as hampered as the "router" functionality on the CM-2 which is a severe
>>mismatch for the PE computational bandwidth.
>
>It was mismatched on the CM-1, too. Would anyone care to comment on
>why the CM-2 didn't improve in this area? Is there something
>intrinsic to their design, or their implementation strategy?
>
The CM-2 was basically a CM-1 with floating point hardware and some
extra glue circuitry stuck onto the boards in what little spare board
space there was.  Very little, if any, of the CM-1 hardware was majorly
redesigned (to my knowledge. I may be wrong.)
There were bug fixes, of course.  It took a lot of time and effort to 
design the router hardware and software, and could not be redone for the
CM-2.  

I'm sure the next CM will show the significant changes that Don is implying 
should have been in the CM-2.

David B. Serafini			serafini@ralph.arc.nasa.gov
Rose Engineering and Research		or ...!ames!amelia!serafini
NASA/Ames Research Center  MS227-6	(415)604-6233	
Moffett Field, CA 94035			#include <std-disclaimer.h>



--
David B. Serafini			serafini@ralph.arc.nasa.gov
Rose Engineering and Research		or ...!ames!amelia!serafini
NASA/Ames Research Center  MS227-6	(415)604-6233	
Moffett Field, CA 94035			#include <std-disclaimer.h>

brooks@physics.llnl.gov (Eugene D. Brooks III) (03/15/91)

In article <1991Mar11.220703.13544@Think.COM> rlk@think.com (Robert Krawitz) writes:
An attempt to discredit the performance data I posted for the FFT on the CM-2.

The data I saw specified 32 bit precision on an 8K machine with
32 bit floating point hardware.  It was for the NAS FFT benchmark.

For further reference, I of course refer interested parties to the
appropriate NAS publications.