lamson@el1.crd.ge.com (scott h lamson) (03/08/91)
In article <1991Mar7.215545.430@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: > From: henry@zoo.toronto.edu (Henry Spencer) > In article <45252@ut-emx.uucp> guru@ut-emx.uucp (chen liehgong) writes: > >Why were the CDC 6600 and the TI ASC (Advanced Scientific Computer) > >failures? I would like to have your opinions on this. > Assuming that you have confused the 6600 (successful) with the Star 100 > (failure), the most basic reason why the Star and the ASC failed was that > they focussed on long-vector performance and ignored scalar performance, > on the theory that everything would vectorize extremely well. The Cray 1 > succeeded where they failed because it was a blazing-fast *scalar* machine > first and an even-faster vector machine second. > -- Given this line of reasoning, how do you look at massive parallel ala the connection machine? Should you think of the CM as a slow scalar machine with a super fast very long vector processor? If so, will it fail for the same reasons you give? One difference is vector units take only 1 clock cycle for each additional operation, whereas CM's take no extra clock cycles for each additional operation (until you run out of processors). but this has nothing to do with the scalar speed argument anyway. or is this maybe the wrong way to look at the machine to start with. -- Scott| ARPA: lamson@crd.ge.com Lamson| UUCP: uunet!crd.ge.com!lamson (518)387-5795| UUCP: uunet!sierra.crd.ge.com!lamson General Electric Corporate Research and Development
hrubin@pop.stat.purdue.edu (Herman Rubin) (03/08/91)
In article <LAMSON.91Mar7181743@el1.crd.ge.com>, lamson@el1.crd.ge.com (scott h lamson) writes: ...................... > Given this line of reasoning, how do you look at massive parallel ala > the connection machine? Should you think of the CM as a slow scalar > machine with a super fast very long vector processor? If so, will it > fail for the same reasons you give? One difference is vector units > take only 1 clock cycle for each additional operation, whereas CM's > take no extra clock cycles for each additional operation (until you > run out of processors). but this has nothing to do with the scalar > speed argument anyway. > > or is this maybe the wrong way to look at the machine to start with. This is the wrong way to look at it, and there are huge problems with massively parallel processors handling long vectors. Here is a simple example which will be relatively poor on SIMD machines: there is a function to be computed on all arguments of a vector. There are different efficient algorithms to be used in different parts of the domain, but there is no common even moderately efficient algorithm. Now vector machines perform differently on this problem. The CRAY 1 has a considerable amount of initial overhead, which the X-MP can reduce, and then each algorithm can be used only on the relevant arguments, and a merge then carried out, with overhead slightly worse. The IBM 3090 reduces the overhead somewhat, and the CYBER 205 has this overhead quite small. If extremely fast moves of data between units could be achieved by additional instructions, SIMD machines could achieve this also, but these would not be SIMD type instructions; the data blocks would have to be individualized. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/08/91)
>> On 8 Mar 91 13:47:05 GMT, hrubin@pop.stat.purdue.edu (Herman Rubin) said: Herman> In article <LAMSON.91Mar7181743@el1.crd.ge.com>, lamson@el1.crd.ge.com (scott h lamson) writes: > Given this line of reasoning, how do you look at massive parallel ala > the connection machine? Should you think of the CM as a slow scalar > machine with a super fast very long vector processor? > or is this maybe the wrong way to look at the machine to start with. Herman> This is the wrong way to look at it, and there are huge Herman> problems with massively parallel processors handling long Herman> vectors. Here is a simple example which will be relatively Herman> poor on SIMD machines: there is a function to be computed on Herman> all arguments of a vector. There are different efficient Herman> algorithms to be used in different parts of the domain, but Herman> there is no common even moderately efficient algorithm. That is why Danny Hillis originally wanted to have more than one instruction stream on the Connection Machine. I believe that he told me that the original idea was for 4 instruction streams. This would require 2 bits for "instruction stream select" rather than the 1 bit that is currently used for "masking". This part of the overhead is negligible. The part that caused them to drop the idea was that the front-end VAXen had enough trouble generating *one* instruction stream fast enough to keep the machine busy --- it would have failed badly trying to generate 4 independent instruction streams. Of course, the other part of the story is that they saw no overwhelming reason to include this added complexity in the first line of the machines. Now that the CM-2 SIMD architecture has proven itself successful, and now that *much* faster front end machines are available, Thinking Machines, Inc, might be more receptive to user requests for this sort of functionality. The "multiple-instruction-stream" feature may or may not speed up a particular application. The best case is when the different algorithms all require the same amount of time, then one obtains a speedup equal to the lesser of the number of instruction streams or the number of algorithms used. One may quibble about the choice of 4 instruction streams. I think that this would handle most of the applications, though I might want 8 for some stuff I do (one "interior" instruction stream and 6 "boundary condition" instruction streams for the faces of a three-dimensional rectangular box). Once one gets much more complicated than a "few" instruction streams, the problem would probably map better onto a MIMD architecture. -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) (03/09/91)
In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes: >Given this line of reasoning, how do you look at massive parallel ala >the connection machine? Should you think of the CM as a slow scalar >machine with a super fast very long vector processor? If so, will it >fail for the same reasons you give? One difference is vector units >take only 1 clock cycle for each additional operation, whereas CM's >take no extra clock cycles for each additional operation (until you >run out of processors). but this has nothing to do with the scalar >speed argument anyway. Another important advantage of a CM and a long vector machine is that the CM has more wires: more communication. While communication steps can be very costly, they don't need to pass through the bottleneck of a Central processing unit as they might when doing a scatter/gather on a vector machine.
henry@zoo.toronto.edu (Henry Spencer) (03/09/91)
In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes: >Given this line of reasoning, how do you look at massive parallel ala >the connection machine? ... My impression -- beware that it's an area I'm not really up on -- is that CM buyers are more or less aware that they're going to have to revise their software to exploit the thing. Buyers of "main line" supers want their dusty decks to run well, if not optimally, right away. The question is how many CM buyers there are, given that constraint. -- "But this *is* the simplified version | Henry Spencer @ U of Toronto Zoology for the general public." -S. Harris | henry@zoo.toronto.edu utzoo!henry
brooks@physics.llnl.gov (Eugene D. Brooks III) (03/11/91)
In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes: >Given this line of reasoning, how do you look at massive parallel ala >the connection machine? Should you think of the CM as a slow scalar >machine with a super fast very long vector processor? Yes, this is the correct view of the CM-2. >If so, will it fail for the same reasons you give? Yes, it will fail for the same historical reasons. NASA AMES has some impressive computational efficiencies for this machine on 3-D FFTs, the number I saw on a viewgraph was "~0" >One difference is vector units >take only 1 clock cycle for each additional operation, whereas CM's >take no extra clock cycles for each additional operation (until you >run out of processors). This does not matter. >but this has nothing to do with the scalar >speed argument anyway. Right. What gets you is vector operations under mask. The fundamental capability which is needed is MIMD functioning of the processors, so that each processor can go its own way and get effective work done instead of being disabled for the computation.
brooks@physics.llnl.gov (Eugene D. Brooks III) (03/11/91)
In article <13834@life.ai.mit.edu> lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) writes: >the CM has more wires: more communication. While communication steps >can be very costly, they don't need to pass through the bottleneck of >a Central processing unit as they might when doing a scatter/gather on >a vector machine. Scatter/gather is the equivalent of the communication on the CM. On a Cray XMP or YMP the scatter/gather bandwidth matches the cpu speed just fine and does not represent a bottleneck. It is certainly not as hampered as the "router" functionality on the CM-2 which is a severe mismatch for the PE computational bandwidth.
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/11/91)
In article <781@llnl.LLNL.GOV> brooks@physics.llnl.gov (Eugene D. Brooks III) writes: >Scatter/gather is the equivalent of the communication on the CM. On >a Cray XMP or YMP the scatter/gather bandwidth matches the cpu speed >just fine and does not represent a bottleneck. It is certainly not >as hampered as the "router" functionality on the CM-2 which is a severe >mismatch for the PE computational bandwidth. It was mismatched on the CM-1, too. Would anyone care to comment on why the CM-2 didn't improve in this area? Is there something intrinsic to their design, or their implementation strategy? -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
rlk@think.com (Robert Krawitz) (03/12/91)
In article <780@llnl.LLNL.GOV>, brooks@physics (Eugene D. Brooks III) writes: ]In article <LAMSON.91Mar7181743@el1.crd.ge.com> lamson@el1.crd.ge.com (scott h lamson) writes: ]Yes, it will fail for the same historical reasons. ]NASA AMES has some impressive computational efficiencies for this ]machine on 3-D FFTs, the number I saw on a viewgraph was "~0" This timing was performed in 64-bit precision on a machine with 32 bit floating point hardware (effectively, without floating point hardware). We do offer 64 bit floating point hardware. The timing was also for considerably more than just a 3D FFT. The performance for the FFT's in question, using currently-released software and 64 bit FPU hardware, is between 45 and 140 Mflops, depending upon the options chosen, on an 8K processor machine (1/8 of a full machine). -- ames >>>>>>>>> | Robert Krawitz <rlk@think.com> 245 First St. bloom-beacon > |think!rlk (postmaster) Cambridge, MA 02142 harvard >>>>>> . Thinking Machines Corp. (617)234-2116
serafini@nas.nasa.gov (David B. Serafini) (03/12/91)
In article <12315@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >In article <781@llnl.LLNL.GOV> > brooks@physics.llnl.gov (Eugene D. Brooks III) writes: >>Scatter/gather is the equivalent of the communication on the CM. On >>a Cray XMP or YMP the scatter/gather bandwidth matches the cpu speed >>just fine and does not represent a bottleneck. It is certainly not >>as hampered as the "router" functionality on the CM-2 which is a severe >>mismatch for the PE computational bandwidth. > >It was mismatched on the CM-1, too. Would anyone care to comment on >why the CM-2 didn't improve in this area? Is there something >intrinsic to their design, or their implementation strategy? > The CM-2 was basically a CM-1 with floating point hardware and some extra glue circuitry stuck onto the boards in what little spare board space there was. Very little, if any, of the CM-1 hardware was majorly redesigned (to my knowledge. I may be wrong.) There were bug fixes, of course. It took a lot of time and effort to design the router hardware and software, and could not be redone for the CM-2. I'm sure the next CM will show the significant changes that Don is implying should have been in the CM-2. David B. Serafini serafini@ralph.arc.nasa.gov Rose Engineering and Research or ...!ames!amelia!serafini NASA/Ames Research Center MS227-6 (415)604-6233 Moffett Field, CA 94035 #include <std-disclaimer.h> -- David B. Serafini serafini@ralph.arc.nasa.gov Rose Engineering and Research or ...!ames!amelia!serafini NASA/Ames Research Center MS227-6 (415)604-6233 Moffett Field, CA 94035 #include <std-disclaimer.h>
brooks@physics.llnl.gov (Eugene D. Brooks III) (03/15/91)
In article <1991Mar11.220703.13544@Think.COM> rlk@think.com (Robert Krawitz) writes:
An attempt to discredit the performance data I posted for the FFT on the CM-2.
The data I saw specified 32 bit precision on an 8K machine with
32 bit floating point hardware. It was for the NAS FFT benchmark.
For further reference, I of course refer interested parties to the
appropriate NAS publications.