fouts@bozeman.ingr.com (Martin Fouts) (05/31/90)
In article <1990May23.041119.4359@ux1.cso.uiuc.edu> klietz@ux1.cso.uiuc.edu (Alan Klietz) writes:
From: klietz@ux1.cso.uiuc.edu (Alan Klietz)
Summary: Be not hasty to judge
Aw, Marty don't write off CRI too fast. While I admit that my sentiments
about the YMP 2E are probably similar to yours, viz. divergence
from CRI's strengths into areas of weakness, etc., they will survive
based on outstanding technical merit (as usual), but in other areas.
CRI's outstanding technical merit walked out the door, packed his
bags, and went to Boulder. (;-)
What areas? Data parallelism, I think. I am seeing deja-vu in the
acceptance of the scientific computing community to DP as I saw in the
acceptance to UNIX back in the mid-80's (cf. ETA discussion). Those
entities that embrace and push DP fastest will be the winners, while
those that that continue with big iron vector boxes will be crushed
by the Killer Micros and go the way of NOS Cybers and AOS Novas.
Data parallelism is an easy way to solve easy to parallelize problems.
It is a poor way to solve hard to parallelize problems. As example, I
cite that the CM compilers do not run on the CM... [Of course, most
of the problems in the "scientific computing" community are 'easy', so
that may not be an important point. Linear algrabra exhibits good
locality of reference and relative independence of calculations.]
CRI has embraced DP in a big way in only the last few months, I think.
I have a pretty good idea, based totally on supposition, on what the
teraflop YMP will look like (hint: think CM.)
I hope it isn't in the same big way they embraced "network
supercomputing" --- claiming to have invented something which on
analysis only means what other people call 'interconnectivity.' If
that's true, they'll introduce a PARDO construct to Fortran and call
it data parallelism. (;-)
I'm not selling my CRI stock anytime soon.
Me either. I can't get the money I put into it back, and I don't
need the loss on my taxes. (:-( [CRI is now trading ~47, or about
half the average cost of CRI stock...]
But seriously. For data parallelism to be fully effective one needs a
very high bandwidth low latency intercnonnect mechanism and problems
which exhibit high locality of reference. The power of the processing
element isn't very important by comparison. A large part of Thinking
Machine's success with the SIMD approach in the connection machine as
opposed to the price competitive MIMD approach of "hypercube" systems
is the very clever and rather quick routing of the machine, coupled
with the very low synchronization cost built into the lock step
approach of SIMD. Hillis, in his PHD thesis, argues that the
qualitative value of data parallelism only comes when a very large
number of processing elements can be effectively utilized in parallel.
CRI can build a data parallel machine in one of two ways. It can
build a medium number of processor MIMD machine with near zero
synchronization cost and use the medium number of processors to model
the large number of processors Hillis postulates. Or, it can build a
large number of processor systems. In either case, doing this with a
MIMD system is a very difficult technical problem because of the cost
of synchronization.
CRI could possibly build a SIMD implementation of the Y/MP; that is a
Y/MP instruction set driven data parallel processor. There are only
three things needed to do this that they don't currently have the
expertise for:
1) hardware
2) software
3) marketing (;-)
In fact, I can't even imagine such a machine running. However, I've
been wrong about enough things that I'll try instead to imagine how
long it will take Crayless-Cray to produce the machine.
Let's see the C90 project has to be done first and then people freed
up.... (think, think, gaze at crystal ball:)
It would take CRI at least 10 years from today to introduce a usable
SIMD system ala connection machine with a processor count in the low
to medium thousands.
Can they survive that long without a follow on to the Y/MP, which will
be obsolete in 1994, if current industry trends of five year life
times are followed?
Marty
--
Martin Fouts
UUCP: ...!pyramid!garth!fouts ARPA: apd!fouts@ingr.com
PHONE: (415) 852-2310 FAX: (415) 856-9224
MAIL: 2400 Geng Road, Palo Alto, CA, 94303
If you can find an opinion in my posting, please let me know.
I don't have opinions, only misconceptions.
fouts@bozeman.ingr.com (Martin Fouts) (06/01/90)
In article <6308@amelia.nas.nasa.gov> serafini@amelia.nas.nasa.gov (David B. Serafini) writes: In article <1990May23.041119.4359@ux1.cso.uiuc.edu> klietz@ux1.cso.uiuc.edu (Alan Klietz) writes: >In article <354@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes: >< ><1) Y-MP 2E was announced (1 - 2 cpu air cooled YMP) as CRI's first >< "minisupercomputer" Was availability date announced? Is this the next generation Supertek machine or really a CRI-developed air-cooled Y (have to be some pretty big fans in that puppy, the Y circuit boards make a lot of heat. :-) <dbs> The availability date was announced, but I don't have it. (You should ask John Barton, others should check with their Cray salesman.) It is not a Supertek system, but an air-cooled Y. You don't need the big fans if you run the clocks slowly enough(;-) Marty -- Martin Fouts UUCP: ...!pyramid!garth!fouts ARPA: apd!fouts@ingr.com PHONE: (415) 852-2310 FAX: (415) 856-9224 MAIL: 2400 Geng Road, Palo Alto, CA, 94303 If you can find an opinion in my posting, please let me know. I don't have opinions, only misconceptions.
art@cs.bu.edu (Al Thompson) (06/04/90)
In article <390@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes: [...] | | What areas? Data parallelism, I think. I am seeing deja-vu in the | acceptance of the scientific computing community to DP as I saw in the | acceptance to UNIX back in the mid-80's (cf. ETA discussion). Those | entities that embrace and push DP fastest will be the winners, while | those that that continue with big iron vector boxes will be crushed | by the Killer Micros and go the way of NOS Cybers and AOS Novas. | |Data parallelism is an easy way to solve easy to parallelize problems. |It is a poor way to solve hard to parallelize problems. As example, I |cite that the CM compilers do not run on the CM... [Of course, most |of the problems in the "scientific computing" community are 'easy', so |that may not be an important point. Linear algrabra exhibits good |locality of reference and relative independence of calculations.] That's the point, the "scientific problems" are indeed "easy". I am surprised at you compiler comment since compilers really don't fit the model. I realize that's your point, but it really raises the old general purpose arguments. Clearly the data parallel model is spectacular for some applications. I have been working with a CM for a while now, and it's really quite a rush to ponder a problem for a while and then suddenly discover it can be solved in one statement. After a bit of experience it is clear that there are new problem solutions that can be implemented. Finding the implementations, or just searching for them, is quite illuminating. | | CRI has embraced DP in a big way in only the last few months, I think. | I have a pretty good idea, based totally on supposition, on what the | teraflop YMP will look like (hint: think CM.) | |I hope it isn't in the same big way they embraced "network |supercomputing" --- claiming to have invented something which on |analysis only means what other people call 'interconnectivity.' If |that's true, they'll introduce a PARDO construct to Fortran and call |it data parallelism. (;-) Isn't it interestng how this happens so often. | | I'm not selling my CRI stock anytime soon. | |Me either. I can't get the money I put into it back, and I don't |need the loss on my taxes. (:-( [CRI is now trading ~47, or about |half the average cost of CRI stock...] | |But seriously. For data parallelism to be fully effective one needs a |very high bandwidth low latency intercnonnect mechanism and problems |which exhibit high locality of reference. The power of the processing |element isn't very important by comparison. A large part of Thinking |Machine's success with the SIMD approach in the connection machine as |opposed to the price competitive MIMD approach of "hypercube" systems |is the very clever and rather quick routing of the machine, coupled |with the very low synchronization cost built into the lock step |approach of SIMD. Hillis, in his PHD thesis, argues that the |qualitative value of data parallelism only comes when a very large |number of processing elements can be effectively utilized in parallel. That is the point. You really need a huge number of processors to get the advantages. Cursory cost analyses look like they kill the CM, and they do too if your problem is small. The reason for this is that the number of processors in a factor in the cost equation. In the case of the CM this number is both fixed and large (usually). So, if you have a problem that doesn't really need a lot of processors then its cost seems prohibitive. | |CRI can build a data parallel machine in one of two ways. It can |build a medium number of processor MIMD machine with near zero |synchronization cost and use the medium number of processors to model |the large number of processors Hillis postulates. Or, it can build a |large number of processor systems. In either case, doing this with a |MIMD system is a very difficult technical problem because of the cost |of synchronization. You can say that again. If you are not getting a bunch of processors you are better off staying with super fast Von Neumann. See Stone's article on the search results reported by Thinking Machines. | |CRI could possibly build a SIMD implementation of the Y/MP; that is a |Y/MP instruction set driven data parallel processor. There are only |three things needed to do this that they don't currently have the |expertise for: | |1) hardware |2) software |3) marketing (;-) | |In fact, I can't even imagine such a machine running. However, I've |been wrong about enough things that I'll try instead to imagine how |long it will take Crayless-Cray to produce the machine. I can't imagine such a machine being SIMD. The Cray instruction set, if implemented on each processor, contains data dependent jumps. The first time one of these is executed, poof you're off in MIMD land.
fouts@bozeman.ingr.com (Martin Fouts) (06/28/90)
In article <58992@bu.edu.bu.edu> art@cs.bu.edu (Al Thompson) writes: In article <470@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes: |In article <58230@bu.edu.bu.edu> art@cs.bu.edu (Al Thompson) writes: | In article <390@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes: [...] |I'm not quite as optimistic. I will not quite equate scientific |problem with easy, although I will certainly agree that they embody a |wide range of easy problems. But I still agree. In fact, let me now |introduce a term which I didn't invent: (though wish I had) I didn't mean "easy" in the sense of a snap, I meant it in the sense of "easy parallelism". That's not necessarily an easy problem. In fact many scientific problems consist of calculations that are inherently simple but must be replicated over a large number of cases. Geofry Fox has done a lot of work on this. Actually, I also meant easy in the sense of "easy parallelism." I'm familiar with Fox's work. I'm also familiar with several classes of scientific problems which are not easy to parallelize. To me, QCD is easy to parallelize but hard to understand, (I'm no quantum physicist) while estimation of population dynamics is easy to understand but hard to parallelize. (I'm no biologist either, but the arguments are "intuitive", and the problem is partial order dependent, rendering it intractable as a parallelizable problem.) [...] | |"Functional Multiprocessing (tm?)" == Decomposing the processors to |match the problem and then optimizing each processing element to its |task. I predict that this will be a more important contribution then |vectorization. This is probably correct. It is a compelling idea and one we are working on here, although not quite in the form as stated. | | [...] | | | |CRI could possibly build a SIMD implementation of the Y/MP; that is a | |Y/MP instruction set driven data parallel processor. [...] | | I can't imagine such a machine being SIMD. The Cray instruction set, if | implemented on each processor, contains data dependent jumps. The first | time one of these is executed, poof you're off in MIMD land. | |Actually, there are ways to make those jumps which would work in a |SIMD architecture and would support the itteration to a fixed point |method of parallelization described by Chandy and Misra, but I |wouldn't want to program one of them (;-) My point was turning MIMD to SIMD. Not the other way around. In fact I have done the fixed the fixed point Chandy Misra problem and it's not quite. Perhaps the reason I say this is because I had it solved before I knew it (I wasn't really looking for a solution, but a light went on and sure enough, there it was). Turing SIMD to MIMD is not so terribly difficult. To do so with any reasonable degree of efficiency on the other is not so easy. Agreed. Now think about my statement as an attempt to cast an MIMD machine to a SIMD machine using an N to 1 jump processor: When a data dependent jump occurs, all processors go nondeterministically to the same next instruction. You still have lock step instruction execution, but now you have nondeterminism. The only way I can think to write correct programs for this beast is through fixed point itteration. Marty -- Martin Fouts UUCP: ...!pyramid!garth!fouts ARPA: apd!fouts@ingr.com PHONE: (415) 852-2310 FAX: (415) 856-9224 MAIL: 2400 Geng Road, Palo Alto, CA, 94303 If you can find an opinion in my posting, please let me know. I don't have opinions, only misconceptions.
fouts@bozeman.ingr.com (Martin Fouts) (06/29/90)
In article <27736@metropolis.super.ORG> lerici@super.ORG (Peter W. Brewer) writes:
Hmmm. well yes, editors should run on supercomputers dedicated to doing fast
floating point or large integer computations. Emacs belongs on the Cray of
course.. :-) I think a nice RISC based front end to compile/edit etc. (after
all aren't compilers/link EDITORS etc just editors? ) may be enough .. let
Floating Point do Floating Point. Parallel Compilers/Parsers are nice but
how complicated/how easy to debug? how about extensibility etc? So compilers
don't run on the CM.. do they run in the Cray or Convex Vector Units? Thats
where alot of the performance comes from... I think Unicos etc. should run
on the ForeGround Processors on the Crays... it creates too many problems
running on the backends...
This is an old religious argument. The simple refutation is that a
Cray isn't just a Floating Point (sic) unit. On a Cray system, as
delivered to a customer, typically 1/4 of the manufacturing costs are
for the cpu and the rest for the non cpu. Of the 1/4, maybe half is
floating point. There is no system on the market for which the CPU is
more than half the manufacturing cost.
Besides, the users are more valuable that the tools they use. Let's
try to make the machines usable, shall we?
Marty
--
Martin Fouts
UUCP: ...!pyramid!garth!fouts ARPA: apd!fouts@ingr.com
PHONE: (415) 852-2310 FAX: (415) 856-9224
MAIL: 2400 Geng Road, Palo Alto, CA, 94303
If you can find an opinion in my posting, please let me know.
I don't have opinions, only misconceptions.
ddt@convex.COM (David Taylor) (07/04/90)
>In order to identify a useful point in the space, I proposed the gcc >SPECmark. It is not a scalar benchmark run though a scalar compiler; I disagree. I've been examining the gcc benchmark of the SPEC suite for perhaps a month, now, and it is a /very/ scalar benchmark. Although I may not at this time release a SPECratio (lest it be set in stone), I can say that the majority of the 10 SPEC benchmarks smile favorably on the Convex C-series computers, especially when you consider the price of the computer. A couple of them might blow your formidable mind (regardless of price), but you'll have to wait until we're ready to unveil them. As of yet, I have only run the gcc SPEC benchmark using a scalar compiler built for portability, not optimization for speed, so I can't give you a good feel for the speed. However, the profile does indicate that it will be a very scalar benchmark. I can't give you a date when we'll be releasing the numbers, but we seem to be very close. I've made trial runs of the SPECthruput as well. The C-series once again performed very well. I'll keep you posted of the release date for the actual numbers. I am also a fan of the SPEC benchmarks. I think they're a really good indication of performance. However, inevitably, the best measure of a machine's speed is to see how fast a customer's application will run on it with a customer's typical load. Convex's do what they were designed to do /extremely/ well. Vectorization and parallelization are very important factors for most of our customers. Therefore, our customers are usually very happy with the product. We have the /best/ machines in our price bracket for semi-vectorizable/parallelizable jobs. =-ddt-> -- David L. Taylor, Esq, Performance Measurement Intern, Convex. (whew!) (214) 497-4860, ddt@convex.com or ddt@vondrake.cc.utexas.edu Remember, flatulation is only natural.