[comp.sys.super] Cray tidbits

fouts@bozeman.ingr.com (Martin Fouts) (05/31/90)

In article <1990May23.041119.4359@ux1.cso.uiuc.edu> klietz@ux1.cso.uiuc.edu (Alan Klietz) writes:

   From: klietz@ux1.cso.uiuc.edu (Alan Klietz)
   Summary: Be not hasty to judge

   Aw, Marty don't write off CRI too fast.  While I admit that my sentiments
   about the YMP 2E are probably similar to yours, viz. divergence
   from CRI's strengths into areas of weakness, etc., they will survive
   based on outstanding technical merit (as usual), but in other areas.

CRI's outstanding technical merit walked out the door, packed his
bags, and went to Boulder. (;-)

   What areas?  Data parallelism, I think.   I am seeing deja-vu in the
   acceptance of the scientific computing community to DP as I saw in the
   acceptance to UNIX back in the mid-80's (cf. ETA discussion).  Those
   entities that embrace and push DP fastest will be the winners, while
   those that that continue with big iron vector boxes will be crushed
   by the Killer Micros and go the way of NOS Cybers and AOS Novas.

Data parallelism is an easy way to solve easy to parallelize problems.
It is a poor way to solve hard to parallelize problems.  As example, I
cite that the CM compilers do not run on the CM...  [Of course, most
of the problems in the "scientific computing" community are 'easy', so
that may not be an important point.  Linear algrabra exhibits good
locality of reference and relative independence of calculations.]

   CRI has embraced DP in a big way in only the last few months, I think.
   I have a pretty good idea, based totally on supposition, on what the
   teraflop YMP will look like (hint: think CM.)

I hope it isn't in the same big way they embraced "network
supercomputing" --- claiming to have invented something which on
analysis only means what other people call 'interconnectivity.'  If
that's true, they'll introduce a PARDO construct to Fortran and call
it data parallelism. (;-)

   I'm not selling my CRI stock anytime soon.

Me either. I can't get the money I put into it back, and I don't
need the loss on my taxes. (:-(  [CRI is now trading ~47, or about
half the average cost of CRI stock...]

But seriously.  For data parallelism to be fully effective one needs a
very high bandwidth low latency intercnonnect mechanism and problems
which exhibit high locality of reference.  The power of the processing
element isn't very important by comparison.  A large part of Thinking
Machine's success with the SIMD approach in the connection machine as
opposed to the price competitive MIMD approach of "hypercube" systems
is the very clever and rather quick routing of the machine, coupled
with the very low synchronization cost built into the lock step
approach of SIMD. Hillis, in his PHD thesis, argues that the
qualitative value of data parallelism only comes when a very large
number of processing elements can be effectively utilized in parallel.

CRI can build a data parallel machine in one of two ways.  It can
build a medium number of processor MIMD machine with near zero
synchronization cost and use the medium number of processors to model
the large number of processors Hillis postulates.  Or, it can build a
large number of processor systems.  In either case, doing this with a
MIMD system is a very difficult technical problem because of the cost
of synchronization.

CRI could possibly build a SIMD implementation of the Y/MP; that is a
Y/MP instruction set driven data parallel processor.  There are only
three things needed to do this that they don't currently have the
expertise for:

1) hardware
2) software
3) marketing (;-)

In fact, I can't even imagine such a machine running.  However, I've
been wrong about enough things that I'll try instead to imagine how
long it will take Crayless-Cray to produce the machine.

Let's see the C90 project has to be done first and then people freed
up.... (think, think, gaze at crystal ball:)

It would take CRI at least 10 years from today to introduce a usable
SIMD system ala connection machine with a processor count in the low
to medium thousands.

Can they survive that long without a follow on to the Y/MP, which will
be obsolete in 1994, if current industry trends of five year life
times are followed?

Marty
--
Martin Fouts

 UUCP:  ...!pyramid!garth!fouts  ARPA:  apd!fouts@ingr.com
PHONE:  (415) 852-2310            FAX:  (415) 856-9224
 MAIL:  2400 Geng Road, Palo Alto, CA, 94303

If you can find an opinion in my posting, please let me know.
I don't have opinions, only misconceptions.

fouts@bozeman.ingr.com (Martin Fouts) (06/01/90)

In article <6308@amelia.nas.nasa.gov> serafini@amelia.nas.nasa.gov (David B. Serafini) writes:

   In article <1990May23.041119.4359@ux1.cso.uiuc.edu> klietz@ux1.cso.uiuc.edu (Alan Klietz) writes:
   >In article <354@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes:
   ><
   ><1) Y-MP 2E was announced (1 - 2 cpu air cooled YMP) as CRI's first
   ><   "minisupercomputer"

   Was availability date announced?  Is this the next generation Supertek machine
   or really a CRI-developed air-cooled Y (have to be some pretty big fans in
   that puppy, the Y circuit boards make a lot of heat. :-)

   <dbs>

The availability date was announced, but I don't have it.  (You should
ask John Barton, others should check with their Cray salesman.)

It is not a Supertek system, but an air-cooled Y.  You don't need the
big fans if you run the clocks slowly enough(;-)

Marty
--
Martin Fouts

 UUCP:  ...!pyramid!garth!fouts  ARPA:  apd!fouts@ingr.com
PHONE:  (415) 852-2310            FAX:  (415) 856-9224
 MAIL:  2400 Geng Road, Palo Alto, CA, 94303

If you can find an opinion in my posting, please let me know.
I don't have opinions, only misconceptions.

art@cs.bu.edu (Al Thompson) (06/04/90)

In article <390@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes:
[...]
|
|   What areas?  Data parallelism, I think.   I am seeing deja-vu in the
|   acceptance of the scientific computing community to DP as I saw in the
|   acceptance to UNIX back in the mid-80's (cf. ETA discussion).  Those
|   entities that embrace and push DP fastest will be the winners, while
|   those that that continue with big iron vector boxes will be crushed
|   by the Killer Micros and go the way of NOS Cybers and AOS Novas.
|
|Data parallelism is an easy way to solve easy to parallelize problems.
|It is a poor way to solve hard to parallelize problems.  As example, I
|cite that the CM compilers do not run on the CM...  [Of course, most
|of the problems in the "scientific computing" community are 'easy', so
|that may not be an important point.  Linear algrabra exhibits good
|locality of reference and relative independence of calculations.]

That's the point, the "scientific problems" are indeed "easy".  I am
surprised at you compiler comment since compilers really don't fit the
model.  I realize that's your point, but it really raises the old general
purpose arguments.  Clearly the data parallel model is spectacular for
some applications.  I have been working with a CM for a while now, and
it's really quite a rush to ponder a problem for a while and then suddenly
discover it can be solved in one statement.  After a bit of experience it
is clear that there are new problem solutions that can be implemented.
Finding the implementations, or just searching for them, is quite
illuminating.

|
|   CRI has embraced DP in a big way in only the last few months, I think.
|   I have a pretty good idea, based totally on supposition, on what the
|   teraflop YMP will look like (hint: think CM.)
|
|I hope it isn't in the same big way they embraced "network
|supercomputing" --- claiming to have invented something which on
|analysis only means what other people call 'interconnectivity.'  If
|that's true, they'll introduce a PARDO construct to Fortran and call
|it data parallelism. (;-)

Isn't it interestng how this happens so often.

|
|   I'm not selling my CRI stock anytime soon.
|
|Me either. I can't get the money I put into it back, and I don't
|need the loss on my taxes. (:-(  [CRI is now trading ~47, or about
|half the average cost of CRI stock...]
|
|But seriously.  For data parallelism to be fully effective one needs a
|very high bandwidth low latency intercnonnect mechanism and problems
|which exhibit high locality of reference.  The power of the processing
|element isn't very important by comparison.  A large part of Thinking
|Machine's success with the SIMD approach in the connection machine as
|opposed to the price competitive MIMD approach of "hypercube" systems
|is the very clever and rather quick routing of the machine, coupled
|with the very low synchronization cost built into the lock step
|approach of SIMD. Hillis, in his PHD thesis, argues that the
|qualitative value of data parallelism only comes when a very large
|number of processing elements can be effectively utilized in parallel.

That is the point.  You really need a huge number of processors to get the
advantages.  Cursory cost analyses look like they kill the CM, and they do
too if your problem is small.  The reason for this is that the number of
processors in a factor in the cost equation.  In the case of the CM this
number is both fixed and large (usually).  So, if you have a problem that
doesn't really need a lot of processors then its cost seems prohibitive.

|
|CRI can build a data parallel machine in one of two ways.  It can
|build a medium number of processor MIMD machine with near zero
|synchronization cost and use the medium number of processors to model
|the large number of processors Hillis postulates.  Or, it can build a
|large number of processor systems.  In either case, doing this with a
|MIMD system is a very difficult technical problem because of the cost
|of synchronization.

You can say that again.  If you are not getting a bunch of processors you
are better off staying with super fast Von Neumann.  See Stone's article
on the search results reported by Thinking Machines.

|
|CRI could possibly build a SIMD implementation of the Y/MP; that is a
|Y/MP instruction set driven data parallel processor.  There are only
|three things needed to do this that they don't currently have the
|expertise for:
|
|1) hardware
|2) software
|3) marketing (;-)
|
|In fact, I can't even imagine such a machine running.  However, I've
|been wrong about enough things that I'll try instead to imagine how
|long it will take Crayless-Cray to produce the machine.

I can't imagine such a machine being SIMD.  The Cray instruction set, if
implemented on each processor, contains data dependent jumps.  The first
time one of these is executed, poof you're off in MIMD land.

fouts@bozeman.ingr.com (Martin Fouts) (06/28/90)

In article <58992@bu.edu.bu.edu> art@cs.bu.edu (Al Thompson) writes:

   In article <470@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes:
   |In article <58230@bu.edu.bu.edu> art@cs.bu.edu (Al Thompson) writes:
   |   In article <390@garth.UUCP> fouts@bozeman.ingr.com (Martin Fouts) writes:
   [...]

   |I'm not quite as optimistic.  I will not quite equate scientific
   |problem with easy, although I will certainly agree that they embody a
   |wide range of easy problems.  But I still agree.  In fact, let me now
   |introduce a term which I didn't invent: (though wish I had)

   I didn't mean "easy" in the sense of a snap, I meant it in the sense of
   "easy parallelism".  That's not necessarily an easy problem.  In fact many
   scientific problems consist of calculations that are inherently simple but
   must be replicated over a large number of cases.  Geofry Fox has done a
   lot of work on this.

Actually, I also meant easy in the sense of "easy parallelism."  I'm
familiar with Fox's work.  I'm also familiar with several classes of
scientific problems which are not easy to parallelize.  To me, QCD is
easy to parallelize but hard to understand, (I'm no quantum physicist)
while estimation of population dynamics is easy to understand but hard
to parallelize.  (I'm no biologist either, but the arguments are
"intuitive", and the problem is partial order dependent, rendering it
intractable as a parallelizable problem.)

   [...]

   |
   |"Functional Multiprocessing (tm?)"  == Decomposing the processors to
   |match the problem and then optimizing each processing element to its
   |task.  I predict that this will be a more important contribution then
   |vectorization.

   This is probably correct.  It is a compelling idea and one we are working
   on here, although not quite in the form as stated.

   |
   |   [...]
   |   |
   |   |CRI could possibly build a SIMD implementation of the Y/MP; that is a
   |   |Y/MP instruction set driven data parallel processor.

   [...]

   |
   |   I can't imagine such a machine being SIMD.  The Cray instruction set, if
   |   implemented on each processor, contains data dependent jumps.  The first
   |   time one of these is executed, poof you're off in MIMD land.
   |
   |Actually, there are ways to make those jumps which would work in a
   |SIMD architecture and would support the itteration to a fixed point
   |method of parallelization described by Chandy and Misra, but I
   |wouldn't want to program one of them (;-)

   My point was turning MIMD to SIMD.  Not the other way around.  In fact I
   have done the fixed the fixed point Chandy Misra problem and it's not
   quite.  Perhaps the reason I say this is because I had it solved before I
   knew it (I wasn't really looking for a solution, but a light went on and
   sure enough, there it was).

   Turing SIMD to MIMD is not so terribly difficult.  To do so with any
   reasonable degree of efficiency on the other is not so easy.

Agreed.  Now think about my statement as an attempt to cast an MIMD
machine to a SIMD machine using an N to 1 jump processor:  When a data
dependent jump occurs, all processors go nondeterministically to the
same next instruction.  You still have lock step instruction
execution, but now you have nondeterminism.  The only way I can think
to write correct programs for this beast is through fixed point
itteration.

Marty
--
Martin Fouts

 UUCP:  ...!pyramid!garth!fouts  ARPA:  apd!fouts@ingr.com
PHONE:  (415) 852-2310            FAX:  (415) 856-9224
 MAIL:  2400 Geng Road, Palo Alto, CA, 94303

If you can find an opinion in my posting, please let me know.
I don't have opinions, only misconceptions.

fouts@bozeman.ingr.com (Martin Fouts) (06/29/90)

In article <27736@metropolis.super.ORG> lerici@super.ORG (Peter W. Brewer) writes:

   Hmmm. well yes, editors should run on supercomputers dedicated to doing fast
   floating point or large integer computations. Emacs belongs on the Cray of
   course.. :-) I think a nice RISC based front end to compile/edit etc. (after
   all aren't compilers/link EDITORS etc just editors? ) may be enough .. let
   Floating Point do Floating Point. Parallel Compilers/Parsers are nice but
   how complicated/how easy to debug? how about extensibility etc? So compilers
   don't run on the CM.. do they run in the Cray or Convex Vector Units? Thats
   where alot of the performance comes from... I think Unicos etc. should run
   on the ForeGround Processors on the Crays... it creates too many problems
   running on the backends...

This is an old religious argument.  The simple refutation is that a
Cray isn't just a Floating Point (sic) unit.  On a Cray system, as
delivered to a customer, typically 1/4 of the manufacturing costs are
for the cpu and the rest for the non cpu.  Of the 1/4, maybe half is
floating point.  There is no system on the market for which the CPU is
more than half the manufacturing cost.

Besides, the users are more valuable that the tools they use.  Let's
try to make the machines usable, shall we?

Marty
--
Martin Fouts

 UUCP:  ...!pyramid!garth!fouts  ARPA:  apd!fouts@ingr.com
PHONE:  (415) 852-2310            FAX:  (415) 856-9224
 MAIL:  2400 Geng Road, Palo Alto, CA, 94303

If you can find an opinion in my posting, please let me know.
I don't have opinions, only misconceptions.

ddt@convex.COM (David Taylor) (07/04/90)

>In order to identify a useful point in the space, I proposed the gcc
>SPECmark.  It is not a scalar benchmark run though a scalar compiler;

I disagree.  I've been examining the gcc benchmark of the SPEC suite for
perhaps a month, now, and it is a /very/ scalar benchmark.  Although I may
not at this time release a SPECratio (lest it be set in stone), I can
say that the majority of the 10 SPEC benchmarks smile favorably on the
Convex C-series computers, especially when you consider the price of the
computer.  A couple of them might blow your formidable mind (regardless of
price), but you'll have to wait until we're ready to unveil them.  As of yet,
I have only run the gcc SPEC benchmark using a scalar compiler built for
portability, not optimization for speed, so I can't give you a good feel for
the speed.  However, the profile does indicate that it will be a very
scalar benchmark.

I can't give you a date when we'll be releasing the numbers, but we seem to
be very close.  I've made trial runs of the SPECthruput as well.  The C-series
once again performed very well.  I'll keep you posted of the release date for
the actual numbers.

I am also a fan of the SPEC benchmarks.  I think they're a really good
indication of performance.  However, inevitably, the best measure of a
machine's speed is to see how fast a customer's application will run on it
with a customer's typical load.  Convex's do what they were designed to do
/extremely/ well.  Vectorization and parallelization are very important
factors for most of our customers.  Therefore, our customers are usually very
happy with the product.  We have the /best/ machines in our price bracket for
semi-vectorizable/parallelizable jobs.

	=-ddt->
--

     David L. Taylor, Esq, Performance Measurement Intern, Convex. (whew!)
         (214) 497-4860, ddt@convex.com or ddt@vondrake.cc.utexas.edu
                    Remember, flatulation is only natural.