[comp.sys.super] Massively Parallel LINPACK on the Intel Touchstone Delta machine

baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/01/91)

The Intel Touchstone Delta machine was unveiled for the public today at the
California Institute of Technology.  The Touchstone Delta machine consists of
528 i860 based computational nodes, connected by a high speed, two dimensional
mesh.  This computer is yet another milestone in the cooperatively funded
DARPA / Intel Touchstone project.

The LINPACK benchmark has often been used as one measure of comparison between
machines, and most recently, a new section of the report, entitled 'Massively
Parallel Computing' defines the same test, solve a dense set of linear
equations, but allows for the problem sizes to scale with the size of the
machine.  With the unveiling of the Touchstone Delta machine, Intel can now
publish the following double precision performance numbers for massively 
parallel LINPACK:

	np	    n	 time	GFLOPS	MFLOPS/node
	---	-----	-----	------	-----------
	192	 2000	  5.5	  .971	          5
	192 	 4000	 20.8 	 2.053 	         11
	192	 6000	 51.3	 2.808	         15
	192	 8000	102.5	 3.331	         17
	192	10000	178.2	 3.742	         19
	192	12000	284.8	 4.046	         21
	
	512	 2000	  4.5	 1.187	          2
	512	 4000	 14.2	 3.007	          6
	512	 6000	 31.5    4.574	          9
	512	 8000	 58.3	 5.867	         11
	512	10000	 96.8	 6.889	         13
	512	12000	146.9    7.844	         15
	512 	14000   213.7    8.561 	         17

Where np -> number of processors
      n  -> order of the matrix (full 64 bit precision)
      t  -> time in secs

The previous leader in massively parallel LINPACK was Thinking Machines
at 5.2 GFLOPS on a 26624 order matrix (64K processors).

The current production iPSC/860 has a published performance of 1.92 GFLOPS
on a matrix of order 8600, using 128 i860 processors, each with 8 Mbytes of
local memory.  Conventional 1K Linpack on one i860 node is published at 
25 MFLOPS.


Jerry Baugh
Intel Supercomputer Systems Division

stevo@elroy.jpl.nasa.gov (Steve Groom) (06/04/91)

In article <1991Jun3.130104.15667@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY.
CS.NET (Jerry Baugh) writes:
>
>The LINPACK benchmark has often been used as one measure of comparison between
>machines, and most recently, a new section of the report, entitled 'Massively
>Parallel Computing' defines the same test, solve a dense set of linear
>equations, but allows for the problem sizes to scale with the size of the
>machine.  With the unveiling of the Touchstone Delta machine, Intel can now
>publish the following double precision performance numbers for massively
>parallel LINPACK:
[numbers deleted]

At first, I started to read this thinking "what the heck does LINPACK
have to do with the performance of a parallel computer other than
measuring the power of individual nodes?"  Then I started reading more
closely, and it appears that there's more to it than that.

Can someone explain how "massively parallel LINPACK" is different from
regular LINPACK?  What considerations for communication are made
in this benchmark?  Since LINPACK is normally used as a measure of number
crunching, I'm curious how this benchmark translates to parallel
computers.  As we all know (or we all should know), the performance
of a parallel computer is usually NOT the same as the performance of an
individual node multiplied by the number of nodes in the computer
(although we'd just love that to always be the case).
The obvious misapplication of this kind of benchmark would
be to multiply a single node's LINPACK performance by the number of nodes in
the machine.  I notice that the numbers posted in the above-referenced
article do not scale linearly with the number of nodes used, so there
is some efficiency loss from the single node case.  I'm itching to
find out what the source of this loss is.  This is of particular interest
as I am currently porting some existing parallel code to the Delta, and I'd
like to be able to handle the inevitable queries about "well, they say
the Delta does such-and-such LINPACK GFLOPS...".

Any explanation or references would be welcome.
-- 
Steve Groom, Jet Propulsion Laboratory, Pasadena, CA
stevo@elroy.jpl.nasa.gov  {ames,usc}!elroy!stevo
"... and the babe, she needs a chaaaa--nging..." (apologies to Bob Dylan)

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/05/91)

In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes:
>[original post deleted ... Jerry Baugh]
>
>At first, I started to read this thinking "what the heck does LINPACK
>have to do with the performance of a parallel computer other than
>measuring the power of individual nodes?"  Then I started reading more
>closely, and it appears that there's more to it than that.
>
>Can someone explain how "massively parallel LINPACK" is different from
>regular LINPACK?  

Since I started this, I guess I ought to take a stab at explaining this
as well.  The standard LINPACK benchmark is constrained to 100x100 and
1000x1000.  These problem sizes are relatively small when compared with
the capabilites of todays supercomputers.  And they are small compared
to the size problems people are solving today (a production code at
one of our customer sites does a 20k x 20k matrix multiply).  Recently,
Jack Dongarra extended his standard LINPACK benchmark suite to include
a new catagory "Massively Parallel Computing".  This new catagory allows
the problem size to increase up to and including the limit of physical
memory.  Since the problem size can vary, it is important to note the
size (presented as order of the matrix) as well as the FLOP numbers presented.
One of the reasons we are proud of our numbers here at Intel is that
we got a high GFLOP number with a relatively small problem size (we can
still increase the problem size as we have not yet run out of physical
memory).

>What considerations for communication are made in this benchmark?  

Both data decomposition and data communication are critical.  The
method of parallelizing a program that is chosen must take both into
account. Selecting one method of data decomposition may dictate the
data communication strategy.  The strategy used in this case involved
a 2D data decomposition that does both column and row mapping.  This
takes advantage of the Touchstone Delta's mesh architecture by allowing
messages to flow to all nearest neighbors.  And this leads to
higher performance than a 1D decomposition scheme.

>Since LINPACK is normally used as a measure of number crunching,
>I'm curious how this benchmark translates to parallel computers.  

This particular version was designed for parallel computers.

>As we all know (or we all should know), the performance
>of a parallel computer is usually NOT the same as the performance of an
>individual node multiplied by the number of nodes in the computer
>(although we'd just love that to always be the case).
>The obvious misapplication of this kind of benchmark would
>be to multiply a single node's LINPACK performance by the number of nodes in
>the machine.  I notice that the numbers posted in the above-referenced
>article do not scale linearly with the number of nodes used, so there
>is some efficiency loss from the single node case.  I'm itching to
>find out what the source of this loss is.  

Since the data values are distributed across multiple nodes, data must be
passed from one node (where it resides) to another node (where it is needed
for the calculation).  This communication cost is a major part of the
difference you see between the single node numbers and the 512 node
numbers.

>This is of particular interest
>as I am currently porting some existing parallel code to the Delta, and I'd
>like to be able to handle the inevitable queries about "well, they say
>the Delta does such-and-such LINPACK GFLOPS...".

Personally, I have always said that performance on real codes is important.
The "Massively Parallel LINPACK" benchmark provides one measure that is
meaningful for some classes of codes.  For other classes of programs
(database for example), it is meaningless.

>Any explanation or references would be welcome.

Hope what I've written helps.  For more information on LINPACK and the
"Massively Parallel LINPACK", I suggest you get Jack Dongarra's paper
'Performance of Various Computers Using Standard Linear Equations Software'.
For an explaination of a parallel 1D decomposition, try 'LAPACK Block
Factorization Algorithms on the Intel iPSC/860', LAPACK Working Report 24,
University of Tenessee, Oct. 1990 by Dongarra and Ostrouchov.

>Steve Groom, Jet Propulsion Laboratory, Pasadena, CA
>stevo@elroy.jpl.nasa.gov  {ames,usc}!elroy!stevo

Jerry Baugh
Intel SSD - baugh@SSD.intel.com

patrick@convex.COM (Patrick F. McGehearty) (06/06/91)

In article <1991Jun5.120653.7852@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) writes:
>
>One of the reasons we are proud of our numbers here at Intel is that
>we got a high GFLOP number with a relatively small problem size (we can
>still increase the problem size as we have not yet run out of physical
>memory).
>
And rightfully so.  It's relatively easy to get a high GFLOP number
(assuming the raw machine power exists) for a sufficiently large
problem, assuming you can hook enough machines together and wait the
necessary days for the final solution.  There are effective linear equation
solver techniques with communication times proportional to the square
of the problem size, while the computation time is proportion to the
cube of the problem size.  So for sufficiently large problems, a gaggle
of workstations on a cluster of Ethernets could give GFLOP numbers
if someone wanted to take the trouble to run the test (say at one of
the universities with lots of workstations, like Carnegie Mellon over a
holiday weekend).

Another challenge I would like to see would be one which uses an efficent,
portable algorithm for linear equation solving written in a high level
language such as Fortran.  Then we could get some idea of what is achievable
on a machine without resorting to assembly language.

I know that what's most efficent on each machine varies with the
architecture, so maybe the rules could be that the portability requirement
is met by requiring that program be "standard" which could be determined
by running it on some commonly available workstations (with different
instruction sets from the target machine), but the supercomputer vendor
could select and implement the algorithm of their choice.  This approach
would provide some information about the ability of a compiler to get
the most out of a machine under ideal circumstances.  The requirement
that it also run on some other machine is intended to avoid machine
specific system calls.  Does this benchmark idea sound interesting to
anyone else?

john@spectre.unm.edu (John Prentice) (06/06/91)

In article <1991Jun05.185818.1071@convex.com> patrick@convex.COM (Patrick F. McGehearty) writes:
>the most out of a machine under ideal circumstances.  The requirement
>that it also run on some other machine is intended to avoid machine
>specific system calls.  Does this benchmark idea sound interesting to
>anyone else?

Yes and no.  Yes, because your idea would level the playing field and
give people a feeling for what one might expect if you weren't able or
willing to exploit the special aspects of a system.  No, because one also
wants to know just how much can be milked from a supercomputer.  Perhaps
what we need is both sets of benchmarks.

I must admit however, I would like to see LINPACK benchmarks which tackle
matrices of a size more on the order of 10,000 to 100,000.  I can trivially
invert a 1000 x 1000 in a few seconds on my Sparc.  Telling me how fast
I can do it on a supercomputer doesn't really tell me much about how a
system performs on the kinds of problems I need it for.  Allowing the
benchmarker to scale the problem to his or her physical memory would
seem to me to be a prescription for getting alot of results that can't be
compared to each other.  What I would prefer is to just fix the 
dimension, but fix it big.  That would give a standard benchmark, but
more importantly, from the perspective of a prospective user, it tells me
how well the system can handle real problems, not ones choosen to scale
nicely to a particular architecture.

John

-- 
John K. Prentice    john@spectre.unm.edu (Internet)
Computational Physics Group
Amparo Corporation, Albuquerque, NM

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (06/06/91)

In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> 
	stevo@elroy.jpl.nasa.gov (Steve Groom) writes:
>Can someone explain how "massively parallel LINPACK" is different from
>regular LINPACK?

Dongarra's latest Linpack report states that a fixed-size problem is
inappropriate for a scalable ensemble machine. So, instead, the
matrix size is allowed to vary. The solution method is considered
irrelevant: FLOPS is defined as  (2/3 N^3 + 2 N^2) / elapsed-time.

The Linpack reports wants the following numbers:
 - Accuracy residual.
 - GFLOPS and N for the largest problem run.
 - N-HALF, the problem size that achieved half the GFLOPS.
 - Theoretical peak GFLOPS.
 - Number of processors.

In the following, it's interesting to see the ratio of N^2 to N-HALF^2:

N	N-HALF	  ratio
------	------    -----
26,624	11,000	    6		CM-2	   64 K PE (or, 2 K FPU)
21,376	 3,193	   45		nCUBE 2	    1 K PE
 5,504   1,180     22		MasPar	   16 K PE
14,000	 5,500?     6		Touchstone, 512 PE, as just reported
-- 
Don		D.C.Lindsay 	Carnegie Mellon Robotics Institute

elias@wonton.TC.Cornell.EDU (Doug Elias) (06/06/91)

i'd appreciate some rationale on
   "FLOPS is defined as  (2/3 N^3 + 2 N^2) / elapsed-time",
where, i assume, N == number of processors used in the computation.

Thanks,
doug
#  ____    |Internet:   elias@theory.tn.cornell.edu
#dr _|_)oug|USmail:     Mgr., Adv. Comp. Res. Inst./Cornell Theory Center
#  (_|     |            704A TheoryCtrBldg/C.U./Ithaca/N.Y./14853-3801
#  (_|__   |MaBelle:    607-254-8826   Fax: 607-254-8888

gary@chpc.utexas.edu (Gary Smith) (06/06/91)

In article <1991Jun5.120653.7852@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) writes:

>
>One of the reasons we are proud of our numbers here at Intel is that
>we got a high GFLOP number with a relatively small problem size (we can
>still increase the problem size as we have not yet run out of physical
>memory).
>
Fine.  Scale it to the physical memory limit and tell us how long it takes.
Worley has written a marvelous paper, "The Effect of Time Constraints on
Scaled Speed-Up" (ORNL/TM-11031) wherein he shows that a meaningful inter-
pretation of speed-up depends on constraints on execution time.  Why don't
advocates of the promise of massive parallelism acknowledge such issues?

john@spectre.unm.edu (John Prentice) replies to the above article:

>>Allowing the
>>benchmarker to scale the problem to his or her physical memory would
>>seem to me to be a prescription for getting alot of results that can't be
>>compared to each other.  What I would prefer is to just fix the 
>>dimension, but fix it big.

Another important paper, "Type Architectures, Shared Memory, and the
Corollary of Modest Potential" should be read by john@spectre.unm.edu
and most other advocates of the promise of massive parallelism.  To para-
phrase Lawrence Snyder (author of paper - Annual Reviews of Computer
Science, 1986), it's certainly still premature to be pessimistic regarding
ultimate success of achieving the benefits of massive parallelism, yet
THERE IS REASON TO BE EXTREMELY CAUTIOUS.

Continuing to paraphrase Snyder: The frequent challenge with physical
phemomena models is not to solve a fixed-size problem faster, but to
solve larger instances within a fixed time budget.  Simple rationale:
Solutions are so computationally intensive that problem instances of an
"interesting" size do not complete execution in a tolerable amount of
time.  What is the effect of parallelism on problem size?  Keeping time
fixed and UNREALISTICALLY assuming the best possible speedup, it follows
that superlinear problems (the interesting, realistic ones) can only be
improved sublineraly by parallel computation.  Specifically, if a problem
takes t = c*n^x sequential steps, and if the application of p processors
permits an increase in the problem size by a factor of m, t = c*(m*n)^x/p,
then the problem size can increase by at most a factor of m = p^(1/x).
For example, to increase by two orders of magnitude the size of a problem
whose sequential performance is given by t = cn^4 (three spacial, one time
demension) requires, WITH UNREALISTIC OPTIMISM, 100,000,000 processors!

Again, it's time for the advocates of the promise of massive parallelism
to acknowledge Synder's "corollary of modest potential."

Randolph Gary Smith                       Internet: gary@chpc.utexas.edu
Systems Group                             Phonenet: (512) 471-2411
Center for High Performance Computing     Snailnet: 10100 Burnet Road
The University of Texas System                      Austin, Texas 78758-4497

lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) (06/06/91)

Some questions?

At what size N for massive linpack does the accumulated numerical fuzz
swamp the precision available in a double precision number?

Aren't really large problems of interest sparse matrices anyway?  Who
holds the record for massive sparse matrix solves?

Perhaps some of the numerical experts out there could tell us
something about the characteristics of these massive numerical
problems.

jet@karazm.math.uh.edu (J Eric Townsend) (06/06/91)

In article <16333@life.ai.mit.edu> lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) writes:
>At what size N for massive linpack does the accumulated numerical fuzz
>swamp the precision available in a double precision number?

I dunno.  i860s will work with denormalized numbers (which kills
performance, IMHO).

>Aren't really large problems of interest sparse matrices anyway?  Who

Not really, talk to Boeing, for example.  I wish they'd publish on what they're
doing, but they're so damn secure that some of their employees almost
had a fit during an Intel class when it was mentioned that they
had a ipsc/860 and were running LOOCS.

--
J. Eric Townsend - jet@uh.edu - bitnet: jet@UHOU - vox: (713) 749-2126
Skate UNIX! (curb fault: skater dumped)

   --  If you're hacking PowerGloves and Amigas, drop me a line. --

dodson@convex.COM (Dave Dodson) (06/06/91)

In article <ELIAS.91Jun6090922@wonton.TC.Cornell.EDU> elias@wonton.TC.Cornell.EDU (Doug Elias) writes:
>i'd appreciate some rationale on
>   "FLOPS is defined as  (2/3 N^3 + 2 N^2) / elapsed-time",
>where, i assume, N == number of processors used in the computation.

No.  N is the order of the system of equations.  The expression in
parentheses approximately represents the operation count for Gauss
Elimination using ordinary vector and matrix operations.

What is interesting about this is that there are algorithms based on
"fast" matrix multiplication, where the product of two K by K matrices
can be formed with fewer than O(K^3) floating point operations.  If you
use one of these fast algorithms, you may do significantly fewer than
(2/3 N^3 + 2 N^2) floating point operations, but you get credit for
(2/3 N^3 + 2 N^2) operations anyway.  The numerical stability of
these fast methods is questionable, especially when the equations
and unknowns are poorly scaled.  Therefore, a benchmark report on a
solver should state if a fast matrix multiplication algorithm is used
so the reader can draw his own conclusions regarding the solver's
applicability to his problem.

Another factor that must be considered is the type of pivoting used.
As pivoting can involve much data motion, omitting it is desirable
from the performance point of view.  However, the resulting code may
divide by zero or, even worse, produce severely inaccurate answers
with no warning.  Except for matrices that are known to have special
properties such as positive definiteness or strict diagonal dominance,
at least partial pivoting is required to insure numerical stability.
I haven't read the Dongarra report to know whether it specifies that
pivoting is required, but if it doesn't specify, then a complete
report on a solver would describe the type of pivoting used.  I don't
recall that information in the Touchstone Delta report.

----------------------------------------------------------------------

Dave Dodson		                             dodson@convex.COM
Convex Computer Corporation      Richardson, Texas      (214) 497-4234

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

dodson@convex.COM (Dave Dodson) (06/06/91)

In article <16333@life.ai.mit.edu> lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) writes:
>Aren't really large problems of interest sparse matrices anyway?

In radar-cross-section work, huge dense non-Hermitian complex matrices
abound.  I suppose that larger matrices yield better models, making it
easier for you to determine if you can either find or hide the stealth
bomber, whichever you are trying to do.

----------------------------------------------------------------------

Dave Dodson		                             dodson@convex.COM
Convex Computer Corporation      Richardson, Texas      (214) 497-4234

john@spectre.unm.edu (John Prentice) (06/07/91)

In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes:
>
>
>Continuing to paraphrase Snyder: The frequent challenge with physical
>phemomena models is not to solve a fixed-size problem faster, but to
>solve larger instances within a fixed time budget.  Simple rationale:
>Solutions are so computationally intensive that problem instances of an
>"interesting" size do not complete execution in a tolerable amount of
>time.  What is the effect of parallelism on problem size?  Keeping time
>fixed and UNREALISTICALLY assuming the best possible speedup, it follows
>that superlinear problems (the interesting, realistic ones) can only be
>improved sublineraly by parallel computation.  Specifically, if a problem
>takes t = c*n^x sequential steps, and if the application of p processors
>permits an increase in the problem size by a factor of m, t = c*(m*n)^x/p,
>then the problem size can increase by at most a factor of m = p^(1/x).
>For example, to increase by two orders of magnitude the size of a problem
>whose sequential performance is given by t = cn^4 (three spacial, one time
>demension) requires, WITH UNREALISTIC OPTIMISM, 100,000,000 processors!
>
>Again, it's time for the advocates of the promise of massive parallelism
>to acknowledge Synder's "corollary of modest potential."
>

So what would you suggest as an alternative.  By this analysis, anyway
you cut it, a serial processor will still take p times longer to do the
same problem (of course, you have ignored overhead, but that works in 
favor of your argument).  If I can do it 64,000 times faster on a CM-2
and I don't have any choice but to do the problem, then I am going to
use the CM-2.  The alternative is to just not do the problem. 

Your are both right and wrong about what the goal is in scientific
computing.  For many applications, the goal isn't to run bigger problems,
it is to make current ones less expensive.  Clearly parallelism buys
you something there.  For ones where bigger is better, I will grant
you that parallelism has many drawbacks, but I fail to see an alternative.
The main thing that bothers me in your analysis however is that you are
being far too simplistic.  You are assuming that numerical schemes are
not going to change or be developed which exploit the unique capabilities
of a parallel architecture.  But things like multi-grid, composite grid
domain decomposition, and so forth are being developed and these schemes
can by made to really fly on a parallel system.  Parallelism is 
largely a new way of thinking and as more people become involved in it,
I think you will see numerical methods undergo some major advances as
people learn to think in terms of this new paradigm.

By the way, I know an awful lot of physicists who would not agree with
you that the only problems of interest are superlinear ones  :-) .

John
-- 
John K. Prentice    john@spectre.unm.edu (Internet)
Computational Physics Group
Amparo Corporation, Albuquerque, NM

alan@uh.msc.umn.edu (Alan Klietz) (06/07/91)

In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes:
<
<To paraphrase Snyder: The frequent challenge with physical
<phemomena models is not to solve a fixed-size problem faster, but to
<solve larger instances within a fixed time budget.  Simple rationale:
<Solutions are so computationally intensive that problem instances of an
<"interesting" size do not complete execution in a tolerable amount of
<time.  What is the effect of parallelism on problem size?  Keeping time
<fixed and UNREALISTICALLY assuming the best possible speedup, it follows
<that superlinear problems (the interesting, realistic ones) can only be
<improved sublineraly by parallel computation.  Specifically, if a problem
<takes t = c*n^x sequential steps, and if the application of p processors
<permits an increase in the problem size by a factor of m, t = c*(m*n)^x/p,
<then the problem size can increase by at most a factor of m = p^(1/x).
<For example, to increase by two orders of magnitude the size of a problem
<whose sequential performance is given by t = cn^4 (three spacial, one time
<demension) requires, WITH UNREALISTIC OPTIMISM, 100,000,000 processors!

Well, if your problem really is quartic in n, yes, you are going to have
difficulty.  But I suspect most "real" problems tend more towards more linear
or quadratic behavior.  Physical simulations that at first appear to demand
higher order complexity often yield themselves to a more efficient approach
given a re-examination of the problem.

--
Alan E. Klietz
Minnesota Supercomputer Center, Inc.
1200 Washington Avenue South
Minneapolis, MN  55415
Ph: +1 612 626 1737	       Internet: alan@msc.edu

prins@cs.unc.edu (Jan Prins) (06/08/91)

In article <1991Jun6.174129.25202@hubcap.clemson.edu>, dodson@convex.COM (Dave Dodson) writes:
> >   "FLOPS is defined as  (2/3 N^3 + 2 N^2) / elapsed-time",
> 
> What is interesting about this is that there are algorithms based on
> "fast" matrix multiplication, where the product of two K by K matrices
> can be formed with fewer than O(K^3) floating point operations.  If you
> use one of these fast algorithms, you may do significantly fewer than
> (2/3 N^3 + 2 N^2) floating point operations, but you get credit for
> (2/3 N^3 + 2 N^2) operations anyway.  [...]

In particular, using this months asymptotically fastest solver, your
lowly workstation can beat the LINPACK performance of *any* machine 
whose performance was obtained through use of the standard algorithm, 
provided enough time and space.

Of course it's not clear whether you are allowed to report a record
performance 10,000 years before it completes.

--\--  Jan Prins  (prins@cs.unc.edu)  
  /    Computer Science Dept. 
--\--  UNC Chapel Hill

gary@chpc.utexas.edu (Gary Smith) (06/10/91)

In article <1991Jun06.205144.22611@ariel.unm.edu>, john@spectre.unm.edu (John Prentice) writes:
|> In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes:
|>
|> >Again, it's time for the advocates of the promise of massive parallelism
|> >to acknowledge Synder's "corollary of modest potential."
|> >
|> 
|> So what would you suggest as an alternative.  By this analysis, anyway
|> you cut it, a serial processor will still take p times longer to do the
|> same problem (of course, you have ignored overhead, but that works in 
|> favor of your argument).  If I can do it 64,000 times faster on a CM-2
|> and I don't have any choice but to do the problem, then I am going to
|> use the CM-2.  The alternative is to just not do the problem. 
|> 
|> Your are both right and wrong about what the goal is in scientific
|> computing.  For many applications, the goal isn't to run bigger problems,
|> it is to make current ones less expensive.

Using Speedup(f,p) = 1/[(1-f)+(f/p)], with f being the fraction of code
parallelized and p being the number of processors, and unrealistically
assuming no overhead, a speedup of 64000 using 65536 processors requi-
res that the problem be 99.9999634% parallel.  How many problems do you
know of that are that parallel?

---Gary

Randolph Gary Smith                       Internet: gary@chpc.utexas.edu
Systems Group                             Phonenet: (512) 471-2411
Center for High Performance Computing     Snailnet: 10100 Burnet Road
The University of Texas System                      Austin, Texas 78758-4497

john@spectre.unm.edu (John Prentice) (06/11/91)

In article <1991Jun10.144354.695@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes:
>
>Using Speedup(f,p) = 1/[(1-f)+(f/p)], with f being the fraction of code
>parallelized and p being the number of processors, and unrealistically
>assuming no overhead, a speedup of 64000 using 65536 processors requi-
>res that the problem be 99.9999634% parallel.  How many problems do you
>know of that are that parallel?
>

Very few.  But this misses the central question.  If we are going to even
get a factor of 100 speedup in the new few years, how else are we going to
do it other than by use of massive parallelism?  If there is an alternative,
I would be overjoyed to know about it.  I don't particularly find
parallel computing fun and it is not particularly easy, but I just don't
see any alternative.  That it has limitations, so what?  The question
still remains, what is the alternative?

John
-- 
John K. Prentice    john@spectre.unm.edu (Internet)
Computational Physics Group
Amparo Corporation, Albuquerque, NM

furnish@zig.inria.fr (Geoffrey Furnish) (06/13/91)

In article <1991Jun10.144354.695@chpc.utexas.edu>, gary@chpc.utexas.edu (Gary Smith) writes:
> In article <1991Jun06.205144.22611@ariel.unm.edu>, john@spectre.unm.edu (John Prentice) writes:
> |> In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes:
> |>
> |> >Again, it's time for the advocates of the promise of massive parallelism
> |> >to acknowledge Synder's "corollary of modest potential."
> |> >
> |> 
> |> So what would you suggest as an alternative.  By this analysis, anyway
> |> you cut it, a serial processor will still take p times longer to do the
> |> same problem (of course, you have ignored overhead, but that works in 
> |> favor of your argument).  If I can do it 64,000 times faster on a CM-2
> |> and I don't have any choice but to do the problem, then I am going to
> |> use the CM-2.  The alternative is to just not do the problem. 
> |> 
> |> Your are both right and wrong about what the goal is in scientific
> |> computing.  For many applications, the goal isn't to run bigger problems,
> |> it is to make current ones less expensive.
> 
> Using Speedup(f,p) = 1/[(1-f)+(f/p)], with f being the fraction of code
> parallelized and p being the number of processors, and unrealistically
> assuming no overhead, a speedup of 64000 using 65536 processors requi-
> res that the problem be 99.9999634% parallel.  How many problems do you
> know of that are that parallel?
> 
> ---Gary
> 
> Randolph Gary Smith                       Internet: gary@chpc.utexas.edu
> Systems Group                             Phonenet: (512) 471-2411
> Center for High Performance Computing     Snailnet: 10100 Burnet Road
> The University of Texas System                      Austin, Texas 78758-4497

I have been reading this thread for a while and am finding it very 
interesting.  Especially since I am a graduate student in physics at
UT Austin, and am quite familiar with the center you
are affiliated with.  I am tempted to infer that the pessimism you are
expressing probably reflects the attitudes of the UT System procurement
plans.  If so, that is extremely unfortunate for researchers in the UT
system.

Although I am not familiar with the theoretical results you are citing, 
and indeed am quite interested in what they imply, there are a few things
which are clear to me.

First of all, I have been spending this spring doing serious scientific
computing on a Connection Machine 2 (located in France).  By this point I
have worked with it extensively for several months, and on two completely
different sorts of problems.

One of these is the time advancement of a field subject to a nasty nonlinear
PDE.  The CM with its massively parallel architecture is _EXTREMELY_ well
suited for this problem, and the results I have obtained are absolutely
mind boggling.  I have not computed MIPS or MFLOPS for this problem, but I
do know that the throughput is many many times better than I have ever
seen on any computer before in my life.

Secondly, I am not a green horn.  I spend lots of computing time on high
dollar Cray's out at the National Energy Research Supercomputing Center's
facilities.  I also occasionally use the facilites at the UT Center for
High Performance Computing.  And I can tell you now from personal experience
that for some types of problems, comparing a Cray to a CM-2 is like comparing
a tortoies to a chetah.  That is not a theoretical analysis.  It is an
"experimental" fact/observation.

Additionally, your last line above asks us to consider a speed up of 64000,
and then you presumably show how ridiculous this is.  Of course it is 
ridiculous.  No one needs to see a speed up of 64000 to justify the cost
of an MP architecture machine.  One only needs to see a speed up which
off sets the cost differential in the prices of the computers in question.
It's not clear to me that a CM-2 is much pricier than a Cray in the first
place, yet is _MUCH_ faster for at least some kinds of problems.  The speed
up I see is more like two orders of magnitude, not 4.5.

Your further assert/imply that no problem could be 99.999... % parallel.
I think this shows that you really don't understand what you're talking about.
In my field advancement problem, it is _ALL_ parallel. Period.  End of
discussion.  Even the real-time display of results is accomplished via
a parallel projection function.  The only part of the code which is not
parallel computation is control flow.  And that takes essentially zero time,
since there are no decison branchings in the solution algorithm.  There is
also history keeping, but that relates to i/o, not computer architecture.

The point is, Danny Hillis (president/founder of Thinking Machines, Inc.)
is right:  Many problems are _inherently_ parallel, and trying to solve them
with serial architecture computers introduces artificial and unneccessary
complexity.

Which brings me to my last point.  You seem to be completely oblivious to
the whole issue of ease of programming.  The CM comes with a slew of
languages which implement parallel programming operations in simple and
easy to understand ways, syntax, etc.  They have a compiler they call
C* which consists of very reasonable extensions to ANSI C to provide 
facilities for expressing parallel constructs and communication operations.
Substantial efforts are underway to standardize this language with other
vendors of MP hardware.  They also have an implementation of FORTRAN 90
which provides direct support for array processing.

The importance of this is that these products make progrmming in parallel 
_MUCH EASIER_ than programming for serial architecture machines like Cray's.
For example, elemental array arithmetic requires no subscripts.  On the CM
you can say:
	a = b * c
On a Cray you would have to say:
	do i = 1, n
		do j = 1, m
			a(i,j) = b(i,j) * c(i,j)
		enddo
	enddo

Big deal?  Well it is when you try to go and enhance your program in some
way.  In my case I was able to take my parallel program on the CM and 
convert it from a code which solved a field in two dimension to one which
solved it in three dimensions, by changing 4 lines of code which related
to the specification of the arrays, taking a total of about 30 minutes of
my time.

In a language written for a serial machine like a CRAY, etc, I'd have had
to change every single set of double loops into tripple loops, and add a
subscript to every single array reference in the program.  Something like
3 or 4 hundred places.  In my book, the opportunity for coding errors goes
as the exponential of the number of changes you make.  On the CM, using
a parallel version of C, I only had to change the data declearations.
The actual solution algorithm was not even touched.  On a CRAY I'd have
spent a couple of weeks making the necessary mods to my sources, and then
I'd have had to worry for months about possible coding errors.

Furthermore, using the parallel C on the CM, my solution  algorithm is 
actually  _COMPLETELY INDEPENDENT OF SYSTEM DIMENSIONALITY_.  To make
the code solve this field in 4 or 5 or n dimensions requires precisely
zero modifications to the source code algorithm.  I would have to change
the data decls, but again, that's four lines of code.  

The advantages of massive parallelism are many and great.  You should
endeavor to try it before hollering about what a farce it is.  In my book
there are two ways to benchmark computers.  One is theoretical computing
speed like MFLOPS, etc. In this department the CM completely overwhelms the
Cray for certain classes of programs.  The other is the pragmatic issue
of how long it takes to get a trustworthy program up and running, wall time
to completeion, ease of obtaining graphics, and the like.  Again,
using parallel expression, on problems well suited for it, the CM
beats the Cray hands down in the "relaible programming" department.  From
the standpoint of scientific visualization, the best tools I've ever seen
run on the CM.  One of the reasons I don't use UT CHPC more is because of
the infuriating difficulty of getting graphical output.  In contrast, 
Thinking Machines provides direct and easy to use support for X; so much so
that you can render images on their high speed graphics device or in an
X window on a networkd workstation _WITHOUT MODIFYING A SINGLE LINE OF CODE_.
Ask Cray to do that for you!

Is massive parallelism the solution to all mankind's computing needs?
Probably not.  I am sure the theoretical concerns you have cited have their
domain of applicability.  But, I am also certain that when I want to get
work done, I can get it done faster, more reliably and with less effort on
a CM-2 than I can on a Cray.  That's just the way it is.

For the benefit of the UT scientific research community and beyond, I urge
you and others who hold your skepticism, to reach a little beyond the
confines of Cray's marketting grasp, and take a more informed look at
what MP has to offer.  I for one have been extremely impressed by what I
found under the hood of the CM-2.

Geoffrey Furnish
furnish@solar.ph.utexas.edu

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (06/14/91)

In article <11771@mirsa.inria.fr> furnish@zig.inria.fr (Geoffrey Furnish) writes:

>In a language written for a serial machine like a CRAY, etc, I'd have had
>to change every single set of double loops into tripple loops, and add a
>subscript to every single array reference in the program.

Er, there's nothing about FORTRAN 90 that makes it impossible to write
a Cray compiler for it. On the other hand, there are array expressions
in F90 which cannot be optimized as well as the equivalent serial
expression.  Ask on comp.lang.fortran for examples.

Array notation can be nice in some situations. It isn't for everyone.

For example, I have a program which has a 3d data array, and a
one-dimensional operator which must be applied in the x, y, and z
directions. This "operator" is about 3,000 lines of code. Tell me how
I can write all those expressions into 3 sets of 3d-array code without
writing my own preprocessor? Hm?

Nice, yes. Solves everything, no. BTW, my code is typical for
high-accuracy astrophysical fluid dynamics.

-- greg

rvdg@cs.utexas.edu (Robert van de Geijn) (06/14/91)

This article is in response to several articles posted recently to
comp.parallel concerning the massively parallel LINPACK benchmark on
the Intel Touchstone DELTA machine.  Since I was the anonymous
"benchmarker", I believe a few words from me may shed some light on
the situation.

I was contacted by Intel around the second week of May and asked to
implement the LINPACK Benchmark on the DELTA.  The initial goal was to
beat the "world record" 5.2 GFLOPS (double precision), reported by
Thinking Machines for the CM-2, by the unveiling of the DELTA, May 31.
While a version of the LU factorization had been developed at the
University of Tennessee by Jack Dongarra and Susan Ostrouchov, this
version assumes that the matrix is mapped to nodes by wrapping panels
onto an embedded ring.  Moreover, no satisfactory triangular solve had
been implemented, nor had all pieces been put together.  Finally, a
quick "back-of-the-envelope" calculation indicated that this version
did not scale well to a large machine like the DELTA.

Starting from scratch, I implemented a block-torus wrapped version of
the right-looking LAPACK LU factorization variant, as well as
torus-wrapped variants of the Li-Coleman implementation of the
triangular solve (also known as cyclic algorithms).  The development
was done on the ORNL 128 node Intel iPSC/i860 GAMMA machine, achieving
an impressive 1.92 GFLOPS.  Two weeks later, on May 23, the machine
attained 7.844 GFLOPS on 512 nodes, for a 12000 X 12000 problem
(during the first attempt on the DELTA).

After adjusting storage buffer sizes, the same code produced the
numbers reported in article 2643 of comp.parallel (8.561 GFLOPS).
Next, the author went on a much needed vacation at Disneyland and Sea
World.  On June 5, Thinking Machine announced the CM-200, and a new
record, 9.03 GFLOPS for a 28672 X 28672 problem.  On June 6, my code
attained 10.2 GFLOPS for a 20000 X 20000 problem on the DELTA.

It should be noted that the code that is being used was quickly put
together in a matter of 2 weeks.  There are many parameters that can
be optimized, and the memory has not yet been exhausted.  A 20K X 20K
problem is completed in only 522.7 seconds.  Moreover, there are still
many standard techniques that can be used to reduce the communication
overhead.  10.2 GFLOPS is only a start.

I am working on a short report on this topic and I am in the process
of completing the code.  Intel has promised to release the code to the
public domain, once it has been completed.

Robert van de Geijn
Assistant Professor
The University of Texas at Austin





-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

furnish@zig.inria.fr (Geoffrey Furnish) (06/17/91)

This is a followup to my prior posting in which I related my experiences
in using the CM-2.  Since that posting I have received a deluge of private
mail containing comments of all sorts.  In particular it has come to my
attention that some of the statements I made in that posting were not correct.
In the following I present corrected information supplied to me by an
employee of Cray, and some comments of my own.  Assuming that I get all the
mistakes worked out with this posting, I don't intend to continue my part
of this discussion further on the net, but would still be happy to receive
comments from other interested parties.  My appologies for the confusion I
introduced.
----------
>> you can say:
>> 	a = b * c
>> On a Cray you would have to say:
>> 	do i = 1, n
>> 		do j = 1, m
>> 			a(i,j) = b(i,j) * c(i,j)
>> 		enddo
>> 	enddo
>
>A subset of the Fortran-90 array syntax has been in CFT77 since
>version 1.0 (about 1986).  If you would like to express your problems
>this way, there is nothing stopping you!

I was unaware that CFT77 provides partial Fortran 90 support, and
thanks for setting me straight.  I am sure we will all benefit by the
widespread availability of this enhanced language definition on all
platforms from workstations to supercomputers.  May that day come quickly!

>Many believe that Unicos has excellent development tools for both Fortran and
>C.  Both are far beyond what a 'normal' Unix system would provide.  Many are
>X window based (cdbx, and the *view tools for example).  

Having not used Unicos, I was unaware of this also.  So far all Cray systems
I've used have run either COS or CTSS.  I am aware that there is a growing
movement among supercomputing facilities to move to Unicos, and it sounds
like we will benefit from this trend.

>Also comparing a 'stand-alone' connection machine use (they aren't *really*
>timeshared...) with a highly utilized Cray system is an Apples to Oranges
>comparison.  

I'm not sure what the author means by *really* in this case.  I do know that
the CM-2 I use every day comes in two "halves."  One half is single tasking
during business hours, and the other provides what I consider to be genuine
timesharing capability during business hours.  By this I mean that several
users can simultaneously use this half of the machine, simultaneously running
programs and obtaining results.  Obviously it isn't as fast as when you 
operate in single task mode, but that is only natural.  My understanding is
that this multi user capability is accomplished by swapping user's session
in and out, rather than making them share resources.  I believe this is by
definition timesharing, as opposed to multitasking.  Both halves run batch
queues at night, but you can slip in interactive sessions between batch
jobs if you're sly.

>Stand-alone Cray systems can provide excellent turnaround to
>a single user as well.  If you have DARPA to provide you with an expensive
>play toy, a CM can make sense.  Most of our customers are not so lucky.

No doubt.  On the other hand, neither am I.  The CM-2 I have been using
is shared by a large number of researchers.  The user load is similar
to what I have experienced when using CRAY's at US based supercomputer
centers.  My claim is that the CM has provided the most productive research 
environment that I have used to date.  I obviously can't speak for anyone
else, but it works for me.

>> Thinking Machines provides direct and easy to use support for X; so much so
>> that you can render images on their high speed graphics device or in an
>> X window on a networkd workstation _WITHOUT MODIFYING A SINGLE LINE OF CODE_.
>> Ask Cray to do that for you!
>
>This is also an interesting comment.  X has been available on Unicos since
>version 3.0 (1987) - first X10, and currently X11 (R4).  Motif and OpenLook
>are also readily available.  Again what is the problem? 

Again, I have not used Unicos, so was not aware of this capability.  Sounds
like all Cray users would be a lot better off if all Cray's ran Unicos.

To wrap up my contribution to this thread, let me say that it was certainly
not my intention to under-represent Cray products.  I can only comment on
my own experiences, and all of these things were outside of my experience.

Furthermore, parallelism has many facets, and runs a wide gamut from coarse
grained (multi processors like Y-MP's and others) to fine grained (like the
CM and others).  Each system has applicability to a class of problems, and
those of us doing research with parallel computing are continually finding
new ways to use systems of both descriptions.  I think the point that I and
several others have been trying to make is that MP in particular has a whole
lot more capability to offer than most people realize.

When I first sat down and read the technical description of the CM and its
SIMD programming model I thought "well, that's interesting, but it would
be better if ..."  Then I sat down and started programming it.  And I realized
much to my own surprise that a very large percentage of the things I am
interested in are very naturally and efficiently expressed and solved in the
SIMD model.  In fact, there is nothing that I am interested in which is
more easily/naturally expressed using any other paradigm.  

MIMD machines in particular (I have been reminded of the origins of this
thread) may provide very significant per processor performance, that is
true.  But then you have to introduce all kinds of synchronization code to
manage the dispersal of intermediate results.  This introduces time delay,
(overhead referred to in numerous prior postings) and what I consider to be
more important--added complexity.  Additionally, in order to squeeze good
performance out of such systems throughout the course of an application's
exectuion, it is often necessary to dynamically repartition the problem to
balance the load on each processor so that they each perform their share in
a comparble amount of time.  This adds even more artificial complexity to
the problem. 

It seems to me that a lot of the discussion about supercomputer performance
is made in reference to performance on certain canned applications like
Linpack or NASTRAN or the like. While these things do reflect the needs of
a large body of users, there is another group who NEVER run canned
programs, but rather spend the vast majority of our time writing our own
code to solve our own problems.  For those of us in that category, M/GFLOPS
may very legitimately take a back seat to more important and less discussed
issues like code complexity and reliability.  NASTRAN may have been
debugged in 1965, but what about the nuclear reactor simulator you are
writing yourself?  For these people, the expressive capability and
simplicity of SIMD programming is not a matter of convenience, but of
necessity.  That such machines can yield superior performance too for a
large class of problems, is icing on the cake. 

To each his own.

Geoff Furnish
furnish@solar.ph.utexas.edu

spsl@cs.bham.ac.uk (Steven Lam) (06/20/91)

Dear Sir,

I understand that you've been using both Cray and CM-2. Can you give me some
figures, such as the price and the performance of each machine? I personnally
believe that fine-grain computers are much cheaper to build and hence less
dependent on integration technology. Does the compiler for CM-2 allow you to
grap several number of processing cells for your applications and other users
may do the same? Have you ever experience the communication bottleneck when
working with Cray? There are a few papers on CM-2 machine recently, they found
that the connection machine is not very good in computing singular value
decompostion or similar problem, as the cell utilization may fall to 20-40%,
what is your comment?

Stephen Lam

+++++++++++++++++++++++
School of Electronic Engineering
University of Birmingham
Birmingham B15 2TT
e-mail: spsl@uk.ac.bham.cs
+++++++++++++++++++++

Thank you.