[comp.parallel] Massively Parallel LINPACK on the Intel Touchstone Delta machine

baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/01/91)

The Intel Touchstone Delta machine was unveiled for the public today at the
California Institute of Technology.  The Touchstone Delta machine consists of
528 i860 based computational nodes, connected by a high speed, two dimensional
mesh.  This computer is yet another milestone in the cooperatively funded
DARPA / Intel Touchstone project.

The LINPACK benchmark has often been used as one measure of comparison between
machines, and most recently, a new section of the report, entitled 'Massively
Parallel Computing' defines the same test, solve a dense set of linear
equations, but allows for the problem sizes to scale with the size of the
machine.  With the unveiling of the Touchstone Delta machine, Intel can now
publish the following double precision performance numbers for massively 
parallel LINPACK:

	np	    n	 time	GFLOPS	MFLOPS/node
	---	-----	-----	------	-----------
	192	 2000	  5.5	  .971	          5
	192 	 4000	 20.8 	 2.053 	         11
	192	 6000	 51.3	 2.808	         15
	192	 8000	102.5	 3.331	         17
	192	10000	178.2	 3.742	         19
	192	12000	284.8	 4.046	         21
	
	512	 2000	  4.5	 1.187	          2
	512	 4000	 14.2	 3.007	          6
	512	 6000	 31.5    4.574	          9
	512	 8000	 58.3	 5.867	         11
	512	10000	 96.8	 6.889	         13
	512	12000	146.9    7.844	         15
	512 	14000   213.7    8.561 	         17

Where np -> number of processors
      n  -> order of the matrix (full 64 bit precision)
      t  -> time in secs

The previous leader in massively parallel LINPACK was Thinking Machines
at 5.2 GFLOPS on a 26624 order matrix (64K processors).

The current production iPSC/860 has a published performance of 1.92 GFLOPS
on a matrix of order 8600, using 128 i860 processors, each with 8 Mbytes of
local memory.  Conventional 1K Linpack on one i860 node is published at 
25 MFLOPS.


Jerry Baugh
Intel Supercomputer Systems Division

stevo@elroy.jpl.nasa.gov (Steve Groom) (06/04/91)

In article <1991Jun3.130104.15667@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY.
CS.NET (Jerry Baugh) writes:
>
>The LINPACK benchmark has often been used as one measure of comparison between
>machines, and most recently, a new section of the report, entitled 'Massively
>Parallel Computing' defines the same test, solve a dense set of linear
>equations, but allows for the problem sizes to scale with the size of the
>machine.  With the unveiling of the Touchstone Delta machine, Intel can now
>publish the following double precision performance numbers for massively
>parallel LINPACK:
[numbers deleted]

At first, I started to read this thinking "what the heck does LINPACK
have to do with the performance of a parallel computer other than
measuring the power of individual nodes?"  Then I started reading more
closely, and it appears that there's more to it than that.

Can someone explain how "massively parallel LINPACK" is different from
regular LINPACK?  What considerations for communication are made
in this benchmark?  Since LINPACK is normally used as a measure of number
crunching, I'm curious how this benchmark translates to parallel
computers.  As we all know (or we all should know), the performance
of a parallel computer is usually NOT the same as the performance of an
individual node multiplied by the number of nodes in the computer
(although we'd just love that to always be the case).
The obvious misapplication of this kind of benchmark would
be to multiply a single node's LINPACK performance by the number of nodes in
the machine.  I notice that the numbers posted in the above-referenced
article do not scale linearly with the number of nodes used, so there
is some efficiency loss from the single node case.  I'm itching to
find out what the source of this loss is.  This is of particular interest
as I am currently porting some existing parallel code to the Delta, and I'd
like to be able to handle the inevitable queries about "well, they say
the Delta does such-and-such LINPACK GFLOPS...".

Any explanation or references would be welcome.
-- 
Steve Groom, Jet Propulsion Laboratory, Pasadena, CA
stevo@elroy.jpl.nasa.gov  {ames,usc}!elroy!stevo
"... and the babe, she needs a chaaaa--nging..." (apologies to Bob Dylan)

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/05/91)

In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes:
>[original post deleted ... Jerry Baugh]
>
>At first, I started to read this thinking "what the heck does LINPACK
>have to do with the performance of a parallel computer other than
>measuring the power of individual nodes?"  Then I started reading more
>closely, and it appears that there's more to it than that.
>
>Can someone explain how "massively parallel LINPACK" is different from
>regular LINPACK?  

Since I started this, I guess I ought to take a stab at explaining this
as well.  The standard LINPACK benchmark is constrained to 100x100 and
1000x1000.  These problem sizes are relatively small when compared with
the capabilites of todays supercomputers.  And they are small compared
to the size problems people are solving today (a production code at
one of our customer sites does a 20k x 20k matrix multiply).  Recently,
Jack Dongarra extended his standard LINPACK benchmark suite to include
a new catagory "Massively Parallel Computing".  This new catagory allows
the problem size to increase up to and including the limit of physical
memory.  Since the problem size can vary, it is important to note the
size (presented as order of the matrix) as well as the FLOP numbers presented.
One of the reasons we are proud of our numbers here at Intel is that
we got a high GFLOP number with a relatively small problem size (we can
still increase the problem size as we have not yet run out of physical
memory).

>What considerations for communication are made in this benchmark?  

Both data decomposition and data communication are critical.  The
method of parallelizing a program that is chosen must take both into
account. Selecting one method of data decomposition may dictate the
data communication strategy.  The strategy used in this case involved
a 2D data decomposition that does both column and row mapping.  This
takes advantage of the Touchstone Delta's mesh architecture by allowing
messages to flow to all nearest neighbors.  And this leads to
higher performance than a 1D decomposition scheme.

>Since LINPACK is normally used as a measure of number crunching,
>I'm curious how this benchmark translates to parallel computers.  

This particular version was designed for parallel computers.

>As we all know (or we all should know), the performance
>of a parallel computer is usually NOT the same as the performance of an
>individual node multiplied by the number of nodes in the computer
>(although we'd just love that to always be the case).
>The obvious misapplication of this kind of benchmark would
>be to multiply a single node's LINPACK performance by the number of nodes in
>the machine.  I notice that the numbers posted in the above-referenced
>article do not scale linearly with the number of nodes used, so there
>is some efficiency loss from the single node case.  I'm itching to
>find out what the source of this loss is.  
 
Since the data values are distributed across multiple nodes, data must be
passed from one node (where it resides) to another node (where it is needed
for the calculation).  This communication cost is a major part of the
difference you see between the single node numbers and the 512 node
numbers.
 
>This is of particular interest
>as I am currently porting some existing parallel code to the Delta, and I'd
>like to be able to handle the inevitable queries about "well, they say
>the Delta does such-and-such LINPACK GFLOPS...".

Personally, I have always said that performance on real codes is important.
The "Massively Parallel LINPACK" benchmark provides one measure that is
meaningful for some classes of codes.  For other classes of programs
(database for example), it is meaningless.
 
>Any explanation or references would be welcome.

Hope what I've written helps.  For more information on LINPACK and the
"Massively Parallel LINPACK", I suggest you get Jack Dongarra's paper
'Performance of Various Computers Using Standard Linear Equations Software'.
For an explaination of a parallel 1D decomposition, try 'LAPACK Block
Factorization Algorithms on the Intel iPSC/860', LAPACK Working Report 24,
University of Tenessee, Oct. 1990 by Dongarra and Ostrouchov.

>Steve Groom, Jet Propulsion Laboratory, Pasadena, CA
>stevo@elroy.jpl.nasa.gov  {ames,usc}!elroy!stevo

Jerry Baugh
Intel SSD - baugh@SSD.intel.com

elias@wonton.TC.Cornell.EDU (Doug Elias) (06/06/91)

i'd appreciate some rationale on
   "FLOPS is defined as  (2/3 N^3 + 2 N^2) / elapsed-time",
where, i assume, N == number of processors used in the computation.

Thanks,
doug
#  ____    |Internet:   elias@theory.tn.cornell.edu
#dr _|_)oug|USmail:     Mgr., Adv. Comp. Res. Inst./Cornell Theory Center
#  (_|     |            704A TheoryCtrBldg/C.U./Ithaca/N.Y./14853-3801
#  (_|__   |MaBelle:    607-254-8826   Fax: 607-254-8888

dodson@convex.COM (Dave Dodson) (06/06/91)

In article <ELIAS.91Jun6090922@wonton.TC.Cornell.EDU> elias@wonton.TC.Cornell.EDU (Doug Elias) writes:
>i'd appreciate some rationale on
>   "FLOPS is defined as  (2/3 N^3 + 2 N^2) / elapsed-time",
>where, i assume, N == number of processors used in the computation.

No.  N is the order of the system of equations.  The expression in
parentheses approximately represents the operation count for Gauss
Elimination using ordinary vector and matrix operations.

What is interesting about this is that there are algorithms based on
"fast" matrix multiplication, where the product of two K by K matrices
can be formed with fewer than O(K^3) floating point operations.  If you
use one of these fast algorithms, you may do significantly fewer than
(2/3 N^3 + 2 N^2) floating point operations, but you get credit for
(2/3 N^3 + 2 N^2) operations anyway.  The numerical stability of
these fast methods is questionable, especially when the equations
and unknowns are poorly scaled.  Therefore, a benchmark report on a
solver should state if a fast matrix multiplication algorithm is used
so the reader can draw his own conclusions regarding the solver's
applicability to his problem.

Another factor that must be considered is the type of pivoting used.
As pivoting can involve much data motion, omitting it is desirable
from the performance point of view.  However, the resulting code may
divide by zero or, even worse, produce severely inaccurate answers
with no warning.  Except for matrices that are known to have special
properties such as positive definiteness or strict diagonal dominance,
at least partial pivoting is required to insure numerical stability.
I haven't read the Dongarra report to know whether it specifies that
pivoting is required, but if it doesn't specify, then a complete
report on a solver would describe the type of pivoting used.  I don't
recall that information in the Touchstone Delta report.

----------------------------------------------------------------------

Dave Dodson		                             dodson@convex.COM
Convex Computer Corporation      Richardson, Texas      (214) 497-4234

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

prins@cs.unc.edu (Jan Prins) (06/08/91)

In article <1991Jun6.174129.25202@hubcap.clemson.edu>, dodson@convex.COM (Dave Dodson) writes:
> >   "FLOPS is defined as  (2/3 N^3 + 2 N^2) / elapsed-time",
> 
> What is interesting about this is that there are algorithms based on
> "fast" matrix multiplication, where the product of two K by K matrices
> can be formed with fewer than O(K^3) floating point operations.  If you
> use one of these fast algorithms, you may do significantly fewer than
> (2/3 N^3 + 2 N^2) floating point operations, but you get credit for
> (2/3 N^3 + 2 N^2) operations anyway.  [...]

In particular, using this months asymptotically fastest solver, your
lowly workstation can beat the LINPACK performance of *any* machine 
whose performance was obtained through use of the standard algorithm, 
provided enough time and space.

Of course it's not clear whether you are allowed to report a record
performance 10,000 years before it completes.

--\--  Jan Prins  (prins@cs.unc.edu)  
  /    Computer Science Dept. 
--\--  UNC Chapel Hill  

rvdg@cs.utexas.edu (Robert van de Geijn) (06/14/91)

This article is in response to several articles posted recently to
comp.parallel concerning the massively parallel LINPACK benchmark on
the Intel Touchstone DELTA machine.  Since I was the anonymous
"benchmarker", I believe a few words from me may shed some light on
the situation.

I was contacted by Intel around the second week of May and asked to
implement the LINPACK Benchmark on the DELTA.  The initial goal was to
beat the "world record" 5.2 GFLOPS (double precision), reported by
Thinking Machines for the CM-2, by the unveiling of the DELTA, May 31.
While a version of the LU factorization had been developed at the
University of Tennessee by Jack Dongarra and Susan Ostrouchov, this
version assumes that the matrix is mapped to nodes by wrapping panels
onto an embedded ring.  Moreover, no satisfactory triangular solve had
been implemented, nor had all pieces been put together.  Finally, a
quick "back-of-the-envelope" calculation indicated that this version
did not scale well to a large machine like the DELTA.

Starting from scratch, I implemented a block-torus wrapped version of
the right-looking LAPACK LU factorization variant, as well as
torus-wrapped variants of the Li-Coleman implementation of the
triangular solve (also known as cyclic algorithms).  The development
was done on the ORNL 128 node Intel iPSC/i860 GAMMA machine, achieving
an impressive 1.92 GFLOPS.  Two weeks later, on May 23, the machine
attained 7.844 GFLOPS on 512 nodes, for a 12000 X 12000 problem
(during the first attempt on the DELTA).

After adjusting storage buffer sizes, the same code produced the
numbers reported in article 2643 of comp.parallel (8.561 GFLOPS).
Next, the author went on a much needed vacation at Disneyland and Sea
World.  On June 5, Thinking Machine announced the CM-200, and a new
record, 9.03 GFLOPS for a 28672 X 28672 problem.  On June 6, my code
attained 10.2 GFLOPS for a 20000 X 20000 problem on the DELTA.

It should be noted that the code that is being used was quickly put
together in a matter of 2 weeks.  There are many parameters that can
be optimized, and the memory has not yet been exhausted.  A 20K X 20K
problem is completed in only 522.7 seconds.  Moreover, there are still
many standard techniques that can be used to reduce the communication
overhead.  10.2 GFLOPS is only a start.

I am working on a short report on this topic and I am in the process
of completing the code.  Intel has promised to release the code to the
public domain, once it has been completed.

Robert van de Geijn
Assistant Professor
The University of Texas at Austin





-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell