baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/01/91)
The Intel Touchstone Delta machine was unveiled for the public today at the
California Institute of Technology. The Touchstone Delta machine consists of
528 i860 based computational nodes, connected by a high speed, two dimensional
mesh. This computer is yet another milestone in the cooperatively funded
DARPA / Intel Touchstone project.
The LINPACK benchmark has often been used as one measure of comparison between
machines, and most recently, a new section of the report, entitled 'Massively
Parallel Computing' defines the same test, solve a dense set of linear
equations, but allows for the problem sizes to scale with the size of the
machine. With the unveiling of the Touchstone Delta machine, Intel can now
publish the following double precision performance numbers for massively
parallel LINPACK:
np n time GFLOPS MFLOPS/node
--- ----- ----- ------ -----------
192 2000 5.5 .971 5
192 4000 20.8 2.053 11
192 6000 51.3 2.808 15
192 8000 102.5 3.331 17
192 10000 178.2 3.742 19
192 12000 284.8 4.046 21
512 2000 4.5 1.187 2
512 4000 14.2 3.007 6
512 6000 31.5 4.574 9
512 8000 58.3 5.867 11
512 10000 96.8 6.889 13
512 12000 146.9 7.844 15
512 14000 213.7 8.561 17
Where np -> number of processors
n -> order of the matrix (full 64 bit precision)
t -> time in secs
The previous leader in massively parallel LINPACK was Thinking Machines
at 5.2 GFLOPS on a 26624 order matrix (64K processors).
The current production iPSC/860 has a published performance of 1.92 GFLOPS
on a matrix of order 8600, using 128 i860 processors, each with 8 Mbytes of
local memory. Conventional 1K Linpack on one i860 node is published at
25 MFLOPS.
Jerry Baugh
Intel Supercomputer Systems Divisionstevo@elroy.jpl.nasa.gov (Steve Groom) (06/04/91)
In article <1991Jun3.130104.15667@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY. CS.NET (Jerry Baugh) writes: > >The LINPACK benchmark has often been used as one measure of comparison between >machines, and most recently, a new section of the report, entitled 'Massively >Parallel Computing' defines the same test, solve a dense set of linear >equations, but allows for the problem sizes to scale with the size of the >machine. With the unveiling of the Touchstone Delta machine, Intel can now >publish the following double precision performance numbers for massively >parallel LINPACK: [numbers deleted] At first, I started to read this thinking "what the heck does LINPACK have to do with the performance of a parallel computer other than measuring the power of individual nodes?" Then I started reading more closely, and it appears that there's more to it than that. Can someone explain how "massively parallel LINPACK" is different from regular LINPACK? What considerations for communication are made in this benchmark? Since LINPACK is normally used as a measure of number crunching, I'm curious how this benchmark translates to parallel computers. As we all know (or we all should know), the performance of a parallel computer is usually NOT the same as the performance of an individual node multiplied by the number of nodes in the computer (although we'd just love that to always be the case). The obvious misapplication of this kind of benchmark would be to multiply a single node's LINPACK performance by the number of nodes in the machine. I notice that the numbers posted in the above-referenced article do not scale linearly with the number of nodes used, so there is some efficiency loss from the single node case. I'm itching to find out what the source of this loss is. This is of particular interest as I am currently porting some existing parallel code to the Delta, and I'd like to be able to handle the inevitable queries about "well, they say the Delta does such-and-such LINPACK GFLOPS...". Any explanation or references would be welcome. -- Steve Groom, Jet Propulsion Laboratory, Pasadena, CA stevo@elroy.jpl.nasa.gov {ames,usc}!elroy!stevo "... and the babe, she needs a chaaaa--nging..." (apologies to Bob Dylan) -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/05/91)
In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes: >[original post deleted ... Jerry Baugh] > >At first, I started to read this thinking "what the heck does LINPACK >have to do with the performance of a parallel computer other than >measuring the power of individual nodes?" Then I started reading more >closely, and it appears that there's more to it than that. > >Can someone explain how "massively parallel LINPACK" is different from >regular LINPACK? Since I started this, I guess I ought to take a stab at explaining this as well. The standard LINPACK benchmark is constrained to 100x100 and 1000x1000. These problem sizes are relatively small when compared with the capabilites of todays supercomputers. And they are small compared to the size problems people are solving today (a production code at one of our customer sites does a 20k x 20k matrix multiply). Recently, Jack Dongarra extended his standard LINPACK benchmark suite to include a new catagory "Massively Parallel Computing". This new catagory allows the problem size to increase up to and including the limit of physical memory. Since the problem size can vary, it is important to note the size (presented as order of the matrix) as well as the FLOP numbers presented. One of the reasons we are proud of our numbers here at Intel is that we got a high GFLOP number with a relatively small problem size (we can still increase the problem size as we have not yet run out of physical memory). >What considerations for communication are made in this benchmark? Both data decomposition and data communication are critical. The method of parallelizing a program that is chosen must take both into account. Selecting one method of data decomposition may dictate the data communication strategy. The strategy used in this case involved a 2D data decomposition that does both column and row mapping. This takes advantage of the Touchstone Delta's mesh architecture by allowing messages to flow to all nearest neighbors. And this leads to higher performance than a 1D decomposition scheme. >Since LINPACK is normally used as a measure of number crunching, >I'm curious how this benchmark translates to parallel computers. This particular version was designed for parallel computers. >As we all know (or we all should know), the performance >of a parallel computer is usually NOT the same as the performance of an >individual node multiplied by the number of nodes in the computer >(although we'd just love that to always be the case). >The obvious misapplication of this kind of benchmark would >be to multiply a single node's LINPACK performance by the number of nodes in >the machine. I notice that the numbers posted in the above-referenced >article do not scale linearly with the number of nodes used, so there >is some efficiency loss from the single node case. I'm itching to >find out what the source of this loss is. Since the data values are distributed across multiple nodes, data must be passed from one node (where it resides) to another node (where it is needed for the calculation). This communication cost is a major part of the difference you see between the single node numbers and the 512 node numbers. >This is of particular interest >as I am currently porting some existing parallel code to the Delta, and I'd >like to be able to handle the inevitable queries about "well, they say >the Delta does such-and-such LINPACK GFLOPS...". Personally, I have always said that performance on real codes is important. The "Massively Parallel LINPACK" benchmark provides one measure that is meaningful for some classes of codes. For other classes of programs (database for example), it is meaningless. >Any explanation or references would be welcome. Hope what I've written helps. For more information on LINPACK and the "Massively Parallel LINPACK", I suggest you get Jack Dongarra's paper 'Performance of Various Computers Using Standard Linear Equations Software'. For an explaination of a parallel 1D decomposition, try 'LAPACK Block Factorization Algorithms on the Intel iPSC/860', LAPACK Working Report 24, University of Tenessee, Oct. 1990 by Dongarra and Ostrouchov. >Steve Groom, Jet Propulsion Laboratory, Pasadena, CA >stevo@elroy.jpl.nasa.gov {ames,usc}!elroy!stevo Jerry Baugh Intel SSD - baugh@SSD.intel.com
elias@wonton.TC.Cornell.EDU (Doug Elias) (06/06/91)
i'd appreciate some rationale on "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", where, i assume, N == number of processors used in the computation. Thanks, doug # ____ |Internet: elias@theory.tn.cornell.edu #dr _|_)oug|USmail: Mgr., Adv. Comp. Res. Inst./Cornell Theory Center # (_| | 704A TheoryCtrBldg/C.U./Ithaca/N.Y./14853-3801 # (_|__ |MaBelle: 607-254-8826 Fax: 607-254-8888
dodson@convex.COM (Dave Dodson) (06/06/91)
In article <ELIAS.91Jun6090922@wonton.TC.Cornell.EDU> elias@wonton.TC.Cornell.EDU (Doug Elias) writes: >i'd appreciate some rationale on > "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", >where, i assume, N == number of processors used in the computation. No. N is the order of the system of equations. The expression in parentheses approximately represents the operation count for Gauss Elimination using ordinary vector and matrix operations. What is interesting about this is that there are algorithms based on "fast" matrix multiplication, where the product of two K by K matrices can be formed with fewer than O(K^3) floating point operations. If you use one of these fast algorithms, you may do significantly fewer than (2/3 N^3 + 2 N^2) floating point operations, but you get credit for (2/3 N^3 + 2 N^2) operations anyway. The numerical stability of these fast methods is questionable, especially when the equations and unknowns are poorly scaled. Therefore, a benchmark report on a solver should state if a fast matrix multiplication algorithm is used so the reader can draw his own conclusions regarding the solver's applicability to his problem. Another factor that must be considered is the type of pivoting used. As pivoting can involve much data motion, omitting it is desirable from the performance point of view. However, the resulting code may divide by zero or, even worse, produce severely inaccurate answers with no warning. Except for matrices that are known to have special properties such as positive definiteness or strict diagonal dominance, at least partial pivoting is required to insure numerical stability. I haven't read the Dongarra report to know whether it specifies that pivoting is required, but if it doesn't specify, then a complete report on a solver would describe the type of pivoting used. I don't recall that information in the Touchstone Delta report. ---------------------------------------------------------------------- Dave Dodson dodson@convex.COM Convex Computer Corporation Richardson, Texas (214) 497-4234 -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
prins@cs.unc.edu (Jan Prins) (06/08/91)
In article <1991Jun6.174129.25202@hubcap.clemson.edu>, dodson@convex.COM (Dave Dodson) writes: > > "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", > > What is interesting about this is that there are algorithms based on > "fast" matrix multiplication, where the product of two K by K matrices > can be formed with fewer than O(K^3) floating point operations. If you > use one of these fast algorithms, you may do significantly fewer than > (2/3 N^3 + 2 N^2) floating point operations, but you get credit for > (2/3 N^3 + 2 N^2) operations anyway. [...] In particular, using this months asymptotically fastest solver, your lowly workstation can beat the LINPACK performance of *any* machine whose performance was obtained through use of the standard algorithm, provided enough time and space. Of course it's not clear whether you are allowed to report a record performance 10,000 years before it completes. --\-- Jan Prins (prins@cs.unc.edu) / Computer Science Dept. --\-- UNC Chapel Hill
rvdg@cs.utexas.edu (Robert van de Geijn) (06/14/91)
This article is in response to several articles posted recently to
comp.parallel concerning the massively parallel LINPACK benchmark on
the Intel Touchstone DELTA machine. Since I was the anonymous
"benchmarker", I believe a few words from me may shed some light on
the situation.
I was contacted by Intel around the second week of May and asked to
implement the LINPACK Benchmark on the DELTA. The initial goal was to
beat the "world record" 5.2 GFLOPS (double precision), reported by
Thinking Machines for the CM-2, by the unveiling of the DELTA, May 31.
While a version of the LU factorization had been developed at the
University of Tennessee by Jack Dongarra and Susan Ostrouchov, this
version assumes that the matrix is mapped to nodes by wrapping panels
onto an embedded ring. Moreover, no satisfactory triangular solve had
been implemented, nor had all pieces been put together. Finally, a
quick "back-of-the-envelope" calculation indicated that this version
did not scale well to a large machine like the DELTA.
Starting from scratch, I implemented a block-torus wrapped version of
the right-looking LAPACK LU factorization variant, as well as
torus-wrapped variants of the Li-Coleman implementation of the
triangular solve (also known as cyclic algorithms). The development
was done on the ORNL 128 node Intel iPSC/i860 GAMMA machine, achieving
an impressive 1.92 GFLOPS. Two weeks later, on May 23, the machine
attained 7.844 GFLOPS on 512 nodes, for a 12000 X 12000 problem
(during the first attempt on the DELTA).
After adjusting storage buffer sizes, the same code produced the
numbers reported in article 2643 of comp.parallel (8.561 GFLOPS).
Next, the author went on a much needed vacation at Disneyland and Sea
World. On June 5, Thinking Machine announced the CM-200, and a new
record, 9.03 GFLOPS for a 28672 X 28672 problem. On June 6, my code
attained 10.2 GFLOPS for a 20000 X 20000 problem on the DELTA.
It should be noted that the code that is being used was quickly put
together in a matter of 2 weeks. There are many parameters that can
be optimized, and the memory has not yet been exhausted. A 20K X 20K
problem is completed in only 522.7 seconds. Moreover, there are still
many standard techniques that can be used to reduce the communication
overhead. 10.2 GFLOPS is only a start.
I am working on a short report on this topic and I am in the process
of completing the code. Intel has promised to release the code to the
public domain, once it has been completed.
Robert van de Geijn
Assistant Professor
The University of Texas at Austin
--
=========================== MODERATOR ==============================
Steve Stevenson {steve,fpst}@hubcap.clemson.edu
Department of Computer Science, comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell