baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/01/91)
The Intel Touchstone Delta machine was unveiled for the public today at the California Institute of Technology. The Touchstone Delta machine consists of 528 i860 based computational nodes, connected by a high speed, two dimensional mesh. This computer is yet another milestone in the cooperatively funded DARPA / Intel Touchstone project. The LINPACK benchmark has often been used as one measure of comparison between machines, and most recently, a new section of the report, entitled 'Massively Parallel Computing' defines the same test, solve a dense set of linear equations, but allows for the problem sizes to scale with the size of the machine. With the unveiling of the Touchstone Delta machine, Intel can now publish the following double precision performance numbers for massively parallel LINPACK: np n time GFLOPS MFLOPS/node --- ----- ----- ------ ----------- 192 2000 5.5 .971 5 192 4000 20.8 2.053 11 192 6000 51.3 2.808 15 192 8000 102.5 3.331 17 192 10000 178.2 3.742 19 192 12000 284.8 4.046 21 512 2000 4.5 1.187 2 512 4000 14.2 3.007 6 512 6000 31.5 4.574 9 512 8000 58.3 5.867 11 512 10000 96.8 6.889 13 512 12000 146.9 7.844 15 512 14000 213.7 8.561 17 Where np -> number of processors n -> order of the matrix (full 64 bit precision) t -> time in secs The previous leader in massively parallel LINPACK was Thinking Machines at 5.2 GFLOPS on a 26624 order matrix (64K processors). The current production iPSC/860 has a published performance of 1.92 GFLOPS on a matrix of order 8600, using 128 i860 processors, each with 8 Mbytes of local memory. Conventional 1K Linpack on one i860 node is published at 25 MFLOPS. Jerry Baugh Intel Supercomputer Systems Division
stevo@elroy.jpl.nasa.gov (Steve Groom) (06/04/91)
In article <1991Jun3.130104.15667@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY. CS.NET (Jerry Baugh) writes: > >The LINPACK benchmark has often been used as one measure of comparison between >machines, and most recently, a new section of the report, entitled 'Massively >Parallel Computing' defines the same test, solve a dense set of linear >equations, but allows for the problem sizes to scale with the size of the >machine. With the unveiling of the Touchstone Delta machine, Intel can now >publish the following double precision performance numbers for massively >parallel LINPACK: [numbers deleted] At first, I started to read this thinking "what the heck does LINPACK have to do with the performance of a parallel computer other than measuring the power of individual nodes?" Then I started reading more closely, and it appears that there's more to it than that. Can someone explain how "massively parallel LINPACK" is different from regular LINPACK? What considerations for communication are made in this benchmark? Since LINPACK is normally used as a measure of number crunching, I'm curious how this benchmark translates to parallel computers. As we all know (or we all should know), the performance of a parallel computer is usually NOT the same as the performance of an individual node multiplied by the number of nodes in the computer (although we'd just love that to always be the case). The obvious misapplication of this kind of benchmark would be to multiply a single node's LINPACK performance by the number of nodes in the machine. I notice that the numbers posted in the above-referenced article do not scale linearly with the number of nodes used, so there is some efficiency loss from the single node case. I'm itching to find out what the source of this loss is. This is of particular interest as I am currently porting some existing parallel code to the Delta, and I'd like to be able to handle the inevitable queries about "well, they say the Delta does such-and-such LINPACK GFLOPS...". Any explanation or references would be welcome. -- Steve Groom, Jet Propulsion Laboratory, Pasadena, CA stevo@elroy.jpl.nasa.gov {ames,usc}!elroy!stevo "... and the babe, she needs a chaaaa--nging..." (apologies to Bob Dylan) -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/05/91)
In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes: >[original post deleted ... Jerry Baugh] > >At first, I started to read this thinking "what the heck does LINPACK >have to do with the performance of a parallel computer other than >measuring the power of individual nodes?" Then I started reading more >closely, and it appears that there's more to it than that. > >Can someone explain how "massively parallel LINPACK" is different from >regular LINPACK? Since I started this, I guess I ought to take a stab at explaining this as well. The standard LINPACK benchmark is constrained to 100x100 and 1000x1000. These problem sizes are relatively small when compared with the capabilites of todays supercomputers. And they are small compared to the size problems people are solving today (a production code at one of our customer sites does a 20k x 20k matrix multiply). Recently, Jack Dongarra extended his standard LINPACK benchmark suite to include a new catagory "Massively Parallel Computing". This new catagory allows the problem size to increase up to and including the limit of physical memory. Since the problem size can vary, it is important to note the size (presented as order of the matrix) as well as the FLOP numbers presented. One of the reasons we are proud of our numbers here at Intel is that we got a high GFLOP number with a relatively small problem size (we can still increase the problem size as we have not yet run out of physical memory). >What considerations for communication are made in this benchmark? Both data decomposition and data communication are critical. The method of parallelizing a program that is chosen must take both into account. Selecting one method of data decomposition may dictate the data communication strategy. The strategy used in this case involved a 2D data decomposition that does both column and row mapping. This takes advantage of the Touchstone Delta's mesh architecture by allowing messages to flow to all nearest neighbors. And this leads to higher performance than a 1D decomposition scheme. >Since LINPACK is normally used as a measure of number crunching, >I'm curious how this benchmark translates to parallel computers. This particular version was designed for parallel computers. >As we all know (or we all should know), the performance >of a parallel computer is usually NOT the same as the performance of an >individual node multiplied by the number of nodes in the computer >(although we'd just love that to always be the case). >The obvious misapplication of this kind of benchmark would >be to multiply a single node's LINPACK performance by the number of nodes in >the machine. I notice that the numbers posted in the above-referenced >article do not scale linearly with the number of nodes used, so there >is some efficiency loss from the single node case. I'm itching to >find out what the source of this loss is. Since the data values are distributed across multiple nodes, data must be passed from one node (where it resides) to another node (where it is needed for the calculation). This communication cost is a major part of the difference you see between the single node numbers and the 512 node numbers. >This is of particular interest >as I am currently porting some existing parallel code to the Delta, and I'd >like to be able to handle the inevitable queries about "well, they say >the Delta does such-and-such LINPACK GFLOPS...". Personally, I have always said that performance on real codes is important. The "Massively Parallel LINPACK" benchmark provides one measure that is meaningful for some classes of codes. For other classes of programs (database for example), it is meaningless. >Any explanation or references would be welcome. Hope what I've written helps. For more information on LINPACK and the "Massively Parallel LINPACK", I suggest you get Jack Dongarra's paper 'Performance of Various Computers Using Standard Linear Equations Software'. For an explaination of a parallel 1D decomposition, try 'LAPACK Block Factorization Algorithms on the Intel iPSC/860', LAPACK Working Report 24, University of Tenessee, Oct. 1990 by Dongarra and Ostrouchov. >Steve Groom, Jet Propulsion Laboratory, Pasadena, CA >stevo@elroy.jpl.nasa.gov {ames,usc}!elroy!stevo Jerry Baugh Intel SSD - baugh@SSD.intel.com
elias@wonton.TC.Cornell.EDU (Doug Elias) (06/06/91)
i'd appreciate some rationale on "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", where, i assume, N == number of processors used in the computation. Thanks, doug # ____ |Internet: elias@theory.tn.cornell.edu #dr _|_)oug|USmail: Mgr., Adv. Comp. Res. Inst./Cornell Theory Center # (_| | 704A TheoryCtrBldg/C.U./Ithaca/N.Y./14853-3801 # (_|__ |MaBelle: 607-254-8826 Fax: 607-254-8888
dodson@convex.COM (Dave Dodson) (06/06/91)
In article <ELIAS.91Jun6090922@wonton.TC.Cornell.EDU> elias@wonton.TC.Cornell.EDU (Doug Elias) writes: >i'd appreciate some rationale on > "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", >where, i assume, N == number of processors used in the computation. No. N is the order of the system of equations. The expression in parentheses approximately represents the operation count for Gauss Elimination using ordinary vector and matrix operations. What is interesting about this is that there are algorithms based on "fast" matrix multiplication, where the product of two K by K matrices can be formed with fewer than O(K^3) floating point operations. If you use one of these fast algorithms, you may do significantly fewer than (2/3 N^3 + 2 N^2) floating point operations, but you get credit for (2/3 N^3 + 2 N^2) operations anyway. The numerical stability of these fast methods is questionable, especially when the equations and unknowns are poorly scaled. Therefore, a benchmark report on a solver should state if a fast matrix multiplication algorithm is used so the reader can draw his own conclusions regarding the solver's applicability to his problem. Another factor that must be considered is the type of pivoting used. As pivoting can involve much data motion, omitting it is desirable from the performance point of view. However, the resulting code may divide by zero or, even worse, produce severely inaccurate answers with no warning. Except for matrices that are known to have special properties such as positive definiteness or strict diagonal dominance, at least partial pivoting is required to insure numerical stability. I haven't read the Dongarra report to know whether it specifies that pivoting is required, but if it doesn't specify, then a complete report on a solver would describe the type of pivoting used. I don't recall that information in the Touchstone Delta report. ---------------------------------------------------------------------- Dave Dodson dodson@convex.COM Convex Computer Corporation Richardson, Texas (214) 497-4234 -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
prins@cs.unc.edu (Jan Prins) (06/08/91)
In article <1991Jun6.174129.25202@hubcap.clemson.edu>, dodson@convex.COM (Dave Dodson) writes: > > "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", > > What is interesting about this is that there are algorithms based on > "fast" matrix multiplication, where the product of two K by K matrices > can be formed with fewer than O(K^3) floating point operations. If you > use one of these fast algorithms, you may do significantly fewer than > (2/3 N^3 + 2 N^2) floating point operations, but you get credit for > (2/3 N^3 + 2 N^2) operations anyway. [...] In particular, using this months asymptotically fastest solver, your lowly workstation can beat the LINPACK performance of *any* machine whose performance was obtained through use of the standard algorithm, provided enough time and space. Of course it's not clear whether you are allowed to report a record performance 10,000 years before it completes. --\-- Jan Prins (prins@cs.unc.edu) / Computer Science Dept. --\-- UNC Chapel Hill
rvdg@cs.utexas.edu (Robert van de Geijn) (06/14/91)
This article is in response to several articles posted recently to comp.parallel concerning the massively parallel LINPACK benchmark on the Intel Touchstone DELTA machine. Since I was the anonymous "benchmarker", I believe a few words from me may shed some light on the situation. I was contacted by Intel around the second week of May and asked to implement the LINPACK Benchmark on the DELTA. The initial goal was to beat the "world record" 5.2 GFLOPS (double precision), reported by Thinking Machines for the CM-2, by the unveiling of the DELTA, May 31. While a version of the LU factorization had been developed at the University of Tennessee by Jack Dongarra and Susan Ostrouchov, this version assumes that the matrix is mapped to nodes by wrapping panels onto an embedded ring. Moreover, no satisfactory triangular solve had been implemented, nor had all pieces been put together. Finally, a quick "back-of-the-envelope" calculation indicated that this version did not scale well to a large machine like the DELTA. Starting from scratch, I implemented a block-torus wrapped version of the right-looking LAPACK LU factorization variant, as well as torus-wrapped variants of the Li-Coleman implementation of the triangular solve (also known as cyclic algorithms). The development was done on the ORNL 128 node Intel iPSC/i860 GAMMA machine, achieving an impressive 1.92 GFLOPS. Two weeks later, on May 23, the machine attained 7.844 GFLOPS on 512 nodes, for a 12000 X 12000 problem (during the first attempt on the DELTA). After adjusting storage buffer sizes, the same code produced the numbers reported in article 2643 of comp.parallel (8.561 GFLOPS). Next, the author went on a much needed vacation at Disneyland and Sea World. On June 5, Thinking Machine announced the CM-200, and a new record, 9.03 GFLOPS for a 28672 X 28672 problem. On June 6, my code attained 10.2 GFLOPS for a 20000 X 20000 problem on the DELTA. It should be noted that the code that is being used was quickly put together in a matter of 2 weeks. There are many parameters that can be optimized, and the memory has not yet been exhausted. A 20K X 20K problem is completed in only 522.7 seconds. Moreover, there are still many standard techniques that can be used to reduce the communication overhead. 10.2 GFLOPS is only a start. I am working on a short report on this topic and I am in the process of completing the code. Intel has promised to release the code to the public domain, once it has been completed. Robert van de Geijn Assistant Professor The University of Texas at Austin -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell