baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/01/91)
The Intel Touchstone Delta machine was unveiled for the public today at the California Institute of Technology. The Touchstone Delta machine consists of 528 i860 based computational nodes, connected by a high speed, two dimensional mesh. This computer is yet another milestone in the cooperatively funded DARPA / Intel Touchstone project. The LINPACK benchmark has often been used as one measure of comparison between machines, and most recently, a new section of the report, entitled 'Massively Parallel Computing' defines the same test, solve a dense set of linear equations, but allows for the problem sizes to scale with the size of the machine. With the unveiling of the Touchstone Delta machine, Intel can now publish the following double precision performance numbers for massively parallel LINPACK: np n time GFLOPS MFLOPS/node --- ----- ----- ------ ----------- 192 2000 5.5 .971 5 192 4000 20.8 2.053 11 192 6000 51.3 2.808 15 192 8000 102.5 3.331 17 192 10000 178.2 3.742 19 192 12000 284.8 4.046 21 512 2000 4.5 1.187 2 512 4000 14.2 3.007 6 512 6000 31.5 4.574 9 512 8000 58.3 5.867 11 512 10000 96.8 6.889 13 512 12000 146.9 7.844 15 512 14000 213.7 8.561 17 Where np -> number of processors n -> order of the matrix (full 64 bit precision) t -> time in secs The previous leader in massively parallel LINPACK was Thinking Machines at 5.2 GFLOPS on a 26624 order matrix (64K processors). The current production iPSC/860 has a published performance of 1.92 GFLOPS on a matrix of order 8600, using 128 i860 processors, each with 8 Mbytes of local memory. Conventional 1K Linpack on one i860 node is published at 25 MFLOPS. Jerry Baugh Intel Supercomputer Systems Division
stevo@elroy.jpl.nasa.gov (Steve Groom) (06/04/91)
In article <1991Jun3.130104.15667@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY. CS.NET (Jerry Baugh) writes: > >The LINPACK benchmark has often been used as one measure of comparison between >machines, and most recently, a new section of the report, entitled 'Massively >Parallel Computing' defines the same test, solve a dense set of linear >equations, but allows for the problem sizes to scale with the size of the >machine. With the unveiling of the Touchstone Delta machine, Intel can now >publish the following double precision performance numbers for massively >parallel LINPACK: [numbers deleted] At first, I started to read this thinking "what the heck does LINPACK have to do with the performance of a parallel computer other than measuring the power of individual nodes?" Then I started reading more closely, and it appears that there's more to it than that. Can someone explain how "massively parallel LINPACK" is different from regular LINPACK? What considerations for communication are made in this benchmark? Since LINPACK is normally used as a measure of number crunching, I'm curious how this benchmark translates to parallel computers. As we all know (or we all should know), the performance of a parallel computer is usually NOT the same as the performance of an individual node multiplied by the number of nodes in the computer (although we'd just love that to always be the case). The obvious misapplication of this kind of benchmark would be to multiply a single node's LINPACK performance by the number of nodes in the machine. I notice that the numbers posted in the above-referenced article do not scale linearly with the number of nodes used, so there is some efficiency loss from the single node case. I'm itching to find out what the source of this loss is. This is of particular interest as I am currently porting some existing parallel code to the Delta, and I'd like to be able to handle the inevitable queries about "well, they say the Delta does such-and-such LINPACK GFLOPS...". Any explanation or references would be welcome. -- Steve Groom, Jet Propulsion Laboratory, Pasadena, CA stevo@elroy.jpl.nasa.gov {ames,usc}!elroy!stevo "... and the babe, she needs a chaaaa--nging..." (apologies to Bob Dylan) -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) (06/05/91)
In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes: >[original post deleted ... Jerry Baugh] > >At first, I started to read this thinking "what the heck does LINPACK >have to do with the performance of a parallel computer other than >measuring the power of individual nodes?" Then I started reading more >closely, and it appears that there's more to it than that. > >Can someone explain how "massively parallel LINPACK" is different from >regular LINPACK? Since I started this, I guess I ought to take a stab at explaining this as well. The standard LINPACK benchmark is constrained to 100x100 and 1000x1000. These problem sizes are relatively small when compared with the capabilites of todays supercomputers. And they are small compared to the size problems people are solving today (a production code at one of our customer sites does a 20k x 20k matrix multiply). Recently, Jack Dongarra extended his standard LINPACK benchmark suite to include a new catagory "Massively Parallel Computing". This new catagory allows the problem size to increase up to and including the limit of physical memory. Since the problem size can vary, it is important to note the size (presented as order of the matrix) as well as the FLOP numbers presented. One of the reasons we are proud of our numbers here at Intel is that we got a high GFLOP number with a relatively small problem size (we can still increase the problem size as we have not yet run out of physical memory). >What considerations for communication are made in this benchmark? Both data decomposition and data communication are critical. The method of parallelizing a program that is chosen must take both into account. Selecting one method of data decomposition may dictate the data communication strategy. The strategy used in this case involved a 2D data decomposition that does both column and row mapping. This takes advantage of the Touchstone Delta's mesh architecture by allowing messages to flow to all nearest neighbors. And this leads to higher performance than a 1D decomposition scheme. >Since LINPACK is normally used as a measure of number crunching, >I'm curious how this benchmark translates to parallel computers. This particular version was designed for parallel computers. >As we all know (or we all should know), the performance >of a parallel computer is usually NOT the same as the performance of an >individual node multiplied by the number of nodes in the computer >(although we'd just love that to always be the case). >The obvious misapplication of this kind of benchmark would >be to multiply a single node's LINPACK performance by the number of nodes in >the machine. I notice that the numbers posted in the above-referenced >article do not scale linearly with the number of nodes used, so there >is some efficiency loss from the single node case. I'm itching to >find out what the source of this loss is. Since the data values are distributed across multiple nodes, data must be passed from one node (where it resides) to another node (where it is needed for the calculation). This communication cost is a major part of the difference you see between the single node numbers and the 512 node numbers. >This is of particular interest >as I am currently porting some existing parallel code to the Delta, and I'd >like to be able to handle the inevitable queries about "well, they say >the Delta does such-and-such LINPACK GFLOPS...". Personally, I have always said that performance on real codes is important. The "Massively Parallel LINPACK" benchmark provides one measure that is meaningful for some classes of codes. For other classes of programs (database for example), it is meaningless. >Any explanation or references would be welcome. Hope what I've written helps. For more information on LINPACK and the "Massively Parallel LINPACK", I suggest you get Jack Dongarra's paper 'Performance of Various Computers Using Standard Linear Equations Software'. For an explaination of a parallel 1D decomposition, try 'LAPACK Block Factorization Algorithms on the Intel iPSC/860', LAPACK Working Report 24, University of Tenessee, Oct. 1990 by Dongarra and Ostrouchov. >Steve Groom, Jet Propulsion Laboratory, Pasadena, CA >stevo@elroy.jpl.nasa.gov {ames,usc}!elroy!stevo Jerry Baugh Intel SSD - baugh@SSD.intel.com
patrick@convex.COM (Patrick F. McGehearty) (06/06/91)
In article <1991Jun5.120653.7852@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) writes: > >One of the reasons we are proud of our numbers here at Intel is that >we got a high GFLOP number with a relatively small problem size (we can >still increase the problem size as we have not yet run out of physical >memory). > And rightfully so. It's relatively easy to get a high GFLOP number (assuming the raw machine power exists) for a sufficiently large problem, assuming you can hook enough machines together and wait the necessary days for the final solution. There are effective linear equation solver techniques with communication times proportional to the square of the problem size, while the computation time is proportion to the cube of the problem size. So for sufficiently large problems, a gaggle of workstations on a cluster of Ethernets could give GFLOP numbers if someone wanted to take the trouble to run the test (say at one of the universities with lots of workstations, like Carnegie Mellon over a holiday weekend). Another challenge I would like to see would be one which uses an efficent, portable algorithm for linear equation solving written in a high level language such as Fortran. Then we could get some idea of what is achievable on a machine without resorting to assembly language. I know that what's most efficent on each machine varies with the architecture, so maybe the rules could be that the portability requirement is met by requiring that program be "standard" which could be determined by running it on some commonly available workstations (with different instruction sets from the target machine), but the supercomputer vendor could select and implement the algorithm of their choice. This approach would provide some information about the ability of a compiler to get the most out of a machine under ideal circumstances. The requirement that it also run on some other machine is intended to avoid machine specific system calls. Does this benchmark idea sound interesting to anyone else?
john@spectre.unm.edu (John Prentice) (06/06/91)
In article <1991Jun05.185818.1071@convex.com> patrick@convex.COM (Patrick F. McGehearty) writes: >the most out of a machine under ideal circumstances. The requirement >that it also run on some other machine is intended to avoid machine >specific system calls. Does this benchmark idea sound interesting to >anyone else? Yes and no. Yes, because your idea would level the playing field and give people a feeling for what one might expect if you weren't able or willing to exploit the special aspects of a system. No, because one also wants to know just how much can be milked from a supercomputer. Perhaps what we need is both sets of benchmarks. I must admit however, I would like to see LINPACK benchmarks which tackle matrices of a size more on the order of 10,000 to 100,000. I can trivially invert a 1000 x 1000 in a few seconds on my Sparc. Telling me how fast I can do it on a supercomputer doesn't really tell me much about how a system performs on the kinds of problems I need it for. Allowing the benchmarker to scale the problem to his or her physical memory would seem to me to be a prescription for getting alot of results that can't be compared to each other. What I would prefer is to just fix the dimension, but fix it big. That would give a standard benchmark, but more importantly, from the perspective of a prospective user, it tells me how well the system can handle real problems, not ones choosen to scale nicely to a particular architecture. John -- John K. Prentice john@spectre.unm.edu (Internet) Computational Physics Group Amparo Corporation, Albuquerque, NM
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (06/06/91)
In article <1991Jun3.233741.8570@elroy.jpl.nasa.gov> stevo@elroy.jpl.nasa.gov (Steve Groom) writes: >Can someone explain how "massively parallel LINPACK" is different from >regular LINPACK? Dongarra's latest Linpack report states that a fixed-size problem is inappropriate for a scalable ensemble machine. So, instead, the matrix size is allowed to vary. The solution method is considered irrelevant: FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time. The Linpack reports wants the following numbers: - Accuracy residual. - GFLOPS and N for the largest problem run. - N-HALF, the problem size that achieved half the GFLOPS. - Theoretical peak GFLOPS. - Number of processors. In the following, it's interesting to see the ratio of N^2 to N-HALF^2: N N-HALF ratio ------ ------ ----- 26,624 11,000 6 CM-2 64 K PE (or, 2 K FPU) 21,376 3,193 45 nCUBE 2 1 K PE 5,504 1,180 22 MasPar 16 K PE 14,000 5,500? 6 Touchstone, 512 PE, as just reported -- Don D.C.Lindsay Carnegie Mellon Robotics Institute
elias@wonton.TC.Cornell.EDU (Doug Elias) (06/06/91)
i'd appreciate some rationale on "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", where, i assume, N == number of processors used in the computation. Thanks, doug # ____ |Internet: elias@theory.tn.cornell.edu #dr _|_)oug|USmail: Mgr., Adv. Comp. Res. Inst./Cornell Theory Center # (_| | 704A TheoryCtrBldg/C.U./Ithaca/N.Y./14853-3801 # (_|__ |MaBelle: 607-254-8826 Fax: 607-254-8888
gary@chpc.utexas.edu (Gary Smith) (06/06/91)
In article <1991Jun5.120653.7852@hubcap.clemson.edu> baugh%ssd.intel.com@RELAY.CS.NET (Jerry Baugh) writes: > >One of the reasons we are proud of our numbers here at Intel is that >we got a high GFLOP number with a relatively small problem size (we can >still increase the problem size as we have not yet run out of physical >memory). > Fine. Scale it to the physical memory limit and tell us how long it takes. Worley has written a marvelous paper, "The Effect of Time Constraints on Scaled Speed-Up" (ORNL/TM-11031) wherein he shows that a meaningful inter- pretation of speed-up depends on constraints on execution time. Why don't advocates of the promise of massive parallelism acknowledge such issues? john@spectre.unm.edu (John Prentice) replies to the above article: >>Allowing the >>benchmarker to scale the problem to his or her physical memory would >>seem to me to be a prescription for getting alot of results that can't be >>compared to each other. What I would prefer is to just fix the >>dimension, but fix it big. Another important paper, "Type Architectures, Shared Memory, and the Corollary of Modest Potential" should be read by john@spectre.unm.edu and most other advocates of the promise of massive parallelism. To para- phrase Lawrence Snyder (author of paper - Annual Reviews of Computer Science, 1986), it's certainly still premature to be pessimistic regarding ultimate success of achieving the benefits of massive parallelism, yet THERE IS REASON TO BE EXTREMELY CAUTIOUS. Continuing to paraphrase Snyder: The frequent challenge with physical phemomena models is not to solve a fixed-size problem faster, but to solve larger instances within a fixed time budget. Simple rationale: Solutions are so computationally intensive that problem instances of an "interesting" size do not complete execution in a tolerable amount of time. What is the effect of parallelism on problem size? Keeping time fixed and UNREALISTICALLY assuming the best possible speedup, it follows that superlinear problems (the interesting, realistic ones) can only be improved sublineraly by parallel computation. Specifically, if a problem takes t = c*n^x sequential steps, and if the application of p processors permits an increase in the problem size by a factor of m, t = c*(m*n)^x/p, then the problem size can increase by at most a factor of m = p^(1/x). For example, to increase by two orders of magnitude the size of a problem whose sequential performance is given by t = cn^4 (three spacial, one time demension) requires, WITH UNREALISTIC OPTIMISM, 100,000,000 processors! Again, it's time for the advocates of the promise of massive parallelism to acknowledge Synder's "corollary of modest potential." Randolph Gary Smith Internet: gary@chpc.utexas.edu Systems Group Phonenet: (512) 471-2411 Center for High Performance Computing Snailnet: 10100 Burnet Road The University of Texas System Austin, Texas 78758-4497
lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) (06/06/91)
Some questions? At what size N for massive linpack does the accumulated numerical fuzz swamp the precision available in a double precision number? Aren't really large problems of interest sparse matrices anyway? Who holds the record for massive sparse matrix solves? Perhaps some of the numerical experts out there could tell us something about the characteristics of these massive numerical problems.
jet@karazm.math.uh.edu (J Eric Townsend) (06/06/91)
In article <16333@life.ai.mit.edu> lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) writes: >At what size N for massive linpack does the accumulated numerical fuzz >swamp the precision available in a double precision number? I dunno. i860s will work with denormalized numbers (which kills performance, IMHO). >Aren't really large problems of interest sparse matrices anyway? Who Not really, talk to Boeing, for example. I wish they'd publish on what they're doing, but they're so damn secure that some of their employees almost had a fit during an Intel class when it was mentioned that they had a ipsc/860 and were running LOOCS. -- J. Eric Townsend - jet@uh.edu - bitnet: jet@UHOU - vox: (713) 749-2126 Skate UNIX! (curb fault: skater dumped) -- If you're hacking PowerGloves and Amigas, drop me a line. --
dodson@convex.COM (Dave Dodson) (06/06/91)
In article <ELIAS.91Jun6090922@wonton.TC.Cornell.EDU> elias@wonton.TC.Cornell.EDU (Doug Elias) writes: >i'd appreciate some rationale on > "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", >where, i assume, N == number of processors used in the computation. No. N is the order of the system of equations. The expression in parentheses approximately represents the operation count for Gauss Elimination using ordinary vector and matrix operations. What is interesting about this is that there are algorithms based on "fast" matrix multiplication, where the product of two K by K matrices can be formed with fewer than O(K^3) floating point operations. If you use one of these fast algorithms, you may do significantly fewer than (2/3 N^3 + 2 N^2) floating point operations, but you get credit for (2/3 N^3 + 2 N^2) operations anyway. The numerical stability of these fast methods is questionable, especially when the equations and unknowns are poorly scaled. Therefore, a benchmark report on a solver should state if a fast matrix multiplication algorithm is used so the reader can draw his own conclusions regarding the solver's applicability to his problem. Another factor that must be considered is the type of pivoting used. As pivoting can involve much data motion, omitting it is desirable from the performance point of view. However, the resulting code may divide by zero or, even worse, produce severely inaccurate answers with no warning. Except for matrices that are known to have special properties such as positive definiteness or strict diagonal dominance, at least partial pivoting is required to insure numerical stability. I haven't read the Dongarra report to know whether it specifies that pivoting is required, but if it doesn't specify, then a complete report on a solver would describe the type of pivoting used. I don't recall that information in the Touchstone Delta report. ---------------------------------------------------------------------- Dave Dodson dodson@convex.COM Convex Computer Corporation Richardson, Texas (214) 497-4234 -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
dodson@convex.COM (Dave Dodson) (06/06/91)
In article <16333@life.ai.mit.edu> lethin@raisin-scone.ai.mit.edu (Richard A. Lethin) writes: >Aren't really large problems of interest sparse matrices anyway? In radar-cross-section work, huge dense non-Hermitian complex matrices abound. I suppose that larger matrices yield better models, making it easier for you to determine if you can either find or hide the stealth bomber, whichever you are trying to do. ---------------------------------------------------------------------- Dave Dodson dodson@convex.COM Convex Computer Corporation Richardson, Texas (214) 497-4234
john@spectre.unm.edu (John Prentice) (06/07/91)
In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes: > > >Continuing to paraphrase Snyder: The frequent challenge with physical >phemomena models is not to solve a fixed-size problem faster, but to >solve larger instances within a fixed time budget. Simple rationale: >Solutions are so computationally intensive that problem instances of an >"interesting" size do not complete execution in a tolerable amount of >time. What is the effect of parallelism on problem size? Keeping time >fixed and UNREALISTICALLY assuming the best possible speedup, it follows >that superlinear problems (the interesting, realistic ones) can only be >improved sublineraly by parallel computation. Specifically, if a problem >takes t = c*n^x sequential steps, and if the application of p processors >permits an increase in the problem size by a factor of m, t = c*(m*n)^x/p, >then the problem size can increase by at most a factor of m = p^(1/x). >For example, to increase by two orders of magnitude the size of a problem >whose sequential performance is given by t = cn^4 (three spacial, one time >demension) requires, WITH UNREALISTIC OPTIMISM, 100,000,000 processors! > >Again, it's time for the advocates of the promise of massive parallelism >to acknowledge Synder's "corollary of modest potential." > So what would you suggest as an alternative. By this analysis, anyway you cut it, a serial processor will still take p times longer to do the same problem (of course, you have ignored overhead, but that works in favor of your argument). If I can do it 64,000 times faster on a CM-2 and I don't have any choice but to do the problem, then I am going to use the CM-2. The alternative is to just not do the problem. Your are both right and wrong about what the goal is in scientific computing. For many applications, the goal isn't to run bigger problems, it is to make current ones less expensive. Clearly parallelism buys you something there. For ones where bigger is better, I will grant you that parallelism has many drawbacks, but I fail to see an alternative. The main thing that bothers me in your analysis however is that you are being far too simplistic. You are assuming that numerical schemes are not going to change or be developed which exploit the unique capabilities of a parallel architecture. But things like multi-grid, composite grid domain decomposition, and so forth are being developed and these schemes can by made to really fly on a parallel system. Parallelism is largely a new way of thinking and as more people become involved in it, I think you will see numerical methods undergo some major advances as people learn to think in terms of this new paradigm. By the way, I know an awful lot of physicists who would not agree with you that the only problems of interest are superlinear ones :-) . John -- John K. Prentice john@spectre.unm.edu (Internet) Computational Physics Group Amparo Corporation, Albuquerque, NM
alan@uh.msc.umn.edu (Alan Klietz) (06/07/91)
In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes:
<
<To paraphrase Snyder: The frequent challenge with physical
<phemomena models is not to solve a fixed-size problem faster, but to
<solve larger instances within a fixed time budget. Simple rationale:
<Solutions are so computationally intensive that problem instances of an
<"interesting" size do not complete execution in a tolerable amount of
<time. What is the effect of parallelism on problem size? Keeping time
<fixed and UNREALISTICALLY assuming the best possible speedup, it follows
<that superlinear problems (the interesting, realistic ones) can only be
<improved sublineraly by parallel computation. Specifically, if a problem
<takes t = c*n^x sequential steps, and if the application of p processors
<permits an increase in the problem size by a factor of m, t = c*(m*n)^x/p,
<then the problem size can increase by at most a factor of m = p^(1/x).
<For example, to increase by two orders of magnitude the size of a problem
<whose sequential performance is given by t = cn^4 (three spacial, one time
<demension) requires, WITH UNREALISTIC OPTIMISM, 100,000,000 processors!
Well, if your problem really is quartic in n, yes, you are going to have
difficulty. But I suspect most "real" problems tend more towards more linear
or quadratic behavior. Physical simulations that at first appear to demand
higher order complexity often yield themselves to a more efficient approach
given a re-examination of the problem.
--
Alan E. Klietz
Minnesota Supercomputer Center, Inc.
1200 Washington Avenue South
Minneapolis, MN 55415
Ph: +1 612 626 1737 Internet: alan@msc.edu
prins@cs.unc.edu (Jan Prins) (06/08/91)
In article <1991Jun6.174129.25202@hubcap.clemson.edu>, dodson@convex.COM (Dave Dodson) writes: > > "FLOPS is defined as (2/3 N^3 + 2 N^2) / elapsed-time", > > What is interesting about this is that there are algorithms based on > "fast" matrix multiplication, where the product of two K by K matrices > can be formed with fewer than O(K^3) floating point operations. If you > use one of these fast algorithms, you may do significantly fewer than > (2/3 N^3 + 2 N^2) floating point operations, but you get credit for > (2/3 N^3 + 2 N^2) operations anyway. [...] In particular, using this months asymptotically fastest solver, your lowly workstation can beat the LINPACK performance of *any* machine whose performance was obtained through use of the standard algorithm, provided enough time and space. Of course it's not clear whether you are allowed to report a record performance 10,000 years before it completes. --\-- Jan Prins (prins@cs.unc.edu) / Computer Science Dept. --\-- UNC Chapel Hill
gary@chpc.utexas.edu (Gary Smith) (06/10/91)
In article <1991Jun06.205144.22611@ariel.unm.edu>, john@spectre.unm.edu (John Prentice) writes: |> In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes: |> |> >Again, it's time for the advocates of the promise of massive parallelism |> >to acknowledge Synder's "corollary of modest potential." |> > |> |> So what would you suggest as an alternative. By this analysis, anyway |> you cut it, a serial processor will still take p times longer to do the |> same problem (of course, you have ignored overhead, but that works in |> favor of your argument). If I can do it 64,000 times faster on a CM-2 |> and I don't have any choice but to do the problem, then I am going to |> use the CM-2. The alternative is to just not do the problem. |> |> Your are both right and wrong about what the goal is in scientific |> computing. For many applications, the goal isn't to run bigger problems, |> it is to make current ones less expensive. Using Speedup(f,p) = 1/[(1-f)+(f/p)], with f being the fraction of code parallelized and p being the number of processors, and unrealistically assuming no overhead, a speedup of 64000 using 65536 processors requi- res that the problem be 99.9999634% parallel. How many problems do you know of that are that parallel? ---Gary Randolph Gary Smith Internet: gary@chpc.utexas.edu Systems Group Phonenet: (512) 471-2411 Center for High Performance Computing Snailnet: 10100 Burnet Road The University of Texas System Austin, Texas 78758-4497
john@spectre.unm.edu (John Prentice) (06/11/91)
In article <1991Jun10.144354.695@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes: > >Using Speedup(f,p) = 1/[(1-f)+(f/p)], with f being the fraction of code >parallelized and p being the number of processors, and unrealistically >assuming no overhead, a speedup of 64000 using 65536 processors requi- >res that the problem be 99.9999634% parallel. How many problems do you >know of that are that parallel? > Very few. But this misses the central question. If we are going to even get a factor of 100 speedup in the new few years, how else are we going to do it other than by use of massive parallelism? If there is an alternative, I would be overjoyed to know about it. I don't particularly find parallel computing fun and it is not particularly easy, but I just don't see any alternative. That it has limitations, so what? The question still remains, what is the alternative? John -- John K. Prentice john@spectre.unm.edu (Internet) Computational Physics Group Amparo Corporation, Albuquerque, NM
furnish@zig.inria.fr (Geoffrey Furnish) (06/13/91)
In article <1991Jun10.144354.695@chpc.utexas.edu>, gary@chpc.utexas.edu (Gary Smith) writes: > In article <1991Jun06.205144.22611@ariel.unm.edu>, john@spectre.unm.edu (John Prentice) writes: > |> In article <1991Jun6.144903.20456@chpc.utexas.edu> gary@chpc.utexas.edu (Gary Smith) writes: > |> > |> >Again, it's time for the advocates of the promise of massive parallelism > |> >to acknowledge Synder's "corollary of modest potential." > |> > > |> > |> So what would you suggest as an alternative. By this analysis, anyway > |> you cut it, a serial processor will still take p times longer to do the > |> same problem (of course, you have ignored overhead, but that works in > |> favor of your argument). If I can do it 64,000 times faster on a CM-2 > |> and I don't have any choice but to do the problem, then I am going to > |> use the CM-2. The alternative is to just not do the problem. > |> > |> Your are both right and wrong about what the goal is in scientific > |> computing. For many applications, the goal isn't to run bigger problems, > |> it is to make current ones less expensive. > > Using Speedup(f,p) = 1/[(1-f)+(f/p)], with f being the fraction of code > parallelized and p being the number of processors, and unrealistically > assuming no overhead, a speedup of 64000 using 65536 processors requi- > res that the problem be 99.9999634% parallel. How many problems do you > know of that are that parallel? > > ---Gary > > Randolph Gary Smith Internet: gary@chpc.utexas.edu > Systems Group Phonenet: (512) 471-2411 > Center for High Performance Computing Snailnet: 10100 Burnet Road > The University of Texas System Austin, Texas 78758-4497 I have been reading this thread for a while and am finding it very interesting. Especially since I am a graduate student in physics at UT Austin, and am quite familiar with the center you are affiliated with. I am tempted to infer that the pessimism you are expressing probably reflects the attitudes of the UT System procurement plans. If so, that is extremely unfortunate for researchers in the UT system. Although I am not familiar with the theoretical results you are citing, and indeed am quite interested in what they imply, there are a few things which are clear to me. First of all, I have been spending this spring doing serious scientific computing on a Connection Machine 2 (located in France). By this point I have worked with it extensively for several months, and on two completely different sorts of problems. One of these is the time advancement of a field subject to a nasty nonlinear PDE. The CM with its massively parallel architecture is _EXTREMELY_ well suited for this problem, and the results I have obtained are absolutely mind boggling. I have not computed MIPS or MFLOPS for this problem, but I do know that the throughput is many many times better than I have ever seen on any computer before in my life. Secondly, I am not a green horn. I spend lots of computing time on high dollar Cray's out at the National Energy Research Supercomputing Center's facilities. I also occasionally use the facilites at the UT Center for High Performance Computing. And I can tell you now from personal experience that for some types of problems, comparing a Cray to a CM-2 is like comparing a tortoies to a chetah. That is not a theoretical analysis. It is an "experimental" fact/observation. Additionally, your last line above asks us to consider a speed up of 64000, and then you presumably show how ridiculous this is. Of course it is ridiculous. No one needs to see a speed up of 64000 to justify the cost of an MP architecture machine. One only needs to see a speed up which off sets the cost differential in the prices of the computers in question. It's not clear to me that a CM-2 is much pricier than a Cray in the first place, yet is _MUCH_ faster for at least some kinds of problems. The speed up I see is more like two orders of magnitude, not 4.5. Your further assert/imply that no problem could be 99.999... % parallel. I think this shows that you really don't understand what you're talking about. In my field advancement problem, it is _ALL_ parallel. Period. End of discussion. Even the real-time display of results is accomplished via a parallel projection function. The only part of the code which is not parallel computation is control flow. And that takes essentially zero time, since there are no decison branchings in the solution algorithm. There is also history keeping, but that relates to i/o, not computer architecture. The point is, Danny Hillis (president/founder of Thinking Machines, Inc.) is right: Many problems are _inherently_ parallel, and trying to solve them with serial architecture computers introduces artificial and unneccessary complexity. Which brings me to my last point. You seem to be completely oblivious to the whole issue of ease of programming. The CM comes with a slew of languages which implement parallel programming operations in simple and easy to understand ways, syntax, etc. They have a compiler they call C* which consists of very reasonable extensions to ANSI C to provide facilities for expressing parallel constructs and communication operations. Substantial efforts are underway to standardize this language with other vendors of MP hardware. They also have an implementation of FORTRAN 90 which provides direct support for array processing. The importance of this is that these products make progrmming in parallel _MUCH EASIER_ than programming for serial architecture machines like Cray's. For example, elemental array arithmetic requires no subscripts. On the CM you can say: a = b * c On a Cray you would have to say: do i = 1, n do j = 1, m a(i,j) = b(i,j) * c(i,j) enddo enddo Big deal? Well it is when you try to go and enhance your program in some way. In my case I was able to take my parallel program on the CM and convert it from a code which solved a field in two dimension to one which solved it in three dimensions, by changing 4 lines of code which related to the specification of the arrays, taking a total of about 30 minutes of my time. In a language written for a serial machine like a CRAY, etc, I'd have had to change every single set of double loops into tripple loops, and add a subscript to every single array reference in the program. Something like 3 or 4 hundred places. In my book, the opportunity for coding errors goes as the exponential of the number of changes you make. On the CM, using a parallel version of C, I only had to change the data declearations. The actual solution algorithm was not even touched. On a CRAY I'd have spent a couple of weeks making the necessary mods to my sources, and then I'd have had to worry for months about possible coding errors. Furthermore, using the parallel C on the CM, my solution algorithm is actually _COMPLETELY INDEPENDENT OF SYSTEM DIMENSIONALITY_. To make the code solve this field in 4 or 5 or n dimensions requires precisely zero modifications to the source code algorithm. I would have to change the data decls, but again, that's four lines of code. The advantages of massive parallelism are many and great. You should endeavor to try it before hollering about what a farce it is. In my book there are two ways to benchmark computers. One is theoretical computing speed like MFLOPS, etc. In this department the CM completely overwhelms the Cray for certain classes of programs. The other is the pragmatic issue of how long it takes to get a trustworthy program up and running, wall time to completeion, ease of obtaining graphics, and the like. Again, using parallel expression, on problems well suited for it, the CM beats the Cray hands down in the "relaible programming" department. From the standpoint of scientific visualization, the best tools I've ever seen run on the CM. One of the reasons I don't use UT CHPC more is because of the infuriating difficulty of getting graphical output. In contrast, Thinking Machines provides direct and easy to use support for X; so much so that you can render images on their high speed graphics device or in an X window on a networkd workstation _WITHOUT MODIFYING A SINGLE LINE OF CODE_. Ask Cray to do that for you! Is massive parallelism the solution to all mankind's computing needs? Probably not. I am sure the theoretical concerns you have cited have their domain of applicability. But, I am also certain that when I want to get work done, I can get it done faster, more reliably and with less effort on a CM-2 than I can on a Cray. That's just the way it is. For the benefit of the UT scientific research community and beyond, I urge you and others who hold your skepticism, to reach a little beyond the confines of Cray's marketting grasp, and take a more informed look at what MP has to offer. I for one have been extremely impressed by what I found under the hood of the CM-2. Geoffrey Furnish furnish@solar.ph.utexas.edu
gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (06/14/91)
In article <11771@mirsa.inria.fr> furnish@zig.inria.fr (Geoffrey Furnish) writes: >In a language written for a serial machine like a CRAY, etc, I'd have had >to change every single set of double loops into tripple loops, and add a >subscript to every single array reference in the program. Er, there's nothing about FORTRAN 90 that makes it impossible to write a Cray compiler for it. On the other hand, there are array expressions in F90 which cannot be optimized as well as the equivalent serial expression. Ask on comp.lang.fortran for examples. Array notation can be nice in some situations. It isn't for everyone. For example, I have a program which has a 3d data array, and a one-dimensional operator which must be applied in the x, y, and z directions. This "operator" is about 3,000 lines of code. Tell me how I can write all those expressions into 3 sets of 3d-array code without writing my own preprocessor? Hm? Nice, yes. Solves everything, no. BTW, my code is typical for high-accuracy astrophysical fluid dynamics. -- greg
rvdg@cs.utexas.edu (Robert van de Geijn) (06/14/91)
This article is in response to several articles posted recently to comp.parallel concerning the massively parallel LINPACK benchmark on the Intel Touchstone DELTA machine. Since I was the anonymous "benchmarker", I believe a few words from me may shed some light on the situation. I was contacted by Intel around the second week of May and asked to implement the LINPACK Benchmark on the DELTA. The initial goal was to beat the "world record" 5.2 GFLOPS (double precision), reported by Thinking Machines for the CM-2, by the unveiling of the DELTA, May 31. While a version of the LU factorization had been developed at the University of Tennessee by Jack Dongarra and Susan Ostrouchov, this version assumes that the matrix is mapped to nodes by wrapping panels onto an embedded ring. Moreover, no satisfactory triangular solve had been implemented, nor had all pieces been put together. Finally, a quick "back-of-the-envelope" calculation indicated that this version did not scale well to a large machine like the DELTA. Starting from scratch, I implemented a block-torus wrapped version of the right-looking LAPACK LU factorization variant, as well as torus-wrapped variants of the Li-Coleman implementation of the triangular solve (also known as cyclic algorithms). The development was done on the ORNL 128 node Intel iPSC/i860 GAMMA machine, achieving an impressive 1.92 GFLOPS. Two weeks later, on May 23, the machine attained 7.844 GFLOPS on 512 nodes, for a 12000 X 12000 problem (during the first attempt on the DELTA). After adjusting storage buffer sizes, the same code produced the numbers reported in article 2643 of comp.parallel (8.561 GFLOPS). Next, the author went on a much needed vacation at Disneyland and Sea World. On June 5, Thinking Machine announced the CM-200, and a new record, 9.03 GFLOPS for a 28672 X 28672 problem. On June 6, my code attained 10.2 GFLOPS for a 20000 X 20000 problem on the DELTA. It should be noted that the code that is being used was quickly put together in a matter of 2 weeks. There are many parameters that can be optimized, and the memory has not yet been exhausted. A 20K X 20K problem is completed in only 522.7 seconds. Moreover, there are still many standard techniques that can be used to reduce the communication overhead. 10.2 GFLOPS is only a start. I am working on a short report on this topic and I am in the process of completing the code. Intel has promised to release the code to the public domain, once it has been completed. Robert van de Geijn Assistant Professor The University of Texas at Austin -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
furnish@zig.inria.fr (Geoffrey Furnish) (06/17/91)
This is a followup to my prior posting in which I related my experiences in using the CM-2. Since that posting I have received a deluge of private mail containing comments of all sorts. In particular it has come to my attention that some of the statements I made in that posting were not correct. In the following I present corrected information supplied to me by an employee of Cray, and some comments of my own. Assuming that I get all the mistakes worked out with this posting, I don't intend to continue my part of this discussion further on the net, but would still be happy to receive comments from other interested parties. My appologies for the confusion I introduced. ---------- >> you can say: >> a = b * c >> On a Cray you would have to say: >> do i = 1, n >> do j = 1, m >> a(i,j) = b(i,j) * c(i,j) >> enddo >> enddo > >A subset of the Fortran-90 array syntax has been in CFT77 since >version 1.0 (about 1986). If you would like to express your problems >this way, there is nothing stopping you! I was unaware that CFT77 provides partial Fortran 90 support, and thanks for setting me straight. I am sure we will all benefit by the widespread availability of this enhanced language definition on all platforms from workstations to supercomputers. May that day come quickly! >Many believe that Unicos has excellent development tools for both Fortran and >C. Both are far beyond what a 'normal' Unix system would provide. Many are >X window based (cdbx, and the *view tools for example). Having not used Unicos, I was unaware of this also. So far all Cray systems I've used have run either COS or CTSS. I am aware that there is a growing movement among supercomputing facilities to move to Unicos, and it sounds like we will benefit from this trend. >Also comparing a 'stand-alone' connection machine use (they aren't *really* >timeshared...) with a highly utilized Cray system is an Apples to Oranges >comparison. I'm not sure what the author means by *really* in this case. I do know that the CM-2 I use every day comes in two "halves." One half is single tasking during business hours, and the other provides what I consider to be genuine timesharing capability during business hours. By this I mean that several users can simultaneously use this half of the machine, simultaneously running programs and obtaining results. Obviously it isn't as fast as when you operate in single task mode, but that is only natural. My understanding is that this multi user capability is accomplished by swapping user's session in and out, rather than making them share resources. I believe this is by definition timesharing, as opposed to multitasking. Both halves run batch queues at night, but you can slip in interactive sessions between batch jobs if you're sly. >Stand-alone Cray systems can provide excellent turnaround to >a single user as well. If you have DARPA to provide you with an expensive >play toy, a CM can make sense. Most of our customers are not so lucky. No doubt. On the other hand, neither am I. The CM-2 I have been using is shared by a large number of researchers. The user load is similar to what I have experienced when using CRAY's at US based supercomputer centers. My claim is that the CM has provided the most productive research environment that I have used to date. I obviously can't speak for anyone else, but it works for me. >> Thinking Machines provides direct and easy to use support for X; so much so >> that you can render images on their high speed graphics device or in an >> X window on a networkd workstation _WITHOUT MODIFYING A SINGLE LINE OF CODE_. >> Ask Cray to do that for you! > >This is also an interesting comment. X has been available on Unicos since >version 3.0 (1987) - first X10, and currently X11 (R4). Motif and OpenLook >are also readily available. Again what is the problem? Again, I have not used Unicos, so was not aware of this capability. Sounds like all Cray users would be a lot better off if all Cray's ran Unicos. To wrap up my contribution to this thread, let me say that it was certainly not my intention to under-represent Cray products. I can only comment on my own experiences, and all of these things were outside of my experience. Furthermore, parallelism has many facets, and runs a wide gamut from coarse grained (multi processors like Y-MP's and others) to fine grained (like the CM and others). Each system has applicability to a class of problems, and those of us doing research with parallel computing are continually finding new ways to use systems of both descriptions. I think the point that I and several others have been trying to make is that MP in particular has a whole lot more capability to offer than most people realize. When I first sat down and read the technical description of the CM and its SIMD programming model I thought "well, that's interesting, but it would be better if ..." Then I sat down and started programming it. And I realized much to my own surprise that a very large percentage of the things I am interested in are very naturally and efficiently expressed and solved in the SIMD model. In fact, there is nothing that I am interested in which is more easily/naturally expressed using any other paradigm. MIMD machines in particular (I have been reminded of the origins of this thread) may provide very significant per processor performance, that is true. But then you have to introduce all kinds of synchronization code to manage the dispersal of intermediate results. This introduces time delay, (overhead referred to in numerous prior postings) and what I consider to be more important--added complexity. Additionally, in order to squeeze good performance out of such systems throughout the course of an application's exectuion, it is often necessary to dynamically repartition the problem to balance the load on each processor so that they each perform their share in a comparble amount of time. This adds even more artificial complexity to the problem. It seems to me that a lot of the discussion about supercomputer performance is made in reference to performance on certain canned applications like Linpack or NASTRAN or the like. While these things do reflect the needs of a large body of users, there is another group who NEVER run canned programs, but rather spend the vast majority of our time writing our own code to solve our own problems. For those of us in that category, M/GFLOPS may very legitimately take a back seat to more important and less discussed issues like code complexity and reliability. NASTRAN may have been debugged in 1965, but what about the nuclear reactor simulator you are writing yourself? For these people, the expressive capability and simplicity of SIMD programming is not a matter of convenience, but of necessity. That such machines can yield superior performance too for a large class of problems, is icing on the cake. To each his own. Geoff Furnish furnish@solar.ph.utexas.edu
spsl@cs.bham.ac.uk (Steven Lam) (06/20/91)
Dear Sir, I understand that you've been using both Cray and CM-2. Can you give me some figures, such as the price and the performance of each machine? I personnally believe that fine-grain computers are much cheaper to build and hence less dependent on integration technology. Does the compiler for CM-2 allow you to grap several number of processing cells for your applications and other users may do the same? Have you ever experience the communication bottleneck when working with Cray? There are a few papers on CM-2 machine recently, they found that the connection machine is not very good in computing singular value decompostion or similar problem, as the cell utilization may fall to 20-40%, what is your comment? Stephen Lam +++++++++++++++++++++++ School of Electronic Engineering University of Birmingham Birmingham B15 2TT e-mail: spsl@uk.ac.bham.cs +++++++++++++++++++++ Thank you.