carter@IASTATE.EDU (Carter Michael Brannon) (03/21/91)
Archive-name: benchmark/supercomputer/slalom/0-- Archive-directory: tantalus.al.iastate.edu:/pub/Slalom/ [129.186.200.15] Original-posting-by: carter@IASTATE.EDU (Carter Michael Brannon) Original-subject: Re: Price/Performance figures for Number-Crunching Reposted-by: emv@msen.com (Edward Vielmetti, MSEN) Dr. McCalpin, We found your LINPACK/peak MFLOPS table quite interesting, and have some data to add to it; we applaud the use of "stream MFLOPS," which we generally refer to as "Level 1 BLAS MFLOPS" around here. Your "MFLOPS Max" is what we would call "Level 3 BLAS MFLOPS", meaning that operands are re-used and processors that are bandwidth-starved can still show high numbers. However, we find this kind of performance evaluation to be limited and misleading. MFLOPS measures correlate poorly with actual performance on complete applications, and the LINPACK measure is particularly inaccurate in that it ignores all but a particular, not very common, form of linear algebra operation. (Most matrices that arise in applications are sparse, diagonally dominant, and symmetric; Dongarra's is dense, random, and nonsymmetric). LINPACK cannot assess parallel computers because of its ethnocentric uniprocessor FORTRAN rules, and it does not scale to the amount of computing power available. Do people buy computers to perform MFLOPS or to solve problems? Here is some data for the nCUBE 2 and Intel iPSC/860 hypercubes, and the MasPar MP-1, with caveats: MFLOPS MFLOPS MFLOPS MFLOPS Price MFLOPS/Million$ System Peak "Max" Lnpk Stream $10**6 "Max" Stream ------------------------------------------------------------------------- nCUBE 2, 1024 PE 2409 247 n.a. 2120 3.0 82 700 (1908) (640) 64 PE 151 73 n.a. 133 0.31 235 430 (120) (387) 1 PE 2.35 2.02 0.78 2.09 0.0031 647 670 (2.04) (652) ------------------------------------------------------------------------- iPSC/860, 32 PE 1920 126 n.a. 213 0.85 148 250 8 PE 480 63 n.a. 53 0.25 252 210 1 PE 60 10 4.5 6.7 0.03 333 220 ------------------------------------------------------------------------- MasPar, 8192 PE 252 (220) n.a. 252 0.32 (690) 790 ------------------------------------------------------------------------- All measurements are for 64-bit, IEEE floating-point arithmetic. (You need to state this in your table. Some vendors, such as Convex, are notorious for citing 32-bit MFLOPS and hoping you won't notice.) The "MFLOPS Max" here is the 1000 by 1000 LINPACK measure, but that hardly constitutes a maximum for these machines. The nCUBE 2 with 1024 processors, for example, is designed to run problems about 400 times bigger than that. The numbers in parentheses give the "scaled LINPACK" figures that Dongarra is moving toward... let the problem size grow to whatever fits in main memory. So the nCUBE gets 1908 MFLOPS solving a problem of size 20000 by 20000. The nCUBE and MasPar are "bandwidth-rich" architectures, easily able to move several operands to and from memory for every floating point operation. LOCAL memory, that is. The Intel i860 can only get one operand every other cycle (64-bit), limiting it to about 6.7 MFLOPS for things like a(i) = b(i) * c(i). Another problem with MFLOPS and LINPACK is "benchmark rot". The LINPACK benchmark has become so important that compilers now have "LINPACK recognizers" that drop in super-optimized code whenever they see a structure that looks like the LINPACK kernel. We have found that LINPACK overpredicts actual application performance by an order of magnitude for some computers... FPS, Stardent, Convex, Alliant, and to some extent, CRAY, are all guilty. So you might find that traditional vector computers aren't holding up that well when asked to do something other than DAXPY with unit stride! Finally, we suggest you take a look at the SLALOM benchmark, described in Supercomputing Review (November 1990, March 1991). It does an entire application, it scales to the amount of computing power available, and removes the issue of MFLOPS completely. SLALOM fixes time rather than problem size, and so the size of the problem solved in one minute becomes the figure of merit. It works on all kinds of computers: MIMD, SIMD, shared memory, distributed memory, vector, scalar... and we have versions in C, Fortran, Pascal, for various vendors. The last time we checked, there were 70 computers on our database, which you can peruse by doing an "anonymous ftp" to tantalus.al.iastate.edu (IP address 129.186.200.15). -Mike Carter Steve Elbert John Gustafson Diane Rover Ames Laboratory, U.S. DOE Ames, IA 50011