[comp.archives] [benchmarks] Re: Price/Performance figures for Number-Crunching

carter@IASTATE.EDU (Carter Michael Brannon) (03/21/91)
Archive-name: benchmark/supercomputer/slalom/0--
Archive-directory: tantalus.al.iastate.edu:/pub/Slalom/ [129.186.200.15]
Original-posting-by: carter@IASTATE.EDU (Carter Michael Brannon)
Original-subject: Re: Price/Performance figures for Number-Crunching
Reposted-by: emv@msen.com (Edward Vielmetti, MSEN)

Dr. McCalpin,

We found your LINPACK/peak MFLOPS table quite interesting, and have some
data to add to it; we applaud the use of "stream MFLOPS," which we 
generally refer to as "Level 1 BLAS MFLOPS" around here.  Your "MFLOPS Max"
is what we would call "Level 3 BLAS MFLOPS", meaning that operands are
re-used and processors that are bandwidth-starved can still show high
numbers.

However, we find this kind of performance evaluation to be limited and
misleading.  MFLOPS measures correlate poorly with actual performance on
complete applications, and the LINPACK measure is particularly inaccurate
in that it ignores all but a particular, not very common, form of linear
algebra operation.  (Most matrices that arise in applications are sparse,
diagonally dominant, and symmetric; Dongarra's is dense, random, and
nonsymmetric).  LINPACK cannot assess parallel computers because of its
ethnocentric uniprocessor FORTRAN rules, and it does not scale to the
amount of computing power available.  Do people buy computers to perform
MFLOPS or to solve problems?

Here is some data for the nCUBE 2 and Intel iPSC/860 hypercubes, and the
MasPar MP-1, with caveats:

                  MFLOPS  MFLOPS  MFLOPS  MFLOPS  Price   MFLOPS/Million$
System            Peak    "Max"   Lnpk    Stream  $10**6  "Max"   Stream
-------------------------------------------------------------------------
nCUBE 2, 1024 PE  2409     247     n.a.    2120    3.0      82      700
                         (1908)                           (640)

           64 PE   151      73     n.a.     133    0.31    235      430
                          (120)                           (387)

            1 PE     2.35    2.02  0.78       2.09 0.0031  647      670
                            (2.04)                        (652)
-------------------------------------------------------------------------
iPSC/860,  32 PE  1920     126     n.a.     213    0.85    148      250

            8 PE   480      63     n.a.      53    0.25    252      210

            1 PE    60      10     4.5        6.7  0.03    333      220
-------------------------------------------------------------------------
MasPar,  8192 PE   252    (220)    n.a.     252    0.32   (690)     790
-------------------------------------------------------------------------

All measurements are for 64-bit, IEEE floating-point arithmetic.  (You
need to state this in your table.  Some vendors, such as Convex, are 
notorious for citing 32-bit MFLOPS and hoping you won't notice.)  The
"MFLOPS Max" here is the 1000 by 1000 LINPACK measure, but that hardly
constitutes a maximum for these machines.  The nCUBE 2 with 1024 processors,
for example, is designed to run problems about 400 times bigger than that.
The numbers in parentheses give the "scaled LINPACK" figures that Dongarra
is moving toward... let the problem size grow to whatever fits in main
memory.  So the nCUBE gets 1908 MFLOPS solving a problem of size 20000 by
20000.

The nCUBE and MasPar are "bandwidth-rich" architectures, easily able to
move several operands to and from memory for every floating point operation.
LOCAL memory, that is. The Intel i860 can only get one operand every other
cycle (64-bit), limiting it to about 6.7 MFLOPS for things like
a(i) = b(i) * c(i).

Another problem with MFLOPS and LINPACK is "benchmark rot".  The LINPACK
benchmark has become so important that compilers now have "LINPACK
recognizers" that drop in super-optimized code whenever they see a
structure that looks like the LINPACK kernel.  We have found that LINPACK
overpredicts actual application performance by an order of magnitude for
some computers... FPS, Stardent, Convex, Alliant, and to some extent,
CRAY, are all guilty.  So you might find that traditional vector computers
aren't holding up that well when asked to do something other than DAXPY
with unit stride!

Finally, we suggest you take a look at the SLALOM benchmark, described in
Supercomputing Review (November 1990, March 1991).  It does an entire 
application, it scales to the amount of computing power available, and
removes the issue of MFLOPS completely.  SLALOM fixes time rather than
problem size, and so the size of the problem solved in one minute becomes
the figure of merit.  It works on all kinds of computers: MIMD, SIMD,
shared memory, distributed memory, vector, scalar... and we have versions
in C, Fortran, Pascal, for various vendors.  The last time we checked,
there were 70 computers on our database, which you can peruse by doing
an "anonymous ftp" to tantalus.al.iastate.edu (IP address 129.186.200.15).

-Mike Carter
 Steve Elbert
 John Gustafson
 Diane Rover
 Ames Laboratory, U.S. DOE
 Ames, IA 50011