[comp.benchmarks] Price/Performance figures for Number-Crunching

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/19/91)

Here is a table with some info I have derived from Jack Dongarra's
latest LINPACK report.  

The important columns are the last two, giving the performance/price
ratio in MFLOPS per Million dollars for two different MFLOPS
estimators: 

(1) The optimized LINPACK 1000x1000 case; and
(2) The memory-bandwidth-limited MFLOPS rate for long dyadic vector
    operations.  

The former number gives the best-case results for each machine, while
the latter number is (in my experience) a good estimator of the
performance of real, un-optimized (but vectorizable) codes.

I was surprised at how well the traditional supercomputers and minisupers
are holding up in price/performance..... 

Given that these numbers are only accurate to within +/- 25% (at best?),
the bottom line is that only the IBM 320 noticeably exceeds the Y/MP in 
"Streaming MFLOPS" per Million dollars.   But since the difference 
in performance is a factor of 200, it is not really appropriate to
make a one-for-one sort of comparison between these two machines.

For "cache-friendly" applications, on the other hand, several machines
are noticeably more cost-effective than the Cray's, notably the IBM
and Stardent machines, with a pretty good performance turned in by
the SGI 4D/380.

By the way, the prices used are for the best available University 
discounts and including the cost of 3rd-party memory and disk drives.

		-----------------------------------
		Performance Summary Table - LINPACK
		-----------------------------------

		MFLOPS	MFLOPS	MFLOPS	MFLOPS	Price	MFLOPS/Million$
System		Peak	Max	Lnpk	Stream	$10**6   Max	Stream
-----------------------------------------------------------------------
IBM 550		 82	 62	  27	 12	 0.13	 477	  92
MIPS RC6280	 24	 16	  10	  8	 0.20	  80	  40
IBM 320		 40	 29	   9	  6	<0.02	1450	 300
-----------------------------------------------------------------------
Convex C-210	 50	 44	  17	  9	~0.5	  88	  18
Convex C-240	200	166	  26	 36	~1.6	 104	  23
-----------------------------------------------------------------------
Cray Y/MP-1	333	324	  25	150	~3.0	 108	  50
Cray Y/MP-8    2664    2144	 275   1200    ~16.0	 134	  75
-----------------------------------------------------------------------
1xIBM 3090E VF	116	 71	  13	 11	~3.0	  24	   4
2xIBM 3090E VF	232	141	  26*	 22	~5.0	  28	   4
3xIBM 3090E VF	348	210	  39*	 33	~7.0	  30	   5
-----------------------------------------------------------------------
SGI 4D/310	 10	  8	   6	  3
SGI 4D/380	 80	 52	  48*	  3	~0.20	 260	  15
-----------------------------------------------------------------------
Stardent 3010	 32	 25	  10	  6
Stardent 3040	128	 77	  12	 11	~0.25	 308	  44
-----------------------------------------------------------------------
IBM 320		 40	 29	   9	  6	<0.02	1450	 300
8x  IBM 320	320	232*	  72*	 48	 0.11	1450*	 300
16x IBM 320	640	464*	 144*	 96	 0.21	1450*	 300
-----------------------------------------------------------------------
(*) indicates extrapolated figures.

Definitions:
------------
"MFLOPS Peak" is the hardware-limited peak performance.  This
	is the performance which the hardware is "guaranteed
	not to exceed".
"MFLOPS Max" is the observed performance for highly optimized
	code in the solution of a 1000x1000 dense system of
	equations.  All numbers are observed, unless marked
	by '*', in which case they are extrapolated.
"MFLOPS Lnpk" is the observed performance on the LINPACK 100x100
	system of equations using standard Fortran.
"MFLOPS Stream" Is the bandwidth-limited speed for 64-bit dyadic 
	vector operations, and is (usually) the best estimate 
	for the speed of *unoptimized* 64-bit floating-point 
	codes on each machine.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/20/91)

> On 18 Mar 91 21:59:12 GMT, (mccalpin@perelandra.cms.udel.edu) I wrote:

Me> Here is a table with some info I have derived from Jack Dongarra's
Me> latest LINPACK report.  

Me> 		MFLOPS	MFLOPS	MFLOPS	MFLOPS	Price	MFLOPS/Million$
Me> System	Peak	Max	Lnpk	Stream	$10**6   Max	Stream
Me> -----------------------------------------------------------------------
Me> Cray Y/MP-1	 333	 324	  25	150	~3.0	 108	  50
                      		^^^^
				^^^^
Aaarggh!   That should be 90 MFLOPS, not 25 MFLOPS!

I would have caught this if I had used it for one of the
price/performance calculations....

By the way, the "MFLOPS Stream" is not derived from the LINPACK
report, but from lots of other sources.  It is close to the maximum
sustainable memory bandwidth in MBytes/sec divided by 24 MB/sec
(which required to sustain 1 MFLOPS of long 64-bit vector dyads).

On some machines I use my own observations of the "maximum sustainable
memory bandwidth" rather than the manufacturers specified "memory
bandwidth".  This results in some slightly lower, but more generally
accurate numbers.  For example on the Cray, the theoretical peak
streaming speed is 166 MFLOPS/cpu, but in real FORTRAN code, it is
difficult to exceed 150 MFLOPS/cpu for long vector dyads.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

carter@IASTATE.EDU (Carter Michael Brannon) (03/21/91)

Dr. McCalpin,

We found your LINPACK/peak MFLOPS table quite interesting, and have some
data to add to it; we applaud the use of "stream MFLOPS," which we 
generally refer to as "Level 1 BLAS MFLOPS" around here.  Your "MFLOPS Max"
is what we would call "Level 3 BLAS MFLOPS", meaning that operands are
re-used and processors that are bandwidth-starved can still show high
numbers.

However, we find this kind of performance evaluation to be limited and
misleading.  MFLOPS measures correlate poorly with actual performance on
complete applications, and the LINPACK measure is particularly inaccurate
in that it ignores all but a particular, not very common, form of linear
algebra operation.  (Most matrices that arise in applications are sparse,
diagonally dominant, and symmetric; Dongarra's is dense, random, and
nonsymmetric).  LINPACK cannot assess parallel computers because of its
ethnocentric uniprocessor FORTRAN rules, and it does not scale to the
amount of computing power available.  Do people buy computers to perform
MFLOPS or to solve problems?

Here is some data for the nCUBE 2 and Intel iPSC/860 hypercubes, and the
MasPar MP-1, with caveats:

                  MFLOPS  MFLOPS  MFLOPS  MFLOPS  Price   MFLOPS/Million$
System            Peak    "Max"   Lnpk    Stream  $10**6  "Max"   Stream
-------------------------------------------------------------------------
nCUBE 2, 1024 PE  2409     247     n.a.    2120    3.0      82      700
                         (1908)                           (640)

           64 PE   151      73     n.a.     133    0.31    235      430
                          (120)                           (387)

            1 PE     2.35    2.02  0.78       2.09 0.0031  647      670
                            (2.04)                        (652)
-------------------------------------------------------------------------
iPSC/860,  32 PE  1920     126     n.a.     213    0.85    148      250

            8 PE   480      63     n.a.      53    0.25    252      210

            1 PE    60      10     4.5        6.7  0.03    333      220
-------------------------------------------------------------------------
MasPar,  8192 PE   252    (220)    n.a.     252    0.32   (690)     790
-------------------------------------------------------------------------

All measurements are for 64-bit, IEEE floating-point arithmetic.  (You
need to state this in your table.  Some vendors, such as Convex, are 
notorious for citing 32-bit MFLOPS and hoping you won't notice.)  The
"MFLOPS Max" here is the 1000 by 1000 LINPACK measure, but that hardly
constitutes a maximum for these machines.  The nCUBE 2 with 1024 processors,
for example, is designed to run problems about 400 times bigger than that.
The numbers in parentheses give the "scaled LINPACK" figures that Dongarra
is moving toward... let the problem size grow to whatever fits in main
memory.  So the nCUBE gets 1908 MFLOPS solving a problem of size 20000 by
20000.

The nCUBE and MasPar are "bandwidth-rich" architectures, easily able to
move several operands to and from memory for every floating point operation.
LOCAL memory, that is. The Intel i860 can only get one operand every other
cycle (64-bit), limiting it to about 6.7 MFLOPS for things like
a(i) = b(i) * c(i).

Another problem with MFLOPS and LINPACK is "benchmark rot".  The LINPACK
benchmark has become so important that compilers now have "LINPACK
recognizers" that drop in super-optimized code whenever they see a
structure that looks like the LINPACK kernel.  We have found that LINPACK
overpredicts actual application performance by an order of magnitude for
some computers... FPS, Stardent, Convex, Alliant, and to some extent,
CRAY, are all guilty.  So you might find that traditional vector computers
aren't holding up that well when asked to do something other than DAXPY
with unit stride!

Finally, we suggest you take a look at the SLALOM benchmark, described in
Supercomputing Review (November 1990, March 1991).  It does an entire 
application, it scales to the amount of computing power available, and
removes the issue of MFLOPS completely.  SLALOM fixes time rather than
problem size, and so the size of the problem solved in one minute becomes
the figure of merit.  It works on all kinds of computers: MIMD, SIMD,
shared memory, distributed memory, vector, scalar... and we have versions
in C, Fortran, Pascal, for various vendors.  The last time we checked,
there were 70 computers on our database, which you can peruse by doing
an "anonymous ftp" to tantalus.al.iastate.edu (IP address 129.186.200.15).

-Mike Carter
 Steve Elbert
 John Gustafson
 Diane Rover
 Ames Laboratory, U.S. DOE
 Ames, IA 50011

dodson@convex.COM (Dave Dodson) (03/21/91)

In article <1991Mar20.104926@IASTATE.EDU> carter@IASTATE.EDU (Carter Michael Brannon) writes:
>All measurements are for 64-bit, IEEE floating-point arithmetic.  (You
>need to state this in your table.  Some vendors, such as Convex, are 
>notorious for citing 32-bit MFLOPS and hoping you won't notice.)

Sorry, but Convex does not cite 32-bit MFLOPS since 32-bit floating point
runs almost exactly the same speed as 64-bit floating point on C2 Series
machines.  For example, on the C240, the 1000x1000 LINPACK benchmark speed
is 166 MFLOPS in 64-bit precision and 176 MFLOPS in 32-bit precision.

----------------------------------------------------------------------

Dave Dodson		                             dodson@convex.COM
Convex Computer Corporation      Richardson, Texas      (214) 497-4234

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/21/91)

> On 20 Mar 91 16:49:26 GMT, carter@IASTATE.EDU (Carter Michael Brannon) said:

Carter> We found your LINPACK/peak MFLOPS table quite interesting, and
Carter> have some data to add to it; we applaud the use of "stream
Carter> MFLOPS," which we generally refer to as "Level 1 BLAS MFLOPS"
Carter> around here.  Your "MFLOPS Max" is what we would call "Level 3
Carter> BLAS MFLOPS", meaning that operands are re-used and processors
Carter> that are bandwidth-starved can still show high numbers.

I am glad you found it interesting....

Carter> However, we find this kind of performance evaluation to be
Carter> limited and misleading.  MFLOPS measures correlate poorly with
Carter> actual performance on complete applications, and the LINPACK
Carter> measure is particularly inaccurate in that it ignores all but
Carter> a particular, not very common, form of linear algebra
Carter> operation.  

What you mean is that you find the LINPACK MFLOPS correlates poorly
with *your* complete applications.  I find pretty good correlation
with *my* complete applications.  As for someone *else's*
applications, the appropriate response is, "Well, it depends...."

But in any event, please notice that I only used the LINPACK 1000x1000
hand-optimized number to calculate a price/performance ratio (Max
MFLOPS/million $).  The other price/performance number (Stream
MFLOPS/million $) comes from other sources.

The LINPACK 1000x1000 number is a particularly *good* estimate of the
maximum speed attainable by cache-friendly applications on vector and
mildly parallel machines (ncpus=2,4,8 but not ncpus=256).  It is not
intended to be a good estimate of the speed of anyone's application.

The "Stream MFLOPS" is a good estimator (in *my* experience) of the
performance of well-structured vectorizable codes with no specific
optimizations.


Carter> [....]  LINPACK cannot assess parallel computers because of
Carter> its ethnocentric uniprocessor FORTRAN rules, and it does not
Carter> scale to the amount of computing power available.  Do people
Carter> buy computers to perform MFLOPS or to solve problems?

You are getting carried away by your own propaganda here.  The LINPACK
100x100 test case has some specific rules.  Following those rules
allows one to make specific statements about the results that could
not be made in the absence of those "ethnocentric uniprocessor FORTRAN
rules".  

It is certainly true that these rules do not allow massively parallel
machines to show off their best potential. So what?  If you don't like
those rules, make up another benchmark.  I would suggest something
like solving Laplace's equation on a 512x512x512 finite-difference
grid.  (That's 128 MW of data, and is probably big enough for just
about any massively parallel computer).

Carter> All measurements are for 64-bit, IEEE floating-point
Carter> arithmetic.  (You need to state this in your table.  Some
Carter> vendors, such as Convex, are notorious for citing 32-bit
Carter> MFLOPS and hoping you won't notice.) 

(1) I thought that my table was clear enough.  Dongarra's report
only contains 64-bit results and my "Stream MFLOPS" are clearly based
on 64-bit arithmetic.
(2) Are you sure you have your libel in order here?  The Convex machines
do not run noticeably faster in 32-bits than 64-bits, so why would
they bother?  CDC is the company that would have had something to
gain, but their results were generally clearly labelled as well....

Carter> Another problem with MFLOPS and LINPACK is "benchmark rot".
Carter> The LINPACK benchmark has become so important that compilers
Carter> now have "LINPACK recognizers" that drop in super-optimized
Carter> code whenever they see a structure that looks like the LINPACK
Carter> kernel.

Documentation?  Or is this another "urban myth"?

Carter>  We have found that LINPACK overpredicts actual
Carter> application performance by an order of magnitude for some
Carter> computers... FPS, Stardent, Convex, Alliant, and to some
Carter> extent, CRAY, are all guilty.  So you might find that
Carter> traditional vector computers aren't holding up that well when
Carter> asked to do something other than DAXPY with unit stride!

Which LINPACK number overpredicts performance by an order of magnitude?

The LINPACK 100x100 test case gives very reasonable numbers for
vectorizable applications.  If your applications are running at an
order of magnitude slower than the LINPACK 100x100 numbers, then you 
are doing something seriously wrong -- either in implementing your
code or in choosing what machine to run it on....

On the other hand, the LINPACK 1000x1000 numbers could be an order of
magnitude faster than your application, especially on parallel
machines.  That test case was never intended to give you an estimate
of your program performance, it was intended to verify that it is
possible to write an application that runs at a substantial fraction
of the machine's peak speed.  Thus machines like the Alliant FX/80
stuck out like a sore thumb, since the best performance was only 1/3
of the peak advertised performance (69 MFLOPS vs 188 MFLOPS).
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

patrick@convex.COM (Patrick F. McGehearty) (03/21/91)

In article <MCCALPIN.91Mar20163118@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>> On 20 Mar 91 16:49:26 GMT, carter@IASTATE.EDU (Carter Michael Brannon) said:
>
lots of interesting discussion of LINPACK vs $$ vs other things.
...
...until we get to here...which leads to what I want to comment on
...
>
>Carter> Another problem with MFLOPS and LINPACK is "benchmark rot".
>Carter> The LINPACK benchmark has become so important that compilers
>Carter> now have "LINPACK recognizers" that drop in super-optimized
>Carter> code whenever they see a structure that looks like the LINPACK
>Carter> kernel.
>
Mccalpin>Documentation?  Or is this another "urban myth"?
>

I would recommend the phrase "benchmark smoothing" to "benchmark rot",
in the sense that the benchmark has been used like sandpaper to smooth
rough edges out of the compiler.  The smaller the benchmark, the coarser
and more specific the smoothing.  Thus, any benchmark which has been
widely used for a number of years will be much less likely to trip over
compiler weaknesses than newer benchmarks.  Many customers use known
benchmarks to select their first round of bidders, and then ask that
their own benchmarks be run for final selection purposes.  This procedure
explains why having "LINPACK only recognizers" is not a winning approach.
Yes, you get invited to bid, but no, you don't win the business.
I'm sure they exist, but are quickly found to be a waste of effort.

On the compiler development side (where I sit), LINPACK certainly has
been a "compiler smoother".  Several of our current optimizations were
driven by LINPACK.  They are not specific to LINPACK, since they will
work for most dense vector*vector or matrix*vector code, but they
certainly are important to getting the LINPACK results we get.  Is this
the sort of "LINPACK recognizer" that Carter was refering too?  If so,
I am concerned about the apparent distain for "pattern recognizers".
Pattern recognition is a critical tool in the optimizing compiler's tool bag.
There are large numbers of such "pattern recognizers" in any good optimizing
compiler, and a good compiler development group is continually trying
to identify new ones that have at least moderate degrees of utility.

So, I would argue that you should not distain the old benchmarks for
"benchmark rot", but be wary of any single number to characterize an
architecture.  Look at several benchmark results (as many as you can find),
and cross check the different machines on the different tests.  When a large
variety of tests show machine A to be twice as fast as machine B, then you
can have more confidence of your result than when a single benchmark shows
such a result.  If the results vary, then understanding why will lead to
better understanding of how the machines would work in your environment.

I agree with McCalpin that the Linpack 1000x1000 data to provide a good
'truth in advertising' comparison with "PEAK/guaranteed not to exceed"
numbers quoted by marketing glossies.  Most applications will not approach
those rates without some serious optimization effort, but at least you know
what the potentials are.  Data for new systems may be subject to later
improvement due to better understanding of the system by the benchmarkers,
but otherwise, it is a useful guideline.

That problem size should be sufficient for up to about 10 Gflop size
systems.  For larger and faster systems, I like the concept put forth by
Carter of "how large of a problem can be solved in fixed time?".  A
debatable point is "what is the ideal fixed time?"  Consider the following
two extremes expressed:

1) Anything less than 1/25 of a second is 'apparently instant' from the
   human perception point of view.
2) For anything longer than a week, you forget why you ran the program. :-)

I know there are exceptions on both ends, but these limits serve as
convenient bounds for the domain of discourse.  The "one minute" suggested
by Carter has several good points.  As a model of real work, it is small
enough so that a a researcher will not switch to some other major activity
before getting results.  This allows the researcher to continue an
incremental train of thought.  It also is small enough for the benchmarker
to run repeatedly with different configurations or tuning options.  For
these reasons, it is a useful length of time.  However, it does not cover
the 'overnight' run category.  For this category of test, issues involving
very large data sets and management of same come into play.  It is an
important area of benchmarking that is generally neglected due to the high
cost of developing the benchmark and getting vendors to run and tune such a
benchmark.  Perhaps we (vendors and customers) could encourage the
development of such 'super benchmarks' by supporting two versions of the
same code, the one minute version and the one night version.  Then, most
tuning could be done with the one minute version, and the one night version
could be used to confirm the effectiveness of the system for truly large
problems.

I seem to have wandered over several related topics, so I will stop now
and wait for the net's comments.

metzger@convex.com (Robert Metzger) (03/22/91)

In article <1991Mar21.000302.10103@convex.com> patrick@convex.COM (Patrick F. McGehearty) writes:
>>
>>Carter> Another problem with MFLOPS and LINPACK is "benchmark rot".
>>Carter> The LINPACK benchmark has become so important that compilers
>>Carter> now have "LINPACK recognizers" that drop in super-optimized
>>Carter> code whenever they see a structure that looks like the LINPACK
>>Carter> kernel.
>>
>Mccalpin>Documentation?  Or is this another "urban myth"?
>>
>
>On the compiler development side (where I sit), LINPACK certainly has
>been a "compiler smoother".  Several of our current optimizations were
>driven by LINPACK.  They are not specific to LINPACK, since they will
>work for most dense vector*vector or matrix*vector code, but they
>certainly are important to getting the LINPACK results we get.  Is this
>the sort of "LINPACK recognizer" that Carter was refering too?  If so,
>I am concerned about the apparent distain for "pattern recognizers".
>Pattern recognition is a critical tool in the optimizing compiler's tool bag.
>There are large numbers of such "pattern recognizers" in any good optimizing
>compiler, and a good compiler development group is continually trying
>to identify new ones that have at least moderate degrees of utility.

Let's distinguish between "surface" and "semantic" pattern recognition.

Surface PR does things like:
Finds the lexemes 'REAL*4 FUNCTION SAXPY', junks the lexemes until it
sees 'END', and either directly inserts highly optimized assembly code,
or hacks the link line to include a special library of highly optimized code.
In other words, it doesn't do any compilation at all.  Anyone who cares
about accurate results from something like LINPACK should replace all
the procedure names with JOE, FRED, etc. before compiling.   

Semantic PR analyzes the control flow, data flow, and symbol definitions
of a procedure to determine its purpose, if the procedure was not amenable
to other optimization methods.  It should recognize a pattern regardless of:
1) the spelling of the lexemes
2) whether the loops were coded with DO, DO WHILE, or IF-GOTO
3) whether the results of individual arithmetic operators are assigned
   to temporary variables or used as a part of larger expression
etc., etc.

CONVEX compilers do perform semantic pattern matching on a handful of general
patterns, that do not lend themselves to standard vectorization techniques,
and which occur frequently in user applications. 
--
Robert Metzger		CONVEX Computer Corp.  		Richardson, Texas
Generic Disclaimer:	I write software, not CONVEX Corporate Policies.
"The only legitimate function of government is to protect its citizens from
harm or property loss by violence, stealth, or fraud.  All else is tyranny."

jt@aeras.uucp (J T McDuffie) (03/24/91)

In article <1991Mar21.000302.10103@convex.com> patrick@convex.COM (Patrick F. McGehearty) writes:
>
>So, I would argue that you should not distain the old benchmarks for
>"benchmark rot", but be wary of any single number to characterize an
>architecture.  Look at several benchmark results (as many as you can find),
>and cross check the different machines on the different tests.  When a large
>variety of tests show machine A to be twice as fast as machine B, then you
>can have more confidence of your result than when a single benchmark shows
>such a result.  If the results vary, then understanding why will lead to
>better understanding of how the machines would work in your environment.
>
        If you receive the SPEC newsletter you'll find that they have
        started to utilize multiple figures of merit to describe any
        given machine, feeling that THERE IS NO SINGLE FIGURE OF MERIT.
        Thus SPEC has the SPECmark, SPECthruput, SPECint (SPECmark
        considering INTEGER only CPU benchmark components), and SPECfp
        (SPECmark considering FLOATING POINT only CPU benchmark
        components).

	So it would seem that at least SPEC would agree with McGehearty
	that you need to evaluate a NUMBER of benchmark results when
	attempting to characterize any given system/architecture.

--
=====================================DVC=====================================
My boss will disavow any statement I make      -      thinks I talk too much!

==>  ==> // James T. McDuffie, III             ==> ==> // aeras!jt@Sun.COM
        // Manager, Benchmark & Performance           // uunet!sun!aeras!jt
       // Arix Corporation                           // (408) 432-0263 (FAX)
      // 821 Fox Lane                               // (408) 922-1879 (Voice)
     // San Jose, CA  95131                        // My opinions are my own
   =========================                     ============================
-- 
--
<=======================================================================>
<  Disclamer: My opinions are my own.  No one else seems to want them!	>
<=======================================================================>

mash@mips.com (John Mashey) (03/30/91)

In article <1991Mar23.192405.7668@aeras.uucp> jt@aeras.UUCP (J T McDuffie) writes:
>>
>        If you receive the SPEC newsletter you'll find that they have
>        started to utilize multiple figures of merit to describe any
>        given machine, feeling that THERE IS NO SINGLE FIGURE OF MERIT.
>        Thus SPEC has the SPECmark, SPECthruput, SPECint (SPECmark
>        considering INTEGER only CPU benchmark components), and SPECfp
>        (SPECmark considering FLOATING POINT only CPU benchmark
>        components).
>
>	So it would seem that at least SPEC would agree with McGehearty
>	that you need to evaluate a NUMBER of benchmark results when
>	attempting to characterize any given system/architecture.
SPEC has ALWAYS believed that there is no one figure of merit.
That's why SPEC has always used a reporting form that included all 10 numbers,
and why we've always said you need to see all of the numbers to even
get a clue what the machine is like.  It turns out that the SPECint
metric (the 4 integer C benchmarks) is relatively stable, i.e.,
it has a fairly narrow confidence interval, small variance, etc,
on every machine ever measured.  SPECfp tends to vary a lot more,
i.e., it is alwasy good to run your own benchmarks, but you'd be especially
tempted to do so when comparing machines on FP performance, because it
varies all over the map, especially on the more vectorizable ones.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

maf@hpfcso.FC.HP.COM (Mark Forsyth) (04/01/91)

<From: mash@mips.com (John Mashey)
<
<It turns out that the SPECint
<metric (the 4 integer C benchmarks) is relatively stable, i.e.,
<it has a fairly narrow confidence interval, small variance, etc,
<on every machine ever measured.  SPECfp tends to vary a lot more,
<i.e., it is alwasy good to run your own benchmarks, but you'd be especially
<tempted to do so when comparing machines on FP performance, because it
<varies all over the map, especially on the more vectorizable ones.

You seem to be confusing "performance relative to a VAX 11/780" with
"performance".  Since the units chosen for the SPEC numbers are relative
to a VAX 11/780, a flat curve on the individual components means only
that the system faithfully reproduces all of the relative strengths, and
weaknesses, of the VAX, NOT that it is necessarily a "balanced" design. 
A small variance is not necessarily a good goal to design for.  The 780 
was very good at integer intensive workloads, but, one would expect that 
modern RISCs with 64 bit floating point units and an additional 10 years 
of evolution in compiler technology to perform relatively better on FP 
intensive applications. The FP components of the SPEC suite also seem 
to do a better job at measuring memory hierarchy performance - we see the 
highest cache and TLB miss rates on these. There is a lot of variation from 
machine to machine and benchmark to benchmark on the floating point intensive 
components of the SPEC suite but, presumably all of these were considered 
to be representations of important classes of customer applications on 
workstations. I agree that the only good measure of a machine's performance 
is on the actual applications that each user will run, especially for 
users who care as much about graphics or I/O performance as they do about
integer and FP computation intensive applications. 


disclaimer: these are my opinions, not HP's 

Mark Forsyth
Hewlett Packard Company
Fort Collins, Colorado

mash@mips.com (John Mashey) (04/02/91)

In article <21720002@hpfcso.FC.HP.COM> maf@hpfcso.FC.HP.COM (Mark Forsyth) writes:
>
><From: mash@mips.com (John Mashey)
><It turns out that the SPECint
><metric (the 4 integer C benchmarks) is relatively stable, i.e.,
><it has a fairly narrow confidence interval, small variance, etc,
><on every machine ever measured.  SPECfp tends to vary a lot more,
><i.e., it is alwasy good to run your own benchmarks, but you'd be especially
><tempted to do so when comparing machines on FP performance, because it
><varies all over the map, especially on the more vectorizable ones.
>
>You seem to be confusing "performance relative to a VAX 11/780" with
>"performance".  Since the units chosen for the SPEC numbers are relative
No, don't think so.
>to a VAX 11/780, a flat curve on the individual components means only
>that the system faithfully reproduces all of the relative strengths, and
>weaknesses, of the VAX, NOT that it is necessarily a "balanced" design. 
>A small variance is not necessarily a good goal to design for.  The 780 
Of course not.  I didn't say that the point of all this was to
build something that mimicked a VAX.  let me try again:
	a) On every machine for which SPEC has published benchmarks
	(with the sole exception of early DN10000s, for which some
	compiler bug was fixed, removing the exception), the bottom-to-
	top ratio amongst the integer benchmarks is at worst 1.5X to 1,
	but more usually, 1.2-1.3X to 1.  Many companies have internal
	data that correlates well with the SPEC integer benchmarks,
	across machines (not just on VAXen, i.e., the ratios have a similar
	property regardless of which machine you pick.)
	Thus, every bit of data that I have says that if I knew I had
	a new integer benchmark to be run on two machines, I'd expect that
	the performance ratio would quite often be close (+/-15%) of
	the ratio of the SPECint metrics.  (Not always, but quite often.)
	b) The point is that the SPECfloat does NOT have this property,
	because the variance is much higher - I've seen at least as much
	as 8X from bottom to top, and as I recall, the new HP's have
	something like 5-6X from bottom to top.  All of this says:
	1) IF I had to predict the performance of a new FP benchmark
	on two machines, and I knew nothing but their SPECfloat numbers,
	and knew nothing about the benchmark, I'd guess the SPECfloat
	ratio, for lack of anything else.
	2) However, I'd expect a MUCH wider range of measured performance
	ratios, i.e., much more often, I'd expect to find inversions in
	the measured performance compared to predicted, or at least
	a wider range of ratios.
	3) All of this behavior is PERFECTLY in line with the observed behavior
	seen in the super-computer/minisupercomputer world forever.
	In particular, it continues to show that there are things to do
	to drastically accelerate vector-ish code, and hard to do much
	about integer and scalar FP.
>was very good at integer intensive workloads, but, one would expect that 
Actually, compared to many machines, the VAX was better on FP than integer...
Or, put another way, for many systems, SPECfp numbers are lower than SPECint.
>modern RISCs with 64 bit floating point units and an additional 10 years 
>of evolution in compiler technology to perform relatively better on FP 
>intensive applications.....
Yes; I think we all agree on that.  The issue is not the performance
of FP relative to integer, but of the variance. An amusing thing to do
is to plot the SPEC numbers for IBM RS6000/540, Stardent 3010,
and MIPS RC6280.  The SPECfp numbers are fairly similar, but the
relative ORDER changes from benchmark to benchmark.
Certainly, given the likely changes going on, we'll probably have to
change the SPEC scale from linear to log, to be able to show both
integer and FP numbers on same chart. :-)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

jclark@sdcc6.ucsd.edu (John Clark) (04/03/91)

In article <1628@spim.mips.COM> mash@mips.com (John Mashey) writes:
+SPEC has ALWAYS believed that there is no one figure of merit.
+That's why SPEC has always used a reporting form that included all 10 numbers,
+and why we've always said you need to see all of the numbers to even
+get a clue what the machine is like.  It turns out that the SPECint

I just got the SPEC info from Franson, Hagerty, and Associates, Inc.
and it appears to me that one must have un*x running on one's CPU to
perform most if not all the 10 benchmarks. How have others done
benchmarks on machines which run say, vxWorks or other 'real time'
OS's?

(Note: yes we do have a version of BSD on our hardware but we
didn't port FORTRAN, so 6 of the 10 are unavailable to us.)

And yes we could get the 'sun3' FORTRAN and then run the result
under vxWorks until I/O or virtual memory requirements caused
problems.
-- 

John Clark
jclark@ucsd.edu

jpk@ingres.com (Jon Krueger) (04/03/91)

From article <21720002@hpfcso.FC.HP.COM>, by maf@hpfcso.FC.HP.COM (Mark Forsyth):
> You seem to be confusing "performance relative to a VAX 11/780" with
> "performance".

It is good to know that HP has developed absolute performance metrics.
Could you tells us more about them?

-- Jon
--

Jon Krueger, jpk@ingres.com 

mash@mips.com (John Mashey) (04/03/91)

In article <17933@sdcc6.ucsd.edu> jclark@sdcc6.ucsd.edu (John Clark) writes:
>In article <1628@spim.mips.COM> mash@mips.com (John Mashey) writes:
>+SPEC has ALWAYS believed that there is no one figure of merit.
>+That's why SPEC has always used a reporting form that included all 10 numbers,
>+and why we've always said you need to see all of the numbers to even
>+get a clue what the machine is like.  It turns out that the SPECint

>I just got the SPEC info from Franson, Hagerty, and Associates, Inc.
>and it appears to me that one must have un*x running on one's CPU to
>perform most if not all the 10 benchmarks. How have others done
>benchmarks on machines which run say, vxWorks or other 'real time'
>OS's?
Actually, the benchmarks have been run under VAX/VMS (mostly),
and on various i860 attached-processor boards (albeit attached to UNIX).
However, I don't think that helps you in any case.

>(Note: yes we do have a version of BSD on our hardware but we
>didn't port FORTRAN, so 6 of the 10 are unavailable to us.)
>
>And yes we could get the 'sun3' FORTRAN and then run the result
>under vxWorks until I/O or virtual memory requirements caused
>problems.

Let's take the 3 problems in order:
	a) Single-task integer compute speed.  For the case you describe,
	it is probably not too bad to convert 3 of the 4 C programs,
	and put that profile versus other things.  (I wouldn't attempt
	teh 001.gcc benchmark).  Thus, you wouldat least get a couple
	data points that you could compare on, and most of these will
	run OK in an 8MB machine.
	b) FP: is FP important to you?
		No.  Done with this part.
		Yes. painful, with no FORTRAN compiler.
			also, we didn't manage to find a C floating-point
			program we liked the first round, so there's nothing
			there for you.
	c) Real-time response: people have been wishing for a while that
	there was a real-time and/or embedded-control benchmark suite that
	lots of people agreed on....  good luck.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

dougd@uts.amdahl.com (Douglas DeMers) (04/04/91)

mash@mips.com (John Mashey) writes:
>Let's take the 3 problems in order:
>	a) Single-task integer compute speed.  For the case you describe,
>	it is probably not too bad to convert 3 of the 4 C programs,
>	and put that profile versus other things.  (I wouldn't attempt
>	teh 001.gcc benchmark).  Thus, you would at least get a couple
>	data points that you could compare on, and most of these will
>	run OK in an 8MB machine.
>	b) FP: is FP important to you?
>		No.  Done with this part.
>		Yes. painful, with no FORTRAN compiler.
>			also, we didn't manage to find a C floating-point
>			program we liked the first round, so there's nothing
>			there for you.

Have you tried f2c, the fortran to c program translator?  I used it once
on linpack (another "classic" fortran benchmark) to get a c version.
Even with a fortran compiler available, it is sometimes entertaining to
compare a translated-to-c benchmark against the fortran compiler.
-- 
Douglas DeMers,      | (408-746-8546) | dougd@uts.amdahl.com
Amdahl Corporation   |                | {sun,uunet}!amdahl!dougd
  [The opinions expressed above are mine, solely, and do not    ]
  [necessarily reflect the opinions or policies of Amdahl Corp. ]

maf@hpfcso.FC.HP.COM (Mark Forsyth) (04/05/91)

>It is good to know that HP has developed absolute performance metrics.
>Could you tells us more about them?
>
>-- Jon

The point was not that there is a better number to compare, only that a low
variance on SPEC  only means "as balanced as a VAX was" and not "more balanced 
performance".

- mark