[comp.benchmarks] One Number

mash@mips.COM (John Mashey) (12/27/90)

In article <15379@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes:
>Thank you for at least driving another stake in "bc benchmark"'s heart.
>However, as you and I know, there is a tremendous need out there for
>[sigh] [gasp] A SINGLE NUMBER to characterize JUST EXACLTY HOW FAST
>ANY GIVEN COMPUTER IS.  I have my own personal favorite which I will
>not belabor because everyone has his own personal favorite.  My question
>is this: just as you and I believe that vampires don't exist, do you
>believe that a single-number that measures a computer's speed doesn't
>exist?  I won't state MY belief to avoid bias in the discussion.  My
>use of the word "bias" in the preceding sentence is a HINT on my belief!

There is no One Number that predicts performance.

Let me restate the hypothesis more precisely:
	Let ON(a) be the One Number for machine a.
	We'd then expect that ON(a) / ON(b) would predict the
	relative performance of machines a and b on all benchmarks.
Well, that number doesn't exist, and is easily proved not,
even by looking just at the published SPEC benchmarks.
	How about saying that ON(a) / ON(b) predicts the relative performance
	of any benchmark within 10%?
Well that doesn't exist either, from the same data.
	How about saying that ON(a) / ON(b) predicts the relative performance
	within a factor of 10, i.e., suppose ON(a) / ON(b) == 1.0,
	then it would be OK as long as "a" was no more than 10X faster
	than "b" on any benchmark, or vice-versa.
This might exist, and might even cover the SPEC data, although one may have
to go even higher, like allowing a factor of 20X off.
(I'm just unpacking, and don't have the numbers handy.)
For example, try comparing a CISC micro (like a 486), which has good
integer performance, but whose VAX-relative floating-point is pretty
low, with a vector machine (like the Stardent), or with the IBM RS/6000,
both of whose floating point performance tends to be much stronger
than their VAX-relative integer performance. 

Of course, something that predicts only within a factor of 10X to 20X is pretty
useless..... But even if it were within 20-40%, it's still pretty bad.
(Note, for example, that published Dhrystone results easily mis-predict
SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine
"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more
realistic integer benchmarks.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (01/03/91)

In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes:

>(Note, for example, that published Dhrystone results easily mis-predict
>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine
>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more
>realistic integer benchmarks.)
>-- 
>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

This is an interesting observation (result).

Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer
programs. That is, hundreds (I believe) of programs were 
analyzed to come up with the (ahem) 'typical' high level
language instructions and their frequency of usage. In view of this 
I would, at first sight, suspect the Dhrystone to be more accurate
than SPEC as SPEC is based upon only a few integer programs. 
What happened? Why does Dhrystone fail? 
Is it due to:

 (a) Instruction Mix is WRONG?

 (b) Optimization Problems? 
     This is not a problem in my view --- we just need people to
     report results using various compiler options then we gain 
     a more proper perspective of the variation in performance. 
     Of course, in general, people tend to publically report the
     'Max' or 'Best' performance.  The 'Min' or 'Mean' results 
     are more difficult to find. I know Dhrystone (1.0, 1.1, 2.0, 
     2.1) can all be optimized a great deal (up to a factor of 2
     or so because I've done it) but this should not be a problem
     as long as we know what result corresponds to what compiler
     options --- this helps to define the RANGE of expected
     performance (Min, Max and/or Std. Dev.) with a certain compiler
     and system, and also the 'Mean' or 'Median' performance.

 (c) Program Size TOO small?
     I suppose that if it were not for cacheing (cache size) 
     effects then program size should not be a problem, but I'm
     no expert ...

 (d) Something else?

Why should one expect the integer SPEC results to be more 'accurate'
than the Dhrystone?  I'm just wondering.  What is a 'typical' program
or 'typical' frequency of instruction usage?  Seems to me there is no 
one real 'typical' anything but a wide variety of 'typical' programs,
instruction mixes, and frequency of usages depending upon application.

Real programs also show a great variation in performance.  I noticed
this recently in a Scientific American article (Jan 1991) which
showed the comparison of 13 different real programs on a wide
variety of supercomputers.  The program 'megflop' variation in 
perfromance was truly tremendous especially for the fastest systems
(Cray and a NEC computer I think).

Al Aburto
aburto@marlin.nosc.mil