[net.arch] Incorrect Benchmark summary.

radford@calgary.UUCP (Radford Neal) (09/22/86)

In article <20954@rochester.ARPA>, crowl@rochester.ARPA (Lawrence Crowl) writes:
> The table below is a reorganization of the following table. [I ommitted
> this table in this posting. It contained the times for the benchmarks in
> seconds. RN]
> 				Relative Performance
> 
> processor	80286	80386	68000	68020	68020	32032	32100	32100
> cache (MHz)	(10)	(16)	(8)	N (16)	C (16)	(10)	N (18)	C (18)
> 
> string search	1.37	1.00	3.85	2.25	1.08	3.51	4.71	1.89
> bit manipulate5.52	2.09	5.91	2.25	1.00	5.29	3.58	1.74
> linked list	3.08	1.70	4.11	1.79	1.00	2.90	2.36	1.28
> quicksort	4.07	2.49	4.39	2.05	1.00	3.12	3.12	1.32
> matrix trans	6.42	2.05	5.49	1.58	1.00	4.33	3.04	1.47
> 
> average	4.09	1.87	4.75	1.98	1.02	3.83	3.36	1.54

The averaging in this table is done incorrectly. As noted in a recent CACM
article, normalized benchmark results should be averaged with a geometric
mean, not an arithmetic mean. The geometic mean of N numbers is the Nth 
root of their product. This method gives the correct results:

  RIGHT average	3.60	1.79	4.68	1.97	1.02	3.74	3.28	1.52

In this case, it doesn't seem to make all that much difference in the
conclusions. In can though, consider the following example:

		Machine A	Machine B

  Benchmark 1:	10 seconds	5 seconds
  Benchmark 2:	10 seconds	20 seconds

Look at the results of normalizing these figures to Machine A and then
taking the arithmetic mean of the results:

  		Machine A	Machine B
  Benchmark 1:	  1.0		  0.5
  Benchmark 2:	  1.0		  2.0

  arith. mean:	  1.0		  1.25

Machine B is thus 25% slower than machine A, right? Wrong. Look at what
happens when you take the *same* benchmark results, normalize to machine
B, and take the arithmetic mean:

  		Machine A	Machine B
  Benchmark 1:	  2.0		  1.0
  Benchmark 2:	  0.5		  1.0

  arith. mean:	  1.25		  1.0

Now machine B comes out looking faster! 

If you take the geometric mean, however, machine A and machine B look
equally fast regardless of how you normalize the results. 

    Radford Neal
    The University of Calgary

eugene@ames.UUCP (Eugene Miya) (09/25/86)

> 
> The averaging in this table is done incorrectly. As noted in a recent CACM
> article, normalized benchmark results should be averaged with a geometric
> mean, not an arithmetic mean.
>     Radford Neal
>     The University of Calgary

There is no clear cut evidence that the geometric mean is any more correct
than any other [Re: the Flemming and Wallace paper].  Jack Wolton of
Los Alamost published a paper in 1984 (IEEE Compcon, with the title:
Bottleneckology: something, my proceedings are lent out at the moment).
This paper touted the HARMONIC mean as the "correct" mean.  This draws
both suspect.  The proof offered by W&F is not a sufficiently
rigorous proof.  And I think a poor proof is worse than no proof.

The science and art of measurement are independent of statistics (I
realize an over simplification).  We resort to statistics because our
measurement tool on the computer are so poor.  What sense is it to
average numbers intended to show some sort of peak performance?
(Don't say average peak performance.)

I suggest reading two texts:

%A Darrell Huff
%T How to Lie with Statistics
%I Norton
%C NY
%D 1954

%A Edward R. Tufte
%T The Visual Display of Quatitative Information
%I Graphics Press
%C Cheshire, CO
%D 1983

Separate note on performance measurement:
I have been talking to colleagues at Berkeley, LBL, LLNL,
and other locations.  We will be having an informal dinner meeting,
probably to be held at UC Berkeley (to the shagrin of the Stanford people)
sometime during the end of October.  I am trying to think of a name to
characterize this group: New Generation Performance Measurement Group
(naw, make it a Ring), Bay Area Performance Measurement Ring,
Those People Interested in Improving Performance Measurement People.
Anyway, if you are in the Bay Area and are seriously interested,
contact me.  We have several mail correspondents like Jack Dongarra
and Ken Dymond, but we want to have this meeting, too.

Added reference:

%A Philip J. Flemming
%A John J. Wallace
%T How Not to Lie with Statistics: The Correct Way to Summarize
Benchmark Results
%J CACM
%V 29
%N 3
%D March 1986
%P 218-221
%X Waste of a good title.

From the Rock of Ages Home for Retired Hackers:
--eugene miya
  NASA Ames Research Center
  com'on do you trust Reply commands with all these different mailers?
  {hplabs,ihnp4,dual,hao,decwrl,tektronix,allegra}!ames!aurora!eugene
  eugene@ames-aurora.ARPA

peters@cubsvax.UUCP (Peter S. Shenkin) (09/27/86)

In article <ames.1675> eugene@ames.UUCP (Eugene Miya) writes:
>> 
>> The averaging in this table is done incorrectly. As noted in a recent CACM
>> article, normalized benchmark results should be averaged with a geometric
>> mean, not an arithmetic mean.
>>     Radford Neal
>>     The University of Calgary
>
>There is no clear cut evidence that the geometric mean is any more correct
>than any other [Re: the Flemming and Wallace paper].  Jack Wolton of
>Los Alamos published a paper in 1984...
>This paper touted the HARMONIC mean as the "correct" mean....

Note that if the values being averaged don't have too much spread, all these
means are about the same;  also, even if the distributions of values have 
large spread, I believe the distributions being compared have to be rather 
different in form for the different methods of calculating the mean
to give different rank-orders.  (But I seem to have missed the original
article to which this refers, so I'm not sure what's being compared!)

Peter S. Shenkin	 Columbia Univ. Biology Dept., NY, NY  10027
{philabs,rna}!cubsvax!peters		cubsvax!peters@columbia.ARPA

radford@calgary.UUCP (Radford Neal) (09/29/86)

In article <1675@ames.UUCP>, eugene@ames.UUCP (Eugene Miya) writes:

> > The averaging in this table is done incorrectly. As noted in a recent CACM
> > article, normalized benchmark results should be averaged with a geometric
> > mean, not an arithmetic mean.
> 
> There is no clear cut evidence that the geometric mean is any more correct
> than any other [Re: the Flemming and Wallace paper].  Jack Wolton of
> Los Alamost published a paper in 1984 (IEEE Compcon) [touting]
> HARMONIC mean as the "correct" mean...
> The proof offered by W&F is not a sufficiently
> rigorous proof.  And I think a poor proof is worse than no proof.

Could you elaborate on what's wrong with the proof offered by W&F? I 
re-read the article last night, and they seem to me to prove what they
set out to prove. The harmonic mean certainly doesn't work for normalized
numbers.

W&F do not claim that the geometric mean is the only good way to average
benchmarks, just the only way to average *normalized* benchmarks. If you
have a good idea of your job mix, a weighted arithmetic mean of the raw data
is the way to go. Then you can normalize this mean to one machine if you feel
like it.

Regardless, the case *against* arithmetic means of normalized numbers 
seems completely incontrovertable, regardless of what one considers the
"best" replacement. It's not acceptable for the results to depend on 
which machine one arbitrarily decides to normalize to. 

The paper discussed here is:

  Flemming, Philip J. and Wallace, John J. "How not to lie with
    statistics: The correct way to summarize benchmark resuls",
    Communications of the ACM, Vol. 29, No. 3. (March 1986).


Radford Neal
The University of Calgary

hansen@mips.UUCP (10/03/86)

The geometric and harmonic means are MEANINGLESS for benchmarking machines.
They are only useful for fudging the results to try to make an
unbalanced, ill-designed, machine look good.

Look at it this way: if you've got two jobs to get done and
two machines take the following times:

	Machine A	Machine B
Job 1	10 sec		20 sec
Job 2	10 sec		5 sec

Does anyone disagree that Machine A takes 20 seconds,
and Machine B takes 25 seconds? It should be obvious
that machine A is the faster machine FOR THE GIVEN WORKLOAD.

Even if the times were more extreme:

	Machine A	Machine B
Job 1	10 sec		20 sec
Job 2	10 sec		1 sec

Still, Machine B is slower. It doesn't matter a damn
that Job 2 executes ten times faster on Machine B,
because it was so slow on Job 1 that it already lost
the race before even starting on Job 2.

Now, if you run Job 2 more often than Job 1, or Job 2
is more closely representative of the workload you
intend for the machine, then sure, go ahead and
adjust the weighting. However, the geometric
and harmonic means take neither of these factors into account,
and effectively use inconsistent weightings between machines.

Doesn't anyone remember the parable of the Tortise and the Hare?
I suppose someone will now try and convince us that
the Tortise should have been the winner via the geometric mean!

-- 

Craig Hansen			|	 "Evahthun' tastes
MIPS Computer Systems		|	 bettah when it
...decwrl!mips!hansen		|	 sits on a RISC"