[net.arch] The correct mean to use when comparing benchmark performance

peters@cubsvax.UUCP (Peter S. Shenkin) (09/30/86)

Eugene Miya kindly sent me a long epistle justifying geometric mean.
The point was that if one normalizes each benchmark separately, the
choice of machine influences the ultimate rank-order when the arithmetic
mean, but not the arithmetic mean is taken.  However, I feel it is
incorrect to normalize each benchmark separately.  When all benchmarks
are normalized to the arithmetic mean for any given machine, the
comparisons are identical, regardless of what machine is used.  Thus
I believe the issue is primarily how to normalize.  Given this, the
arithmetic mean has advantages that nothing other means do not.  Read on...

HOW TO NORMALIZE:

Suppose this is the raw data:
		Machine A	Machine B
Benchmark 1	10.0		 5.0
Benchmark 2	10.0		20.0
-----------------------------------------
arith mean	10.0		12.5

Now, it is DUMB to normalize each benchmark separately.  THAT, and not
arithmetic mean, is what gives rise to artifacts.  Instead
let's normalize to the arithmetic mean, first of Machine A, then B:

Normalized to A:
		Machine A	Machine B
Benchmark 1	1.0		 .5
Benchmark 2	1.0		2.0
-----------------------------------------
arith mean	1.0		1.25


Normalized to B:
		Machine A	Machine B
Benchmark 1	0.8		0.4
Benchmark 2	0.8		1.6
-----------------------------------------
arith mean	0.8		1.0

The results are identical throughout, despite the use of the arithmetic
mean throughout.  Any other mean used throughout would, I believe, also
give identical results.


SO WHY SHOULD WE PREFER ARITHMETIC MEAN?:

However, the arithmetic mean is directly related to the application
of benchmark timings to real-world (nebulous though this entire field may
be).  These averages would reflect real-word performance -- modulo this
cloudiness -- if the application set were well-represented equally by
the two benchmarks;  but this approach also works for weighted averages.
Other means, such at the geometric, do not have this property

SUMMARY:  arithmetic mean wins, if normalization is properly performed.

Peter S. Shenkin	 Columbia Univ. Biology Dept., NY, NY  10027
{philabs,rna}!cubsvax!peters		cubsvax!peters@columbia.ARPA

P.S.:  I like this writeup so much I'm posting it to the net!  Thanks for
the inspiration...

hammond@petrus.UUCP (Rich A. Hammond) (10/02/86)

Peter S. Shenkin writes:

> ...
> HOW TO NORMALIZE:
> 
> Suppose this is the raw data:
> 		Machine A	Machine B
> Benchmark 1	10.0		 5.0
> Benchmark 2	10.0		20.0
> -----------------------------------------
> arith mean	10.0		12.5
> 
> Now, it is DUMB to normalize each benchmark separately.  THAT, and not
> arithmetic mean, is what gives rise to artifacts.  ...
 ... (read original or do arithmetic yourself)...
> 
> The results are identical throughout, despite the use of the arithmetic
> mean throughout.  Any other mean used throughout would, I believe, also
> give identical results.
> 
> SO WHY SHOULD WE PREFER ARITHMETIC MEAN?:
> 
> However, the arithmetic mean is directly related to the application
> of benchmark timings to real-world (nebulous though this entire field may
> be).  These averages would reflect real-word performance -- modulo this
> cloudiness -- if the application set were well-represented equally by
> the two benchmarks;  but this approach also works for weighted averages.
> Other means, such at the geometric, do not have this property
> 
> SUMMARY:  arithmetic mean wins, if normalization is properly performed.

I respectfully disagree, if you work the arithmetic out, you don't need to
normalize at all using your method, just compare the sums of the benchmark
times.  This is because you assume that:
a) The benchmarks are a representative sample of actual load,
and
b) that comparison of the component times is unimportant.

I claim that for the environment of most network news readers the first
is in fact false, most people have no idea what the load on their system
is composed of or even where a given application program spends its time.
E.g. - I asked programmers on our system for FORTRAN programs that would
"vectorize" well.  Of the 4 programs submitted, half were dominated by
subroutine calls and not vector/matrix arithmetic.

The second is often false, in that many benchmark suites are composed of
programs which stress one particular aspect of a system, e.g. ackermann's
function gives a picture of subroutine call/return costs.  In this case
one would like to compare the individual programs run times.  With
normalization to the arithmetic mean of a processor, one is still left
with taking the ratio of the times of interest and the normalization to
the arithmetic mean can be factored out and not done.  Normalization to
the individual component time, on the other hand, gives cases where the
ratio is trivial to compute because you're dividing by 1.

In the context of the CACM article, both assumptions are false:
the benchmarks aren't representative of the load(no system calls),
and the comparison of interest was individual program times and not the
sum.  What the CACM article pointed out was that under those conditions,
the geometric mean was the only one to use to get ratios of machine
performance that were independent of the machine normalized to.  What
the CACM article didn't say (and should have) was that the performance
ratio was pretty worhless anyway, so that computing it "correctly" is
a moot point.

Rich Hammond	Bell Communications Research	hammond@bellcore.com

willc@tekchips.UUCP (10/03/86)

In article <549@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes:
>HOW TO NORMALIZE:
>
>Suppose this is the raw data:
>		Machine A	Machine B
>Benchmark 1	10.0		 5.0
>Benchmark 2	10.0		20.0
>-----------------------------------------
>arith mean	10.0		12.5
>
>Now, it is DUMB to normalize each benchmark separately.  THAT, and not
>arithmetic mean, is what gives rise to artifacts.  Instead
>let's normalize to the arithmetic mean, first of Machine A, then B:
>
>Normalized to A:
>		Machine A	Machine B
>Benchmark 1	1.0		 .5
>Benchmark 2	1.0		2.0
>-----------------------------------------
>arith mean	1.0		1.25
>
>
>Normalized to B:
>		Machine A	Machine B
>Benchmark 1	0.8		0.4
>Benchmark 2	0.8		1.6
>-----------------------------------------
>arith mean	0.8		1.0

So the advertising manager for Machine B notices that Benchmark 1 consists
of 1 iteration, while Benchmark 2 consists of 1000 iterations.  That
doesn't seem quite fair, so he/she re-runs benchmark 1 with 1000 iterations
instead of 1 to obtain the raw data:

		Machine A	Machine B
Benchmark 1	10000.0		 5000.0
Benchmark 2	   10.0		   20.0
-----------------------------------------
arith mean	 5005.0		 2560.0

Normalizing according to the procedure recommended in the text quoted
above:

Normalized to A:
		Machine A	Machine B
Benchmark 1	1.998		 .999
Benchmark 2	 .0001998	 .0003996
-----------------------------------------
arith mean	 .9990999	 .4996998


Normalized to B:
		Machine A	Machine B
Benchmark 1	3.90625		1.953125
Benchmark 2	0.000390625	 .00078125
-----------------------------------------
arith mean	1.9533203125	 .976953125

Advertisements for Machine B then proclaim that it is nearly twice as
fast as Machine A.  The advertising manager for Machine A cries foul.
Can you help him/her adjust these same benchmarks to prove that Machine
A is really twice as fast as Machine B?

Artifacts are neither art nor facts.

By the way, I see no flaw in the proof that appears in Philip J Fleming
and John J Wallace, "How not to lie with statistics: the correct way to
summarize benchmark results", CACM Volume 29 Number 3 (March 1986),
pages 218-221.  I'm not very happy with their presentation, primarily
because they never give a clear statement of their theorem, which I
paraphrase:  The geometric mean is the only function of n positive real
arguments that is reflexive, symmetric, and multiplicative.  It's fair
to take issue with their proof, but if you're going to do so I'd like to
know which step(s) of their proof you find unconvincing, or which of the
three properties you feel is dispensable for an unweighted average of
normalized benchmark results.

William Clinger
Tektronix Computer Research Laboratory
willc%tekchips@tek

peters@cubsvax.UUCP (Peter S. Shenkin) (10/05/86)

In article <tekchips.698> willc@tekchips.UUCP (Will Clinger) writes:
>In article <549@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes:
>>HOW TO NORMALIZE:
>>
>>Suppose this is the raw data:
>>		Machine A	Machine B
>>Benchmark 1	10.0		 5.0
>>Benchmark 2	10.0		20.0
>>-----------------------------------------
>>arith mean	10.0		12.5
>>

[ I'm deleting the rest of my quoted original article;  I showed that if one
normalizes all benchmarks to the arithmetic mean of EITHER machine, the
relative performance of the two machines is identical, no matter which
of the two machines is chosen as the norm. ]

>So the advertising manager for Machine B notices that Benchmark 1 consists
>of 1 iteration, while Benchmark 2 consists of 1000 iterations.  That
>doesn't seem quite fair, so he/she re-runs benchmark 1 with 1000 iterations
>instead of 1 to obtain the raw data:
>
>		Machine A	Machine B
>Benchmark 1	10000.0		 5000.0
>Benchmark 2	   10.0		   20.0
>-----------------------------------------
>arith mean	 5005.0		 2560.0

[ WIlliam goes on to point out, correctly, that even though following my 
recommended procedure continues to give the same relative performance of A and 
B, it now appears that B is faster...]

...and I reply, OF COURSE now B is faster... the benchmark has changed!  And
B would also be faster using the geometric mean, or any other mean, with this
altered data.  Therefore this is not an issue of which mean is better, but
one of which benchmarks are the fair or applicable ones to use.  And adver-
tising managers will always pick the ones to make their machines look better.
If it's not the advertising manager picking the benchmarks, however, but the
end-user, then if the benchmarks in my article represent the proposed machine
usage, then A is faster;  if William's benchmarks represent the proposed usage,
then B is faster.  The arithmetic means support this conclusion.

>Artifacts are neither art nor facts.

I agree; and probably the difficulty of choosing good benchmarks and/or
predicting the use of the machine contributes more to artifacts than the
type of mean one uses;  except that if you use arithmetic mean, you MUST
normalize the way I've shown, and if you don't your results don't mean
anything.

>By the way, I see no flaw in the proof that appears in Philip J Fleming
>and John J Wallace, "How not to lie with statistics: the correct way to
>summarize benchmark results", CACM Volume 29 Number 3 (March 1986),
>pages 218-221.  I'm not very happy with their presentation, primarily
>because they never give a clear statement of their theorem, which I
>paraphrase:  The geometric mean is the only function of n positive real
>arguments that is reflexive, symmetric, and multiplicative.  It's fair
>to take issue with their proof, but if you're going to do so I'd like to
>know which step(s) of their proof you find unconvincing, or which of the
>three properties you feel is dispensable for an unweighted average of
>normalized benchmark results.

Well, here I have to admit that I've been talking through my hat all along;
I've not read the article.  I suppose I will, now.  I probably object to
the relevance of the multiplicative property.  Since the actual time it
will take for a machine to perform a series of tasks is the SUM of the
times it takes for the tasks, one wants a mean which predicts this SUM.
The (weighted, if necessary) arithmetic mean of the types of tasks which
the machine will carry out is directly proportional to this SUM.  Geometric
and other means may require less care in calculation, but give a number, in
the end, which bears no direct relation to the time it will take a machine
to perform the tasks for which it is intended.  And I believe this time is
the desired performance criterion.

Peter S. Shenkin	 Columbia Univ. Biology Dept., NY, NY  10027
{philabs,rna}!cubsvax!peters		cubsvax!peters@columbia.ARPA

peters@cubsvax.UUCP (Peter S. Shenkin) (10/05/86)

In article <petrus.327> hammond@petrus.UUCP (Rich A. Hammond) writes:
>Peter S. Shenkin writes:

>> However, the arithmetic mean is directly related to the application
>> of benchmark timings to real-world (nebulous though this entire field may
>> be).  These averages would reflect real-word performance -- modulo this
>> cloudiness -- if the application set were well-represented equally by
>> the two benchmarks;  but this approach also works for weighted averages.
>> Other means, such at the geometric, do not have this property
>> 
>> SUMMARY:  arithmetic mean wins, if normalization is properly performed.

>I respectfully disagree, if you work the arithmetic out, you don't need to
>normalize at all using your method, just compare the sums of the benchmark
>times.  This is because you assume that:
>a) The benchmarks are a representative sample of actual load,
>and
>b) that comparison of the component times is unimportant.
>
>I claim that for the environment of most network news readers the first
>is in fact false, most people have no idea what the load on their system
>is composed of...
>
>The second is often false, [since]
>...one would like to compare the individual programs run times.
>							...normalization to
>the arithmetic mean can be factored out and not done.  Normalization to
>the individual component time, on the other hand, gives cases where the
>ratio is trivial to compute because you're dividing by 1.
>
>In the context of the CACM article, both assumptions are false:
>the benchmarks aren't representative of the load(no system calls),
>and the comparison of interest was individual program times and not the
>sum.  What the CACM article pointed out was that under those conditions,
>the geometric mean was the only one to use to get ratios of machine
>performance that were independent of the machine normalized to.  WHAT
>THE CACM ARTICLE DIDN'T SAY (AND SHOULD HAVE) WAS THAT THE PERFORMANCE
>RATIO WAS PRETTY WORHLESS ANYWAY, SO THAT COMPUTING IT "CORRECTLY" IS
>A MOOT POINT.   [ my emphasis -- peters ]

I would like to put myself on record as agreeing with every one of Rich's
remarks.  (I jumped into the discussion solely on the question of which of
the means more closely represents system performance subject to assumptions
(a) and (b) ).  It seems to me that the bottom line is that there's no one-
dimensional measure of performance that's any good;  arithmetic mean fails
because (a) and (b) are too difficult so satisfy, and geometric mean fails
'cause it doesn't mean a damn thing, self-consistent though it may be.

Peter S. Shenkin	 Columbia Univ. Biology Dept., NY, NY  10027
{philabs,rna}!cubsvax!peters		cubsvax!peters@columbia.ARPA