peters@cubsvax.UUCP (Peter S. Shenkin) (09/30/86)
Eugene Miya kindly sent me a long epistle justifying geometric mean. The point was that if one normalizes each benchmark separately, the choice of machine influences the ultimate rank-order when the arithmetic mean, but not the arithmetic mean is taken. However, I feel it is incorrect to normalize each benchmark separately. When all benchmarks are normalized to the arithmetic mean for any given machine, the comparisons are identical, regardless of what machine is used. Thus I believe the issue is primarily how to normalize. Given this, the arithmetic mean has advantages that nothing other means do not. Read on... HOW TO NORMALIZE: Suppose this is the raw data: Machine A Machine B Benchmark 1 10.0 5.0 Benchmark 2 10.0 20.0 ----------------------------------------- arith mean 10.0 12.5 Now, it is DUMB to normalize each benchmark separately. THAT, and not arithmetic mean, is what gives rise to artifacts. Instead let's normalize to the arithmetic mean, first of Machine A, then B: Normalized to A: Machine A Machine B Benchmark 1 1.0 .5 Benchmark 2 1.0 2.0 ----------------------------------------- arith mean 1.0 1.25 Normalized to B: Machine A Machine B Benchmark 1 0.8 0.4 Benchmark 2 0.8 1.6 ----------------------------------------- arith mean 0.8 1.0 The results are identical throughout, despite the use of the arithmetic mean throughout. Any other mean used throughout would, I believe, also give identical results. SO WHY SHOULD WE PREFER ARITHMETIC MEAN?: However, the arithmetic mean is directly related to the application of benchmark timings to real-world (nebulous though this entire field may be). These averages would reflect real-word performance -- modulo this cloudiness -- if the application set were well-represented equally by the two benchmarks; but this approach also works for weighted averages. Other means, such at the geometric, do not have this property SUMMARY: arithmetic mean wins, if normalization is properly performed. Peter S. Shenkin Columbia Univ. Biology Dept., NY, NY 10027 {philabs,rna}!cubsvax!peters cubsvax!peters@columbia.ARPA P.S.: I like this writeup so much I'm posting it to the net! Thanks for the inspiration...
hammond@petrus.UUCP (Rich A. Hammond) (10/02/86)
Peter S. Shenkin writes: > ... > HOW TO NORMALIZE: > > Suppose this is the raw data: > Machine A Machine B > Benchmark 1 10.0 5.0 > Benchmark 2 10.0 20.0 > ----------------------------------------- > arith mean 10.0 12.5 > > Now, it is DUMB to normalize each benchmark separately. THAT, and not > arithmetic mean, is what gives rise to artifacts. ... ... (read original or do arithmetic yourself)... > > The results are identical throughout, despite the use of the arithmetic > mean throughout. Any other mean used throughout would, I believe, also > give identical results. > > SO WHY SHOULD WE PREFER ARITHMETIC MEAN?: > > However, the arithmetic mean is directly related to the application > of benchmark timings to real-world (nebulous though this entire field may > be). These averages would reflect real-word performance -- modulo this > cloudiness -- if the application set were well-represented equally by > the two benchmarks; but this approach also works for weighted averages. > Other means, such at the geometric, do not have this property > > SUMMARY: arithmetic mean wins, if normalization is properly performed. I respectfully disagree, if you work the arithmetic out, you don't need to normalize at all using your method, just compare the sums of the benchmark times. This is because you assume that: a) The benchmarks are a representative sample of actual load, and b) that comparison of the component times is unimportant. I claim that for the environment of most network news readers the first is in fact false, most people have no idea what the load on their system is composed of or even where a given application program spends its time. E.g. - I asked programmers on our system for FORTRAN programs that would "vectorize" well. Of the 4 programs submitted, half were dominated by subroutine calls and not vector/matrix arithmetic. The second is often false, in that many benchmark suites are composed of programs which stress one particular aspect of a system, e.g. ackermann's function gives a picture of subroutine call/return costs. In this case one would like to compare the individual programs run times. With normalization to the arithmetic mean of a processor, one is still left with taking the ratio of the times of interest and the normalization to the arithmetic mean can be factored out and not done. Normalization to the individual component time, on the other hand, gives cases where the ratio is trivial to compute because you're dividing by 1. In the context of the CACM article, both assumptions are false: the benchmarks aren't representative of the load(no system calls), and the comparison of interest was individual program times and not the sum. What the CACM article pointed out was that under those conditions, the geometric mean was the only one to use to get ratios of machine performance that were independent of the machine normalized to. What the CACM article didn't say (and should have) was that the performance ratio was pretty worhless anyway, so that computing it "correctly" is a moot point. Rich Hammond Bell Communications Research hammond@bellcore.com
willc@tekchips.UUCP (10/03/86)
In article <549@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes: >HOW TO NORMALIZE: > >Suppose this is the raw data: > Machine A Machine B >Benchmark 1 10.0 5.0 >Benchmark 2 10.0 20.0 >----------------------------------------- >arith mean 10.0 12.5 > >Now, it is DUMB to normalize each benchmark separately. THAT, and not >arithmetic mean, is what gives rise to artifacts. Instead >let's normalize to the arithmetic mean, first of Machine A, then B: > >Normalized to A: > Machine A Machine B >Benchmark 1 1.0 .5 >Benchmark 2 1.0 2.0 >----------------------------------------- >arith mean 1.0 1.25 > > >Normalized to B: > Machine A Machine B >Benchmark 1 0.8 0.4 >Benchmark 2 0.8 1.6 >----------------------------------------- >arith mean 0.8 1.0 So the advertising manager for Machine B notices that Benchmark 1 consists of 1 iteration, while Benchmark 2 consists of 1000 iterations. That doesn't seem quite fair, so he/she re-runs benchmark 1 with 1000 iterations instead of 1 to obtain the raw data: Machine A Machine B Benchmark 1 10000.0 5000.0 Benchmark 2 10.0 20.0 ----------------------------------------- arith mean 5005.0 2560.0 Normalizing according to the procedure recommended in the text quoted above: Normalized to A: Machine A Machine B Benchmark 1 1.998 .999 Benchmark 2 .0001998 .0003996 ----------------------------------------- arith mean .9990999 .4996998 Normalized to B: Machine A Machine B Benchmark 1 3.90625 1.953125 Benchmark 2 0.000390625 .00078125 ----------------------------------------- arith mean 1.9533203125 .976953125 Advertisements for Machine B then proclaim that it is nearly twice as fast as Machine A. The advertising manager for Machine A cries foul. Can you help him/her adjust these same benchmarks to prove that Machine A is really twice as fast as Machine B? Artifacts are neither art nor facts. By the way, I see no flaw in the proof that appears in Philip J Fleming and John J Wallace, "How not to lie with statistics: the correct way to summarize benchmark results", CACM Volume 29 Number 3 (March 1986), pages 218-221. I'm not very happy with their presentation, primarily because they never give a clear statement of their theorem, which I paraphrase: The geometric mean is the only function of n positive real arguments that is reflexive, symmetric, and multiplicative. It's fair to take issue with their proof, but if you're going to do so I'd like to know which step(s) of their proof you find unconvincing, or which of the three properties you feel is dispensable for an unweighted average of normalized benchmark results. William Clinger Tektronix Computer Research Laboratory willc%tekchips@tek
peters@cubsvax.UUCP (Peter S. Shenkin) (10/05/86)
In article <tekchips.698> willc@tekchips.UUCP (Will Clinger) writes: >In article <549@cubsvax.UUCP> peters@cubsvax.UUCP (Peter S. Shenkin) writes: >>HOW TO NORMALIZE: >> >>Suppose this is the raw data: >> Machine A Machine B >>Benchmark 1 10.0 5.0 >>Benchmark 2 10.0 20.0 >>----------------------------------------- >>arith mean 10.0 12.5 >> [ I'm deleting the rest of my quoted original article; I showed that if one normalizes all benchmarks to the arithmetic mean of EITHER machine, the relative performance of the two machines is identical, no matter which of the two machines is chosen as the norm. ] >So the advertising manager for Machine B notices that Benchmark 1 consists >of 1 iteration, while Benchmark 2 consists of 1000 iterations. That >doesn't seem quite fair, so he/she re-runs benchmark 1 with 1000 iterations >instead of 1 to obtain the raw data: > > Machine A Machine B >Benchmark 1 10000.0 5000.0 >Benchmark 2 10.0 20.0 >----------------------------------------- >arith mean 5005.0 2560.0 [ WIlliam goes on to point out, correctly, that even though following my recommended procedure continues to give the same relative performance of A and B, it now appears that B is faster...] ...and I reply, OF COURSE now B is faster... the benchmark has changed! And B would also be faster using the geometric mean, or any other mean, with this altered data. Therefore this is not an issue of which mean is better, but one of which benchmarks are the fair or applicable ones to use. And adver- tising managers will always pick the ones to make their machines look better. If it's not the advertising manager picking the benchmarks, however, but the end-user, then if the benchmarks in my article represent the proposed machine usage, then A is faster; if William's benchmarks represent the proposed usage, then B is faster. The arithmetic means support this conclusion. >Artifacts are neither art nor facts. I agree; and probably the difficulty of choosing good benchmarks and/or predicting the use of the machine contributes more to artifacts than the type of mean one uses; except that if you use arithmetic mean, you MUST normalize the way I've shown, and if you don't your results don't mean anything. >By the way, I see no flaw in the proof that appears in Philip J Fleming >and John J Wallace, "How not to lie with statistics: the correct way to >summarize benchmark results", CACM Volume 29 Number 3 (March 1986), >pages 218-221. I'm not very happy with their presentation, primarily >because they never give a clear statement of their theorem, which I >paraphrase: The geometric mean is the only function of n positive real >arguments that is reflexive, symmetric, and multiplicative. It's fair >to take issue with their proof, but if you're going to do so I'd like to >know which step(s) of their proof you find unconvincing, or which of the >three properties you feel is dispensable for an unweighted average of >normalized benchmark results. Well, here I have to admit that I've been talking through my hat all along; I've not read the article. I suppose I will, now. I probably object to the relevance of the multiplicative property. Since the actual time it will take for a machine to perform a series of tasks is the SUM of the times it takes for the tasks, one wants a mean which predicts this SUM. The (weighted, if necessary) arithmetic mean of the types of tasks which the machine will carry out is directly proportional to this SUM. Geometric and other means may require less care in calculation, but give a number, in the end, which bears no direct relation to the time it will take a machine to perform the tasks for which it is intended. And I believe this time is the desired performance criterion. Peter S. Shenkin Columbia Univ. Biology Dept., NY, NY 10027 {philabs,rna}!cubsvax!peters cubsvax!peters@columbia.ARPA
peters@cubsvax.UUCP (Peter S. Shenkin) (10/05/86)
In article <petrus.327> hammond@petrus.UUCP (Rich A. Hammond) writes: >Peter S. Shenkin writes: >> However, the arithmetic mean is directly related to the application >> of benchmark timings to real-world (nebulous though this entire field may >> be). These averages would reflect real-word performance -- modulo this >> cloudiness -- if the application set were well-represented equally by >> the two benchmarks; but this approach also works for weighted averages. >> Other means, such at the geometric, do not have this property >> >> SUMMARY: arithmetic mean wins, if normalization is properly performed. >I respectfully disagree, if you work the arithmetic out, you don't need to >normalize at all using your method, just compare the sums of the benchmark >times. This is because you assume that: >a) The benchmarks are a representative sample of actual load, >and >b) that comparison of the component times is unimportant. > >I claim that for the environment of most network news readers the first >is in fact false, most people have no idea what the load on their system >is composed of... > >The second is often false, [since] >...one would like to compare the individual programs run times. > ...normalization to >the arithmetic mean can be factored out and not done. Normalization to >the individual component time, on the other hand, gives cases where the >ratio is trivial to compute because you're dividing by 1. > >In the context of the CACM article, both assumptions are false: >the benchmarks aren't representative of the load(no system calls), >and the comparison of interest was individual program times and not the >sum. What the CACM article pointed out was that under those conditions, >the geometric mean was the only one to use to get ratios of machine >performance that were independent of the machine normalized to. WHAT >THE CACM ARTICLE DIDN'T SAY (AND SHOULD HAVE) WAS THAT THE PERFORMANCE >RATIO WAS PRETTY WORHLESS ANYWAY, SO THAT COMPUTING IT "CORRECTLY" IS >A MOOT POINT. [ my emphasis -- peters ] I would like to put myself on record as agreeing with every one of Rich's remarks. (I jumped into the discussion solely on the question of which of the means more closely represents system performance subject to assumptions (a) and (b) ). It seems to me that the bottom line is that there's no one- dimensional measure of performance that's any good; arithmetic mean fails because (a) and (b) are too difficult so satisfy, and geometric mean fails 'cause it doesn't mean a damn thing, self-consistent though it may be. Peter S. Shenkin Columbia Univ. Biology Dept., NY, NY 10027 {philabs,rna}!cubsvax!peters cubsvax!peters@columbia.ARPA