guru@ut-emx.uucp (chen liehgong) (11/18/90)
I have a few queries regarding benchmark performance ratios. 1. If the benchmark consists of a set of programs (eg. the livermore loops) is the overall performance ratio of the architecture under test (as compared to a standard one) calculated as the harmonic mean of the performance ratios (say speed-ups) obtained for each program (or livermore loop)? If so, Why is the harmonic mean used instead of the arithmetic or geometric means? 2. If different kinds of benchmarks (eg. integer performance, floating- point performance or livermore loops, whetstones and dhrystones) are used, how is the overall performance ratio (speed-up) calculated? i.e., Which mean (AM, GM or HM) should be used? 3. If the performance ratio is changed (say from speed-up to percentage decrease in execution time - in clock cycles) do the answers to 1 and 2 above, remain the same? 4. What are some of the performance ratios that are used in comparing two architectures? I can think of only two, viz., speed-up and percentage decrease in execution time. -guru@emx.utexas.edu
tim@proton.amd.com (Tim Olson) (11/20/90)
In article <39896@ut-emx.uucp> guru@ut-emx.uucp (chen liehgong) writes: | I have a few queries regarding benchmark performance ratios. | | 1. If the benchmark consists of a set of programs (eg. the | livermore loops) is the overall performance ratio of the architecture | under test (as compared to a standard one) calculated as the | harmonic mean of the performance ratios (say speed-ups) obtained | for each program (or livermore loop)? If so, Why is the harmonic mean | used instead of the arithmetic or geometric means? Fleming and Wallace, in their paper entitled "How Not to Lie With Statistics: The Correct Way to Summarize Benchmark Results" [CACM March 1986, Voluem 29 #3] say that the arithmetic mean should be used when the individual benchmarks are reported in absolute time, while the geometric mean should be used when individual benchmarks are normalized to some "known machine." James Smith, in the paper entitled "Characterizing Computer Performance With a Single Number" [CACM, October 1988, #10] argues that the harmonic mean should be used, but only again with absolute quantities such as MFLOPS (normalization should occur after the mean has been calculated). The problem with mean calculations based upon absolute quantities (seconds, MFLOPS, etc.) is that there is an implicit weighting of the benchmarks based upon how long they run. This is fine if the benchmarks are designed such that the relative runtimes of the benchmarks correspond to the actual runtime ratios expected in the real application(s). However, this is rarely the case -- a benchmark suite typically contains a large number of varied programs that don't have an overall relationship. Because of this, I think that the best thing that can be done is to give each benchmark equal weighting. If this is done, then the geometric mean of the normalized performances should be used (e.g. SPEC). | 2. If different kinds of benchmarks (eg. integer performance, floating- | point performance or livermore loops, whetstones and dhrystones) are | used, how is the overall performance ratio (speed-up) calculated? i.e., | Which mean (AM, GM or HM) should be used? The type of benchmark makes no difference, as long as it is measured consistantly among each of the machines to get a normalized performance. | 3. If the performance ratio is changed (say from speed-up to percentage | decrease in execution time - in clock cycles) do the answers to 1 and 2 | above, remain the same? I don't believe you can average using %increase/decrease -- you must convert this into normalized performance first, average using the geometric mean, then re-convert into %increase/decrease. -- -- Tim Olson Advanced Micro Devices (tim@amd.com)
aglew@crhc.uiuc.edu (Andy Glew) (11/20/90)
>James Smith, in the paper entitled "Characterizing Computer >Performance With a Single Number" [CACM, October 1988, #10] argues >that the harmonic mean should be used, but only again with absolute >quantities such as MFLOPS (normalization should occur after the mean >has been calculated). Urghhh. That isn't quite what the paper said. You use the harmonic mean to summarize *rates* like megaflops, because the harmonic mean of rates corresponds to an arithmetic mean of run-times. And run-times, after all, is what we are all interested in in the long run (well, some of us are interested in throughputs too as distinct from latencies, but...) How you weight the components of the means is another matter. Averaging normalized values just means that you have chosen a particular weighting function. Hennessy suggests another weighting function. The best weighting function, of course, corresponds to the relative frequency of the application types represented by the benchmarks in your workload - and that statement, of course, is full of assumptions. Another reason to use the harmonic mean is that it tends to be conservative, so it reduces the chance that you will be accused of an overestimate based on a few distorting values. Handwaving there, too. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]