[comp.arch] Benchmark performance ratios

guru@ut-emx.uucp (chen liehgong) (11/18/90)

I have a few queries regarding benchmark performance ratios.

1. If the benchmark consists of a set of programs (eg. the 
livermore loops) is the overall performance ratio of the architecture
under test (as compared to a standard one) calculated as the 
harmonic mean of the performance ratios (say speed-ups) obtained 
for each program (or livermore loop)? If so, Why is the harmonic mean
used instead of the arithmetic or geometric means?

2. If different kinds of benchmarks (eg. integer performance, floating-
point performance or livermore loops, whetstones and dhrystones) are 
used, how is the overall performance ratio (speed-up) calculated? i.e.,
Which mean (AM, GM or HM) should be used?

3. If the performance ratio is changed (say from speed-up to percentage
decrease in execution time - in clock cycles) do the answers to 1 and 2
above, remain the same?

4. What are some of the performance ratios that are used in
comparing two architectures? I can think of only two, viz., speed-up
and percentage decrease in execution time.

-guru@emx.utexas.edu

tim@proton.amd.com (Tim Olson) (11/20/90)

In article <39896@ut-emx.uucp> guru@ut-emx.uucp (chen  liehgong) writes:
| I have a few queries regarding benchmark performance ratios.
| 
| 1. If the benchmark consists of a set of programs (eg. the 
| livermore loops) is the overall performance ratio of the architecture
| under test (as compared to a standard one) calculated as the 
| harmonic mean of the performance ratios (say speed-ups) obtained 
| for each program (or livermore loop)? If so, Why is the harmonic mean
| used instead of the arithmetic or geometric means?

Fleming and Wallace, in their paper entitled "How Not to Lie With
Statistics: The Correct Way to Summarize Benchmark Results" [CACM
March 1986, Voluem 29 #3] say that the arithmetic mean should be used
when the individual benchmarks are reported in absolute time, while
the geometric mean should be used when individual benchmarks are
normalized to some "known machine."  James Smith, in the paper
entitled "Characterizing Computer Performance With a Single Number"
[CACM, October 1988, #10] argues that the harmonic mean should be
used, but only again with absolute quantities such as MFLOPS
(normalization should occur after the mean has been calculated).

The problem with mean calculations based upon absolute quantities
(seconds, MFLOPS, etc.) is that there is an implicit weighting of the
benchmarks based upon how long they run.  This is fine if the
benchmarks are designed such that the relative runtimes of the
benchmarks correspond to the actual runtime ratios expected in the
real application(s).  However, this is rarely the case -- a benchmark
suite typically contains a large number of varied programs that don't
have an overall relationship.  Because of this, I think that the best
thing that can be done is to give each benchmark equal weighting.  If
this is done, then the geometric mean of the normalized performances
should be used (e.g. SPEC).

| 2. If different kinds of benchmarks (eg. integer performance, floating-
| point performance or livermore loops, whetstones and dhrystones) are 
| used, how is the overall performance ratio (speed-up) calculated? i.e.,
| Which mean (AM, GM or HM) should be used?

The type of benchmark makes no difference, as long as it is measured
consistantly among each of the machines to get a normalized performance.

| 3. If the performance ratio is changed (say from speed-up to percentage
| decrease in execution time - in clock cycles) do the answers to 1 and 2
| above, remain the same?

I don't believe you can average using %increase/decrease -- you must
convert this into normalized performance first, average using the
geometric mean, then re-convert into %increase/decrease.


--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

aglew@crhc.uiuc.edu (Andy Glew) (11/20/90)

>James Smith, in the paper entitled "Characterizing Computer
>Performance With a Single Number" [CACM, October 1988, #10] argues
>that the harmonic mean should be used, but only again with absolute
>quantities such as MFLOPS (normalization should occur after the mean
>has been calculated).

Urghhh.  That isn't quite what the paper said.

You use the harmonic mean to summarize *rates* like megaflops, because
the harmonic mean of rates corresponds to an arithmetic mean of
run-times.  And run-times, after all, is what we are all interested
in in the long run (well, some of us are interested in throughputs
too as distinct from latencies, but...)

How you weight the components of the means is another matter.
Averaging normalized values just means that you have chosen a
particular weighting function.  Hennessy suggests another weighting
function.  The best weighting function, of course, corresponds to the
relative frequency of the application types represented by the
benchmarks in your workload - and that statement, of course, is full
of assumptions.


Another reason to use the harmonic mean is that it tends to be
conservative, so it reduces the chance that you will be accused of an
overestimate based on a few distorting values.  Handwaving there, too.
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]