[comp.lang.c] Suggestions for SPEC 3.0 CPU Performance Evaluation Suite

dgh@validgh.com (David G. Hough on validgh) (06/21/91)
     When the idea that became SPEC first started circulating I was among the
many that agreed that it would be good if somebody, somewhere did all the work
necessary to establish an industry standard performance test suite to super-
sede *h*stone and the many linpacks, which had outlived their usefulness in an
era of rapid technological change.

     Fortunately a few somebodies somewhere did get together and do the work,
and the SPEC 1.0 benchmark suite has been a tremendous success in re-orienting
end users toward realistic expectations about computer system performance on
realistic applications, and in re-orienting hardware and software designers
toward optimizing performance for realistic applications.

     In that spirit I'd like to suggest some changes for consideration in SPEC
3.0, the second generation compute-intensive benchmark suite.  Many of the
suggestions come from study of the Perfect Club benchmarks and procedures,
which are more narrowly focused than SPEC, primarily on scientific Fortran
programs.

     Why does SPEC need to publish a new 3.0 suite just as 1.0 is getting well
established?  Because the computer business is an extremely dynamic one, and
performance measurement techniques have lifetimes little better than the pro-
ducts they measure - a year or two!

Reporting Results

     In addition to the mandatory standard SPEC results which permit changes
to source code solely to permit portability, SPEC should also permit optional
publication of tuned SPEC results in which applications may be rewritten for
better performance on specific systems.  In the spirit of SPEC, publication of
tuned results must be accompanied by listings of the differences between the
tuned source code and the portable source code.  If these differences are so
massive as to discourage publication, perhaps that's a signal to the system
vendors that they've been unrealistic in tuning.

     SPEC previously allowed publication of results for source codes enhanced
for performance.  This was a mistake because it was not accompanied by all the
specific source code changes!  All confirmed SPEC results must be reproducible
by unassisted independent observers from published source codes and Makefiles
and commercially available hardware and software.

     These two types of results - on portable programs and on specifically
tuned programs - correspond to two important classes of end users.   Most
numerous are those who, for many reasons, can't or won't rewrite programs.
Their needs are best represented by SPEC results on standard portable source
code. More influential in the long run, but far fewer in numbers, are
leading-edge users who will take any steps necessary to get the performance
they require, including rewriting software for specific platforms.  Supercom-
puter users are often in this class, as are former supercomputer users who
have migrated to high-performance workstations.

     Arguing the legitimacy of rewrites by system vendors would be a black
hole for the SPEC organization.  Allowing rewrites under public scrutiny
leaves the decision about appropriateness to the leading-edge end users who
would have to make such a determination anyway.  Requiring tuned SPEC results
to always be accompanied by portable SPEC results and by the corresponding
source code diffs reminds the majority of end users of the cost required to
get maximum performance on specific platforms.

     Just as tuned SPECstats should never be confused with portable SPECstats,
projected SPECstats for unreleased hardware or software products should never
be confused with confirmed SPECstats.  A confirmed SPECstat is one that can be
reproduced by anybody because the benchmark sources and Makefiles are avail-
able from SPEC, and the hardware and software are publicly available.

     Nor should SPECstats computed from SPEC 3.0 be confused with those com-
puted from SPEC 1.0.  All SPECstats should be qualified with an identification
of the SPEC suite used to compute them.  The calendar year of publication is
easiest to remember.  Thus integer performance results derived from SPEC 3.0
benchmarks published in 1992 should be identified:

        SPECint.92
                                confirmed from portable source
        SPECint.92.projected
                                projected from portable source
        SPECint.92.tuned
                                confirmed from tuned source, diffs attached
        SPECint.92.tuned.projected
                                projected from tuned source, diffs attached

Similarly for SPECfp.  I suspect the overall SPECmark has outlived its useful-
ness - there is no reason to expect SPECint and SPECfp to be closely corre-
lated in general.  Otherwise there would be no need to measure both.  If any
circumstance warrants publishing just one SPECmark, let it be the worst
SPECratio of all the programs.

     Defining "floating-point-intensive application" and "integer application"
is an interesting problem.  If floating-point operations constitute less than
1% of the total dynamic instruction count on all platforms that bother to
measure, that's surely an integer application.  If floating-point operations
constitute more than 10% of the total dynamic instruction count on all plat-
forms that bother to measure, that's surely a floating-point application.
Intermediate cases may represent important application areas; these should not
be included in SPEC 3.0 however unless at least three can be identified.
spice running the greycode input could be the first.  Should these mixed cases
be included in SPECint.92, or SPECfp.92, or form a third category SPEC-
mixed.92?

     All SPECstats.92 should include an indication of the dispersion of the
underlying set of SPECratios used to compute the SPECstat.92 geometric mean.
It is a feature of modern high-performance computing systems that their rela-
tive performance varies tremendously across different types of applications.
It is therefore inevitable, rather than a defect in SPEC, that a single per-
formance figure has so little predictive power.  This means a single number
should never be cited as a SPECstat.92.

     SPEC requires that SPECmark results be accompanied by SPECratios for each
test.  This is an important requirement, but it is not realistic to expect
every consumer of SPEC results to absorb 30 or more performance numbers for
every system.  Some additional simple means of representing dispersion is war-
ranted.  One very simple method is to quote the range from the worst SPECratio
to the geometric mean:

        SPECint.92 =    15..21

means that the worst SPECratio was 15, the geometric mean was 21. The worst
ratio is much more likely to be achieved on realistic problems than the best,
which is why I don't see much value in quoting the latter except as part of
the list of all the SPECratios.

     A more complicated way to represent dispersion, based on standard devia-
tions of the logs of SPECratios in order to produce a SPECstat of the form 21
+ 3, is discussed later.
-

Summary of Reporting Format

     Thus I propose that SPEC 3.0 results be reported in the form

        SPECint.92 = WI..UI             SPECfp.92 = WF..UF

W = worst SPECratio, U = geometric mean of SPECratios.  The complete list of
portable SPECratios follows.

     Optionally after that,

        SPECint.92.tuned = WI..UI       SPECfp.92.tuned = WF..UF

followed by the complete list of tuned SPECratios AND the complete list of
source differences between tuned and portable source.  Add another column for
SPECmixed.92 if desired, and for SPECmark.92 if retained.

Floating-point Precision

     SPECint and SPECfp are currently recognized as subsets of SPECmark.
Should single-precision floating-point results be treated separately from
double-precision?  How should SPECfp be reported on systems which whose "sin-
gle precision" is 64-bit rather than the 32-bit common on workstations and
PC's?  Although 64-bit computation is most common as a safeguard against
roundoff, many important computations are routinely performed in 32-bit single
precision with satisfactory results.  To bypass these issues SPEC Fortran
source programs declare floating-point variables as either "real*4" or
"real*8" - no "real" or "doubleprecision".

     To be meaningful to end users, SPEC source codes would ideally allow
easily changing the precision of variables, and vendors would be allowed and
encouraged to treat working precision like a compiler option, using the best
performance that yields correct results - of course documenting those choices.
I know from experience, however, the great tedium of adapting source codes to
be so flexible; and such flexibility also requires greater care in testing
correctness of results.

Verifying Correctness

     Astounding performance computing erroneous results is easy but not very
interesting.  Correctness verification is ideally an independent step that is
not timed as part of the benchmark.  Somewhat in contradiction, it is highly
desirable that the correctness be in terms meaningful to the application.  For
physical simulations, appropriate tests of correctness include checks that
physically conserved quantities such as momentum and energy are conserved com-
putationally.

     Consider the linpack 1000x1000 benchmark as an example, because it's easy
to analyze rather than because it's appropriate for SPEC 3.0.  The rules
require you to use the data generation and result testing software provided by
Dongarra, but you may code the computation any reasonable way appropriate to
the system.  Correctness is determined by computing and printing a single
number, a normalized residual ||b-Ax||, that depends on all the quantities x
computed in the program - thus foiling optimizers aggressively eliminating
dead code.

     How does one determine whether a residual is acceptable?  Unfortunately
that question can only be answered by the designers or users of the applica-
tion.  In this respect the linpack benchmark is obviously artificial because
there really is no a-priori reason to draw the line of acceptable residuals at
10, or 100, or 1000...  it depends on the intended use of the results.

     If the correctness criterion were established in absolute terms (rather
than relative to the underlying machine precision as linpack's normalized
residual does) then there would be no harm in rewriting programs in higher
precision and avoiding pivoting, if that produced acceptable results and
improved performance.

     The difference between absolute correctness criteria and criteria rela-
tive to machine precision reflect differences between requirements placed on
complete applications and requirements of mathematical software libraries.
The complete application typically needs to compute certain quantities to some
known absolute accuracy.  Mathematical software libraries, like the Linpack
library from which the well-known benchmark was drawn, will be used by many
applications with many differing requirements not known in advance, so their
quality should be the highest reasonably obtainable with a particular arith-
metic precision, and thus is best measured in units of that precision.

General Content

     SPEC 3.0 benchmarks should time important realistic applications, as com-
plete as portability permits, from whose performance users may reasonably pro-
ject performance of their similar applications.

     Benchmarks should be independent in this statistical sense:  it should
not be possible statistically to predict the performance of one SPEC benchmark
with any accuracy across many SPEC member platforms based upon the known per-
formance of some other disjoint subset of SPEC benchmarks on those platforms.
As long as important realistic applications are chosen that can be reasonably
verified for correctness, and this independence criterion is satisfied, I see
no need to arbitrarily limit the number of SPEC computational benchmarks.

     In addition to the current SPEC 3.0 candidates, I recommend to SPEC for
future consideration the Fortran programs collected by the PERFECT Club.
Aside from spice2g6, which has a different input deck, they seem mostly
independent of current SPEC 1.0 programs.

     A number of the gcc and espresso subtests run too fast to be timed accu-
rately.  They should be replaced by more substantial ones.

Specific Comments - matrix300

     The matrix300 benchmark has outlived its usefulness.  Like linpack before
it, it has forced adoption of new technology that has in turned made it
obsolete, for it is now susceptible to optimization improvements seldom
observed in realistic applications.  Amazing performance improvements have
been reported by applying modern compiler technology previously reserved for
vectorizing supercomputers:

                   System    Old SPECratio   New SPECratio

                   IBM 550        100             730
                   HP 730          36             510

Competitive compilers should indeed exploit such technology, but it does end
users no good to suggest that many realistic applications will subsequently
show 7X-14X performance improvements.  Such results are simply an artifact of
this particular artificial benchmark, and demonstrate how misleading it is to
present SPEC 1.0 performance with one number.  Inasmuch as the SPEC 1.0 ver-
sion of matrix300 does not actually report any numerical results, the entire
execution could legitimately be eliminated as dead code, although so far
nobody has exhibited that much temerity.

     While the matrix multiplication portion of some realistic applications
can and should demonstrate significant improvements, the overall application's
improvement will be tempered by the portions that aren't susceptible to such
optimizations.  nasa7 includes a matrix multiplication kernel, but the spirit
of SPEC is much better served by incorporating into SPEC 3.0 certain proposed
realistic applications of which matrix multiplication is one important com-
ponent among others.

Specific Comments - nasa7

     Nasa7 consists of the kernels of seven different important computational
applications.   As such it was much more realistic - because its kernels were
more realisticly complicated - than the livermore loops which it has largely
supplanted.  Each of the specific types of applications - involving matrix
multiplication, 2D complex FFT, linear equations solved by Cholesky, block
tridiagonal, complex gaussian elimination methods, etc. - should be
represented separately by realistic applications rather than somewhat arbi-
trarily lumped into one benchmark: the repetition factors for the seven ker-
nels are 100, 100, 200,20,2,10,400, which may represent the relative loads at
NASA but probably not elsewhere.  And they make the run time fairly long.

Specific Comments - doduc

     There is one troubling aspect of the doduc program: it lacks any good
test of correctness other than the number of iterations required to complete
the program, which number might not be a very reliable guide.  For instance,
if the simulated time is extended from 50 to 100 seconds, the number of itera-
tions appears to vary by 20% (20,000 - 24,000) among systems which appear to
behave similarly in shorter runs, casting doubt on the correctness of the
shorter runs.  doduc is an interesting and valuable benchmark that should be
retained in SPEC 3.0 if a more confidence-inspiring correctness criterion can
be devised.

Specific Comments - spice

     The greycode input deck doesn't seem to correspond to any very common
realistic computations, and takes a long time to run as well. A number of oth-
ers have been proposed; SPEC 3.0 should include a number of them as spice2g6
subtests.

     In addition I urge SPEC to consider, when opportunity permits, the spice3
program from UCB.  It is unusual - a publicly available, substantial scien-
tific computation program, written in C.  It accepts most of the input decks
that spice2g6 accepts.

Specific Comments - gcc

     gcc 1.35 represents relatively old compiler technology suitable for CISC
systems based on 80386 or 68020, for instance.  gcc 2.0 is designed to do the
kinds of aggressive local optimizations required for RISC architectures - such
as most of the hardware platforms sold on the basis of their SPECmarks.  I
encourage SPEC to replace gcc 1.35 with 2.0 as soon as the latter is available
for distribution.

     In addition I urge SPEC to consider the f2c Fortran-to-C translator from
ATT.  It is another publicly available, substantial program written in C, with
many of the same kinds of analyses that a full Fortran compiler performs.

SPECstat Computations

     SPECint and SPECfp are geometric means of ratios of elapsed real times of
realistic applications.  That's the correct approach.

     I would handle the cases of multiple subtests somewhat differently than
SPEC 1.0 does: for gcc and espresso, and perhaps spice2g6 and spice3 in the
future.  Currently the run times of subtests are added up to get an overall
execution time.  For the same reason that the geometric mean of several tests
is appropriate for the overall SPECmark, the geometric mean of the SPECratios
of the subtests is the appropriate SPECratio for that test.  Thus instead of
adding up the times for all the 8 espresso inputs and comparing those to the
sum of the 8 times on the reference system, I'd compute the SPECratio for each
espresso input, compute the geometric mean of those 8 SPECratios, and use that
as the SPECratio for the espresso benchmark when computing the overall SPEC-
mark.

SPECstat.92 Reference Times

     The VAX 780 is rapidly disappearing but is anything but rapid in running
SPEC programs.  For convenience, the reference system for SPEC 3.0 should be
widely available and as fast as possible.  One could choose candidate refer-
ence platforms on the basis of SPECmass, the performance equivalent of
biomass: SPECmass = SPECstat * installed base.  On that basis one of the
SPARCstations might be selected, but it doesn't matter too much - any recent
widely available RISC Unix workstation would do.  The reference result would
be the best elapsed time achieved on the reference system by the time SPEC 3.0
was announced, using any combination of compiler and operating system produc-
ing correct results.

     The results for the reference system would be remarkably balanced - all
SPECratios equal to 1 - which might appear to be to the advantage of the ven-
dor of the reference system, but any such advantage is illusory.  With product
lifetimes of a year or so, the reference system - which by definition has a
large installed base and therefore is near its end of life - would be out of
production during most of the lifetime of the SPEC 3.0 suite, and any replace-
ment products from that vendor would likely have SPECratios that would be far
from uniform.

     The concept of "system balance" exists mostly in marketing science any-
way; the same computer that has no outstanding bottlenecks in one environment
may be limited by integer, floating-point, memory, i/o, or graphics perfor-
mance in others.  If SPEC needs politically to avoid choosing one particular
reference system, it could compromise by choosing several of roughly compar-
able integer performance, using one for integer benchmark reference results,
one for floating-point reference results, etc.

     To avoid intentional or accidental confusion between SPEC 1.0 SPECstats
and SPEC 3.0 SPECstats.92, it's desirable to recalibrate SPECstats.92.  If the
SPARCstation 2 were chosen as the SPEC 3 reference, for instance, ignoring the
effects of using a different suite of benchmarks, then SPECstat.92 would be
immediately deflated by a factor of about 21 relative to the SPEC 1.0
SPECstat, reducing opportunities for confusion.

Why Geometric Mean is Best for SPEC

     The progress of some end users is limited by the time it takes a fixed
series of computational tasks to complete.  They then think about the results
and decide what to do next.  The appropriate metric for them is the total
elapsed time for the applications to complete, so the arithmetic mean of times
is the appropriate summary statistic.  If rates, the inverse of times, happen
to be available instead, the appropriate statistic is the harmonic mean of
rates.  If application A runs ten times as long as application B, then a 2X
improvement in application A is ten times as important as a 2X improvement in
application B.

     Other computational situations are characterized by a continual backlog
of processes awaiting execution.  If the backlog were ever extinguished, the
grid densities would be doubled and saturation would again result.  In these
cases the appropriate metric is rates - computations per time - and the
appropriate summary statistic is an arithmetic mean of rates, or if times are
available, a harmonic mean of times.  A 2X improvement in application A is
just as important as a 2X improvement in application B.

     What about the commonest case consisting of workloads of both sorts?
With geometric means of SPECratios, the conclusions are the same whether rates
or times are used, and rate data and time data may readily be combined.
That's why I like to use the geometric mean to combine SPECratios of diverse
types of programs.

     As with benchmarks themselves, the most appropriate way to combine bench-
mark results varies among end users according to their situation.  SPEC has
wisely chosen the most neutral way to combine results - while requiring that
individual results be available as well.

Another Way to Represent Dispersion of SPECratios

     Accompany every SPECmean computed by geometric mean of SPECratios by a +
                                                                            -
tolerance representing the dispersion in the set of SPECratios used to compute
the geometric mean.  Inasmuch as geometric mean of SPECratios is the exponen-
tial of the arithmetic mean of the logs of SPECratios, the tolerance could be
computed from the standard deviation s of the logs of the SPECratios in this
way:

        u = mean(log(SPECratios))
        s = standard deviation(log(SPECratios))
        U = exp(u)
        S = U*(exp(2*s) - 1)
        round U to nearest two significant figures
        round S upward to same number of decimal places as U
        SPECmean = U + S
                     -

Thus I would summarize the results of some recent experimental compiler tests
as

        SPECmark = 20 + 2
                      -
        SPECint  = 19 + 2
                      -
        SPECfp   = 21 + 1
                      -

Such a + presentation emphasizes the futility of buying decisions based on
       -
insignificant SPECmean differences in the third significant figure, and may
therefore help focus system vendor efforts on improving the worst SPECratios
instead of the best.

     Strictly speaking statistically, the SPECmean would be

        exp(u) + exp(u)*(exp(2*s)-1) - exp(u)/(exp(2*s)-1)

but simplicity recommends the earlier formulation.

-- 

David Hough

dgh@validgh.com		uunet!validgh!dgh	na.hough@na-net.ornl.gov