[comp.arch] Benchmarking performance vs. robustness

dgh@validgh.com (David G. Hough on validgh) (06/10/91)
There's been an ongoing discussion of issues of performance vs. robustness in
comp.arch.  How much is it worth to improve the performance on the 
"common" case at the cost of producing wrong results on "uncommon"
cases?  "uncommon" is not necessarily rare, of course.

IEEE 754 floating-point arithmetic was intended to increase the domain
of problems for which good results could be obtained, and to increase the
likelihood of getting error indications if good results could not be obtained,
all without significantly degrading performance on "common" cases and
without significantly increasing total system cost relative to sloppy
arithmetic.

Since none of these goals are very quantitative, 
people will argue about how well they've been achieved.

Part of the problem is that the benchmark programs in use measure only
common cases.  Various versions of the Linpack benchmark all have in common
that the data is taken from a uniform random distribution, producing problems
of very good condition.   So the worst possible linear
equation solver algorithm running on the dirtiest possible floating-point
hardware should be able to produce a reasonably small residual, even for
input problems of very large dimension.

Such problems may actually correspond to some realistic applications, but many
other realistic applications
tend at times to get data that is much closer to the boundaries of
reliable performance.   This means that the same algorithms and input data
will fail on some computer systems and not on others.   In consequence
benchmark suites that are developed by consortia of manufacturers 
proceeding largely by consensus, such as SPEC, will tend to exclude
applications, however realistic, for which results of comparable quality
can't be obtained by all the members' systems.

The user's own applications is always the best benchmark, but because of
practical difficulties it makes more sense for consortia of users in various
industries to specify in a general way the problems to be solved in terms
of input data, sample algorithms, and a measure of correctness of the computed
results, and then allow fairly radical recoding of the algorithms to solve
the problems on particular architectures.  This is more like the way the 
PERFECT club works than like SPEC.   The most important thing is that the
published performance and correctness results should be accompanied by the
source code (and Makefiles etc.) that achieves them so that prospective
purchasers can determine for themselves the relevance of the results to their
particular requirements.  Most users place a great premium on NOT rewriting
any source code, but others are willing to do whatever is necessary to get
their jobs done.

The rules for the 1000x1000 Linpack benchmark, for instance, are that you
have to use the provided program for generating the input data and testing
the results, but you get to recode the linear equation solution itself,
the part that gets timed, in any way that makes sense for a particular
architecture.
-- 

David Hough

dgh@validgh.com		uunet!validgh!dgh	na.hough@na-net.ornl.gov