[comp.benchmarks] SPEC vs. Dhrystone

borasky@ogicse.ogi.edu (M. Edward Borasky) (01/03/91)

In article <1685@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes:
>In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>(Note, for example, that published Dhrystone results easily mis-predict
>>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine
>>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more
>>realistic integer benchmarks.)
>This is an interesting observation (result).
>Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer
>programs. That is, hundreds (I believe) of programs were 
>analyzed to come up with the (ahem) 'typical' high level
>language instructions and their frequency of usage. In view of this 
>I would, at first sight, suspect the Dhrystone to be more accurate
>than SPEC as SPEC is based upon only a few integer programs. 
I suspect there are two factors at work here.  First, Dhrystone is a
fairly small benchmark, and would not exercise the memory hierarchy 
as hard as the real programs in SPEC.  The second factor is that it is
easier to tune a compiler to a small benchmark like Dhrystone (or for
that matter Whetstone and the Livermore Loops) than it is to tune for
a variety of different real programs.  By the way, I believe Dhrystone
was originally written in ADA and was translated to "C", the form in
which it is usually run.
>Real programs also show a great variation in performance.  I noticed
>this recently in a Scientific American article (Jan 1991) which
>showed the comparison of 13 different real programs on a wide
>variety of supercomputers.  
This is the PERFECT CLUB benchmark suite.  I just picked up the article;
when I glanced at it in the supermarket checkout line (some people
look at Oprah's diets, some don't) I suspected that there might be a
howler or two lurking within the article.  If I see anything that looks
suspicious I will post something to comp.sys.super or here or both.
>The program 'megflop' variation in 
>perfromance was truly tremendous especially for the fastest systems
>(Cray and a NEC computer I think).
This variation is a fact of life in supercomputing.  If you take any
suite of benchmarks that is fairly comprehensive, you will see much
more variation in performance for a supercomputer than you will for a 
scalar machine.  This is true for both absolute Megaflops and for speed
relative to a fixed SCALAR machine.

mash@mips.COM (John Mashey) (01/03/91)

In article <15546@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes:
>In article <1685@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes:
>>In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>>(Note, for example, that published Dhrystone results easily mis-predict
>>>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine
>>>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more
>>>realistic integer benchmarks.)
>>This is an interesting observation (result).
>>Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer
>>programs. That is, hundreds (I believe) of programs were 
>>analyzed to come up with the (ahem) 'typical' high level
>>language instructions and their frequency of usage. In view of this 
>>I would, at first sight, suspect the Dhrystone to be more accurate
>>than SPEC as SPEC is based upon only a few integer programs. 
>I suspect there are two factors at work here.  First, Dhrystone is a
>fairly small benchmark, and would not exercise the memory hierarchy 
>as hard as the real programs in SPEC.  The second factor is that it is
>easier to tune a compiler to a small benchmark like Dhrystone (or for
>that matter Whetstone and the Livermore Loops) than it is to tune for
>a variety of different real programs.  By the way, I believe Dhrystone
>was originally written in ADA and was translated to "C", the form in
>which it is usually run.
>>Real programs also show a great variation in performance.  I noticed

Well, there are multiple issues, of which several have been mentioned
already here.  Let us review the history of Dhrystone, which was originally,
as stated, a reasonable attempt to model ADA usage, and then got
converted into C.

ISSUE 1: building small synthetic benchmark that is TYPICAL of intended usage is
	1a: extremely difficult to do, even for a current set of
		hardware and software
	1b: REALLY hard to do and expct to remain valid over time<
		even in the absence of 1c
	1c: really hard to do, if small enough to be subject to compiler
		gimmickkry

Let us take each of these in turn:
	1a: is hard, because the usual (and reasonable) methodology is:
		1: select attributes that should be measured
		2: gather statistics from a set of programs
		3: build the benchmark to model those attributes
The problem is:
	1: you may not choose the "right" attributes, and in fact,
	there is no small set of right attributes.  in fact,
	there are only better an better approximations, even to
	modeling a single user program (not even a large mix).
	For example, suppose your first approximation is:
		count the number of +, -, *, and /. executed
	Not a very good approximation
		so add, number of if statements
	Still not good
		so add, distribution of sizes of expressions
	Still not good
		so add, number of function calls
	Still not good,
		so add, distribution of number of arguments of function
		calls (makes bigger difference amongst machines that pass
			some arguments in registers)
	Still not good, some architectures (like SPARC) can be sensitive to
		the depth of function calls, so do somethign about that
	Still not good, haven't done anything about array indexing,
		and different architectures react differently
	Still not good, haven't done anything about pointers, so add
		some pointer references
	Still not good, you haven't measured the frequency of different-
		sized offsets from pointers (and surprise! in some architectures
		there is no different between a zero=offset, and a non-zero
		offset; in others (such as AMD29K), a zero offset is cheaper
		than non-zero; in others, presence of particular addressing
		modes helps some combinations much more than others.
	Still not good: how often is the same pointer->object referenced
		close enough in the code, and under such conditions that
		the compiler can jjust leave it in a register?
	Still not good: is the distribution of variable referencees such
		that the benchmartk will model the effects of a good
		register allocator, or not.
	..... and on, and on....
I.e., it is VERY easy to do a competent job of feature extraction and
modeling, and still get surprised, where "surprise" = the synthetic
benchmark doesn't correlate well with realistic code of the class that it
was supposed to model.  (I've looked at many synthetic benchmarks
with our tools; the numbers quite often don't look anything like what you
see wheen you analyze real programs.)

1b: hard to do over time:
	If asked to compare machines that basically differ only by the
	clock rate (same CPU, same compilers), a small benchmark is
	adequate.  However, hardware tends to get more complex over time,
	in particular, faster machines use caches, caches get bigger,
	multi-level caches appear, etc..  Programs expand to use these;
	if a benchmark doesn't also expand appropriately, it starts to
	measure only the smallest part of the memory hierarchy.

	In addition, optimizing compilers get better, and they optimize
	away pieces of the code, especially in a small synthetic benchmark.

1C: Compiler gimmickry
	For any importnat benchmark that is small, compilers will get tuned
	in ways that are absolutely useless in real life.  This has happened
	at least with Whetstone, Dhrystone, and LINPACK.

ISSUE 2: Dhrystone in particular
	The MIPS Performance Brief, Issue 3.9 (and earlier) has had analyses
	of Dhrystone issues, for years.
	here is a brief summary:
	1. Small, will fit in tiny instruction and data caches
	2. References and re-
	  models effects of write-back & write-thru caches poorly
	3. Subroutine calls sahllow depth, hence never underflows/overflows
	  on aa register window/stack cache machine.
	4. Makes function calls more frequently than any real program
	  I've ever seen, i.e., on a MIPS, uses <40 cycles per call,
	  whereas 60-100 is much more typical of C programs
	5. Can easily spend 30% of it's time in strcpy, unlike any real
	  program I've ever analyzed.  Due to the particular use
	  (copy a 30-byte constant, over and over again), it is especially
	  amenable to gimmickry, such as compiler options which generate
	  incorrect code for real use, but happen to work for Dhrystone.
	  (Note that some of this is an artifact of translation from
	  ADA/Pascal (fixed-length strings) to C.)
	  Most amusing code: i860, where the 30-byte constant is expanded
	  to 32, and is then copied with 2 16-byte loads, followed by
	  2 16-byte stores, not very typical of real C-language string
	  processing, where most pointers are to variables whose sizes
	  are unknown at compile time....
	6. There is an unusually high frequency of zero-offset pointers.
	7. In the earlier versions, there was obvious dead code, which started
	  to disappear under the pressure of better optimizers.
	  (not gimmickry, just better compilers).
	8.  Also, the earlier version never worried about compilers
	  that can merge the whole program together and inline EVERYTHING...

So, Reinhold W. started with somethign that was actually a reasonable
attempt, but it is HARD TO DO, and even HARDER to keep sensible...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/03/91)

>>>>> On 3 Jan 91 06:18:59 GMT, mash@mips.COM (John Mashey) said:

mash> 1C: Compiler gimmickry
mash> 	For any important benchmark that is small, compilers will get tuned
mash> 	in ways that are absolutely useless in real life.  This has happened
mash> 	at least with Whetstone, Dhrystone, and LINPACK.

So what optimizations have been performed on the LINPACK 100x100 code
that are "absolutely useless" in real life?

I am honestly curious, since it seems to me that the vast majority of
the time is spent in a pair of loops which repeatedly call SAXPY.
SAXPY, in turn, is a short, vectorizable function with a number of
parameter checks before the calculation.

The only optimizations that I can think of that would be useful are
inlining and constant propagation, and they are both generally
helpful.  Inlining can be useful in short vector codes, especially if
combined with compile-time constant propagation, since that would
allow most of the checks to be deleted.  The checks are (potentially)
useful in the BLAS library (though they are redundant when LINPACK
calls the BLAS, since the parameters are already checked at least one
before the BLAS calls).  Nonetheless, optimizing them away by hand
removes the check that should be there when the user calls the BLAS
routines directly, as I sometimes do....

--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

borasky@ogicse.ogi.edu (M. Edward Borasky) (01/03/91)

In article <MCCALPIN.91Jan3100122@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>>>>>> On 3 Jan 91 06:18:59 GMT, mash@mips.COM (John Mashey) said:
>
>mash> 1C: Compiler gimmickry
>mash> 	For any important benchmark that is small, compilers will get tuned
>mash> 	in ways that are absolutely useless in real life.  This has happened
>mash> 	at least with Whetstone, Dhrystone, and LINPACK.
>
>So what optimizations have been performed on the LINPACK 100x100 code
>that are "absolutely useless" in real life?

1. There was a compiler once that actually CHEATED on the Linpack bench-
mark.  What they did was, any time they encountered a routine called
"SGEFA" or "DGEFA", they checked the number of parameters and their
type, and if they agreed with the standard usage in the LINPACK bench-
mark, branched off to a machine-coded SGEFA or DGEFA, ILLEGALLY UNDER
THE FORTRAN RULES IGNORING THE SGEFA/DGEFA CODE THAT IS PRESENT IN THE
LINPACK BENCHMARK.  There were also pre-processors that stripped out
SGEFA/DGEFA or SAXPY/DAXPY, so that they would be linked in from libraries
rather than compiled.  This is a fine optimization technique IF CONTROLLED
BY THE USER -- that was the whole REASON LINPACK was written using
two levels of libraries.  But the compiler CANNOT know without looking
at the submitted code whether these cheats will produce correct results.

2. Another cheat was once pulled on the 1000-equation LINPACK benchmark.
What this vendor did was to use a Gauss-Jordan reduction, where all the
vectors are of length 1000, rather than the decreasing-length algorithm
normally used.  This in itself is legal; you can use any algorithm you
want to.  Unfortunately, the G-J reduction uses twice as many operations
as the standard one, so when they computed their MEGAFLOPS, they divided
how many FLOPS they had done by the time it took.  Unfortunately, the
Dongarra rules allow you to claim only the FLOPS required by the 
STANDARD LINPACK algorithm.  So they claimed twice the speed they were
allowed to.  In fact, they also ran the benchmark in 32 bits instead of
64, to get another 2X speed boost.  It turns out that the matrix in
the LINPACK benchmark was ill-conditioned enough to blow up on the 32-bit
arithmetic and Dongarra got suspicious.

Even discounting blatant cheats like the ones I describe above, there
is a wider issue here about how much COMPILE time a user is willing 
to expend for "perfect" compiles.  It is possible using techniques of
whole-program compiling to optimize out all the unnecessary checks in
the LINPACK benchmark -- it's simple enough and well-structured enough
so that a compiler can "realize" that DAXPY is always called with inc-
rement 1, that the DAXPY can be in-lined, that N is always greater than
zero, that one-trip DO loops will work, etc.  But these optimizations
take time -- so much time that on real codes larger than LINPACK, one 
of two things will happen:

1. The user will think the compiler has gone into an endless loop and
zap the compile, or
2. The compiler writer, probably interfacing with Marketing, will
build in lots of escapes into the optimizer so that compile times
appear to be finite.

There ARE users who claim they will tolerate long optimization times;
my experience in Marketing was that such were few and far between.  
The main uses for such a compiler are large third-party application
codes that are seldom compiled -- once or twice a year for the ones I
worked with.  But the size of these codes works AGAINST the whole-
program compiler!

mash@mips.COM (John Mashey) (01/04/91)

In article <MCCALPIN.91Jan3100122@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>>>>>> On 3 Jan 91 06:18:59 GMT, mash@mips.COM (John Mashey) said:
>mash> 	at least with Whetstone, Dhrystone, and LINPACK.
>
>So what optimizations have been performed on the LINPACK 100x100 code
>that are "absolutely useless" in real life?

As reported in Patterson&Hennessy, page 47, "An extreme example of such targeted
engineering employed compiler optimizations that were benchmark sensitive.
...a person at one company used a preprocessor that scanned the text for
keywords to try to identify benchmarks by looking at the name of the
author and the name of a key subroutine."

Although it doesn't say so in P&H, this was a search for "DONGARRA".
to be fair, one might call this outright cheating rather than a real
optimization.  Certainly, it is at the extreme useless end of the scale
of:
	a) Hunt for Dongarra
	b) Hunt for the inner loop, and character-match, and if so,
	generate optimal code.
	c) Do all of the analysis like you should, and generate optimal
	code.

For Whetstone, the classic one is recognizing a trigonometric identity,
and thus avoiding a fairly expensive computation; for Dhrystone, it is
the strcpy stuff, or earlier, removal of dead code.
-------
Reinhold Weicker has published some good analyses of common benchmarks;
there was a talk for Slater's Microprocessor Forum last Spring,
and one for RISC'90 in Karlsruhe in November.  I think there's one
in the current (or immediate past) issue of COMPUTER: maybe somebody
can cite the exact reference, as I don't have it handy.
-------
To my knowledge, no one has so far found anything quite like these things
in the SPEC benchmarks.  The closest case is matrix300, where most of the
time is spent inside a small loop.  This benchmark was included,
rightly or wrongly, as it  at least thrashed the data cache around, and the
inner loop is representative of some real calculations; certainly it was
argued about a lot...  For most of the SPEC benchmarks, even if you knew
that you were looking at that benchmark, it wouldn't do you much good.
I.e., exactly what good does it do to know you're compiling gcc?
In some cases, about the only hack you could do would be to
generate a program that emitted a copy of the required output, and did nothing
else.  I believe this would be fairly easy to discover :-)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

borasky@ogicse.ogi.edu (M. Edward Borasky) (01/04/91)

In article <44466@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>To my knowledge, no one has so far found anything quite like these things
>in the SPEC benchmarks.  The closest case is matrix300, where most of the
>time is spent inside a small loop.  This benchmark was included,
>rightly or wrongly, as it  at least thrashed the data cache around
Correct me if I'm wrong, but the matrix300 is Jack Dongarra's 300-
equation LINPACK benchmark, isn't it?  If so, the inner loop is matrix
multiply, unrolled 1, 2, 4, 8 and 16 ways.  At least one compiler (FPS
Model 500) recognizes the 1-way unrolled case as a matrix-vector 
multiply and calls in a hand-optimized matrix multiply routine.  In 
this case, cache is not relevant because the hand-optimized routine 
knows about the memory hierarchy.

On the subject of LINPACK cheating, we really ought to get Jack
Dongarra involved in the discussion.  Unfortunately, he HATES to talk
about his experiences with vendors in this area.  Every vendor had
somebody who was ever vigilant for cases of OTHER vendors cheating; I
filled that function for FPS at one time.  
>In some cases, about the only hack you could do would be to
>generate a program that emitted a copy of the required output, and did nothing
>else.  
Do you believe that vendors are working on this?  If so, it would be
relatively easy to defeat using techniques of public-key cryprography
(which by the way is an interesting application for BENCHMARKING!!!!)

mash@mips.COM (John Mashey) (01/04/91)

In article <15577@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes:
>In article <44466@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>To my knowledge, no one has so far found anything quite like these things
>>in the SPEC benchmarks.  The closest case is matrix300, where most of the
>>time is spent inside a small loop.  This benchmark was included,
>>rightly or wrongly, as it  at least thrashed the data cache around
>Correct me if I'm wrong, but the matrix300 is Jack Dongarra's 300-
>equation LINPACK benchmark, isn't it?  If so, the inner loop is matrix
>multiply, unrolled 1, 2, 4, 8 and 16 ways.  At least one compiler (FPS
>Model 500) recognizes the 1-way unrolled case as a matrix-vector 
>multiply and calls in a hand-optimized matrix multiply routine.  In 
>this case, cache is not relevant because the hand-optimized routine 
>knows about the memory hierarchy.

Well, like I said, that's why matrix300 was on the edge.  However,
I don't even mind this too much, if the compiler recognizes the case
generically (as opposed to special-case recognition), given that
certain kinds of code do really spend a lot fo time in such loops.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

gideony@microsoft.UUCP (Gideon YUVAL) (01/08/91)

I notice (Myers & Budde's 80960 book, p.5, footnote) "As a historical
note, the Dhrystone benchmark was originally developed as part of the
80960 development program". It may therefore be that Dhrystone's
love for register-window machines (i.e. 960 and SPARC) is not 100%
an accident.
-- 
Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)