[comp.benchmarks] Which benchmarks are useless?

mark@mips.com (Mark G. Johnson) (04/20/91)

In article <1991Apr20.083301.28886@ux1.cso.uiuc.edu> andreess@mrlaxa.mrl.uiuc.edu (Marc Andreessen) writes:
    >I've been reading this newsgroup since its formation.  It spends
    >about 95% of its time spewing out a plethora of meaningless numbers
    >on meaningless and trivial little pseudo-hacks which don't deserve
    >to be called benchmarks by any stretch of the imagination.
    >
    >These numbers are useless.
    >

Useless?  Meaningless?  I would suggest that the words "useless" and/or
"meaningless" be reserved for benchmarks that produce results that are
uncorrelated (absolute value of correlation coefficient < 0.2) with
"correct benchmark results".

Whatever "correct benchmark results" are.

Since Andreessen is at the University of Illinois, let's pretend,
temporarily, that the Illinois Perfect Club is the definition of a
correct benchmark.  Then a useless benchmark is one which, after being
run on a large subset of the same machines as have run the Illinois
Perfect Club, produces a correlation coefficient r in the range
(-0.2 < r < 0.2)  -- that is, the candidate benchmark's results are
uncorrelated with the Illinois Perfect Club results.

We could go further and define a "misleading" benchmark as one which
produces a correlation coefficient that is large and negative, i.e.
ranks machines in the wrong order (compared to the Illinois Perfect
Club, our temporary definition of a "correct benchmark").

Unfortunately, the "Dates Per Second" benchmark is attempting to
investigate OS behavior, something the Illinois Perfect Club doesn't
measure directly.  So before declaring the DatesPerSecond benchmark
to be "useless", we need to define what the correct results are,
(measuring OS behavior), and then we can correlate the DatesPerSecond
results to the correct results.  Lacking such reference data, it is
premature and perhaps a bit unscientific to assert the DatesPerSecond
results are "useless".
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94088-3650
	(408) 524-8308    mark@mips.com  {or ...!decwrl!mips!mark}

andreess@mrlaxs.mrl.uiuc.edu (Marc Andreessen) (04/21/91)

In article <2502@spim.mips.COM> mark@mips.com (Mark G. Johnson) writes:
>Since Andreessen is at the University of Illinois, let's pretend,
>temporarily, that the Illinois Perfect Club is the definition of a
>correct benchmark.  Then a useless benchmark is one which, after being
>run on a large subset of the same machines as have run the Illinois
>Perfect Club, produces a correlation coefficient r in the range
>(-0.2 < r < 0.2)  -- that is, the candidate benchmark's results are
>uncorrelated with the Illinois Perfect Club results.

You're joking.  I hope.  The Perfect Club (which I have nothing to
do with; my address does not include 'csrd.uiuc.edu') benchmarks only
one thing:  execution time of a given subset of commonly-used
scientific applications.  The hope is that by examining the Perfect Club
numbers, one might obtain an estimation of how fast one's own
similarly-coded application would execute on a given machine.

This says nothing about a broad range of interesting and not-so-interesting
facets of computation (aka dates per second).

If you really want to know how many dates per second your machine
can generate, that's fine, but the numbers are still useless. 
I'm a little bit more concerned with how fast I can execute a given 
scientific application than how many dates I can spew out.  Curiously 
enough, those concerns are shared with most of the scientific community; 
thus, Linpack, the Livermore Loops, and Perfect Club.

Having said that, I might further humbly submit that dates/second is also
a useless measure of OS performance.  This should be obvious; if not,
go back and read the original posting, wherein the author lists a wide
range of reasons this hack is useless as a benchmark.

Marc

-- 
Marc Andreessen___________University of Illinois Materials Research Laboratory
Internet: andreessen@uimrl7.mrl.uiuc.edu____________Bitnet: andreessen@uiucmrl

auvsaff@auvsun1.tamu.edu (David Safford) (04/22/91)

In article <2502@spim.mips.COM>, mark@mips.com (Mark G. Johnson) writes:
|>In article <1991Apr20.083301.28886@ux1.cso.uiuc.edu>
andreess@mrlaxa.mrl.uiuc.edu (Marc Andreessen) writes:
|>    >I've been reading this newsgroup since its formation.  It spends
|>    >about 95% of its time spewing out a plethora of meaningless numbers
|>    >on meaningless and trivial little pseudo-hacks which don't deserve
|>    >to be called benchmarks by any stretch of the imagination.
|>    >
|>    >These numbers are useless.
|>    >
|>
|>Useless?  Meaningless?  I would suggest that the words "useless" and/or
|>"meaningless" be reserved for benchmarks that produce results that are
|>uncorrelated (absolute value of correlation coefficient < 0.2) with
|>"correct benchmark results".
|>

--- stuff deleted

|> -- Mark Johnson	
|> 	MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94088-3650
|>	(408) 524-8308    mark@mips.com  {or ...!decwrl!mips!mark}
                                         
This is a joke, right?

You are not seriously saying that statistical correlation signifies meaning,
are you???

A statistical correlation is necessary, but certainly not sufficient, to
indicate a meaningful relationship.  Just because the per capita consumption
of M&M's correlates with the rate of bank failures does not indicate any
meaningful relationship.  This is an apple and oranges comparison.

The "date" "benchmark" is a meaningful measure of only 2 things: (1) whether
the given system uses dynamic linking, and (2) how fast date runs, given (1).

(1) is more easily determined other ways
(2) is not important to me

Let's please drop these meaningless tests, and look at more important
problems, such as network, graphics, and multiprocessing benchmark
approaches.

dave safford
auvsaff@auvsun1.tamu.edu
Texas A&M University

rbart@shakespyr.Pyramid.COM (Bob Bart) (04/23/91)

In article <15080@helios.TAMU.EDU>, auvsaff@auvsun1.tamu.edu (David
Safford) writes:
|> Let's please drop these meaningless tests, and look at more important
|> problems, such as network, graphics, and multiprocessing benchmark
|> approaches.

Bravo, I second the motion. I'm tired of all the nonsense benchmark
discussions.

      -#-------  Bob Bart -  Performance Analyst                  415-335-8101
    ---###-----  Pyramid Technology Corporation          pyramid!pyrnova!rbart
  -----#####---  Mountain View, CA  94039            rbart@pyrnova.pyramid.com
-------#######-  U.S.A.

terry@venus.sunquest.com (Terry R. Friedrichsen) (04/23/91)

auvsaff@auvsun1.tamu.edu (David Safford) writes:
>mark@mips.com (Mark G. Johnson) writes:

>|>I would suggest that the words "useless" and/or
>|>"meaningless" be reserved for benchmarks that produce results that are
>|>uncorrelated (absolute value of correlation coefficient < 0.2) with
>|>"correct benchmark results".

>You are not seriously saying that statistical correlation signifies meaning,
>are you???
>
>A statistical correlation is necessary, but certainly not sufficient, to
>indicate a meaningful relationship.  Just because the per capita consumption
>of M&M's correlates with the rate of bank failures does not indicate any
>meaningful relationship.  This is an apple and oranges comparison.

No, and he didn't say that.  Read what he wrote:  "useless == uncorrelated".
He did NOT say "correlated == useful".  You yourself admit that correlation
is a necessary, though insufficient, condition.

One of the real problems I note in reading this group is that there are a
LOT of folks who are weak on logic, in interpretation of benchmarks as well
as other postings.

>The "date" "benchmark" is a meaningful measure of only 2 things: (1) whether
>the given system uses dynamic linking, and (2) how fast date runs, given (1).
>
>(1) is more easily determined other ways
>(2) is not important to me

There are a couple of problems here:  first, the benchmark also indicates
the speed of the operating system implementation's fork()/exec() services.
That SHOULD be important to a lot of folks, with the exception of pure
number-crunchers.  Take John Hascall's suggestion and replace date(1) with
true(1), and you'll do even better at measuring fork()/exec() (modulo the
dynamic linking issue, of course).

Second, since date(1) must do integer calculations, how fast it runs MUST
be important to you, unless you do nothing but floating-point arithmetic.
If I handed you a machine that took 47 seconds to run date, I suspect that
you would find other aspects of its performance equally repulsive.  Weak
logic here again.

On the other hand, the matrix300 benchmark is of absolutely no importance
to me, since my matrices are a different size ;-).

Terry R. Friedrichsen

terry@venus.sunquest.com  (Internet)
uunet!sunquest!terry	  (Usenet)
terry@sds.sdsc.edu        (alternate address; I live in Tucson)

Quote:  "Do, or do not.  There is no 'try'." - Yoda, The Empire Strikes Back

auvsaff@auvsun1.tamu.edu (David Safford) (04/23/91)

In article <18049@sunquest.UUCP>, terry@venus.sunquest.com (Terry R.
Friedrichsen) writes:
|>auvsaff@auvsun1.tamu.edu (David Safford) writes:
|>>mark@mips.com (Mark G. Johnson) writes:
|>>
|>>|>I would suggest that the words "useless" and/or
|>>|>"meaningless" be reserved for benchmarks that produce results that are
|>>|>uncorrelated (absolute value of correlation coefficient < 0.2) with
|>>|>"correct benchmark results".
|>>
|>>You are not seriously saying that statistical correlation signifies meaning,
|>>are you???
|>>
|>>A statistical correlation is necessary, but certainly not sufficient, to
|>>indicate a meaningful relationship.  Just because the per capita consumption
|>>of M&M's correlates with the rate of bank failures does not indicate any
|>>meaningful relationship.  This is an apple and oranges comparison.
|>
|>No, and he didn't say that.  Read what he wrote:  "useless == uncorrelated".
|>He did NOT say "correlated == useful".  You yourself admit that correlation
|>is a necessary, though insufficient, condition.
|>

YOU read it again, slowly and carefully.  He said that "useless" should
be RESERVED for uncorrelated benchmarks.  This means that if a benchmark
does correlate, it cannot be called useless, as this is RESERVED for 
uncorrelated.  Using your nomenclature, he said "correlated != useless".
Using the reasonable assumption that the sets of "useful" and "useless"
benchmarks are disjoint and complementary, we can say that 
"!= useless" == "useful", and thus his statement reduces to
"correlated == useful".  To support the assumption that "useful" and
"useless" are disjoint and complementary, attempt a contradiction.
Suppose that the sets are not disjoint.  This means that there must
exist at least on benchmark that is both "useful" and "useless", clearly
a contradiction.  Suppose that there exists a benchmark that is neither
"useful", nor "useless".  But by definition, since this benchmark is not
"useful", it must be "useless".  Thus the reduction "!= useless" == "useful"
must hold. QED.
so he DID say, using your nomenclature, "correlated == useful".

|>One of the real problems I note in reading this group is that there are a
|>LOT of folks who are weak on logic, in interpretation of benchmarks as well
|>as other postings.
|>

You provided an excellent example!                                      
Beam me up, Scotty, there's no intelligent life in THIS newsgroup.

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (04/26/91)

A question I have (for a test, instead of that 'useless', 'meaningless' 
discussion) is:  

What is the correlation between Dhrystone 2.1 results and Integer 
SPECmarks?  How 'bad' is Dhrystone really compared to Integer SPECmarks?
Don't really need to compute a correlation, but just show a table of
comparable results (Integer SPECmark results vs Dhrystone results relative
to VAX-11/780).

What if I took 4 (or N) integer programs (different than used by SPEC)
and ran them on various systems and computed performance relative to
the VAX-11/780. Would these integer results agree with the integer 
SPECmark results for the same systems? Would they even be close?

Al Aburto
aburto@marlin.nosc.mil

mash@mips.com (John Mashey) (04/26/91)

In article <1749@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
>What is the correlation between Dhrystone 2.1 results and Integer 
>SPECmarks?  How 'bad' is Dhrystone really compared to Integer SPECmarks?
>Don't really need to compute a correlation, but just show a table of
>comparable results (Integer SPECmark results vs Dhrystone results relative
>to VAX-11/780).
I don't have the numbers handy, and am about to go out of town again.
However, there are a number of combinations where Dhrystone would predict
that machine A is 25% faster than machine B, but on SPEC integer,
machine B is 25% faster than machine A, or equivalent combinations where
the prediction is 50% off.  Combinations like this include RS/6000 vs
MIPS, or Intel i860 vs MIPS, at appropriate clock rates.  A particular
case is RS/6000 Model 320, which SPECints around 16, but Dhrystone (1.1)
is around 27.5, versus MIPS Magnum (25Mhz, not the newer 33s), which
has SPECint at 19.5, but has a lower Dhrystone than the RS/6000.
If I find time, I'll dig out the numbers, but I've seen enough data over
the years to have stopped collecting it.  What it said was:
	a) Dhrystone ALWAYS gives a higher VAX-mips rating than SPECint.
	(except maybe the VAX-11/780 :-)  1.1 is worse (higher) than 2.1,
	but 2.1 is high also.  the raio ranges from about 1.1 up to at
	least 1.6, maybe even as high as 2X.
	b) The Dhrystone:SPECint ratios grossly track with a single
	product line, except that small-cache machines of a family look
	more better on Dhrystone than on SPECint.
>
>What if I took 4 (or N) integer programs (different than used by SPEC)
>and ran them on various systems and computed performance relative to
>the VAX-11/780. Would these integer results agree with the integer 
>SPECmark results for the same systems? Would they even be close?
Depends on the benchmarks.  If you look at the data, you find that MIPS's
mips-ratings are rather close to SPECint, and the reason is that the
set of benchmarks we used itnernally for the integer side (which are
actually include much worse cache-busters than SPECint), and account for
a few billion cycles of execution .... correlate with SPECint to within
10% or closer ... and they existed BEFORE SPEC.  Of course, one of the
benchmarks (espresso) was included in both.
Anyway, the answer is: if you run substantive integer benchmarks,
single-user, I think SPECint is a pretty good predictor.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (04/30/91)

In article <2717@spim.mips.COM> mash@mips.com (John Mashey) writes:

>I don't have the numbers handy, and am about to go out of town again.
>However, there are a number of combinations where Dhrystone would predict
>that machine A is 25% faster than machine B, but on SPEC integer
achine B is 25% faster than machine A, or equivalent combinations where
>the prediction is 50% off.  Combinations like this include RS/6000 vs
>MIPS, or Intel i860 vs MIPS, at appropriate clock rates.  A particular
>case is RS/6000 Model 320, which SPECints around 16, but Dhrystone (1.1)
>is around 27.5, versus MIPS Magnum (25Mhz, not the newer 33s), which
>has SPECint at 19.5, but has a lower Dhrystone than the RS/6000.
>If I find time, I'll dig out the numbers, but I've seen enough data over
>the years to have stopped collecting it.  What it said was:
>	a) Dhrystone ALWAYS gives a higher VAX-mips rating than SPECint.
>	(except maybe the VAX-11/780 :-)  1.1 is worse (higher) than 2.1,
>	but 2.1 is high also.  the raio ranges from about 1.1 up to at
>	least 1.6, maybe even as high as 2X.
>	b) The Dhrystone:SPECint ratios grossly track with a single
>	product line, except that small-cache machines of a family look
>	more better on Dhrystone than on SPECint.

Here are some Dhrystone 1.1 and integer SPEC program comparison results
I gathered. 

The Dhrystone 1.1 results came from an article by Walter Price of Motorola
('A Benchmark Tutorial', IEEE MICRO, Oct 1989, page 28).  In the table below
'D/S' are the Dhrystone(1.1)/Sec results. I used the PEAK Dhrystone 1.1
results from Price's article for each system in the table. I used the peak
numbers because I didn't want problems that might happen by posting the
low numbers. Also the peak numbers were more consistent. That is, people
tended to report the peak number for both the low AND high Dhrystone results
in Price's article.  So the low numbers tend to be doo-doo and the high
numbers more reasonably consistent.

As indicated in the table Dhrystone 1.1 ratio results are greater than the
Integer SPEC ratio results by 14% to 24% with an average of 21% greater.
This is pretty much as you indicated, but I didn't find any really abnormal
results (probably due to a lack of enough data). Dhrystone 2.1 results
would be useful too, but I don't have a data-base .....

An interesting result are the correlation coefficients across the various
systems. The Dhrystone 1.1 ratios correlate rather well (0.90 to 0.99)
with all 4 SPECratios and the SPECint (Geometric mean of SPECratio results).
What this indicates, relative to the results in the table below, is that
Dhrystone 1.1 predicts RELATIVE PERFORMANCE across the 10 systems examined
just as well as GCC, espresso, li, eqntott, and SPECint.  The correlation
in performance prediction between these various programs is quite strong
despite the fact that they are all really quite different programs with
different instruction mixes. I suppose this makes some sense though
because a CPUs performance (relative to other CPUs) is generally improved 
not for a few instructions but for all instructions. This would tend to
make the correlation of performance ratios somewhat (there are no absolutes)
independent of the instruction mix and thus the type of program.

                            Dhrystone 1.1       SPECratio         SPECint
                            -------------  ---------------------- -------
System                MHz     D/S  Ratio    GCC   ESP    LI   EQN
DEC VAX 11/780        5.00   1870   1.0     1.0   1.0   1.0   1.0     1.0
HP 9000/340          16.67   6536   3.5     3.1   2.3   3.3   2.2     2.7
Sun 4/260            16.67  19900  10.6     9.9   7.8   9.1   8.3     8.7
Sun SPARCstation 1   20.00  22049  11.8    10.7   8.9   9.0   9.7     9.5
HP 9000/834          15.00  23441  12.5    10.2   8.9  11.7  10.1    10.2
MIPS RC2030          16.67  31200  16.7     8.6  11.8  14.2  11.5    11.3
DECstation 3100      16.67  26600  14.2    10.9  12.0  13.1  11.2    11.8
HP Apollo 10000      18.20  27000  14.4    12.8  12.9  11.1  11.1    11.9
SPARCstation 330     25.00  27777  14.9    13.8  11.6  11.2  12.6    12.3
MIPS M/120-5         16.67  31000  16.6    12.5  12.2  15.4  12.0    13.0
MIPS M/2000          25.00  47400  25.3    19.0  18.3  23.8  18.4    19.8
-------------------------------------------------------------------------
Arithmetic Mean                    14.1    11.1  10.7  12.2  10.7    11.1
Standard Deviation                  5.2     3.8   3.9   5.0   3.8     4.0
Correlation Coef WRT Dhry ratio    ----    0.90  0.98  0.98  0.98    0.99
Correlation Coef WRT GCC  ratio            ----  0.92  0.85  0.95    ----
Correlation Coef WRT ESP  ratio                  ----  0.93  0.98    ----
Correlation Coef WRT LI   ratio                        ----  0.94    ----

Percent 'Error' by Dhrystone       ----    21.3  24.1  13.5  24.1    ----
Relative to SPEC Integer Programs.

Al Aburto
aburto@marlin.nosc.mil

patrick@convex.COM (Patrick F. McGehearty) (04/30/91)

In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
>
>As indicated in the table Dhrystone 1.1 ratio results are greater than the
>Integer SPEC ratio results by 14% to 24% with an average of 21% greater.
...
Of the systems listed, I believe all but the DEC VAX 11/780 came after
Dhrystone started to be a widely quoted benchmark.  Because it is a fairly
small code, it is easy to use as a tool for focusing compiler tuning
efforts.  [After all, if you only have limited resources for tuning,
you might as well use them where they can be shown to make a difference]

I suspect that if DEC were to spend a few months tuning their C compiler
(or maybe just rerun their latest release on a 11/780), they also could
get another 20% out of the Dhrystone benchmarks.

I also would not be surprised to see the SPEC numbers improve in the next
few years for existing hardware with better compilers.  What gets measured
gets worked on.  The advantage of benchmark suites like SPEC (and for you
number crunchers, the Perfect Club and Slalom benchmarks) is that there is
such a variety of coding styles and usages that the improvements are likely
to benefit many real codes.  In some cases, special tricks will be found
that only benefit those codes, but for the most part, improvements will be
made that help many programs run faster.

Back to the issue that started this discussion:

When you propose a new benchmark, consider how vendors will respond if
people start using it for serious competitive evaluation.  [If you don't
want people to use it, why are you proposing it??]  Will it encourage
the vendors to improve the things you want improved?  If not, can it
be changed to do so?  Show it to a few people (with DRAFT, DO NOT DUPLICATE
marked all over it), and get their feedback.  Then ask yourself again if it
is useful.

The 'date' benchmark has a number of serious flaws.  A key one is that
if the date operation were added to the command shell, it would go many
times faster.  Since the intent is to measure process spawning time,
... well, you get the point.
A similar thing happened to the getpid system call.  Some people at Berkeley
wanted to know how fast a trivial system call was, so they could tune the
syscall interface.  They wrote a loop to call getpid() many times.  This
test was appropriate for their purposes.  Later, this test (and many others)
was made generally available.  Some vendors chose to speed up this test by
caching the process id in user space on the first getpid(), and avoiding the
system call overhead for subsequent getpid()'s.  There is nothing wrong with
that optimization, just that it does nothing for real user programs.

The main reason I don't like the date benchmark is that it encourages me (as
vendor) to fix the wrong things.  In addition, as a user, the benchmark does
me little good, because I have little confidence that it will measure the
same things I care about (at least it won't after the vendors start working
on it if they take it seriously).  The same reasoning applies to the 'bc'
benchmark which ran through this news stream a while ago.

mash@mips.com (John Mashey) (04/30/91)

In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
>In article <2717@spim.mips.COM> mash@mips.com (John Mashey) writes:

>>I don't have the numbers handy, and am about to go out of town again.
>>However, there are a number of combinations where Dhrystone would predict
>>that machine A is 25% faster than machine B, but on SPEC integer
>achine B is 25% faster than machine A, or equivalent combinations where
>>the prediction is 50% off.  Combinations like this include RS/6000 vs
>>MIPS, or Intel i860 vs MIPS, at appropriate clock rates.  A particular
>>case is RS/6000 Model 320, which SPECints around 16, but Dhrystone (1.1)
>>is around 27.5, versus MIPS Magnum (25Mhz, not the newer 33s), which
>>has SPECint at 19.5, but has a lower Dhrystone than the RS/6000.
>>If I find time, I'll dig out the numbers, but I've seen enough data over
>>the years to have stopped collecting it.  What it said was:
>>	a) Dhrystone ALWAYS gives a higher VAX-mips rating than SPECint.
>>	(except maybe the VAX-11/780 :-)  1.1 is worse (higher) than 2.1,
>>	but 2.1 is high also.  the raio ranges from about 1.1 up to at
>>	least 1.6, maybe even as high as 2X.
>>	b) The Dhrystone:SPECint ratios grossly track with a single
>>	product line, except that small-cache machines of a family look
>>	more better on Dhrystone than on SPECint.

>Here are some Dhrystone 1.1 and integer SPEC program comparison results
>I gathered. 
>
>The Dhrystone 1.1 results came from an article by Walter Price of Motorola

>As indicated in the table Dhrystone 1.1 ratio results are greater than the
>Integer SPEC ratio results by 14% to 24% with an average of 21% greater.
>This is pretty much as you indicated, but I didn't find any really abnormal
>results (probably due to a lack of enough data). Dhrystone 2.1 results
>would be useful too, but I don't have a data-base .....
For the high ones, look at IBM RS/6000, i860, or maybe Motorola 88K.
AMD would probably be high, but doesn't have SPEC result published,
to my knowledge.
Note that IBM's 27.5/15.8 = 1.7+ ...  and I think you'll find the i860 is
probably up there as well ... and the DG workstation labeled 17 mips
gets around 10 on SPECint ...

I don't think we disagree, except in choice of data. Of the data points,
the VAX is = 1 by definition.
4 of them are MIPS machines
1 is an HP [which is fairly similar to MIPS, and shares some roots in
	similar compiler technology]
3 are SPARCs

I.e., as I said, within product lines you expect that the major determiner
of speed is clock rate, and Dhrystone will show you that.


As a minor point, for whatever reason, most of the dhrystone-vax-mips
ratings in the world assume VAX-11/780 = 1,757 1.1 Dhrystones,
which slightly raises the numbers everywhere.

The major issue is (just to make sure people aren't confused by the posted
table):
IF you pick two machines at random, A and B:
	a) Dhry(A) and Dhry(B) will both give vax-mipsratings that are high.
	b) Dhry(A)/Dhry(B) will give reasonably good correlations with
	SPEC(A)/SPEC(B), especially if A and B are from same family or
	are related.
Unfortunately, there are also plenty of data points, specically,
with machines that included instructions to help strcpy, or have done
certain optimizations, where you  easily pick points where:
	Dhry(A) > Dhry(B) and SPEC(A) < SPEC(B), by a substantial margin.
Thus, one must be careful to distinguish between the 2 statements:
	a) There is a good correlation between Dhrystone and SPEC
	(TRUE, in general, especially if you include the vast numbers of
	X86-based products usually listed).
AND
	b) Dhrystone is a good enough predictor of PEC that you don't need
	to run SPEC ....
	FALSE, in practice, because you can get terribly surprised, because
	a signficant number of recent machines and architectures
	are OUTLIERs when you do a scatter plot of this kind of data.
Now, of course, whether this matters or not depends on whether or not you
think SPECint correlates with anything else :-)
	(As it happens, at least some of us think it does, because it 
	correlates closer than dhrystone does with numerous large internal
	single-user benchmarks.)
Anyway, maybe someone can put together a table with
	1 MIPS
	1 SPARC
	1 RS/6000
	1 i860
	1 HP
	1 88K
	1 68K
	1 486
and using the more common 1757 number...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650

jvm@hpfcso.FC.HP.COM (Jack McClurg) (04/30/91)

/ hpfcso:comp.benchmarks / patrick@convex.COM (Patrick F. McGehearty) /  4:51 pm  Apr 29, 1991 /
> A similar thing happened to the getpid system call.  Some people at Berkeley
> wanted to know how fast a trivial system call was, so they could tune the
> syscall interface.  They wrote a loop to call getpid() many times.  This
> test was appropriate for their purposes.  Later, this test (and many others)
> was made generally available.  Some vendors chose to speed up this test by
> caching the process id in user space on the first getpid(), and avoiding the
> system call overhead for subsequent getpid()'s.  There is nothing wrong with
> that optimization, just that it does nothing for real user programs.

While I agree with just about everything else that Mr. McGehearty says in his
response, the statement above is (As far as I know about HP) just speculation.
The getpid system call is not cached on HP-UX,  but the performance on getpid
was so much better than expected (anomalously) that the trick above was
assumed by the author of a paper comparing operating system performance with
performance on common benchmarks.  I cannot remember the name of the paper,
but it pointed out some disconcerting (to me) areas where OS performance does
not keep up with what should be expected from benchmarks.

Jack McClurg

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (05/02/91)

In article <2800@spim.mips.COM> mash@mips.com (John Mashey) writes:
>In article <1751@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
>>In article <2717@spim.mips.COM> mash@mips.com (John Mashey) writes:
>For the high ones, look at IBM RS/6000, i860, or maybe Motorola 88K.
>AMD would probably be high, but doesn't have SPEC result published,
>to my knowledge.
>Note that IBM's 27.5/15.8 = 1.7+ ...  and I think you'll find the i860 is
>probably up there as well ... and the DG workstation labeled 17 mips
>gets around 10 on SPECint ...
>

Sorry about the delayed response but I was out of town too.
Unfortuately I don't have all that information.  I did see i860
Dhrystone 2.1 and SPECmark numbers posted in this news group but they
were not broken down like the other information I had so I didn't use that
i860 information (I needed Dhrystone 1.1, and the SPECratios for each of the
Integer SPEC programs). Anyone have this information for the i860 and
RS/6000 systems?

>I don't think we disagree, except in choice of data. Of the data points,
>the VAX is = 1 by definition.
>4 of them are MIPS machines
>1 is an HP [which is fairly similar to MIPS, and shares some roots in
>  similar compiler technology]
>3 are SPARCs
>
>I.e., as I said, within product lines you expect that the major determiner
>of speed is clock rate, and Dhrystone will show you that.

There were 3 HP systems and one was a 68030 type.  I thought I went across
product lines reasonably well, but it was not complete of course (things
never are, but thats good I guess).

I did the correlation with respect to clock rate across the 10 systems:

                             Dhrystone V1.1         SPECratio       SPECint
                                 Ratio        ----------------------
                                              GCC   ESP   LI    EQN
Correlation WRT Clock Speed:     0.54         0.69  0.54  0.45  0.60  0.58

The correlation with clock speed appears marginal which is interesting.
There appear to be are other things going on besides just clock speed 
increases. 

>As a minor point, for whatever reason, most of the dhrystone-vax-mips
>ratings in the world assume VAX-11/780 = 1,757 1.1 Dhrystones,
>which slightly raises the numbers everywhere.

Yes, I was aware of that, but I felt constrained to use the peak numbers
as given in that article and the article indicated 1870 Dhrys/sec (1.1)
peak for the the VAX 11/780.  I've seen the 1757 Dhrys/sec (V1.1)
referenced in IBM advertisements for their POWERstations, but that is all
I know about that number.

>The major issue is (just to make sure people aren't confused by the posted
>table):
>IF you pick two machines at random, A and B:
>
>       a) Dhry(A) and Dhry(B) will both give vax-mips ratings that are high.
>       b) Dhry(A)/Dhry(B) will give reasonably good correlations with
>          SPEC(A)/SPEC(B), especially if A and B are from same family or
>          are related.

Based on the results I'd say Dhry(A) and Dhry(B) yield VAX-MIPS ratings
that are 14% to 24% high WHEN COMPARED to SPECint(A) and SPECint(B) VAX-MIPS
ratings. I'd hesitate to infer anything beyond that as I'm still seeking
more information.

The results indicate that Dhry(A) / Dhry(B) ratios correlate strongly
(I would say 'strongly' vice 'resonably good')  with SPEC(A) / SPEC(B)
ratios.  The correlations were rather high after all with a minimum of
0.90 and max of 0.98.  The average was 0.96 and the correlation with
SPECint was 0.99.  These high correlations may not hold up though if we
had a larger data base to examine. I think we still need to sift through
more data.  The correlations were across several different CPU's so
I don't agree that 'especially ...' part of b) above as part of the results. 

>Unfortunately, there are also plenty of data points, specifically,
>with machines that included instructions to help strcpy, or have done
>certain optimizations, where you  easily pick points where:
>   Dhry(A) > Dhry(B) and SPEC(A) < SPEC(B), by a substantial margin.

Yes, this is very true but please note that there are also cases where
    Dhry(A) < Dhry(B) AND SPEC(A) > SPEC(B).  There is an example of this
in the table of results I posted. It is not a substantial difference but
still a difference.

By the way, I'm not 'down on SPEC', or 'up on Dhrystone'. I think SPEC is
the best thing that has happened to benchmarking recently. SPEC is certainly
developing an excellent data base on system performance. A verifiable,
repeatable, and solid data base.  Something we've needed for a long time.
Dhrystone results however are confusing, mostly because it is a small
program, cache sensitive, and it can be optimized to such a large extent.
So, yes, I agree one can get surprised and confused with the Dhrystone
results.  One needs to be very careful in using those numbers. I picked
the peak numbers only because they appeared to be more consistent. If I had
used the low numbers or an average of the low and high then I don't think
the results would have been near the same.

I was interested in the correlation of OTHER integer programs too with
SPECint. I used Dhrystone because results were readily available.

>Anyway, maybe someone can put together a table with
>   1 MIPS
>   1 SPARC
>   1 RS/6000
>   1 i860
>   1 HP
>   1 88K
>   1 68K
>   1 486
>and using the more common 1757 number...
>

Yes, the data must be available here and there and it would be good to get
it altogether in one place (here I hope) .....

Al Aburto
aburto@marlin.nosc.mil

wjb@cogsci.cog.jhu.edu (05/03/91)

In article <21720006@hpfcso.FC.HP.COM> jvm@hpfcso.FC.HP.COM (Jack McClurg)
 writes:
>patrick@convex.COM (Patrick F. McGehearty) /  4:51 pm  Apr 29, 1991 / wrote:
>> [benchmarks and caching of process id for getpid() calls]
>
>[disavows any such caching in HPUX]

	I've heard this "rumor" as well.  I always wondered why people
didn't try to test it out by comparing getpid() and getppid() times.  Your
PID may not ever change, but your PPID will change to 1 if your parent exits.
This would appear to require that getppid() enter the kernel.  Actually, the
real way to speed up both of these system calls is to map a page of memory
read-only into the data space of the process and have the program read this
data directly from where the kernel stores it.  I guess this is another
example of the dangers of micro benchmarks and making assumptions about how
different system functions are implemented.

					Bill Bogstad

mash@mips.com (John Mashey) (05/03/91)

In article <1756@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
>>As a minor point, for whatever reason, most of the dhrystone-vax-mips
>>ratings in the world assume VAX-11/780 = 1,757 1.1 Dhrystones,
>>which slightly raises the numbers everywhere.

>Yes, I was aware of that, but I felt constrained to use the peak numbers
>as given in that article and the article indicated 1870 Dhrys/sec (1.1)
>peak for the the VAX 11/780.  I've seen the 1757 Dhrys/sec (V1.1)
>referenced in IBM advertisements for their POWERstations, but that is all
>I know about that number.
That's the problem, of course with using base numbers that can change
around on you [which is why the SPEC base numbers are frozen for eternity].
Using 1757 as the base are {Sun, HP, IBM, Motorola, and many others}
So, I'll list here some various combinations, expressed both as SPECint,
as in vax-mips-using-dhrystone-1.1, assuming 1757 is the base
(because all of the following assume that, I think):

SPECint	dhry-mips Ratio	System		Notes
16.4	47.8	2.7X	i860 @ 33Mhz	SPECint published for Intel Star860,
					Summer 90; Dhrys from March 1989
					i860 Performance Brief
15.8	27.5	1.7	RS/6000 320	SPEC #s and ads
12.9	20+	1.6	HP 9000/425s	68040@ 25Mhz  (I may be wrong on
					the 20, although that's what Moto
					usually calls a 25Mhz 68040, and the
					HP published Dhyr-mips number might
					actually be higher, but I can't seem
					to find it handy.)
					
26.4	41.5	1.6	MIPS Magnum/33	33MHz R3000A (NOTE: just to be
					perfectly clear, MIPSco has NEVER
					used a Dhrystone-mips rating as
					"the mips-rating" for a system.
					On our internal scale (our own set
					of benchmarks) this rates at 27,
					i.e., very close to SPECint).
38.1	57	1.5	HP Snakes,50Mhz	[57 is from all of the ads...]
17.7	27.0	1.5	i486 @ 33Mhz	Intel 486 Performance Brief, 1Q91.
20.7	28.5	1.4	Sun SS2		40Mhz Cypress SPARC

>Based on the results I'd say Dhry(A) and Dhry(B) yield VAX-MIPS ratings
>that are 14% to 24% high WHEN COMPARED to SPECint(A) and SPECint(B) VAX-MIPS
>ratings. I'd hesitate to infer anything beyond that as I'm still seeking
>more information.

Well, above we have:
	1 each of {IBM, MIPS, HP PA, SPARC}
	1 each of {i486, 68040}
which, I think covers a fair chunk of current computers.....
And the LOWEST expansion was 40%.....

Now, Dhrystone 2.1 generally would deflate these numbers by approx 10%;
getting rid of strcpy-inlining would deflate by at another 20-30%, more
in some cases.  [For example, over the last few years, the best way to
boost your Dhrystone was to put in strcpy/strcmp inlining, even though
that has seldom been found to help realistic programs very much.
the fiercest example of this is for the i860, whose code does the
following:
	starts with strcpy of 30-byte constant string
	pads the string and the target to 32-bytes, which happens to be
		possible in this case
	aligns the string on 8-byte or better boundaries
	inlines the strcpy
	generates 2 16-byte loads and 2 16-byte store, which, being to
		and from a writeback cache, with zero-cache misses, goes fast.
Of course, this bears no resemblance to any realistic character-pushing,
and that fact rapidly shows up when you see the SPEC integer numbers...]

Fortunately, (except in some ads in the Wall Street Journal :-),
companies are emphasizing SPEC more and more these days.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650

meissner@osf.org (Michael Meissner) (05/04/91)

In article <02.May.91.132611.54@cogsci.cog.jhu.edu>
wjb@cogsci.cog.jhu.edu writes:

| 	I've heard this "rumor" as well.  I always wondered why people
| didn't try to test it out by comparing getpid() and getppid() times.  Your
| PID may not ever change, but your PPID will change to 1 if your parent exits.
| This would appear to require that getppid() enter the kernel.  Actually, the
| real way to speed up both of these system calls is to map a page of memory
| read-only into the data space of the process and have the program read this
| data directly from where the kernel stores it.  I guess this is another
| example of the dangers of micro benchmarks and making assumptions about how
| different system functions are implemented.

But of course if you fork (), your pid also changes......
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

wjb@cogsci.cog.jhu.edu (Bill Bogstad) (05/04/91)

In article <MEISSNER.91May3135345@curley.osf.org>
meissner@osf.org (Michael Meissner) writes:
>>[I (Bill Bogstad wrote about timing getppid() vs. getpid()]

>But of course if you fork (), your pid also changes......

	Only sort of.  The parents PID doesn't change.  The child just gets a
new one.

	The idea behind this micro benchmark was to measure the overhead of
a system call.  The getpid() and getppid() calls don't do any significant
work in the kernel and therefore would appear to be a good choice for
measuring this.  The problem is that getpid() COULD be written to cache the
PID since it only changes when you do a fork() (under the control of the
program itself).  A PID cache would work in the presence of fork() by having
the fork() run time library invalidate the PID cache within the context of
the process.  The parent process ID of a process can change completely
asynchronously with respect to the process itself.  There is no way for a
cache of the PPID to be maintained in the programs dataspace without
reference to information that only the kernel knows about.  You would have
to use something like a shared page of memory which the kernel sets and the
process reads.  (As mentioned in my last message.)

				Bill Bogstad

mash@mips.com (John Mashey) (05/04/91)

In article <03.May.91.174910.55@cogsci.cog.jhu.edu> wjb@cogsci.cog.jhu.edu (Bill Bogstad) writes:
...
>	The idea behind this micro benchmark was to measure the overhead of
>a system call.  The getpid() and getppid() calls don't do any significant

AS usual, micro-level benchmarks can surprise you, and as architectural
variations arise, you can really get surprised.

As an example, register-window machines can be quite sensitive to
the depth of function calls, unlike most machines that are more sensitive
to the number of function calls.  For example, it is quite possible
for SPARC to look real good on getpid, but incur substantial overhead
on system calls that make many levels of calls into the kernel....

This is one more case where a micro-level benchmark gets zapped by
archiectural changes.  Besides the register-windows thing, one
can be surprised by:
	1) MMU subtleties
	2) Differences in cache design
	3) Differences in memory system design.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94088-3650

aburto@marlin.NOSC.MIL (Alfred A. Aburto) (05/10/91)

In article <3001@spim.mips.COM> mash@mips.com (John Mashey) writes:
>In article <1756@marlin.NOSC.MIL> aburto@marlin.NOSC.MIL (Alfred A. Aburto) writes:
Thanks for that additional information. I used it and SPEC data held at
perelandra.cms.udel.edu (spec.sc in directory bench) to revise the table
I posted (thanks to John McCalpin for the pointer to the SPEC data).
I also corrected the percent mean error results (they were wrong ---
incorrectly calculated!).  I also changed to the 1757 Dhrys/sec figure 
for the VAX 11/780 although I believe a more correct 'peak' value is 
required here!  We need to use numbers OF THE SAME TYPE when comparing
performance!  That is, we need high optimized Dhrystones/sec numbers
in this case for ALL the systems including the VAX 11/780.
 
The sensitivity of Dhrystone to optimization is probably the main reason
the Dhrystone ratio deviates so widely from the SPECratios and SPECint.
As a test of this I ran Dhrystone 1.1 on a Sun 4/260 system. The low
number without optimization ('cc') was 8900 Dhrys/sec while the high
number with optimization ('cc -O4 -DREG=register') was 20000 Dhrys/sec.
These yield VAX-MIPS ratings of 5.1 and 11.4 respectively. To quantify
this type of variability, with respect to optimization, one could for
example take the average of the low and high numbers to give 14450 Dhrys/sec
and a VAX-MIPS rating of 8.2. Compare this to the SPECint rating of 8.7
for the Sun 4/260 in the table below. The comparison is a lot more
reasonable now!  Other measures of performance such as the median over
the results using different compiler options may be more appropriate
of course. Doing all this is a lot of trouble though and almost no one
does it. Most people naturally want to report the 'best' numbers, the peak
numbers, and so we wind up with highly biased and sometimes confusing
Dhrystone results. The easiest solution to all this is to let Dhrystone
rest in peace and just use the more reliable SPEC numbers. But I'm not
happy with this as I want benchmarking to cover a much wider territory
than SPEC now covers.

I don't have alot more to add relative to the results below except the 
correlation between ALL the programs is still very high across ALL the
18 systems EVEN with the IBM POWERstation 320 and Intel i860 (systems
11 and 12) greatly distorting the Dhrystone V1.1 results. Poor Dhrystone
--- the IBM and Intel compilers seem to have chewed it up and spit out
Dhrys/sec and MIPS ratings bearing little relation to other integer
program results and the geometric mean of those results (SPECint). I
wonder what the results (Dhrys/sec) would be if different compilers
were used on systems 11 and 12 and if optimization was disabled
(if at all possible)?  The results, I'm sure, would be quite different.
 
Also I wonder how sensitive the SPECratio results are relative to
different compilers and different compiler options?

Another interesting result in the table below is the consistent and
(now) relatively high correlation with clock speed for all the programs
(Dhrystone1.1, GCC, Espresso, Lisp Interpreter, Eqntott, and SPECint).

Another thing, since I'm being so mouthy anyway :-), what if GCC was
ported to run on non-UNIX systems (to the vast world of microcomputers)?
Maybe then we could arrive at a worthy test program of integer performance
on these 'small systems'?  A test program not so sensitive to optimization
as the Dhrystone.

                                Dhrystone1.1       SPECratio        SPECint
                                ------------ ---------------------- -----
   System                MHz     D/S  Ratio  GCC   ESP    LI   EQN
00 DEC VAX 11/780        5.00   1757   1.0   1.0   1.0   1.0   1.0   1.0
01 HP 9000/340          16.67   6536   3.7   3.1   2.3   3.3   2.2   2.7
02 Sun 4/260            16.67  19900  11.3   9.9   7.8   9.1   8.3   8.7
03 Sun SPARCstation 1   20.00  22049  12.5  10.7   8.9   9.0   9.7   9.5
04 HP 9000/834          15.00  23441  13.3  10.2   8.9  11.7  10.1  10.2
05 MIPS RC2030          16.67  31200  17.8   8.6  11.8  14.2  11.5  11.3
06 DECstation 3100      16.67  26600  15.1  10.9  12.0  13.1  11.2  11.8
07 HP Apollo 10000      18.20  27000  15.4  12.8  12.9  11.1  11.1  11.9
08 Sun SPARCstation 330 25.00  27777  15.8  13.8  11.6  11.2  12.6  12.3
09 HP 9000/425s         25.00  35140? 20.0? 13.8  13.4  15.5   9.7  12.9
10 MIPS M/120-5         16.67  31000  17.6  12.5  12.2  15.4  12.0  13.0
11 IBM POWERstation 320 20.00  51832  29.3  13.7  16.3  15.6  17.7  15.8
12 Intel Star860        33.00  83985  47.8  12.4  20.1  17.7  17.8  16.7
13 AT&T Starserver E    33.00  47439  27.0  16.2  16.6  22.2  14.5  17.2
14 DECstation 5000/200  25.00  42519  24.2  17.3  18.5  21.8  18.4  18.9
15 MIPS M/2000          25.00  47400  27.0  19.0  18.3  23.8  18.4  19.8
16 Sun SPARCstation 2   40.00  50075  27.5  19.6  17.6  22.7  21.4  20.2
17 HP 9000/720          50.00 100149  57.0  35.2  42.5  36.1  40.6  38.5
18 HP 9000/730          66.00 133532  76.0  46.5  55.2  50.3  52.6  51.0
-------------------------------------------------------------------------
Arithmetic Mean                       25.5  15.9  17.1  18.0  16.7  16.8
Standard Deviation                    17.5   9.8  12.2  10.6  11.7  10.9

Correlation Coef WRT Clock Speed      0.92  0.93  0.93  0.92  0.93  0.94

Correlation Coef WRT Dhry ratio       ----  0.90  0.96  0.93  0.95  0.94
Correlation Coef WRT GCC  ratio             ----  0.98  0.97  0.98  0.99
Correlation Coef WRT ESP  ratio                   ----  0.97  0.98  0.99
Correlation Coef WRT LI   ratio                         ----  0.97  0.99
Correlation Coef WRT EQN  ratio                               ----  0.99
                                                                    ----

Percent Mean 'Error' by Dhrystone     ----  60.4  49.1  41.7  52.7  51.8
Relative to SPEC Integer Programs.

Al Aburto
aburto@marlin.nosc.mil