[comp.benchmarks] benchmark evaluations

mark@hubcap.clemson.edu (Mark Smotherman) (12/12/90)

A followup on the old bc thread, and possibly the start of a new thread --


I teach students (for better or worse) that benchmarks should be:

1) Representative
   A) accurate characterization of workload
   B) exploit system structure (including compiler optimization) only
      as much as workload will be able to do so
2) Reproducible
   A) full system configuration (including OS and compiler versions)
      specified (e.g., SPEC reporting)
   B) system unloaded, or load specified and reproducible
   C) operational rules (e.g., compiler options, program inputs and files)
3) Compact
   A) portable, with little or no need of conversion
   B) inexpensive
   C) test files (to avoid size and privacy problems with actual data)


In trying to evaluate the bc test (i.e., echo 2^5000/2^5000 | /bin/time bc)
according to these criteria, I am deeply disturbed by the continuing
promotion of "bc" in the face of evidence that the benchmarks being run
on the different systems are not identical.  That is, the following posts
have seemingly had little impact:

amos@taux01.nsc.com (Amos Shapir) writes:
| Besides, there are several versions of "bc" (some of which do not fork "dc")
| and since the original version of "dc" was rather buggy, several versions
| of it too, some of which are major rewrites.
| The bottom line is: comparing "bc" runs on different systems is necessarily
| comparing apples and oranges (or at least plums & prunes) unless you're
| sure you have the same version of "bc", "dc", and UNIX.

ciemo@bananapc.wpd.sgi.com (Dave Ciemiewicz) writes:
| ... I just diff'ed the sources
| between BSD and SYSV versions of dc which is the compute engine for bc.
| There are changes to the SYSV version for robustness that may sway results
| one way or the other.

IMHO, "bc" only has compactness on its side, with representativeness
questionable and reproducibility totally ruled out.  Why, then, is there
continuing interest?  Are we in fact setting up comp.benchmarks for a
place in Hennessy and Patterson's "benchmarking hall of shame" for their
second edition?


On a second thread, what would you add (or subtract) from the criteria
given above?
-- 
Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634
INTERNET: mark@hubcap.clemson.edu    UUCP: gatech!hubcap!mark

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (12/12/90)

In article <12220@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:

>IMHO, "bc" only has compactness on its side, with representativeness
>questionable and reproducibility totally ruled out.  Why, then, is there
>continuing interest?

Because the group isn't moderated?

Because people don't have any other benchmarks to run?

Because people think benchmarking should be quick and easy?

I think on of the bad things about SPEC is that it's not easy to
obtain the source. This is good because it means people are a little
more careful when running it because they probably actually read the
instructions, but it's bad because random peons don't have a big list
of results for all systems, and they can't go run it on their favorite
off-brand of computer. And that's partially why there's interest in
running bc.

Of course, if they get most of their income by selling SPEC tapes, I
guess there's no choice but to not bite the hand that feeds SPEC.

de5@ornl.gov (Dave Sill) (12/12/90)

In article <12220@hubcap.clemson.edu>, mark@hubcap.clemson.edu (Mark Smotherman) writes:
>
>I teach students (for better or worse) that benchmarks should be:
>
>1) Representative
>   A) accurate characterization of workload
>   B) exploit system structure (including compiler optimization) only
>      as much as workload will be able to do so

Only important if the results are going to be used to predict the
performance of the system on other code.

>2) Reproducible
>   A) full system configuration (including OS and compiler versions)
>      specified (e.g., SPEC reporting)
>   B) system unloaded, or load specified and reproducible
>   C) operational rules (e.g., compiler options, program inputs and files)

Not necessary in all cases, e.g., informal testing or repeated tests
of the same configuration.

>3) Compact
>   A) portable, with little or no need of conversion
>   B) inexpensive
>   C) test files (to avoid size and privacy problems with actual data)

The bc test is certainly portable.

>In trying to evaluate the bc test (i.e., echo 2^5000/2^5000 | /bin/time bc)
>according to these criteria, I am deeply disturbed by the continuing
>promotion of "bc" in the face of evidence that the benchmarks being run
>on the different systems are not identical.

You're assuming that the bc test is used to evaluate the performance
of unlike systems.  This is clearly not a valid use of the test.

>IMHO, "bc" only has compactness on its side, with representativeness
>questionable and reproducibility totally ruled out.  Why, then, is there
>continuing interest?

Because folks use different tools for different jobs, as I've said
many times before.  The bc test is trivial, and is not intended to
replace full, rigorous suites such as SPEC.

>Are we in fact setting up comp.benchmarks for a
>place in Hennessy and Patterson's "benchmarking hall of shame" for their
>second edition?

Huh?

>On a second thread, what would you add (or subtract) from the criteria
>given above?

I'd try to relate various sets of criteria with the different tasks
benchmarks are used for.  There's no "one size fits all" set of
criteria.

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

de5@ornl.gov (Dave Sill) (12/12/90)

In article <1990Dec12.070209.3272@murdoch.acc.Virginia.EDU>, gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:
>In article <12220@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
>
>>IMHO, "bc" only has compactness on its side, with representativeness
>>questionable and reproducibility totally ruled out.  Why, then, is there
>>continuing interest?
>
>Because the group isn't moderated?

The fact that you'd moderate this thread out is evidence that leaving
it unmoderated was the right decision.  Just because you don't like
it, or think it's good enough, doesn't mean others should be denied
it.

>Because people don't have any other benchmarks to run?

I doubt that.  Anyone with news access surely has access to e-mail
and, hence, netlib.

>Because people think benchmarking should be quick and easy?

It should, whenever possible.

>I think on of the bad things about SPEC is that it's not easy to
>obtain the source. This is good because it means people are a little
>more careful when running it because they probably actually read the
>instructions, but it's bad because random peons don't have a big list
>of results for all systems, and they can't go run it on their favorite
>off-brand of computer. And that's partially why there's interest in
>running bc.

Could be.  Why doesn't someone keep a table of SPEC results?

>Of course, if they get most of their income by selling SPEC tapes, I
>guess there's no choice but to not bite the hand that feeds SPEC.

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (12/13/90)

In article <1990Dec12.140615.27870@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:

>The fact that you'd moderate this thread out is evidence that leaving
>it unmoderated was the right decision.

I have no intention of doing that. The fact that you jump to such
conclusions is evidence that any attempt to moderate this group would
have to include educating users as to what moderation is for --
clarifying points, preventing duplication, and generally raising the
signal to noise ratio. Which would you rather read: 10 postings giving
bc benchmark results with no comments, or 1 posting giving 10 sets of
results?

>Could be.  Why doesn't someone keep a table of SPEC results?

Go for it. The only source I have of SPEC results is copyrighted, and
very incomplete.

de5@ornl.gov (Dave Sill) (12/13/90)

In article <1990Dec12.182926.14306@murdoch.acc.Virginia.EDU>, gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:
>In article <1990Dec12.140615.27870@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>
>>The fact that you'd moderate this thread out is evidence that leaving
>>it unmoderated was the right decision.
>
>I have no intention of doing that.

Let's take a step back and look at your comment in context.

In article <1990Dec12.070209.3272@murdoch.acc.Virginia.EDU>, gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:
>In article <12220@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
>
>>IMHO, "bc" only has compactness on its side, with representativeness
>>questionable and reproducibility totally ruled out.  Why, then, is there
>>continuing interest?
>
>Because the group isn't moderated?

So you conjectured that the reason there was continuing interest was
that the group was unmoderated.  This implies that you think moderation
would have squelched the bc thread, no?  It's hardly a leap, then, to
assume that you'd filter out bc articles.

>The fact that you jump to such
>conclusions is evidence that any attempt to moderate this group would
>have to include educating users as to what moderation is for --
>clarifying points, preventing duplication, and generally raising the
>signal to noise ratio. Which would you rather read: 10 postings giving
>bc benchmark results with no comments, or 1 posting giving 10 sets of
>results?

What's that got to do with the question of why there is continuing 
interest in the bc test?  I'm fully aware of moderation and its pros
and cons, thank you.  And why do you think I'm maintaining and posting
the bc results?  I've gotten a couple dozen mail messages that would
otherwise have been postings.

>>Could be.  Why doesn't someone keep a table of SPEC results?
>
>Go for it. The only source I have of SPEC results is copyrighted, and
>very incomplete.

Hmmm.  Anyone got any SPEC results?  (E-mail, please, and I'll edit.)

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

gl8f@astsun9.astro.Virginia.EDU (Greg Lindahl) (12/13/90)

In article <1990Dec12.202608.3906@cs.utk.edu> de5@ornl.gov (Dave Sill) writes:

>  It's hardly a leap, then, to assume that you'd filter out bc articles.

But it's not true. One of the purposes of a moderator is to take
articles like yours and say, "Hey, ask the guy in email and then
you'll know."

>> Which would you rather read: 10 postings giving
>>bc benchmark results with no comments, or 1 posting giving 10 sets of
>>results?
>
>What's that got to do with the question of why there is continuing 
>interest in the bc test?

Because I had been skipping over most of the bc articles, because
they're repetetive. That's what was running through my head when I
read your message. And I gave an example of how moderation could
produce a better signal to noise even with a bad benchmark, which
might be driving away readers.

Your reply would have fit well in alt.flame, but I suspect the readers
don't give a hoot what my intentions are, but might be interested in
positive things that moderation could do.

>>Go for it. The only source I have of SPEC results is copyrighted, and
>>very incomplete.
>
>Hmmm.  Anyone got any SPEC results?  (E-mail, please, and I'll edit.)

You might want to check out the copyright issues first. Or perhaps it
would be a good idea if someone contacted SPEC and asked them about
such listings.

abe@mace.cc.purdue.edu (Vic Abell) (12/13/90)

In article <1990Dec12.202608.3906@cs.utk.edu> de5@ornl.gov (Dave Sill) writes:
>In article <1990Dec12.182926.14306@murdoch.acc.Virginia.EDU>, gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:
>>In article <1990Dec12.140615.27870@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>>>Could be.  Why doesn't someone keep a table of SPEC results?
>>
>>Go for it. The only source I have of SPEC results is copyrighted, and
>>very incomplete.
>
>Hmmm.  Anyone got any SPEC results?  (E-mail, please, and I'll edit.)

There seem to be two obstacles to publishing a table of SPEC results:

	1.  Many of the available numbers are copyrighted by SPEC.

	2.  SPEC license holders are constrained to use a comprehensive
	    reporting protocol that doesn't lend itself to tabulation.

I've got my own, internal table of SPEC results, gleaned from the SPEC
newsletter, vendor reports, and my own running of the SPEC suite, but I
don't think I can publish it.

Vic Abell <abe@mace.cc.purdue.edu>

eugene@eos.arc.nasa.gov (Eugene Miya) (12/14/90)

In article <1990Dec12.140615.27870@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>In article <1990Dec12.070209.3272@murdoch.acc.Virginia.EDU>,
gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:
>The fact that you'd moderate this thread out is evidence that leaving
>it unmoderated was the right decision.  Just because you don't like
>it, or think it's good enough, doesn't mean others should be denied it.

Moderation != censorship.
Moderation should differ little from editorship.
There are numerous fine moderated groups.  A moderator can
cull many "me too postings."
A good moderator can contribute knowledgeable information,
simple queries can be answered before posting, etc.

>Could be.  Why doesn't someone keep a table of SPEC results?

Actually, we keep tables of results, and we have one of the original
SPEC programs.  1) it is not sufficient to simply have the results.
2) when we give the code to a vendor we insist (now) on getting the
code back.  We want to see modifications.  We want to see portability
changes.  We want to see optimization enhancements. We want to see
if iterative algorithms converge (low error bounds).  We do not
like it when some people claim to have our codes running when
in fact, they execute incorrectly.
Our code was developed for a 64-bit single precision machine.  It takes
considerable run time on a non vector architecture. But the biggest 
interesting problem is that certain manufacturers have a policy of
not releasing benchmark results.  We have to sign special non
disclosure of benchmark result forms.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

eugene@eos.arc.nasa.gov (Eugene Miya) (12/14/90)

In article <12220@hubcap.clemson.edu> mark@hubcap.clemson.edu
(Mark Smotherman) writes:
>1) Representative
>   A) accurate characterization of workload
>   B) exploit system structure (including compiler optimization) only
>      as much as workload will be able to do so
>2) Reproducible
>   A) full system configuration (including OS and compiler versions)
>      specified (e.g., SPEC reporting)
>   B) system unloaded, or load specified and reproducible
>   C) operational rules (e.g., compiler options, program inputs and files)
>3) Compact
>   A) portable, with little or no need of conversion
>   B) inexpensive
>   C) test files (to avoid size and privacy problems with actual data)

Ah! From the home of the fine moderated group: comp.parallel.
Unfortunately Dave mailed me that comp.benchmark isn't archived.

This is a good start, but the list is a bit linear.
Comments:
2) There are two kinds of reproducibility:
	You want to be able to replicate the result yourself (not always
	possible on a computer)
	You want colleagues to be able do it without you.
I too had thought about these issues (you got "representative" as better
than "real" or "realistic."  Did you see my first issues post?

I want to also suggest a short reading, the very last chapter of
Danny Hillis's thesis "The Connection Machine" published by MIT
Press entitled "Why Computer Science is No Good."  Despite Danny's (and
my own) flaming, he points out three things CS needs (benchmarking
as well as any other science):
	1) scale
	2) symmetry
	3) locality of effect
We can learn something from these.  Danny is right.  THESE ARE IMPORTANT
THINGS LEFT OUT OF LEARNING ABOUT BENCHMARKS.

But I just got back from vacation, and I have had to jump back into the
benchmarking fire.  More later.


--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

eugene@eos.arc.nasa.gov (Eugene Miya) (12/14/90)

In article <1990Dec12.135910.27667@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>In article <12220@hubcap.clemson.edu>, mark@hubcap.clemson.edu
(Mark Smotherman) writes:
>>1) Representative
>Only important if the results are going to be used to predict the
>performance of the system on other code.

Wrong!  Representativeness is need for any descriptive or diagnostic
system.  Prediction is icing on a cake.

>>2) Reproducible
>Not necessary in all cases, e.g., informal testing or repeated tests
>of the same configuration.

Reproducibility is a hallmark of all good sciences.
See, The Journal of Irreproducible Results (maybe all benchmarks
deserve to be there?).

>full, rigorous suites such as SPEC.

If one benchmark is not adequate, and 2 aren't enough,
when is enough, enough?  42? 700?  I don't think the answer lies solely
in fixed benchmarks.

>I'd try to relate various sets of criteria with the different tasks
>benchmarks are used for.  There's no "one size fits all" set of
>criteria.

I will agree with this.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

jacobsd@usenet@scion.CS.ORST.EDU (Dana Jacobsen) (12/14/90)

In <6391@mace.cc.purdue.edu> abe@mace.cc.purdue.edu (Vic Abell) writes:

>In article <1990Dec12.202608.3906@cs.utk.edu> de5@ornl.gov (Dave Sill) writes:
>>In article <1990Dec12.182926.14306@murdoch.acc.Virginia.EDU>, gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:
>>>In article <1990Dec12.140615.27870@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>>>>Could be.  Why doesn't someone keep a table of SPEC results?
>>>Go for it. The only source I have of SPEC results is copyrighted, and
>>>very incomplete.
>There seem to be two obstacles to publishing a table of SPEC results:
>	1.  Many of the available numbers are copyrighted by SPEC.
>	2.  SPEC license holders are constrained to use a comprehensive
>	    reporting protocol that doesn't lend itself to tabulation.

>I've got my own, internal table of SPEC results, gleaned from the SPEC
>newsletter, vendor reports, and my own running of the SPEC suite, but I
>don't think I can publish it.

  It seems that it is very hard to get ahold of SPEC results.  What is the
point of the SPEC benchmarks?  I had originally assumed it was to help
consumers decide what machines to buy, show off the numbers, whatever.  This
seems to not be the case.  One of the big pluses of SPEC results was that 
one would see 10 numbers, not just one.  Unfortunately, SPEC seems to be
defeating it's own purpose in not allowing these results to be published, as
Trade rags only print the one number.  
  From what I can see, SPEC exists for the sole purpose of giving vendors 
something more to give their software people to do.
--
Dana Jacobsen
jacobsd@cs.orst.edu

de5@ornl.gov (Dave Sill) (12/14/90)

In article <7694@eos.arc.nasa.gov>, eugene@eos.arc.nasa.gov (Eugene Miya) writes:
>In article <1990Dec12.135910.27667@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>>In article <12220@hubcap.clemson.edu>, mark@hubcap.clemson.edu
>(Mark Smotherman) writes:
>>>1) Representative
>>Only important if the results are going to be used to predict the
>>performance of the system on other code.
>
>Wrong!  Representativeness is need for any descriptive or diagnostic
>system.  Prediction is icing on a cake.

Okay, I'll backpedal a bit here.  Yes, representativeness is a
requirement to the extent that without it you're not testing what you
think you're testing.  But representativeness can be shown
empirically.

>>>2) Reproducible
>>Not necessary in all cases, e.g., informal testing or repeated tests
>>of the same configuration.
>
>Reproducibility is a hallmark of all good sciences.
>See, The Journal of Irreproducible Results (maybe all benchmarks
>deserve to be there?).

Yes, it's A Good Thing, but how much effort is required to show that
repeated runs of the same test on the same system are reproducible?
Are there not situations where it can be assumed?

>>full, rigorous suites such as SPEC.
>
>If one benchmark is not adequate, and 2 aren't enough,
>when is enough, enough?  42? 700?  I don't think the answer lies solely
>in fixed benchmarks.

Agreed.

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

ejk@uxh.cso.uiuc.edu (Ed Kubaitis) (12/15/90)

Having spent a year of my life devising a rigorous, reproducible, predictive 
benchmark suite for a multi-million dollar competitive procurement at a DOE 
laboratory, I think I fully grasp the shortcomings of the bc benchmark. But
I find the results interesting and some of the criticism patronizing. *Of 
course* it's rough and marred with anomalies. That doesn't  make it silly or 
useless. Grep isn't useless because it finds things one didn't intend or 
expect. The value of the bc benchmark lies in the growing (because it's easy) 
list of reported results that bear some relation to rigorous (and difficult) 
benchmarks, and in the thought provoked by the anomalies.

The same was true of the xfroot timings -- a rough, simple (not as simple as 
bc!), imperfect benchmark I collated a while back. (Still available in 
pub/xfroot/timings on uxc.cso.uiuc.edu)

So I'd like to thank Dave Sill for taking the time to collate and report the
results. Let many simple, rough, imperfect benchmarks flourish. We learn 
something from each. 
----------------------------------
Ed Kubaitis (ejk@uxh.cso.uiuc.edu)
Computing Services Office - University of Illinois, Urbana

rosenkra@convex.com (William Rosencranz) (12/16/90)

In article <12220@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
> [ list of reasonable bm criteria deleted ]
>In trying to evaluate the bc test (i.e., echo 2^5000/2^5000 | /bin/time bc)
>according to these criteria, I am deeply disturbed by the continuing
>promotion of "bc" in the face of evidence that the benchmarks being run
>on the different systems are not identical.
>
>IMHO, "bc" only has compactness on its side, with representativeness
>questionable and reproducibility totally ruled out.  Why, then, is there
>continuing interest?

why, indeed...

i heartily agree with Mark on this. i have been persuaded that bc can be
a useful bm, PROVIDED (and this is a BIG caveat):

	1) we all use IDENTICAL SOURCE CODE as the basis of comparison
	2) we all have IDENTICAL LOADS ON THE SYSTEM (none is easiest)
	3) we report wallclock time (or wallclock and cputime)

however, this rules out compactness since you will STILL have to carry
a tape around at tradeshows, if you are in fact interested in truly
unbiased results.

i have just run this ditty on a convex C210 and have noted the following:

	1) i ran on a lightly loaded system, though not dedicated.
	2) my times were about 2/3 the time reported for a C2xx here,
	   very nearly the time quoted for a YMP (which has a 6x faster
	   clock and should run at least 2-3x faster than a C210 (and
	   no, i refuse to post our actual numbers)

it is all to obvious that apples and grapes are being compared here.
this is hardly scientific, and reduces the output of this group to
meaningless drivel, hardly what is was set up to do.

if there is a PD version of bc, i suggest that it and it ALONE be used
as a basis of comparison. and run it on a dedicated machine or distribute
a simulated "load" as part of the bm.

if i were cray, i'd really push for this :-)

-bill
rosenkra@convex.com

--
Bill Rosenkranz            |UUCP: {uunet,texsun}!convex!c1yankee!rosenkra
Convex Computer Corp.      |ARPA: rosenkra%c1yankee@convex.com

mash@mips.COM (John Mashey) (12/18/90)

In article <6391@mace.cc.purdue.edu> abe@mace.cc.purdue.edu (Vic Abell) writes:

>There seem to be two obstacles to publishing a table of SPEC results:
>
>	1.  Many of the available numbers are copyrighted by SPEC.
>
>	2.  SPEC license holders are constrained to use a comprehensive
>	    reporting protocol that doesn't lend itself to tabulation.
>
>I've got my own, internal table of SPEC results, gleaned from the SPEC
>newsletter, vendor reports, and my own running of the SPEC suite, but I
>don't think I can publish it.

There seem to be some misconceptions floating around (this was the last
of a sequence):

1) SPEC numbers are NOT copyrighted.  Go ahead and put together tables
and publish them.  Most vendors already have, and so have numerous
industry analysts.  (Even if we wanted to, we COULDN'T copyright
the numbers!)

2) There are reporting rules, all right, but there are perfectly
reasonable ways to approximate this when compiling tables that
fit the spirit of full disclosure.

The rules are:
	In the SPEC Newsletters, numbers are reported for machines
	by their own vendors, and they have to fill out the full form,
	which includes:
	all 10 numbers
	HW description: clock, cache size, model, memory size, disks
	SW description: releases of compilers & OS, anything else special
	any compile-time options used
	Availability dates

	If you claim a SPEC number, you are supposed to be able to
	provide the information on this sheet. (If you cannot, then
	you probably don't have the configuration pinned down well
	enough anyway).


Now: reasonable ways to approximate this, within the spirit, for
a third party, to at least make tables ofthe basic data:
1) Always show all 10 numbers.
2) Either show the source (like SPEC NEwsletter Issue),
or if your own measurements, say so, or if the vendor, say so.
3) For the CPU benchmarks, it is probably sufficient to give
the clock rate, cache size, and compiler version, plus anything "special".
(I.e., I'm thinking of the kind of table you might get to fit on a page.)
Alternatively, especially if providing a number for a amchine for
which no related machines have had published SPEC results,
it would be nice to post the exact details as shown on the upper right
quadrant of the SPEC pages.
4) Expect that if you're posting numbers for a machine, and they're
lower than what the vendor has posted, you'll probably hear from that
vendor.  In particular, courtesy implies that especially if you
have numbers that look low, you might send them to the vendor
first and ask why.
	REMEMBER: all of the SPEC-published numbers came from vendors,
	none of whom do anything like:
		running them on a machine with too-small memory that
			causes paging
		running them with year-out-of-date compilers
		running them unoptimized, or with bad choices of options
		running them with old makefiles with less-than-optimal
			options
		picking the WORST number of N runs...
and hence, you can expect that if you publish numbers for a vendor,
and they're low, they may have a legitimate claim of "foul".
On the other hand, if truly duplicate the environment, you ought to
expect to be close, and if you don't get enough information to duplicate
it, then that's SPEC's or vendor's fault.

On the other hand, one of the whole points of going through all of the
agonizing work in this was to get something that people could use to
keep us vendors honest.  So, go ahead and publish.
But: do identify sources of results, so that people don't wake up
with lists of numbers that came from who knows where, and do try to be polite,
like posting:
	"I ran SPEC on machine X, with these options, and configuration,
and the results seem different from the published ones.  Can anybody
explain why this is?"

rather than:
	"I ran SPEC on machine X, and vendor X is clearly lying scum...."

Anyway, the rules aren't intended to stop people from collecting numbers,
they're intended to try to be fair to lots of people.

------------

Some SPEC numbers have been posted already.  I have big tables of them,
unfortuantely, only in Mac spradsheets.  If i weren't still digging out
from long trip and just cataching up,. I'd post the lot.
I encourage somebody who has a copy of "Your Mileage May Vary"
to post the numbers as a service, as I don't have the time right now.
Also, having exhaustively analyzed past "bc" benchmarks,
and having found them uniformly silly, I wouldn't spend any more
time in that direction.  (One example I saw spent 99% of it's time
in <256 bytes of code on a RISC, and >50% of the time was spent
doing integer multiplies and divides, i.e., a profile that fits in
the tiniest cache, and whose opcode frequencies resemble few other
real programs.....)

I encourage the following experiments:
1) Look for correlations between the SPEC integer subset
and this bc benchmark.
2) Look for SPEC Integer <-> Dhrystone correlations.
3) Look for SPEC Integer <-> mips-ratings correlations.
	(BRW: from about 50 machines I've analyzed, to get
	SPEC Integer estimate, multiply vendor mips by anything
	from 55% to 97%, i.e., vendors disagree by as much as
	a factor of 2X in what a mips means.
	In comp.arch, people were complaing about "28.5 mips";
	even worse is price/performance numbering like "$463/mips"
	in the light of this.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

abe@mace.cc.purdue.edu (Vic Abell) (12/19/90)

In article <44157@mips.mips.COM>, mash@mips.COM (John Mashey) writes:
> 
> The rules are:
> 	In the SPEC Newsletters, numbers are reported for machines
> 	by their own vendors, and they have to fill out the full form,
> 	which includes:

These rules apply to all SPEC license holders, not just vendors.  As I
noted in my previous posting, license holders are more constrained in
reporting SPEC numbers than others, since the SPEC license agreement
specifies a reporting format that takes 8 pages to describe.

The interesting question is whether or not a SPEC license holder must
follow the format when reporting others' SPEC results.  If so, then it
might be hard to avoid violating copyright.

Of course, none of this stops non-licensees (or marketers) from citing
SPEC reports in any fashion.  :-)

mash@mips.COM (John Mashey) (12/19/90)

In article <6434@mace.cc.purdue.edu> abe@mace.cc.purdue.edu (Vic Abell) writes:
>In article <44157@mips.mips.COM>, mash@mips.COM (John Mashey) writes:
.....
>These rules apply to all SPEC license holders, not just vendors.  As I
>noted in my previous posting, license holders are more constrained in
>reporting SPEC numbers than others, since the SPEC license agreement
>specifies a reporting format that takes 8 pages to describe.

>The interesting question is whether or not a SPEC license holder must
>follow the format when reporting others' SPEC results.  If so, then it
>might be hard to avoid violating copyright.
I'm not a lawyer, but I don't think this is a problem.
	a) It would be a problem if you copied the form, filled it in,
	and represented it as being printed by SPEC (but not for copyright
	reasons).
	b) Although it takes a while to describe, it's pretty easy
	to make up a form like the SPEC form (which, after all, is
	basically a common-sense checklist) and fill in the numbers.
	One can make up a spreadsheet that is perfectly adequate in
	a few minutes to do this.
	c) If you copied an existing form, whited out all of the entries,
	and copied again to make a up a form you write into, that would
	probably work too, and I doubt that anyone would care in the slightest.

>Of course, none of this stops non-licensees (or marketers) from citing
>SPEC reports in any fashion.  :-)

Let me try another way: all of these rules were basically put together
to avoid marketing gimmickry, and provide more meaningful disclosures,
in some cases, just by reminding people that things like compiler
release just might be meaningful, and are often lost.
A lot of this follows the sorts of disclosures that many straightforward
vendors have tried to work into their performance documents,
so it's hardly new.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tom@ssd.csd.harris.com (Tom Horsley) (12/19/90)

>>>>> Regarding Re: benchmark evaluations; mash@mips.COM (John Mashey) adds:

mash> Let me try another way: all of these rules were basically put together
mash> to avoid marketing gimmickry, and provide more meaningful disclosures,
mash> in some cases, just by reminding people that things like compiler
mash> release just might be meaningful, and are often lost.
mash> A lot of this follows the sorts of disclosures that many straightforward
mash> vendors have tried to work into their performance documents,
mash> so it's hardly new.

Speaking of full disclosures, the latest SPEC reports (which are
unfortunately locked in the office next to me so I can't get to them right
now to pull off the exact numbers) have a lot of different numbers for MIPS
based machines. Just using the 25MHZ machines as an example (I think thats
an R3000?) there are numbers that range from somewhere around 16-17
SPECmarks to 19-20 SPECmarks for machines that seem nearly identical from
everything described in the SPEC newsletter.  They have the same clock rate,
the same cache size, about the same amount of main memory, and are using the
same compilers. The question then becomes:

   "What's different?"

I tend to suspect memory bandwidth, but maybe they have different float
units?  Can anyone tell me the real reason the numbers vary so much?
--
======================================================================
domain: tahorsley@csd.harris.com       USMail: Tom Horsley
  uucp: ...!uunet!hcx1!tahorsley               511 Kingbird Circle
                                               Delray Beach, FL  33444
+==== Censorship is the only form of Obscenity ======================+
|     (Wait, I forgot government tobacco subsidies...)               |
+====================================================================+

mash@mips.COM (John Mashey) (12/20/90)

In article <TOM.90Dec19070051@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes:

>Speaking of full disclosures, the latest SPEC reports (which are
>unfortunately locked in the office next to me so I can't get to them right
>now to pull off the exact numbers) have a lot of different numbers for MIPS
>based machines. Just using the 25MHZ machines as an example (I think thats
>an R3000?) there are numbers that range from somewhere around 16-17
>SPECmarks to 19-20 SPECmarks for machines that seem nearly identical from
>everything described in the SPEC newsletter.  They have the same clock rate,
>the same cache size, about the same amount of main memory, and are using the
>same compilers. The question then becomes:

>   "What's different?"

>I tend to suspect memory bandwidth, but maybe they have different float
>units?  Can anyone tell me the real reason the numbers vary so much?

let me tell you what it isn't: different floating point units.
	Every R3000 ever built, if it uses FP at all, uses the
	same R3010 FPU.

Here are some of the things that do make a difference:
1) Cache size
2) Cache line size: refill size might be 8, 16, or maybe even 4 words.
3) Write buffering
	(different machines have different depths of write-buffers,
	4, or 8; some may do byte-gathering of byte writes, and
	flush the relevant cache word; others may do a 2-cycle read-check
	tag-write partial word if found in cache).
4) Speed of accepting writes to memory.  This is usually 1 every other
cycle, but depends on interleaving and use/nonuse of page-mode DRAMs,
where you get fast access if the reference is in the same page as previous
one, but slower to switch pages.
5) Main memory system: some may have a single path to memory,
and get stalled if I/O is going on; others may have a private-memory-bus
and a separate VME bus interface with big FIFOs so they interfere less.
6) Presence of secondary cache: some systems (multiprocessors) have these,
and they can affect the benchmarks.
6) How the system got there, i.e., maybe it was a MIPS chip stuck into
an existing bus structure, or maybe it was an upgrade required to use
memory boards from previous version, where latency cycles get added because
of some clock-rate matching that has to be done.

SO.....  From PC benchmarks, one can see that different vendors get
different performance from same chip at same clock.  It's even more so
with people building more complex and/or higher-performance products.

(For people who've seen the "car" talk, whereby computers are equated to
cars, the memory system is equated to a turbo-charger; like cars,
it can make a big difference.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086