[comp.benchmarks] benchmarks

eugene@eos.arc.nasa.gov (Eugene Miya) (11/13/90)

In article <9135@ncar.ucar.edu> pack@acd.UCAR.EDU (Dan Packman) writes:
>Excellent point.  The best benchmark is clearly ones own application
>or set of applications including multi-process loads.

It might seem like this is true.  But it is not.

It depends on what you want benchmarks to do.

Buying a machine to print checks is one thing.  The above is fine.
Buying a machine to solve changing fluid dynamics problems is another
(moving target).  Testing (diagnosing performance) is a third.  

We know the answer.  It's 42.  At least those of us who read THAT book.
It's how you ask the question.  Not knowing how to ask the question
results in supercomputers being one generation behind the problems
we would really like to solve.  So the problem is not quite as clear as
one would like.  I agree, there is some truth to using specific existing
applications, but to ignore flaws invites trouble.

This is why SPEC's decision to normalize performance on the DEC VAX-11/780
or say the IBM PC, simply because they are or were common is bad.
This is akin to to taking the wind up clock I used to wake up as a youngster
and designating it as a time standard.  If you think this analogy is
bogus, consider taking a common ruler, 12 inches or a meter stick,
designating that as THE Meter or THE Foot, those wooden things,
then try to make sub-micron lines for your memories or CPUs.  You won't
produce consistent chips for long.  This is called "gold-plating"
a metric.  Ref: "Foundations of Metrology."

We must start at a few more basic building blocks and proceed in a
progression.  Trying to rush the process will only confuse the issue.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

rnovak@mips.COM (Robert E. Novak) (11/14/90)

To clarify my previous comments on SPEC membership:

SPEC membership costs:

	Initiation	$10,000
	Annual Dues	$ 5,000

SPEC Associate:

	Initiation	$2,500
	Annual Dues	$1,000

To qualify as a SPEC associate, you must be an accredited educational
institution or a non-profit organization.  An associate has no voting
privileges.  An associate will receive the newsletter and the benchmark
tapes as they are available.  In addition, an associate will have early
access to benchmarks under development so that an associate may act in an
advisory capacity to SPEC.

The SPEC tape still costs $699 which includes the cost of a 1 year
subscription to the SPEC newsletter.  The tape by itself costs $300.  
There are no discounts for the SPEC tape/newsletter.  

SPEC
c/o Waterside Associates
39510 Paseo Padre Parkway
Suite 350
Fremont, CA  94538
415/792-3334

After January 1, 1991
SPEC
c/o Waterside Associates
39138 Fremont Blvd.
Fremont, CA  94538
Same Phone Number.
-- 
Robert E. Novak                     Mail Stop 5-10, MIPS Computer Systems, Inc.
{ames,decwrl,pyramid}!mips!rnovak      950 DeGuigne Drive, Sunnyvale, CA  94086
rnovak@mips.COM        (rnovak%mips.COM@ames.arc.nasa.gov)      +1 408 524-7183

eugene@eos.arc.nasa.gov (Eugene Miya) (11/14/90)

Bob-- Did I just see you a while ago? 8^)  Must spend all your time posting...

I'll add the SPEC address, Perfect, NISTLIB, and a few other things to the
FAQ.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

lewine@cheshirecat.webo.dg.com (Donald Lewine) (11/15/90)

In article <7581@eos.arc.nasa.gov>, eugene@eos.arc.nasa.gov (Eugene Miya) writes:
|> 
|> This is why SPEC's decision to normalize performance on the DEC VAX-11/780
|> or say the IBM PC, simply because they are or were common is bad.
|> This is akin to to taking the wind up clock I used to wake up as a youngster
|> and designating it as a time standard.  If you think this analogy is
|> bogus, consider taking a common ruler, 12 inches or a meter stick,
|> designating that as THE Meter or THE Foot, those wooden things,
|> then try to make sub-micron lines for your memories or CPUs. 

	But SPEC took a particular VAX-11/780.  The 11/780 time for
	gcc is 1482 seconds.  It is not what you get on your particular
	VAX.  This is more like taking a gold bar in Paris and saying
	that is the standard meter.  As look as there is only one gold
	bar, that is not a problem.

	I think that 29 SPECmarks is more understandable than saying
	that the Geometric Mean of the benchmark times is 133.3 
	seconds.

--------------------------------------------------------------------
Donald A. Lewine                (508) 870-9008 Voice
Data General Corporation        (508) 366-0750 FAX
4400 Computer Drive. MS D112A
Westboro, MA 01580  U.S.A.

uucp: uunet!dg!lewine   Internet: lewine@cheshirecat.webo.dg.com

eugene@eos.arc.nasa.gov (Eugene Miya) (11/15/90)

In article <1146@dg.dg.com> uunet!dg!lewine writes:
>	But SPEC took a particular VAX-11/780.  The 11/780 time for
>	gcc is 1482 seconds.  It is not what you get on your particular
>	VAX.  This is more like taking a gold bar in Paris and saying
>	that is the standard meter.  As look as there is only one gold
>	bar, that is not a problem.

Particular: that's right.  Two things to add: 1) DEC knew that the
performance of 780 models varied by as much as 10%.  Is 10% acceptable?
In some cases yes, others no.  Using your bar analogy (I recall it's
really platnium-iridum) that's why I gave the Metrology paper as a
reference.  The former NBS director used the term "Gold plating."
John Mash[ey]@mips.com said "VAX under glass" at one time (Cute, I
like it!).  Will you use a ruler which maybe as much as 10% off?
I think our society is beyond that.  That is why the US NIST (was NBS)
maintains an atomic clock.  A highly instrumented multi-million $$
piece of hardware.

Taken to the extreme 2) why a VAX (or PC), well, why not an ENIAC?
You don't want an ENIAC for the same reason you won't want a VAX in the
future.  You are cutting your own long-term throat.
That's what a platnium bar is.  That's why the NIST uses the frequency
of Kp atoms to not only specify, but also length (distance).
We must go beyond that.  Your H/W engineers use the best oscilloscopes? 
Right?  Yet we software types are in the dark ages.

>	I think that 29 SPECmarks is more understandable than saying
>	that the Geometric Mean of the benchmark times is 133.3 
>	seconds.

John Hennessy came today and blasted geometric mean (in favor of
weighted arithmetic mean.  I will commit to no statistic before its time.

My opinion is that we must understand the sample before applying
any statistic.  I hate to say it: give me raw numbers and then I will think
about sending them to S (or BMDP or whatever).

About 29 (or 42), I don't think it's the number of benchmarks.
I had a talk at one time entitled "The Next 700 Benchmarks."
[If you didn't know there have been a string of papers beginning with
"The Next 700 Programming Languages."]  And in fact Carl Ponder (LLNL)
gave a talk about adding benchmark information can just cloud the issue.
It's not just the number of measurements or observations you take.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

I copy Hennessy's five viewgraphs, I do not think John will mind since
he brought Net disucssion up.  I generally support most of what he had to say.
Only 1-3 were presented, 4 and 5 were left over and covered orally:

#1
	Some Comments to SPEC
+ Means for summarizing performance
+ Choosing benchmarks
+ Guidelines for running benchmarks
	[Comment: "You (SPEC members) have a responsibility for what
	your marketing people say."  I agree.]
+ The dangerous of SPECthroughput
[Hennessy expressed worry about ideas cast in concrete.  I really fear
this as well, and it may be too late.]

#2	
	Why not geometric mean?
+Example

	Absolute time			Relative performance
	M1  M2  M3			M1  M2  M3
B1	5   10  10			1  0.5 0.5
B2	10   5  10			1   2   1
				GM	1   1  0.7
B means benchmark, M means Machine, GM is geometric mean

+To replace the summary indicated by geometroc mean for
  M1 and M2: run each 50% of total workload
  M1 and M3: run B1 57% and B2 43% of total workload!
  M2 and M3: run B1 43% and B2 57% of total workload!
  ----------------------------------------------------

#3
	Why weighted arithmetic mean

+ Single weighting yield results proportional to execution time!
+ Suggested weighting: equal time on base machine.
Results: weights for earlier example are 2/3, 1/3.

	Weighted execution times
	M1	M2	M3
B1	10/3	20/3	20/3		can also use a weighted harmonic
B2	10/3	5/3	10/3		mean [I know this ref to be Worlton]
AM	20/3	25/3	30/3
Perf.	1.0	0.8	0.67		--inverse of execution time

Not shown but discussed:
#4
	Choosing benchmarks
+ Some evaluation procedurs need to be established to choose benchmarks.

+ These need to focus on questions like:
 - is this a real program
 - how many lines constitute the 90% or 95% point
 -is the input appropriate
+ How will you know the potential defects before choosing the benchmark?
[I have thought of some of these questions and I wish to discuss them
and some ideas and will try to prsent them in the coming days and weeks.]

#5
	Guidelines for running programs
+ Serious problems can arise because guidelines for running benchmarks
(typo) are not precise.
	[No kidding, this was a point in one SPEC discussion,
	I am not a SPEC member but was invited.  Maybe I should
	post a few notes or impressions.  Basically SPEC is kinda of
	a good thing; only I wish it had been ANSI instead [some minuses]]
+Some examples
 - what routines can be replaced by libraries?
 - what are the requirements for runtime checks such as bounds checking
   and FP exception checks.

I should not that I am not innocent, and one of the SPEC benchmarks came
from me (and we have serious contraints on running that program,
it was renamed).

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (11/15/90)

>>>> On 15 Nov 90 06:54:35 GMT, eugene@eos.arc.nasa.gov (Eugene Miya) said:

Eugene> In article <1146@dg.dg.com> uunet!dg!lewine writes:
>	But SPEC took a particular VAX-11/780.  The 11/780 time for
>	gcc is 1482 seconds.  It is not what you get on your particular
>	VAX.  This is more like taking a gold bar in Paris and saying
>	that is the standard meter.  As look as there is only one gold
>	bar, that is not a problem.

Eugene> Particular: that's right.  Two things to add: 1) DEC knew that
Eugene> the performance of 780 models varied by as much as 10%.  Is
Eugene> 10% acceptable?  In some cases yes, others no.  

I am surprised that such a sensible person as Eugene would imply that
*any* benchmark number had a precision of <10%.  I don't believe that
it is possible to take any combination of "general-purpose" benchmarks
and use that data to predict your application (or workload)
performance to within 10%.  In fact, it is all to easy to have 10%
changes in the performance of your application itself if (as is
inevitable) it is run under conditions that differ from the formal
benchmark test.  Minor changes like operating system or compiler
upgrades, changes in the system background load, or even disk
fragmentation can produce 10% changes in wall-clock time quite
easily....

So how precise do I think the numbers are?  Well, with 6 or so years
of experience in performance evaluation of supercomputers and
high-performance workstations, I can generally [i.e., not always]
estimate the performance of my codes to within 20-25% based on a broad
suite of benchmark results (LINPACK 100x100, LINPACK 1000x1000,
Livermore Loops, hardware description with cycle counts, and maybe a
bit more).  (I deliberately ignore PERFECT since the one code that I
know in some detail [the ocean model from GFDL/Princeton] is a mess,
and I would not blame a compiler at all for having trouble vectorizing
or optimizing it [or even understanding what it is supposed to be
doing!]).

--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

grover@brahmand.Eng.Sun.COM (Vinod Grover) (11/16/90)

In article <7589@eos.arc.nasa.gov> eugene@eos.UUCP (Eugene Miya) writes:
>[If you didn't know there have been a string of papers beginning with
>"The Next 700 Programming Languages."]  

I am aware of Peter Landin and F.L. Morris' papers, are there others?
Would you mind posting them?

Thanks

Vinod Grover
Sun Microsystems

rafael@ucbarpa.Berkeley.EDU (Rafael H. Saavedra-Barrera) (11/16/90)

In article <7581@eos.arc.nasa.gov> ucbvax!agate!shelby!eos!eugene writes:
> Particular: that's right.  Two things to add: 1) DEC knew that the
> performance of 780 models varied by as much as 10%.  Is 10% acceptable?
> In some cases yes, others no.  ...

Gene, you are completely missing the point. The 10% variation on the 
VAX 780 has nothing to do with defining a unit of performance and using
it to measure. If you look at the SPEC reports, you'll notice that they 
use the SAME execution times for the reference machine in all reports 
and for all machines. There is no 10% variation. What is needed is 
something we can use to measure and that everyone uses period. From 
the Enciclopedia Britannica:

	Measuring a quantity means acertaining its ratio to 
	some other fixed quantity of the same kind, known as 
	the unit of that kind of quantity. A unit is an *abstract 
	conception*, defined either by reference to some arbitrary 
	material or to natural phenomena.
	
The key words are: ratio, fixed, and arbitrary. The SPEC people made 
a reasonable, but arbitrary definition of what represents their fixed 
quantity. Once you have done that, everything else follows, and there 
is no nothing else to discuss. All you require from a unit of 
measurement is: 1) that it is fixed; 2) that has validity for some 
significant group of people; 3) that can be verified. The SPECratio 
certainly satisfies the 3 conditions. As long as the SPEC people keep 
the *same* VAX 780, with the same software, and run the programs under 
the same conditions, there is nothing to object. I don't know if they 
are doing this, but they should, if they want to avoid problems in 
the future.

> Taken to the extreme 2) why a VAX (or PC), well, why not an ENIAC?
> You don't want an ENIAC for the same reason you won't want a VAX in the
> future.  You are cutting your own long-term throat.

Wrong! The SPEC people didn't choose an ENIAC, because there are no
ENIACS that can be used to run benchmarks and very few people alive
ever use an ENIAC. Why not a CRAY, because it is very expensive to
keep a CRAY in glass (same software, etc) just to run benchmarks 
in it. But in priciple an ENIAC, a PC, or a CRAY is as good as a 
VAX 780 for the purposes of what the SPEC people are doing. 

One of the nice properties of the SPECmark is that is INVARIANT to
the machine you use as your reference point. The relative performance 
between a MIPS M/2000 and a Sparcstation I, or between a DEC 3100 and
a IBM RS/6000 530 is the same, independent of whether you use a VAX 
780, an ENIAC, or any other machine. So the issue of using a 780 
dissapears.

Your arguments sound very similar to the ones, one of your
fictitious ancestors the Marquis Eugene De La Milla made, with respect
to the use of an Earth's quadrant to define the meter in 1790. He asked, 
why use a quadrant of the earth as reference when there are bigger planets, 
that may be more relevant to future generations of human beings?
Why 1/10,000,000th of the quadrant instead of 1/3,141,592th which looks 
more like pi? Where is te center of Paris? The center of the Ille de la 
Cite, or Notre Dame? [I bet you didn't know you has a french ancestor].

Do you know how the french measured the particular quadrant of the earth 
from the North Pole to the Equator and passes through Paris, one hundred
years before the first man reached the North Pole? Did it make a 
difference?

There are a lot of more interesting questions to ask about the SPEC 
benchmarks. For example, What does each program measures? Are the 
programs really exercising different aspects of the machine? How 
representative is matrix300 of typical linear algebra codes? Can a 
clever compiler writer make minimal changes to the compiler that will 
improve significantly the SPECmark for a particular machine, but will
have a marginal benefit in most users' workloads? How do I estimate 
the performance of my workload by looking at the SPEC results? Why is the 
SPECratio of spice2g6 low on most machines when other double precision 
codes have better performance? Is the geometric mean a good statistic?
This are a few questions I like to know the answer (some I know).

> About 29 (or 42), I don't think it's the number of benchmarks.
> I had a talk at one time entitled "The Next 700 Benchmarks."
> [If you didn't know there have been a string of papers beginning with
> "The Next 700 Programming Languages."]  And in fact Carl Ponder (LLNL)
> gave a talk about adding benchmark information can just cloud the issue.
> It's not just the number of measurements or observations you take.

I don't agree. 29 benchmarks are better than 10 benchmarks, *if* the 
29 benchmarks are well chosen. Every benchmark represents an empirical
observation of the performance of the machine. More observations are
better than few, especially after seeing the results for the Stardent
3010. Here all benchmarks have SPECratios between 14.7 and 62.9, except 
matrix300 that has a ratio of 108.5! Is this an isolated point or
are there many more programs that give similar results? However, you 
are right in saying that everytime we add a new benchmark we have to 
know what it measures and why we are including it? What new information 
it provides?

I like the SPEC methodology for measuring SPECmarks, but I agree with
J. Hennessy about the SPECthruput, the SPEC guys erred here.

I don't agree with J. Hennessy that the weighted arithmetic mean (WAM) 
is better than the geometric mean for the SPEC benchmarks, but I
agree with him when he says that the WAM is the correct statistic to 
use in the example he presented. I am contradicting myself? No, the 
two problems are different and therefore different statistics should 
be used. More on this later.

rafael

lewine@cheshirecat.webo.dg.com (Donald Lewine) (11/16/90)

In article <7589@eos.arc.nasa.gov>, eugene@eos.arc.nasa.gov (Eugene Miya) writes:
|> In article <1146@dg.dg.com> uunet!dg!lewine writes:
|> >	But SPEC took a particular VAX-11/780.  The 11/780 time for
|> >	gcc is 1482 seconds.  It is not what you get on your particular
|> >	VAX.  This is more like taking a gold bar in Paris and saying
|> >	that is the standard meter.  As look as there is only one gold
|> >	bar, that is not a problem.
|> 
|> Particular: that's right.  Two things to add: 1) DEC knew that the
|> performance of 780 models varied by as much as 10%.  Is 10% acceptable?
|> In some cases yes, others no.  

You are still missing the point.  Here are the reference 
execution times for the SPECmarks:
	gcc         1482 Seconds
    expresso    2266 Seconds
	spice 2g6  23951 Seconds
    doduc       1863 Seconds
    nasa7      20093 Seconds
    li          6026 Seconds
    eqntott     1101 Seconds
    matrix300   4525 Seconds
    fpppp       3038 Seconds
    tomcatv     2649 Seconds

These numbers will not vary by 10%.  They will not vary by .001%.
They are fixed.  Ignore the fact that they were measured by 
someone on a VAX someplace.  They are now the definition of 1.000
SPECmarks of performance.  Maybe the 11/780 was not a good choice
for a base, but at this point it does not matter.  The reference is
*NOT* the 780 but the numbers listed above.  THERE IS NOT VARIATION!


-- 
--------------------------------------------------------------------
Donald A. Lewine                (508) 870-9008 Voice
Data General Corporation        (508) 366-0750 FAX
4400 Computer Drive. MS D112A
Westboro, MA 01580  U.S.A.

uucp: uunet!dg!lewine   Internet: lewine@cheshirecat.webo.dg.com

rosenkra@convex.com (William Rosencranz) (11/20/90)

---
i dunno, maybe i am just daft, so ignore this if you beg to differ. it
is not meant to offend, so if you read something into it, pls reread.
it is also my opinion, not that of my employer...

i have been reading this newsgroup for a week or so, and SPECmark is
the current hot topic. i am a bit confused over some of the issues
raised, so maybe i'll raise some of my own.

first off: what are SPEC ratings (or any standard bm ratings for that
matter) meant to do? answer this question in your mind first before
proceeding...

i really see no point whatsoever in relating an execution time on one
machine to that of another "standard" machine, no matter how standard,
(except possibly the old "that's the way we've ALWAYS done it before",
e.g. "MIPS"), just to come up with some single "standard" unit of
performance.

if I were buying (instead of selling :-), i'd want to see wallclock and
cpu times, because i, as a human being, can relate to time far easier
the "SPECs" or whatever. if something runs in 10 seconds, compared to
100 seconds, i know i can sit and wait, call it "interactive". if
something runs in 10 min vs 1 hour, i know i can go out to lunch in the
latter case. a SPEC of 1.345 vs a SPEC of 4.345 means nothing, until
i translate to time anyway. time is easier to "heft", as it were.

further, i'd want to see how the "standard" bm results scale with
problem size, especially on cache-based memory systems. because a
buy decision based on a single number could come back to haunt me.

i'd also want to know what sort of performance enhancements i could
expect if i wanted to put 1 hour, 1 day, and 1 week's effort into
the optimization of any particular code, if possible.

i'd also want to compare a vendor's peak performance with how well
it did on standard bm's or on my own.

finally, i'd want to see what sort of support i can expect from the
vendor. granted, pre-sales and post-sales activities can vary
greatly, but i think i can shake out a vendor during the sales
cycle, as most saavy buyers can.

why the need for complication, other than perhaps marketing fog? and
believe me, if i see 2 or 3 systems with uni-number ratings within
say 5% of each other, i sure as heck would not say "these machines
are identical, so let's buy the cheaper one becasue it has better
price/SPECperformance". i'd want to look at the raw data anyway, and
probably run my sort of workload on them to really get an idea of
what i can expect. similarly, if i see two machines that differ by
alot in some particular individual tests, i' want to know why.

in fact, unless i expect to buy a machine to do just one job (or one
job at a time), i would more than likely ignore these uni-job ratings
altogether, since, from my experience, in "real life", multi-job
thruput is where productivity gains are made, and is where strengths
and weaknesses in architectures (e.g. cache vs widely interleaved memory)
are really determined anyway (in many, if not most cases). probably
without exception, the SPEC'd machines are general purpose systems,
especially workstations, which would get lots of differnt tasks from
text processing to dbms to finite element analysis to ...

the basic problem i see with these uni-number ratings is that
people can make up their minds, even subconsciously, based on a
first impression. this is human nature. you always have that in
the back of your mind. and it is easy to just say "2 > 1.5"
rather than "based on some real workload, and on problem size,
and on vendor support, and on application availability, and on
whatever, 2 is not necessarily > 1.5".

distilling machine performance down to one number tends to make it
easy to abuse it, to misrepresent it. if in fact these sorts of
performance quotients are (good faith?) attempts to enlighten,
then why not enlighten thru education rather than simplification?
surely we can give more credit to the intellect of people making
buy decisions than that?

why not a "SPECparagrah" that sheds more light? consider this my
entry in the standard bm sweepstakes :-).

please don't argue the merits of standards. i am well aware of the
risks an benefits therein. i also know that shopping for supercomputers
is different that shopping for workstations, though in my mind buying
100 w/s at $20k a pop is still spending $2M and it might be better to
buy 100 w/s at $10k and a central system at $1M with my $2M. the SPEC
numbers in no way help me here, i think. having spent the last 15 years
dealing with supercomputers, and only 5 or 6 with workstations and pc's,
i am somewhat biased, i suppose, though i like to at least think i have
an open mind about these sorts of issues.

personally, i think i'll wait for the SPECthroughput bm...

-bill rosenkranz
rosenkra@convex.com

--
Bill Rosenkranz            |UUCP: {uunet,texsun}!convex!c1yankee!rosenkra
Convex Computer Corp.      |ARPA: rosenkra%c1yankee@convex.com

eugene@eos.arc.nasa.gov (Eugene Miya) (11/21/90)

Please excuse my delay time between postings.  I am not at my office
(trying to benchmark from a remote site).  I saw Rafael's post
and the others about reference.  I'd like to address that, but this
news systems only keeps articles 3 days.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

de5@ornl.gov (Dave Sill) (11/21/90)

In article <108988@convex.convex.com>, rosenkra@convex.com (William Rosencranz) writes:
>
>first off: what are SPEC ratings (or any standard bm ratings for that
>matter) meant to do? answer this question in your mind first before
>proceeding...

They're meant to make it possible to get some idea of the performance
one can expect a system to provide, without requiring that one observe
the performance directly.

>i really see no point whatsoever in relating an execution time on one
>machine to that of another "standard" machine, no matter how standard,
>(except possibly the old "that's the way we've ALWAYS done it before",
>e.g. "MIPS"), just to come up with some single "standard" unit of
>performance.

The reason for relating performance on an unknown system to that of a
known one is to give the numbers some relevance.  If you tell me your
system does 15 gigafloogles/second, that tells me nothing unless I
know what a floogle is.  But if you tell me your system scored 11.2
SPECfloogles, I can get a handle on whether 15 GF/s is fast or not, at
least if I have any VAX experience--or another machine whos
SPECfloogle score I know.

>if I were buying (instead of selling :-), i'd want to see wallclock and
>cpu times, because i, as a human being, can relate to time far easier
>the "SPECs" or whatever. 

Sure, but the absolute wall clock isn't going to tell you anything.
It when you compare the values for different systems that gain
information from the results.  So what if the floogle benchmark runs
in 1:26?  That means nothing.  Give me a list of floogle times, and
I'll probably normalize them on some machine I'm familiar with (or
maybe the slowest machine in the list).  It's the relative performance
that's important.

>if something runs in 10 seconds, compared to
>100 seconds, i know i can sit and wait, call it "interactive". if
>something runs in 10 min vs 1 hour, i know i can go out to lunch in the
>latter case. a SPEC of 1.345 vs a SPEC of 4.345 means nothing, until
>i translate to time anyway. time is easier to "heft", as it were.

Only if you are intimately familiar with what's being done.  What's
the difference between 10 minutes versus 100 minutes and 10
SPECfloogles versus 100 SPECfloogles?  Both indicate the same relative
performace, and both measure the same absolute performance.  The
difference is that with the former you need two numbers to compare,
but with the latter you have the built-in VAX value: 10 SPECfloogles
is 10 times faster than SPEC's VAX 11/780.  Not perfect, but better
than nothing.

>further, i'd want to see how the "standard" bm results scale with
>problem size, especially on cache-based memory systems. because a
>buy decision based on a single number could come back to haunt me.

This is a valid point, but has nothing to do with whether wall clock
or relative-to-known values are reported.

>i'd also want to know what sort of performance enhancements i could
>expect if i wanted to put 1 hour, 1 day, and 1 week's effort into
>the optimization of any particular code, if possible.

Lotsa luck.  I don't know of any benchmarks that attempt to anticipate
what gains could be made by optimization, by you or anyone else.

>i'd also want to compare a vendor's peak performance with how well
>it did on standard bm's or on my own.

Just ask the vendors, they'll be glad to give you peak performance
figures.  :-)

>finally, i'd want to see what sort of support i can expect from the
>vendor. granted, pre-sales and post-sales activities can vary
>greatly, but i think i can shake out a vendor during the sales
>cycle, as most saavy buyers can.

This isn't a benchmarking issue at all.  Benchmarking can't and
shouldn't attempt to prevent the foolish buyer from buying foolishly.
Raw performance is just one criterion that should be part of a
procurement effort.

>and
>believe me, if i see 2 or 3 systems with uni-number ratings within
>say 5% of each other, i sure as heck would not say "these machines
>are identical, so let's buy the cheaper one becasue it has better
>price/SPECperformance".

I couldn't agree more.

>i'd want to look at the raw data anyway, and
>probably run my sort of workload on them to really get an idea of
>what i can expect.

SPEC provides the real data.  The SPECmark is just a handy single
figure of merit.  Better that than dhrystone-mips.  As for testing
them yourself: have at it.  Sometimes that's not feasible, and that's
what benchmarks are for.

>similarly, if i see two machines that differ by
>alot in some particular individual tests, i' want to know why.

Again, I agree.  But identifying the reason is not benchmarking issue.
Identifying the difference *is*.

>the basic problem i see with these uni-number ratings is that
>people can make up their minds, even subconsciously, based on a
>first impression.

So what do you propose?  Outlawing single figures of merit?  Better to
have one subject to much scrutiny and well-understood than to have
something ad hoc, unreliable, informal, etc.

>this is human nature. you always have that in
>the back of your mind. and it is easy to just say "2 > 1.5"
>rather than "based on some real workload, and on problem size,
>and on vendor support, and on application availability, and on
>whatever, 2 is not necessarily > 1.5".

No, 2 *is* greater than 1.5.  Always.  The problem is that there may
be more important issues that aren't so easily quantifiable.

>surely we can give more credit to the intellect of people making
>buy decisions than that?

You're the one who seems to think people are going to base their
decisions solely on a SFM.  The detailed data is available to those
who want it.

I don't mean to come across as some kind of SPEC apologist, I just
think what they're doing is better than what was done before they
existed. 

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support