[comp.arch] Benchmarks

eugene@pioneer.arpa (Eugene N. Miya) (05/14/88)

In summary from my paper:

What makes good benchmarks:

You require reproducibility, comprehendability, and you would like
simplicity.  In fact, benchmarks are too simple.  Machines are becoming
more diverse: multiprocessors of different architectures, smart
software, etc.

The simple linear model of measurement: take start time, do work, take
stop time is susceptable to these influences.  What you want are strict
controls of the pre-test condition, the test condition, and the post-test
condition.  Also important is the during test condition.  One of the
biggest challenges to performance measurement is parallelism: the
mythical MIPS problem: confusing work and effort as Brooks would say.
SO linearity is something we have to fight.  Our problem is
oversimplicity.

Test data cases must be carefully selected.  Do you execute a
significant (not necessarily most) portion of the code?  Data can be as
important as the program itself.  Don't even thing about asking about
interactive characterization.  It's mostly a joke.  (Pardons to those
reading who did their PhD thesis on characterizing interactive
systems).

We have to know when hardware/software is being benficial or detrimental.
We must throw off the idea that a program is an experiment [Feigenbaum?];
it is not.  A single program measurement lacks the experimental controls
necessary for good measurements.

We have synthetic as well as real program benchmarks.  The former are
usually derailed as unrealistic, but the problem is  our concept of the
execution of a program is too simplistic.  We talk of memory-CPU
benchmarks when most of the time is spent doing I/O.  We have few IO
benchamrks.  There's no such thing as a standard application.  We need
to do is run programs over applcations, then take the resultant data to
form a performance prediction. (notice we are never told expected Linpacks
only measured Linpacks).
My survey covers the shortest and longest synthetic programs I know.
Some syntactic analysis, some dynamic, etc.

The problem is we treat machines and benchmarks like black boxes.  We
expect near worthless single figures of merit to select machines (I
wonder what types of cars there people buy).  I'm interested in starting
a new "study" call it computer cardiology.

We must systematize the measurement process.  I'm talking to one special
software house on a benchmark test program generator and am working on
prototypes, now in my spare time.
The ideal measurement tool must have a high degree of portability.  It
must be reasonably simple, the analysis portion must be seprable from
the measurement portion.  Unfortunately, most machine make poor
measurement environments: IBM370s, VAXen, Mac, PCs.  Quantity does not
make something good, my standalone time on our X-MP has been curtailed
because our users also need the machine.  Cray-2s don't have HPMs.

The ideal tools should allow one to vary parameters carefully, one at a
time.  Linpack while a nice appearing simple single figure of merit has
diminishing parallelism (since it's a direct solution).  Any 32-bit result
should be viewed with suspicion  (it is a 64-bit test).  It's value is
that Jack Dongarra dares to name names, does not fear getting sued for
holding damning numbers, nor does Rick Richardson for that Dhrystone
matter.  We want the computer equivalent of the pocket tape measure.

Computers really don't differ in fundamental construction all that much
currently (well, Multiflow, CM[12], DAP, etc.).  These represent new
challenges for becnhmarking.  No, the paper does not read like this, it's
being typed "stream of consciousness" in during the "heat of passion."

There is a mailing list devoted to performance measurement (@cs.wisc.edu).
But they are mostly queueing theorists, not benchmarkers.  Largely
quiet, after all SIGMETRICS'88 is what next week?

P.S. Don't ask me for a copy yet, I will announce availability, I've
promised far too many people and get side-tracked too often.
I have a shorter paper which is undergoing review on a tiny aspect of
the bigger pictures, but I have to send a copy of the bigger one to
John, Chuck, and lots of others.  Don't worry.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

sah@mips.COM (Steve Hanson) (05/15/88)

In article <8734@ames.arc.nasa.gov> eugene@pioneer.UUCP (Eugene N. Miya) writes:
>In summary from my paper:
>
>What makes good benchmarks:
>
>You require reproducibility, comprehendability, and you would like
>simplicity.
[other stuff]
>
>--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
>  resident cynic at the Rock of Ages Home for Retired Hackers:
>  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
>  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
>  "Send mail, avoid follow-ups.  If enough, I'll summarize."


	Good benchmarks also validate their results with expected
results. We if don't know whether a benchmark ran successfully, how
can be draw conclusions? 

-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!sah
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

colwell@mfci.UUCP (Robert Colwell) (05/15/88)

In article <8734@ames.arc.nasa.gov> eugene@pioneer.UUCP (Eugene N. Miya) writes:
>In summary from my paper:
>
>What makes good benchmarks:
>
>You require reproducibility, comprehendability, and you would like
>simplicity.  In fact, benchmarks are too simple.  Machines are becoming
>more diverse: multiprocessors of different architectures, smart
>software, etc.
>
>--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov

I think there's one other side to the benchmarking issue.  It's
currently fashionable to dump on the simple benchmarks (Dhrystone,
puzzle et.  al.)  and I've done my share.  But the really simple
benchmarks are not without value.  I think one should always start
with them, and if they run well, it's ok to smile for a few minutes.
If they don't, your machine/architecture/toaster may require some
adjustments, and the simpler benchmarks are much less ambiguous in
terms of causes and effects than larger ones (fewer cache effects,
clearer compiler/architecture interactions, controllable I/O and disk
accesses...)

If what we're looking for is a set of benchmarks that would be useful
in comparing diverse machines and architectures, then I second John
Mashey's call for better integer codes.  I just hope that we don't
swing the pendulum all the way from "compare machines on trivial
benchmarks and make unwarranted conclusions" (which was all the rage
a while back) to "small benchmarks are a joke and anyone who uses
them is too."

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

aglew@urbsdc.Urbana.Gould.COM (05/16/88)

>[Eugene Miya]:
>There is a mailing list devoted to performance measurement (@cs.wisc.edu).
>But they are mostly queueing theorists, not benchmarkers.  Largely
>quiet, after all SIGMETRICS'88 is what next week?

Thanks, Eugene... there are a few SIGMETRICS members who realize that it
doesn't do you any good to simulate things with queuing theory or Petri
net models until you can make measurements to calibrate the models.
Trouble is, too many people think of measurement as trivial, a solved
problem.

And by the way: benchmarking is *NOT* the be all and end all of measurement.
Benchmarks must be calibrated just like models must be. Eugene knows this,
but many people act like they don't (Why do you want to measure my system?
Can't you just run benchmarks?)

Computer system performance evaluation should start off with measurement,
with real customers, on real systems, to determine (1) what is important
to your customers, and (2) what they actually do with the system. (2) can
influence (1), as in "look, you say that floating point speed is the most
important thing to you, but you spend 90% of your time doing integer work",
but not always "we do integer work to fill in any slack time. a real 
application has 90% slack time, but those 10% are on time critical paths".
"Oh".

Once you have measurements, they may be abstracted into benchmarks.
Benchmarks can be used to drive simulations, or to evaluate a new system.
Benchmarks can be used to make far more complicated "pseudo-measurements",
because you can warp the time scale to use time costly instrumentation.

But, just as simulations without realistic workloads are useless, so are
benchmarks without an underlying rationale based on measurement. (Actually,
change "useless" to "not so useful" - sometimes they are better than nothing).

Trouble is, most texts and papers on performamnce evaluation assume that you
have got the measurements - given a spectrum of points in your measurement
space, here is how you produce a reasonably good workload sample. That's
computer science.
    Me, I want to be a "computer naturalist" - I want to develop measurement
techniques that can be applied easily, cheaply, to a variety of systems
and workloads, over extended periods of time.

See you in Santa Fe...

davidsen@steinmetz.ge.com (William E. Davidsen Jr) (05/16/88)

  There are (at least) two ways to do benchmarks; by running complex
programs which are similar to your desired use, or by measuring the
hardware performance in a number of areas, and trying to match the
profile to the resources needed to run your program.

  I am a believer in the independent testing of facets of the machine
capabilities, since that allows me to run multiple tests concurrently to
determine interraction. ie. if a machine has good disk access for a
single process, how does it look under load? What does the use of
floating point have to do with the disk access (hopefully nothing).

  I have seen machines which had lousy disk performance single process,
and very little deterioration under load, while machines with very fast
disks sunk like a rock when loaded. I have seen machines which had poor
thruput for CPU bound jobs when disk bound processes were running. I
wouldn't have seen this as clearly if I hadn't had an idea of what the
machine did in each area individually.

-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

dav@abvax.icd.ab.com (Dave A. Vasko) (11/15/88)

Does anyone know where I can get source code for the Whetstone, Stanford
integer and Stanford floating point benchmarks ?  I assume Stanford 
University has the source for the bench mark that bares their name but I'm
not sure who to contact. Any help would be appreciated.
-- 
****************************************************************************
* Dave Vasko                Allen-Bradley  Highland Hts. Ohio 44139        *
* (216) 646-4695            ...!{decvax,pyramid,cwjcc,masscomp}!abvax!dav  *
****************************************************************************

jonasn@ttds.UUCP (Jonas Nygren) (10/01/89)

Up till approx 6 months ago there were lot of performance discussions
in comp.arch, lots of CISC versus RISC reporting. But suddenly all
this stopped, at least on this side of the atlantic, and I am curious
whether this has something to do with SPEC. Did all participants in the
SPEC group sign a vow of silence or why haven't we seen any reports yet.

Anybody know the answer?

/jonas

thomas@uplog.se (Thomas Tornblom) (02/22/90)

I would like to get pointers to various benchmarks for Unix machines.
What I'm particularly interested in is filesystem throughput, context switches
and the like.

Znks,
Thomas
-- 
Real life:	Thomas Tornblom		Email:	thomas@uplog.se
Snail mail:	TeleLOGIC Uppsala AB		Phone:	+46 18 189406
		Box 1218			Fax:	+46 18 132039
		S - 751 42 Uppsala, Sweden