[comp.benchmarks] Next diatribe on benchmarking

eugene@eos.arc.nasa.gov (Eugene Miya) (11/30/90)

A week ago, I posted a detailed note on a few of the aspects of
benchmarking.

I didn't name all the aspects (John McA. noted one, jumping the gun,
John, other benchmarkers and I have been talking about you.....),
I showed the smallest benchmarks I knew, didn't say anything about
their problems (simplicity, formatting, portability, representativeness,
parallelism, optimization, etc.).  I also tried to give a taste of what I
feel are necessary "things to come" for all systems.

Warning: I am remotely logged in on a machine with a not so good terminal
emulator, big typos are likely, and I also need to "draw" some
figures.  Please excuse.

I think benchmarking's problem are dervived, because it has largely developed
as a hasty activity, a secondary testing thought.  Frank McMahon told
me that he'd rather develop compilers.  And so on.  Many problems are
methodological which contributes to benchmarking's problem as an
"afterthought."

For many benchmarking is seen as a very linear activity like this:

	_________
	|	|
---------	------>
	|_______|
	A	B

You have some activity you want to time, you take a clock grab the start and
stop times and subtract.  Simple right?  Okay for a first approximation.
It's the way we are taught, the program is a black box.  The problem comes
and we must now open the black box, analyze it, and turn it into a gray
and then white box.  This model, very linear, is not adequate for
parallel systems or systems where tricks, optimization, features are
present.  Instead, we must develop benchmarking into a better experimental
endeavor.

Philosophy of computer science hat (with a good reference).
I am fortunate to share and office with a 64 year old man I respect
and can bounce ideas off of.....  Some years ago, the observation occurred
to me that computer science and programming were too mathematical
(this contrasts to what I learned in Software Engineering before I got
to supercomputers where software people like Dijsktsra think it is
not mathematical enough, we are both right).  A few people many have heard
the term: "Any field which has to use "science" in its name, isn't
a science...."  I can see why, now.  We have a weak empirical basis
unlike physics and chemistry which constantly berate CS people.
It came when certain CS people announced that "programs are experiments."
Bull shit.  They aren't.  (If you need to know the people who said
this, they are noted AI people.)

There exists a fear that CS must strive to be closer to the "hard"
like physics and chemistry rather than "softer" sciences like
biology or *gasp!* the social "sciences."  This is necessary because
the harder sciences control science funding in this country.
I had this verified a few years back by the "then chairman" of the
CMU CS Dept.  He gasped when I said that CS was like psychology
a study of behavior.  He did not want his discipline associated with that
softer science, which is too bad, because we could learn a few things
from biology and psychology.

I think we have to resort to some of these techniques because
the indeterminate "behavior" of increasing large computer systems
will require it.  Parallelism, optimization, even AI will force us to
re-examine our testing metholodogies.  Theory in some ways dominates
hands on work, that's an important hacker attribute we must NEVER
forget.  The question is: what do we need to do?  I worked in a psych
dept designing experiments as a workstudy student, it's not hard,
a good reference exists, it shows our (CS) problems.  I also want to
give the best CS reference I can as well.

Many fields have simple, well written books, thin, cheap, readable.
Writing, we give Stunk and White The Elements of Style, almost unconsciously:
thin, cheap, one of these days I shall open my copy (1/2 8^).
Our field of computing is no exception.  I think in Computer Science our
equivalent book is Kernighan and Plauger's The Elements of Programming Style.
Also thin, cheap a real gem (a few other books, too).  Such a book
exists for experiment design: known by the author's names:
Campbell and Stanley: Experimental and Quasi-Experimental Designs for
Research, published by Rand-McNally in the 60s.  My copy cost $3.
I hear new it costs $18, but so does K&P.

It is a book written for education but is useable in studies of behavior
like psychology and biology (like new drug testing). Campbell and Stanley
do two very good things: 1) they introduce an notation which describes the
design of experiments and 2) they describe in detail how each design is
subject to criticism in the form of "invalidity."  Most studies are
invalidated by two classes of problems (internal and external).
Three tables summarize each design and go over the properties of each
with respect to each form of invalidity.  Again, this book is cheap,
thin, and readable.  If I can do one thing to influence the course of
computing science, it is to get people to use this book.  The problem
is you must stretch your thinking to see how the examples are used
are used in computer science.  But it does not have to go very far.

If you require a CS oriented book, no such book exists, but the best
paper on experiment design for computer science and in particular
performance measurement come from Philip Heidelberg at the IBM TJWatson
Research Center.  And in my opinion, some good work will come out of
Allen Malony and Dan Reed at U of I.  And I hope me 8^).  Other good
previous work came out of the people at Los Alamos, Lawrence Livermore,
and a couple of other places.  I think is work is all kind of biased
(except Philip's work) toward high performance systems.  This colors
the work.  Some good work comes from the folk who performance tune
systems (now we get into the SIGMETRICS crowd and more queueing net theory,
MVA, etc. ) starting to get a bit more theoretical.  Too esoteric for some.

NOW:  I am willing to collect interesting references on benchmarking.
Please mail and I will credit you in the keyword field.  Annotations
are useful.  If you know other useful references on benchmarking
please let me know and I will redistribute them. Multiple citations
of the same paper, I will not credit.

I want to make a distinction, here, however artificial (as in many fields
we use the terminology in a sloppy way):
	I will use "performance measurement" to mean the empirical
	testing of a system.  This is the hacker's hands on imperative (Levy).
	"Performance analysis" is the theoretical compliment to measurement.
	A benchmark executed on real hardware is the former, the
	simulated or emulated benchmark is the latter.  I think (or hope)
	we all know which is slightly better 8^).
	Both of these come under the heading of "performance evaluation."

One of the nice things about working at Ames is that its close to
certain notable people in CS.  One of those people is our dear ex-RIACS
Peter Denning.  I discussed some of this with him, and he gave me two
editorials on experimental computer science.  While I agree with the need,
I did not find them very experimental.  Something was lacking, and it ate
away at the back of my head because of my earlier interests in other sciences.
I then figured it out: no where in the CS literature did I see the concept
of experiment control.  I believe we confuse program control flow with
experiment control.  Remember in chemistry, physics and biology? You have
an experimental condition and "control" condition?  Or maybe a few are too
young to remember chemistry sets they don't sell them any more).

Looking in Campbell and Stanley, the designs are classified into 3 types:
	"pre-experiments" these are not experiments, they suffer from
		major invalidity problems, most CS studies fall into these
		categories such as the "one-shot" (X O) design, "do an
		experimental condition and see what you get."  Suffers
		problems, get the book and see what kinds
	"'true' (tm) experiments" this is the pre-test, post-test control
		group design.  Several of them.  They handle invalidity well.
	"Quasi-experiments" while not 'true' experiments, useful methodologies
		for coping with problems.
The psychology terminology for some of this includes "within-subject"
and "between-subject" testing, longitudal studies.  What are some 
of these invalidity problems I refer to: history effects (timing),
instrumentation, 'learning,' interactions, etc.  About 12 things to consider.
Remember how you may have once talked about independent versus dependent
variables, and you also have extraneous variables, this book covers them.

Two points: CS has inflated the meaning of "experiment."  Makes CS sound like
a science when many of studies aren't really experiments.  It's not just
terminology I am batting around this is a good way to see where studies lack
and how to do better studies (useful if you are a critic thus allowing you to
shoot down colleagues).  These people largely only make observations.
We glorify, since mere observations do not sound scientific enough.

Second, many CS departments place an emphasis on statistical methods of
experiment design.  I think this is premature.  That is the beauty of C&S:
it keeps the stats to a minimum, and very simple  It just tries to logically
deduct, you can get more statistically oriented books later (C&S references).
This is part of my reason for not liking any statistics (less note quite
that extreme).

Get this reference book.  It will help.  And think about this in
computer science terms.

Why know this?  Because near term developments will invalidate this model:
	_________
	|	|
---------	------>
	|_______|
	A	B
A dataflow machine has the POTENTIAL of being nearly impossible to collect
timings at A and B because they force synchronization or have no consistent
meaning.  Instead we must develop a model of measurement which must take 4
components:
	_________	_________	_________	_________
	|	|	|	|	|	|	|	|
------->|   1	|------>|   2	|------>|   3	|------>|   5	|------>
	|_______|	|_______|  |	|_______|   |	|_______|
				   |		    |
				   |	_________   |
				   |	|	|   |
				   |--->|  4	|-->|
					|_______|
1- Pre compile phase, issues: portability, input test case selection,
				test selection, ...
2- Pre-test execution phase, issues: input, data generation, clock sync, etc.
3- test phase: "the banana"
4- simultaneous non-test condition: outside the scope of the test, very
					important
5- post-test phase, issues: output, clean up, statistical analysis, etc.

We have a penant for determinism, and this is the only way we can consider
an adequate test measurement.  It may seem complex at first, but it really
isn't. A couple other things.  We have to break the benchmarking process down.
My officemate came up with a neat thing to think about:  it happened by
chance to resemble Amdahl's "other law."

[Unfortunately my attempt to draw figures in ASCII fail across the net using
this particular terminal emulator.......grrrr....]  George made me think
about balancing triangles.  His original one was:

		^
   Algorithm   / \ Architecture
	       <_>
	    Language		This triangle looks pitiful.

So I sought mneumonic devices and ideas to express what I wanted to do like
the ABC's of benchmarking

		^
   Applications/ \ Benchmarks
	       <_>
	    Computers
At heart, a benchmark is a surrogate for the application program.
A benchmarker hopes that a benchmark may lead to greater insight than
simply running an application.

It is useful to consider performance measurement in light of analogies
of other types of performance.  I am inspired by the photography
of E. Muybridge, H. Edgerton, you can use cars (your mileage may vary),
Rafael likes audio systems, I can think of "joke" titles:
Beyond the Livermore Loops, The Next 700 Benchmark Programs,
Benchmarking Considered Harmful, The Mismeasure of Computers 8^)
[S.J. Gould has thought about performance].  Analogies have their limits,
they are just models, and we have to develop a model of taking performance.

The last thing I want to write down is that I want to have a system to
note the quality of a measurement environment.  Here is my scale:
A, B, C, D, E.  C should be just that: average software environment,
average hardware environment.  The scale should be sliding.  What is an A
measurement environment?  I have presented a current 'A:' a Cray X-MP
or a Y-MP.  What makes an 'A:' cycle time clock, specialized non-intrusive
monitoring hardare (instrumentation being one of the sources of invalidity
you recall), good profiling software, etc. B environnments: Cray-1 or Cray-2,
Convex C-1, computers with cycle time clocks and good software.
C environments: VAXen, IBM mainframes with 60 Hz clocks, PCs, etc.
D environments: computers which don't have real clocks, and 
E environments: for testing of components like CPUs separated from other
system parts with incomplete resources.
I see a new book on the DEC M31 is coming out, I think this sounds like
an 'A' machine.

If I am beginning to sound punchy, now, it is because I have been trying to
run benchmark programs for hours.  See the post time.  Really stressing
the humans as well as the machine.  Time for some vacation; hope the
discontinuities of programming have not broken the stream of writing too badly.
When I get back, I intend to use the 5-part model to illustrate some of the
sub-issues involved in measuring different computers: hardware and software.
What is clear is we must also measure the application (there is no other way
to be "representative").  I wish to concentrate
first however, on just trying to separate differences "within" a given
machine.  I will also try to present some simple experiments which
calibrate measurements on a given machine.  I hope I open a few eyes.
For further reading I would also like to note a paper by Noga and Gunther
about measurement discontinuities, and a paper I wrote (poorly, sorry) for the
the 1988 Usenix Supercomputing Workshop.

--e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov
  {uunet,mailrus,most gateways}!ames!eugene
  AMERICA: CHANGE IT OR LOSE IT.

buck@drax.gsfc.nasa.gov (Loren (Buck) Buchanan) (12/05/90)

Thanks eugene for another eye opening discussion of a current problem.

A difficult to implement, but useful benchmark would be one which would
show the speed of processing at various levels of memory (cache, main, and
virtual).  Note that for some computers there may be more than one level
of each kind of memory.  The result of such a benchmark would not be a
single number, but a set of numbers which when plotted would look something
like this:

t                                        ____.....--------'''''''''''''''
i                                       /
m                                      /
e                                     /
                                     /
                 ____.....------'''''
                /
               /
              /
__...----'''''             problem size

The mostly horizontal portions are where the problem is mostly contained 
within the limits of that level of memory, and the mostly vertical portion
is where thrashing starts to occur.  A second benchmark would be one that
would involve multiprocessing, and interprocess communication along with
the change in problem size to measure the effect of various levels of
memory.

Another independent dimension to the above is the division between data
and instruction caching (sp?).  It would would of course be easiest to
write a benchmark that would be able to cache the entire program, and 
let the data be thrashed, but ... well you get the idea.

B Cing U

Buck

Loren Buchanan     | buck@drax.gsfc.nasa.gov   | #include <std_disclaimer.h> 
CSC, 1100 West St. | ...!ames!dftsrv!drax!buck | typedef int by
Laurel, MD 20707   | (301) 497-2531            | void where_prohibited(by law){}
Phone tag, America's fastest growing business sport.