eugene@eos.arc.nasa.gov (Eugene Miya) (11/30/90)
A week ago, I posted a detailed note on a few of the aspects of benchmarking. I didn't name all the aspects (John McA. noted one, jumping the gun, John, other benchmarkers and I have been talking about you.....), I showed the smallest benchmarks I knew, didn't say anything about their problems (simplicity, formatting, portability, representativeness, parallelism, optimization, etc.). I also tried to give a taste of what I feel are necessary "things to come" for all systems. Warning: I am remotely logged in on a machine with a not so good terminal emulator, big typos are likely, and I also need to "draw" some figures. Please excuse. I think benchmarking's problem are dervived, because it has largely developed as a hasty activity, a secondary testing thought. Frank McMahon told me that he'd rather develop compilers. And so on. Many problems are methodological which contributes to benchmarking's problem as an "afterthought." For many benchmarking is seen as a very linear activity like this: _________ | | --------- ------> |_______| A B You have some activity you want to time, you take a clock grab the start and stop times and subtract. Simple right? Okay for a first approximation. It's the way we are taught, the program is a black box. The problem comes and we must now open the black box, analyze it, and turn it into a gray and then white box. This model, very linear, is not adequate for parallel systems or systems where tricks, optimization, features are present. Instead, we must develop benchmarking into a better experimental endeavor. Philosophy of computer science hat (with a good reference). I am fortunate to share and office with a 64 year old man I respect and can bounce ideas off of..... Some years ago, the observation occurred to me that computer science and programming were too mathematical (this contrasts to what I learned in Software Engineering before I got to supercomputers where software people like Dijsktsra think it is not mathematical enough, we are both right). A few people many have heard the term: "Any field which has to use "science" in its name, isn't a science...." I can see why, now. We have a weak empirical basis unlike physics and chemistry which constantly berate CS people. It came when certain CS people announced that "programs are experiments." Bull shit. They aren't. (If you need to know the people who said this, they are noted AI people.) There exists a fear that CS must strive to be closer to the "hard" like physics and chemistry rather than "softer" sciences like biology or *gasp!* the social "sciences." This is necessary because the harder sciences control science funding in this country. I had this verified a few years back by the "then chairman" of the CMU CS Dept. He gasped when I said that CS was like psychology a study of behavior. He did not want his discipline associated with that softer science, which is too bad, because we could learn a few things from biology and psychology. I think we have to resort to some of these techniques because the indeterminate "behavior" of increasing large computer systems will require it. Parallelism, optimization, even AI will force us to re-examine our testing metholodogies. Theory in some ways dominates hands on work, that's an important hacker attribute we must NEVER forget. The question is: what do we need to do? I worked in a psych dept designing experiments as a workstudy student, it's not hard, a good reference exists, it shows our (CS) problems. I also want to give the best CS reference I can as well. Many fields have simple, well written books, thin, cheap, readable. Writing, we give Stunk and White The Elements of Style, almost unconsciously: thin, cheap, one of these days I shall open my copy (1/2 8^). Our field of computing is no exception. I think in Computer Science our equivalent book is Kernighan and Plauger's The Elements of Programming Style. Also thin, cheap a real gem (a few other books, too). Such a book exists for experiment design: known by the author's names: Campbell and Stanley: Experimental and Quasi-Experimental Designs for Research, published by Rand-McNally in the 60s. My copy cost $3. I hear new it costs $18, but so does K&P. It is a book written for education but is useable in studies of behavior like psychology and biology (like new drug testing). Campbell and Stanley do two very good things: 1) they introduce an notation which describes the design of experiments and 2) they describe in detail how each design is subject to criticism in the form of "invalidity." Most studies are invalidated by two classes of problems (internal and external). Three tables summarize each design and go over the properties of each with respect to each form of invalidity. Again, this book is cheap, thin, and readable. If I can do one thing to influence the course of computing science, it is to get people to use this book. The problem is you must stretch your thinking to see how the examples are used are used in computer science. But it does not have to go very far. If you require a CS oriented book, no such book exists, but the best paper on experiment design for computer science and in particular performance measurement come from Philip Heidelberg at the IBM TJWatson Research Center. And in my opinion, some good work will come out of Allen Malony and Dan Reed at U of I. And I hope me 8^). Other good previous work came out of the people at Los Alamos, Lawrence Livermore, and a couple of other places. I think is work is all kind of biased (except Philip's work) toward high performance systems. This colors the work. Some good work comes from the folk who performance tune systems (now we get into the SIGMETRICS crowd and more queueing net theory, MVA, etc. ) starting to get a bit more theoretical. Too esoteric for some. NOW: I am willing to collect interesting references on benchmarking. Please mail and I will credit you in the keyword field. Annotations are useful. If you know other useful references on benchmarking please let me know and I will redistribute them. Multiple citations of the same paper, I will not credit. I want to make a distinction, here, however artificial (as in many fields we use the terminology in a sloppy way): I will use "performance measurement" to mean the empirical testing of a system. This is the hacker's hands on imperative (Levy). "Performance analysis" is the theoretical compliment to measurement. A benchmark executed on real hardware is the former, the simulated or emulated benchmark is the latter. I think (or hope) we all know which is slightly better 8^). Both of these come under the heading of "performance evaluation." One of the nice things about working at Ames is that its close to certain notable people in CS. One of those people is our dear ex-RIACS Peter Denning. I discussed some of this with him, and he gave me two editorials on experimental computer science. While I agree with the need, I did not find them very experimental. Something was lacking, and it ate away at the back of my head because of my earlier interests in other sciences. I then figured it out: no where in the CS literature did I see the concept of experiment control. I believe we confuse program control flow with experiment control. Remember in chemistry, physics and biology? You have an experimental condition and "control" condition? Or maybe a few are too young to remember chemistry sets they don't sell them any more). Looking in Campbell and Stanley, the designs are classified into 3 types: "pre-experiments" these are not experiments, they suffer from major invalidity problems, most CS studies fall into these categories such as the "one-shot" (X O) design, "do an experimental condition and see what you get." Suffers problems, get the book and see what kinds "'true' (tm) experiments" this is the pre-test, post-test control group design. Several of them. They handle invalidity well. "Quasi-experiments" while not 'true' experiments, useful methodologies for coping with problems. The psychology terminology for some of this includes "within-subject" and "between-subject" testing, longitudal studies. What are some of these invalidity problems I refer to: history effects (timing), instrumentation, 'learning,' interactions, etc. About 12 things to consider. Remember how you may have once talked about independent versus dependent variables, and you also have extraneous variables, this book covers them. Two points: CS has inflated the meaning of "experiment." Makes CS sound like a science when many of studies aren't really experiments. It's not just terminology I am batting around this is a good way to see where studies lack and how to do better studies (useful if you are a critic thus allowing you to shoot down colleagues). These people largely only make observations. We glorify, since mere observations do not sound scientific enough. Second, many CS departments place an emphasis on statistical methods of experiment design. I think this is premature. That is the beauty of C&S: it keeps the stats to a minimum, and very simple It just tries to logically deduct, you can get more statistically oriented books later (C&S references). This is part of my reason for not liking any statistics (less note quite that extreme). Get this reference book. It will help. And think about this in computer science terms. Why know this? Because near term developments will invalidate this model: _________ | | --------- ------> |_______| A B A dataflow machine has the POTENTIAL of being nearly impossible to collect timings at A and B because they force synchronization or have no consistent meaning. Instead we must develop a model of measurement which must take 4 components: _________ _________ _________ _________ | | | | | | | | ------->| 1 |------>| 2 |------>| 3 |------>| 5 |------> |_______| |_______| | |_______| | |_______| | | | _________ | | | | | |--->| 4 |-->| |_______| 1- Pre compile phase, issues: portability, input test case selection, test selection, ... 2- Pre-test execution phase, issues: input, data generation, clock sync, etc. 3- test phase: "the banana" 4- simultaneous non-test condition: outside the scope of the test, very important 5- post-test phase, issues: output, clean up, statistical analysis, etc. We have a penant for determinism, and this is the only way we can consider an adequate test measurement. It may seem complex at first, but it really isn't. A couple other things. We have to break the benchmarking process down. My officemate came up with a neat thing to think about: it happened by chance to resemble Amdahl's "other law." [Unfortunately my attempt to draw figures in ASCII fail across the net using this particular terminal emulator.......grrrr....] George made me think about balancing triangles. His original one was: ^ Algorithm / \ Architecture <_> Language This triangle looks pitiful. So I sought mneumonic devices and ideas to express what I wanted to do like the ABC's of benchmarking ^ Applications/ \ Benchmarks <_> Computers At heart, a benchmark is a surrogate for the application program. A benchmarker hopes that a benchmark may lead to greater insight than simply running an application. It is useful to consider performance measurement in light of analogies of other types of performance. I am inspired by the photography of E. Muybridge, H. Edgerton, you can use cars (your mileage may vary), Rafael likes audio systems, I can think of "joke" titles: Beyond the Livermore Loops, The Next 700 Benchmark Programs, Benchmarking Considered Harmful, The Mismeasure of Computers 8^) [S.J. Gould has thought about performance]. Analogies have their limits, they are just models, and we have to develop a model of taking performance. The last thing I want to write down is that I want to have a system to note the quality of a measurement environment. Here is my scale: A, B, C, D, E. C should be just that: average software environment, average hardware environment. The scale should be sliding. What is an A measurement environment? I have presented a current 'A:' a Cray X-MP or a Y-MP. What makes an 'A:' cycle time clock, specialized non-intrusive monitoring hardare (instrumentation being one of the sources of invalidity you recall), good profiling software, etc. B environnments: Cray-1 or Cray-2, Convex C-1, computers with cycle time clocks and good software. C environments: VAXen, IBM mainframes with 60 Hz clocks, PCs, etc. D environments: computers which don't have real clocks, and E environments: for testing of components like CPUs separated from other system parts with incomplete resources. I see a new book on the DEC M31 is coming out, I think this sounds like an 'A' machine. If I am beginning to sound punchy, now, it is because I have been trying to run benchmark programs for hours. See the post time. Really stressing the humans as well as the machine. Time for some vacation; hope the discontinuities of programming have not broken the stream of writing too badly. When I get back, I intend to use the 5-part model to illustrate some of the sub-issues involved in measuring different computers: hardware and software. What is clear is we must also measure the application (there is no other way to be "representative"). I wish to concentrate first however, on just trying to separate differences "within" a given machine. I will also try to present some simple experiments which calibrate measurements on a given machine. I hope I open a few eyes. For further reading I would also like to note a paper by Noga and Gunther about measurement discontinuities, and a paper I wrote (poorly, sorry) for the the 1988 Usenix Supercomputing Workshop. --e.n. miya, NASA Ames Research Center, eugene@eos.arc.nasa.gov {uunet,mailrus,most gateways}!ames!eugene AMERICA: CHANGE IT OR LOSE IT.
buck@drax.gsfc.nasa.gov (Loren (Buck) Buchanan) (12/05/90)
Thanks eugene for another eye opening discussion of a current problem. A difficult to implement, but useful benchmark would be one which would show the speed of processing at various levels of memory (cache, main, and virtual). Note that for some computers there may be more than one level of each kind of memory. The result of such a benchmark would not be a single number, but a set of numbers which when plotted would look something like this: t ____.....--------''''''''''''''' i / m / e / / ____.....------''''' / / / __...----''''' problem size The mostly horizontal portions are where the problem is mostly contained within the limits of that level of memory, and the mostly vertical portion is where thrashing starts to occur. A second benchmark would be one that would involve multiprocessing, and interprocess communication along with the change in problem size to measure the effect of various levels of memory. Another independent dimension to the above is the division between data and instruction caching (sp?). It would would of course be easiest to write a benchmark that would be able to cache the entire program, and let the data be thrashed, but ... well you get the idea. B Cing U Buck Loren Buchanan | buck@drax.gsfc.nasa.gov | #include <std_disclaimer.h> CSC, 1100 West St. | ...!ames!dftsrv!drax!buck | typedef int by Laurel, MD 20707 | (301) 497-2531 | void where_prohibited(by law){} Phone tag, America's fastest growing business sport.