eugene@pioneer.arpa (Eugene N. Miya) (05/14/88)
In summary from my paper: What makes good benchmarks: You require reproducibility, comprehendability, and you would like simplicity. In fact, benchmarks are too simple. Machines are becoming more diverse: multiprocessors of different architectures, smart software, etc. The simple linear model of measurement: take start time, do work, take stop time is susceptable to these influences. What you want are strict controls of the pre-test condition, the test condition, and the post-test condition. Also important is the during test condition. One of the biggest challenges to performance measurement is parallelism: the mythical MIPS problem: confusing work and effort as Brooks would say. SO linearity is something we have to fight. Our problem is oversimplicity. Test data cases must be carefully selected. Do you execute a significant (not necessarily most) portion of the code? Data can be as important as the program itself. Don't even thing about asking about interactive characterization. It's mostly a joke. (Pardons to those reading who did their PhD thesis on characterizing interactive systems). We have to know when hardware/software is being benficial or detrimental. We must throw off the idea that a program is an experiment [Feigenbaum?]; it is not. A single program measurement lacks the experimental controls necessary for good measurements. We have synthetic as well as real program benchmarks. The former are usually derailed as unrealistic, but the problem is our concept of the execution of a program is too simplistic. We talk of memory-CPU benchmarks when most of the time is spent doing I/O. We have few IO benchamrks. There's no such thing as a standard application. We need to do is run programs over applcations, then take the resultant data to form a performance prediction. (notice we are never told expected Linpacks only measured Linpacks). My survey covers the shortest and longest synthetic programs I know. Some syntactic analysis, some dynamic, etc. The problem is we treat machines and benchmarks like black boxes. We expect near worthless single figures of merit to select machines (I wonder what types of cars there people buy). I'm interested in starting a new "study" call it computer cardiology. We must systematize the measurement process. I'm talking to one special software house on a benchmark test program generator and am working on prototypes, now in my spare time. The ideal measurement tool must have a high degree of portability. It must be reasonably simple, the analysis portion must be seprable from the measurement portion. Unfortunately, most machine make poor measurement environments: IBM370s, VAXen, Mac, PCs. Quantity does not make something good, my standalone time on our X-MP has been curtailed because our users also need the machine. Cray-2s don't have HPMs. The ideal tools should allow one to vary parameters carefully, one at a time. Linpack while a nice appearing simple single figure of merit has diminishing parallelism (since it's a direct solution). Any 32-bit result should be viewed with suspicion (it is a 64-bit test). It's value is that Jack Dongarra dares to name names, does not fear getting sued for holding damning numbers, nor does Rick Richardson for that Dhrystone matter. We want the computer equivalent of the pocket tape measure. Computers really don't differ in fundamental construction all that much currently (well, Multiflow, CM[12], DAP, etc.). These represent new challenges for becnhmarking. No, the paper does not read like this, it's being typed "stream of consciousness" in during the "heat of passion." There is a mailing list devoted to performance measurement (@cs.wisc.edu). But they are mostly queueing theorists, not benchmarkers. Largely quiet, after all SIGMETRICS'88 is what next week? P.S. Don't ask me for a copy yet, I will announce availability, I've promised far too many people and get side-tracked too often. I have a shorter paper which is undergoing review on a tiny aspect of the bigger pictures, but I have to send a copy of the bigger one to John, Chuck, and lots of others. Don't worry. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
sah@mips.COM (Steve Hanson) (05/15/88)
In article <8734@ames.arc.nasa.gov> eugene@pioneer.UUCP (Eugene N. Miya) writes: >In summary from my paper: > >What makes good benchmarks: > >You require reproducibility, comprehendability, and you would like >simplicity. [other stuff] > >--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov > resident cynic at the Rock of Ages Home for Retired Hackers: > "Mailers?! HA!", "If my mail does not reach you, please accept my apology." > {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene > "Send mail, avoid follow-ups. If enough, I'll summarize." Good benchmarks also validate their results with expected results. We if don't know whether a benchmark ran successfully, how can be draw conclusions? -- UUCP: {ames,decwrl,prls,pyramid}!mips!sah USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
colwell@mfci.UUCP (Robert Colwell) (05/15/88)
In article <8734@ames.arc.nasa.gov> eugene@pioneer.UUCP (Eugene N. Miya) writes: >In summary from my paper: > >What makes good benchmarks: > >You require reproducibility, comprehendability, and you would like >simplicity. In fact, benchmarks are too simple. Machines are becoming >more diverse: multiprocessors of different architectures, smart >software, etc. > >--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov I think there's one other side to the benchmarking issue. It's currently fashionable to dump on the simple benchmarks (Dhrystone, puzzle et. al.) and I've done my share. But the really simple benchmarks are not without value. I think one should always start with them, and if they run well, it's ok to smile for a few minutes. If they don't, your machine/architecture/toaster may require some adjustments, and the simpler benchmarks are much less ambiguous in terms of causes and effects than larger ones (fewer cache effects, clearer compiler/architecture interactions, controllable I/O and disk accesses...) If what we're looking for is a set of benchmarks that would be useful in comparing diverse machines and architectures, then I second John Mashey's call for better integer codes. I just hope that we don't swing the pendulum all the way from "compare machines on trivial benchmarks and make unwarranted conclusions" (which was all the rage a while back) to "small benchmarks are a joke and anyone who uses them is too." Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
aglew@urbsdc.Urbana.Gould.COM (05/16/88)
>[Eugene Miya]: >There is a mailing list devoted to performance measurement (@cs.wisc.edu). >But they are mostly queueing theorists, not benchmarkers. Largely >quiet, after all SIGMETRICS'88 is what next week? Thanks, Eugene... there are a few SIGMETRICS members who realize that it doesn't do you any good to simulate things with queuing theory or Petri net models until you can make measurements to calibrate the models. Trouble is, too many people think of measurement as trivial, a solved problem. And by the way: benchmarking is *NOT* the be all and end all of measurement. Benchmarks must be calibrated just like models must be. Eugene knows this, but many people act like they don't (Why do you want to measure my system? Can't you just run benchmarks?) Computer system performance evaluation should start off with measurement, with real customers, on real systems, to determine (1) what is important to your customers, and (2) what they actually do with the system. (2) can influence (1), as in "look, you say that floating point speed is the most important thing to you, but you spend 90% of your time doing integer work", but not always "we do integer work to fill in any slack time. a real application has 90% slack time, but those 10% are on time critical paths". "Oh". Once you have measurements, they may be abstracted into benchmarks. Benchmarks can be used to drive simulations, or to evaluate a new system. Benchmarks can be used to make far more complicated "pseudo-measurements", because you can warp the time scale to use time costly instrumentation. But, just as simulations without realistic workloads are useless, so are benchmarks without an underlying rationale based on measurement. (Actually, change "useless" to "not so useful" - sometimes they are better than nothing). Trouble is, most texts and papers on performamnce evaluation assume that you have got the measurements - given a spectrum of points in your measurement space, here is how you produce a reasonably good workload sample. That's computer science. Me, I want to be a "computer naturalist" - I want to develop measurement techniques that can be applied easily, cheaply, to a variety of systems and workloads, over extended periods of time. See you in Santa Fe...
davidsen@steinmetz.ge.com (William E. Davidsen Jr) (05/16/88)
There are (at least) two ways to do benchmarks; by running complex programs which are similar to your desired use, or by measuring the hardware performance in a number of areas, and trying to match the profile to the resources needed to run your program. I am a believer in the independent testing of facets of the machine capabilities, since that allows me to run multiple tests concurrently to determine interraction. ie. if a machine has good disk access for a single process, how does it look under load? What does the use of floating point have to do with the disk access (hopefully nothing). I have seen machines which had lousy disk performance single process, and very little deterioration under load, while machines with very fast disks sunk like a rock when loaded. I have seen machines which had poor thruput for CPU bound jobs when disk bound processes were running. I wouldn't have seen this as clearly if I hadn't had an idea of what the machine did in each area individually. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
dav@abvax.icd.ab.com (Dave A. Vasko) (11/15/88)
Does anyone know where I can get source code for the Whetstone, Stanford integer and Stanford floating point benchmarks ? I assume Stanford University has the source for the bench mark that bares their name but I'm not sure who to contact. Any help would be appreciated. -- **************************************************************************** * Dave Vasko Allen-Bradley Highland Hts. Ohio 44139 * * (216) 646-4695 ...!{decvax,pyramid,cwjcc,masscomp}!abvax!dav * ****************************************************************************
jonasn@ttds.UUCP (Jonas Nygren) (10/01/89)
Up till approx 6 months ago there were lot of performance discussions in comp.arch, lots of CISC versus RISC reporting. But suddenly all this stopped, at least on this side of the atlantic, and I am curious whether this has something to do with SPEC. Did all participants in the SPEC group sign a vow of silence or why haven't we seen any reports yet. Anybody know the answer? /jonas
thomas@uplog.se (Thomas Tornblom) (02/22/90)
I would like to get pointers to various benchmarks for Unix machines. What I'm particularly interested in is filesystem throughput, context switches and the like. Znks, Thomas -- Real life: Thomas Tornblom Email: thomas@uplog.se Snail mail: TeleLOGIC Uppsala AB Phone: +46 18 189406 Box 1218 Fax: +46 18 132039 S - 751 42 Uppsala, Sweden