davet@oakhill.UUCP (04/15/87)
In article <16151@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes: >In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes: >> ... Meanwhile, any advice on modifying >>the Dhrystone for version 1.2 such that a global optimizer won't be >>able to remove anything will be appreciated. > >One easy fix to Dhrystone is to package it as two (or more) separate c >files which must be compiled separately, then linked together. This >will prevent global optimizers from looking at the entire program at one >whack ..... Note that there is at least one company, Philon, Inc. which claims to have an optimization pass which runs on the final linked object code which would defeat the above tactics. I myself may be working on such an animal in a few months. However, I still think it's vital to get the above changes put into a new version of Dhrystone because current results are becoming less meaningfull. My info on Philon comes from Uniforum 1986 Conference Proceedings Pages 125 to 138. I have no other knowledge of that company, it's products or the efficacy of their compiler system. -- David Trissel Motorola Semiconductor Inc., Austin, Texas {siesmo,ihnp4}!ut-sally!im4u!oakhill!davet
sjc@mips.UUCP (04/16/87)
In article <864@oakhill.UUCP>, davet@oakhill.UUCP (Dave Trissel) writes: > In article <16151@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes: > >In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes: > >> ... Meanwhile, any advice on modifying > >>the Dhrystone for version 1.2 such that a global optimizer won't be > >>able to remove anything will be appreciated. > > > >One easy fix to Dhrystone is to package it as two (or more) separate c > >files which must be compiled separately, then linked together... > > Note that there is at least one company, Philon, Inc. which claims to have > an optimization pass which runs on the final linked object code which would > defeat the above tactics. In any benchmark, if you compute some function (e.g. a hash) of all of the variables which are set but not otherwise used, and then print the result, you not only prevent optimizers from eliminating the code which sets those variables, but you also provide a check on whether the emitted program is correct. It's human nature to concentrate on how fast a benchmark runs and ignore whether it runs correctly, but the latter is really a prerequisite, particularly when you're working with a compiler or optimizer which is in a state of flux. My plea to benchmark designers: measure whatever time you want, but afterward compare the results against known correct constant values, and print a go/no-go message telling whether the results were correct (or, for floating point, whether they were within an acceptable range). Dhrystone has additional vulnerable spots: for example, the empty loops used for subtracting out the loop-overhead. But the loops don't have to be empty; you could add some nontrivial computation (whose result is later printed, compared against a constant, or used to compute a hash) to both the empty loops and the real ones, so that the subtraction removes both the effect of the looping and the effect of the computation. -- ...decwrl!mips!sjc Steve Correll
earl@mips.UUCP (Earl Killian) (04/17/87)
>In article <219@homxb.UUCP> gemini@homxb.UUCP (Rick Richardson) writes: >Meanwhile, any advice on modifying the Dhrystone for version 1.2 such >that a global optimizer won't be able to remove anything will be >appreciated. I'd like to second Steve Correll's comments and go a bit farther. I hate to praise whetstone, but if you're looking for ways to improve dhrystone, then you ought to start by looking at whetstone. The only thing dhrystone appears to have borrowed from whetstone so far is the name, and this is unfortunate. whetstone (1) does output so that the program isn't simply thrown away; (2) does input so that the entire computation cannot be done at compile time; (3) prints results so that the correctness of the run can be verified; (4) uses a technique for subtracting out loop overhead that does not involve timing an empty loop that is likely to be deleted by a compiler; (5) makes every computation's value used later so compilers don't just delete the computation (often in conjunction with #3 above); (6) makes all computations in loops depend on variables that change on every loop iteration, thus preventing hoisting. The only optimization that I know of that begins to trash whetstone is inlining subroutines (you can also apply angle addition formulas to save a call or two to SIN/COS if you really want to tailor your compiler to the benchmark). Dhrystone does none of these things. It represents a step backward in the art of benchmarking. If you want to improve dhrystone, bring it up to the standards of the twenty year old whetstone. I started off saying I hated to praise whetstone. The reason is that I don't think it is particularly representative. Neither is dhrystone. Both claim to be synthetic benchmarks created from looking at statistics of source language constructs. This sounds reasonable on the surface, but for some reason this approach seems to generate very different low-level statistics than real programs. For example, dhrystone averages 15 instructions per procedure call on a VAX. The call-intensive programs I've studied average 50 instructions per call (30 is the lowest real program number I saw). For example, I know of a pascal compiler that produces 8200 dhrystones on an VAX/8600 when you use CALLS for the procedure calls and 14000 when you use a faster procedure call protocol. This is 70% more dhrystones! And yet that compiler itself only speeds up by 20% when you use the better procedure call protocol. Another example: this forum has already noted the importance of strcpy to the C version of dhrystone, and questioned that it is quite as important in real programs. Another example: whetstone is half a test of you floating point math library and not the compiler or hardware. Some real floating point programs (e.g. spice) do spend a lot of time in SQRT, LOG, EXP, etc. but not as much as whetstone. And other real programs spend almost no time there. On the other hand, perhaps it is feature that these benchmarks do go make people produce good libraries. :-) Another problem with these benchmarks is that they're small. They fit trivially into caches, for example. For real programs, caches serve not to make your program fit, but to make address the impedance mismatch between main memory and the cpu. You still need memory bandwidth proportional to your processing power (although you wouldn't know it from these small benchmarks) but raw memory bandwidth isn't well matched to the appetite of a processor. Thus caches. Writing a big benchmark is not easy, of course, nor is porting it. On the other hand, ask yourself why we're trying to build faster and faster computers. The reason is so that we can run bigger and bigger programs. Small benchmarks are becoming more and more misleading. And one benchmark won't characterize all your applications either. At MIPS, we use the following programs for benchmarking and architectural study: ccom -- C front-end uopt -- global optimizer as1 -- assembler second pass nroff -- text formatter TeX -- another formatter with very different statistics compress -- data compression espresso -- PLA reduction hspice/spice2g6/spice3a7 -- circuit simulation timber wolf -- routing doduc -- monte carlo nuclear reactor simulation linpack -- matrix reduction (There are more that we want to study, but don't have yet.) All of these have very different statistics. None fit well in 8Kb caches. Some fit in 64Kb caches; others still do not. Some use floating point heavily; some don't. Some want large cache blocks; others want small. Some need very large virtual to physical translation caches; others need only a few entries. Linpack can use vector ops; the others probably cannot. Etc. etc.
reiter@endor.harvard.edu (Ehud Reiter) (04/21/87)
In article <312@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes: >At MIPS, we use the following programs for benchmarking and >architectural study: > <list of 10 or so programs> >All of these have very different statistics. On the subject of benchmarks, I've noticed that many people who do use multiple programs for benchmarking tend to average the results together. Even MIPS, whose Performance Brief is the best of its kind that I've seen, does this to some degree (e.g. their "UNIX Total", in comparing an M/500 to a VAX-11/780, averages together relative performance figures ranging from 4.7 to 6.7 into an average figure of 5.6). Since averaging can be a statistically dubious process (especially if the programs being averaged are not weighted by some kind of "freqency of use" parameter), perhaps it would be better to report the full range instead of an average (so, in the above example, the M/500 would be rated as 4.7-6.7, not 5.6). Reporting the range would also have the advantage of giving people an idea of the variances involved. So, if the community does shift to using multiple programs for benchmarking (which I strongly approve of), we should consider reporting "relative performance" not as a single average number, but rather as a range. Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP)
zrm@eddie.MIT.EDU (Zigurd R. Mednieks) (04/21/87)
Just what is optimization? Is Green Hills' compiler optimizing when it ignores "register" storage class keywords and uses its own intelligent register allocation? Are RISC compiler postpass optimizers within the rules of the Dhrystone game? While it is pretty easy to say that something that radically alters the traslation of C into machine code is optimizing, what separates clever (or in the case of RISC, almost required) hacks in code generation from optimization? -Zigurd
eugene@pioneer.arpa (Eugene Miya N.) (04/21/87)
In article <1706@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes: >In article <312@gumby.UUCP> earl@mips.UUCP (Earl Killian) writes: >>At MIPS, we use the following programs for benchmarking and >>architectural study: >> <list of 10 or so programs> >>All of these have very different statistics. > > discussion about statistics. > >So, if the community does shift to using multiple programs for benchmarking >(which I strongly approve of), we should consider reporting "relative >performance" not as a single average number, but rather as a range. > > Ehud Reiter > reiter@harvard (ARPA,BITNET,UUCP) Rather than prematurely throw statistical tools at the problem, why not pick up a book on experiment design and measurement? Rather than use ranges or single numbers consider other ways to convey information. First, you have to verify things: you can use the: has or not have characterization. Graphics is real popular now. Consider the work done on Faces to display multidimensional data at Los Alamos Natl. Lab (where I'm going next week). Consider other analyses of dynamics: Muybridge's studies on motion. Consider that hearts are characterized by a rate, but an EKG is better characterization of heart rhythm. Cardiologists can get away from describing hearts solely in terms rate. [Don't argue that heart rate is a good over all characterizer, it isn't.] We have to get away from our infatuation with scalars, it will also stump the marketing types. I like Earl's elaboration on whetstones. It reminds me of a time I was skiing two weeks ago: there was no snow at base of North Star, but up higher on the mountain was a snow gauge with 5.5 feet of snow registering in a flat, unskied, tree sheltered area. So what's the coverage at this ski resort? No single figures of merit. (A tee shirt I am making for our team). I will try to have extras at SIGMETRICS 87 and PERFORMANCE 87. Paper coming soon [I'm still late]. Stop some of this B.S. From the Rock of Ages Home for Retired Hackers: --eugene miya Not Ready for Prime Time Performer NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" "Send mail, avoid follow-ups. If enough, I'll summarize." {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene
earl@mips.UUCP (Earl Killian) (04/23/87)
Ehud Reiter and Eugene Miya both have good points. I recently saw a paper on the VAX 8700 from DEC and they presented their benchmark results in a variety of ways, including something like: TOP CAVEAT: I just typed in imaginary data points for illustration -- THESE ARE NOT REAL RESULTS. 4.0 ID 4.5 IICFFDD 5.0 IIICFFFDDD 5.5 IIICCCFFFFDDDD 6.0 IIIIICCCFFFFFDDD 6.5 IIIIICCFFDD 7.0 IIIICFFDD 7.5 IIICFD 8.0 I BOTTOM CAVEAT: I just typed in imaginary data points for illustration -- THESE ARE NOT REAL RESULTS. They had something like 120 benchmarks, and presented these in the above form, where "I" represented an integer benchmark, "C" Cobol, "F" single-precision floating point, and "D" double-precision floating point, etc. I liked this presentation. It really gives you a feel for how wide the performance variation can be, even within members of the same family. Perhaps someone else can post a reference to this paper.
earl@mips.UUCP (Earl Killian) (04/23/87)
David Hough pointed out to me that most of the nice benchmarking features I attributed to whetstone are not generic whetstone. The original version was written in Algol, and lacked the features I described. The one I found lying around here runs the whole program twice, once with the loop multiplier set to an input parameter, and once run with that parameter plus ten. The difference of the times for these two runs is a measure of the time for 10 loops with various overhead canceled (for example, the timing calls and result printout outside of the loops). I just sent away and received a version from Argonne's netlib service, and found that it does not do this. There's probably a lot of different whetstones out there. I have heard that NBS is setting up a benchmark service. Perhaps they'll straighten up some of this. Anyone know any more about it?
eugene@pioneer.arpa (Eugene Miya N.) (04/23/87)
Yes, it's called NBSLIB. It's being tested. I won't post the address, but I'll send it to Earl. Just like netlib. The problem is some of the programs are out of date (like the Livermore loops). Frank McMahon (LOOPS author, aka the MFLOPS program [the original use of MFLOPS, BTW in the 1950s) has to updtae the NBS's 14 loop kernel to the more modern 24 LOOP [significantly better in my opinion] kernel. Faces: Herman Chernoff. "The Use of Faces to Represent Points in k-Dimensional Space Graphically," Journal of the American Statistical Association, 68 June 1973. May I suggest the use of a Cray icon (Cray-1) because it has more dimensionlity than most boxes out there. Cray-2 does not have enough (it lost the seats). This is why we need a graphical (raster) data standard to transfer pictures of data better. The Bar graph was not as good as I would hopes. Check out Tufte's book on data (Then watch one of those Apple Mac ads and see what Apple does wrong). Bye. From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center eugene@ames-aurora.ARPA "You trust the `reply' command with all those different mailers out there?" "Send mail, avoid follow-ups. If enough, I'll summarize." {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene