borasky@ogicse.ogi.edu (M. Edward Borasky) (01/03/91)
In article <1685@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes: >In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>(Note, for example, that published Dhrystone results easily mis-predict >>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine >>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more >>realistic integer benchmarks.) >This is an interesting observation (result). >Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer >programs. That is, hundreds (I believe) of programs were >analyzed to come up with the (ahem) 'typical' high level >language instructions and their frequency of usage. In view of this >I would, at first sight, suspect the Dhrystone to be more accurate >than SPEC as SPEC is based upon only a few integer programs. I suspect there are two factors at work here. First, Dhrystone is a fairly small benchmark, and would not exercise the memory hierarchy as hard as the real programs in SPEC. The second factor is that it is easier to tune a compiler to a small benchmark like Dhrystone (or for that matter Whetstone and the Livermore Loops) than it is to tune for a variety of different real programs. By the way, I believe Dhrystone was originally written in ADA and was translated to "C", the form in which it is usually run. >Real programs also show a great variation in performance. I noticed >this recently in a Scientific American article (Jan 1991) which >showed the comparison of 13 different real programs on a wide >variety of supercomputers. This is the PERFECT CLUB benchmark suite. I just picked up the article; when I glanced at it in the supermarket checkout line (some people look at Oprah's diets, some don't) I suspected that there might be a howler or two lurking within the article. If I see anything that looks suspicious I will post something to comp.sys.super or here or both. >The program 'megflop' variation in >perfromance was truly tremendous especially for the fastest systems >(Cray and a NEC computer I think). This variation is a fact of life in supercomputing. If you take any suite of benchmarks that is fairly comprehensive, you will see much more variation in performance for a supercomputer than you will for a scalar machine. This is true for both absolute Megaflops and for speed relative to a fixed SCALAR machine.
mash@mips.COM (John Mashey) (01/03/91)
In article <15546@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes: >In article <1685@marlin.NOSC.MIL> aburto@marlin.nosc.mil.UUCP (Alfred A. Aburto) writes: >>In article <44353@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>>(Note, for example, that published Dhrystone results easily mis-predict >>>SPEC integer benchmarks pretty badly, i.e., it is quite easy for machine >>>"a" to be 25% faster on Dhrystone than "b", and end up 25% SLOWER on more >>>realistic integer benchmarks.) >>This is an interesting observation (result). >>Dhrystone was intended to be REPRESENTATIVE of TYPICAL integer >>programs. That is, hundreds (I believe) of programs were >>analyzed to come up with the (ahem) 'typical' high level >>language instructions and their frequency of usage. In view of this >>I would, at first sight, suspect the Dhrystone to be more accurate >>than SPEC as SPEC is based upon only a few integer programs. >I suspect there are two factors at work here. First, Dhrystone is a >fairly small benchmark, and would not exercise the memory hierarchy >as hard as the real programs in SPEC. The second factor is that it is >easier to tune a compiler to a small benchmark like Dhrystone (or for >that matter Whetstone and the Livermore Loops) than it is to tune for >a variety of different real programs. By the way, I believe Dhrystone >was originally written in ADA and was translated to "C", the form in >which it is usually run. >>Real programs also show a great variation in performance. I noticed Well, there are multiple issues, of which several have been mentioned already here. Let us review the history of Dhrystone, which was originally, as stated, a reasonable attempt to model ADA usage, and then got converted into C. ISSUE 1: building small synthetic benchmark that is TYPICAL of intended usage is 1a: extremely difficult to do, even for a current set of hardware and software 1b: REALLY hard to do and expct to remain valid over time< even in the absence of 1c 1c: really hard to do, if small enough to be subject to compiler gimmickkry Let us take each of these in turn: 1a: is hard, because the usual (and reasonable) methodology is: 1: select attributes that should be measured 2: gather statistics from a set of programs 3: build the benchmark to model those attributes The problem is: 1: you may not choose the "right" attributes, and in fact, there is no small set of right attributes. in fact, there are only better an better approximations, even to modeling a single user program (not even a large mix). For example, suppose your first approximation is: count the number of +, -, *, and /. executed Not a very good approximation so add, number of if statements Still not good so add, distribution of sizes of expressions Still not good so add, number of function calls Still not good, so add, distribution of number of arguments of function calls (makes bigger difference amongst machines that pass some arguments in registers) Still not good, some architectures (like SPARC) can be sensitive to the depth of function calls, so do somethign about that Still not good, haven't done anything about array indexing, and different architectures react differently Still not good, haven't done anything about pointers, so add some pointer references Still not good, you haven't measured the frequency of different- sized offsets from pointers (and surprise! in some architectures there is no different between a zero=offset, and a non-zero offset; in others (such as AMD29K), a zero offset is cheaper than non-zero; in others, presence of particular addressing modes helps some combinations much more than others. Still not good: how often is the same pointer->object referenced close enough in the code, and under such conditions that the compiler can jjust leave it in a register? Still not good: is the distribution of variable referencees such that the benchmartk will model the effects of a good register allocator, or not. ..... and on, and on.... I.e., it is VERY easy to do a competent job of feature extraction and modeling, and still get surprised, where "surprise" = the synthetic benchmark doesn't correlate well with realistic code of the class that it was supposed to model. (I've looked at many synthetic benchmarks with our tools; the numbers quite often don't look anything like what you see wheen you analyze real programs.) 1b: hard to do over time: If asked to compare machines that basically differ only by the clock rate (same CPU, same compilers), a small benchmark is adequate. However, hardware tends to get more complex over time, in particular, faster machines use caches, caches get bigger, multi-level caches appear, etc.. Programs expand to use these; if a benchmark doesn't also expand appropriately, it starts to measure only the smallest part of the memory hierarchy. In addition, optimizing compilers get better, and they optimize away pieces of the code, especially in a small synthetic benchmark. 1C: Compiler gimmickry For any importnat benchmark that is small, compilers will get tuned in ways that are absolutely useless in real life. This has happened at least with Whetstone, Dhrystone, and LINPACK. ISSUE 2: Dhrystone in particular The MIPS Performance Brief, Issue 3.9 (and earlier) has had analyses of Dhrystone issues, for years. here is a brief summary: 1. Small, will fit in tiny instruction and data caches 2. References and re- models effects of write-back & write-thru caches poorly 3. Subroutine calls sahllow depth, hence never underflows/overflows on aa register window/stack cache machine. 4. Makes function calls more frequently than any real program I've ever seen, i.e., on a MIPS, uses <40 cycles per call, whereas 60-100 is much more typical of C programs 5. Can easily spend 30% of it's time in strcpy, unlike any real program I've ever analyzed. Due to the particular use (copy a 30-byte constant, over and over again), it is especially amenable to gimmickry, such as compiler options which generate incorrect code for real use, but happen to work for Dhrystone. (Note that some of this is an artifact of translation from ADA/Pascal (fixed-length strings) to C.) Most amusing code: i860, where the 30-byte constant is expanded to 32, and is then copied with 2 16-byte loads, followed by 2 16-byte stores, not very typical of real C-language string processing, where most pointers are to variables whose sizes are unknown at compile time.... 6. There is an unusually high frequency of zero-offset pointers. 7. In the earlier versions, there was obvious dead code, which started to disappear under the pressure of better optimizers. (not gimmickry, just better compilers). 8. Also, the earlier version never worried about compilers that can merge the whole program together and inline EVERYTHING... So, Reinhold W. started with somethign that was actually a reasonable attempt, but it is HARD TO DO, and even HARDER to keep sensible... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (01/03/91)
>>>>> On 3 Jan 91 06:18:59 GMT, mash@mips.COM (John Mashey) said:
mash> 1C: Compiler gimmickry
mash> For any important benchmark that is small, compilers will get tuned
mash> in ways that are absolutely useless in real life. This has happened
mash> at least with Whetstone, Dhrystone, and LINPACK.
So what optimizations have been performed on the LINPACK 100x100 code
that are "absolutely useless" in real life?
I am honestly curious, since it seems to me that the vast majority of
the time is spent in a pair of loops which repeatedly call SAXPY.
SAXPY, in turn, is a short, vectorizable function with a number of
parameter checks before the calculation.
The only optimizations that I can think of that would be useful are
inlining and constant propagation, and they are both generally
helpful. Inlining can be useful in short vector codes, especially if
combined with compile-time constant propagation, since that would
allow most of the checks to be deleted. The checks are (potentially)
useful in the BLAS library (though they are redundant when LINPACK
calls the BLAS, since the parameters are already checked at least one
before the BLAS calls). Nonetheless, optimizing them away by hand
removes the check that should be there when the user calls the BLAS
routines directly, as I sometimes do....
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
borasky@ogicse.ogi.edu (M. Edward Borasky) (01/03/91)
In article <MCCALPIN.91Jan3100122@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: >>>>>> On 3 Jan 91 06:18:59 GMT, mash@mips.COM (John Mashey) said: > >mash> 1C: Compiler gimmickry >mash> For any important benchmark that is small, compilers will get tuned >mash> in ways that are absolutely useless in real life. This has happened >mash> at least with Whetstone, Dhrystone, and LINPACK. > >So what optimizations have been performed on the LINPACK 100x100 code >that are "absolutely useless" in real life? 1. There was a compiler once that actually CHEATED on the Linpack bench- mark. What they did was, any time they encountered a routine called "SGEFA" or "DGEFA", they checked the number of parameters and their type, and if they agreed with the standard usage in the LINPACK bench- mark, branched off to a machine-coded SGEFA or DGEFA, ILLEGALLY UNDER THE FORTRAN RULES IGNORING THE SGEFA/DGEFA CODE THAT IS PRESENT IN THE LINPACK BENCHMARK. There were also pre-processors that stripped out SGEFA/DGEFA or SAXPY/DAXPY, so that they would be linked in from libraries rather than compiled. This is a fine optimization technique IF CONTROLLED BY THE USER -- that was the whole REASON LINPACK was written using two levels of libraries. But the compiler CANNOT know without looking at the submitted code whether these cheats will produce correct results. 2. Another cheat was once pulled on the 1000-equation LINPACK benchmark. What this vendor did was to use a Gauss-Jordan reduction, where all the vectors are of length 1000, rather than the decreasing-length algorithm normally used. This in itself is legal; you can use any algorithm you want to. Unfortunately, the G-J reduction uses twice as many operations as the standard one, so when they computed their MEGAFLOPS, they divided how many FLOPS they had done by the time it took. Unfortunately, the Dongarra rules allow you to claim only the FLOPS required by the STANDARD LINPACK algorithm. So they claimed twice the speed they were allowed to. In fact, they also ran the benchmark in 32 bits instead of 64, to get another 2X speed boost. It turns out that the matrix in the LINPACK benchmark was ill-conditioned enough to blow up on the 32-bit arithmetic and Dongarra got suspicious. Even discounting blatant cheats like the ones I describe above, there is a wider issue here about how much COMPILE time a user is willing to expend for "perfect" compiles. It is possible using techniques of whole-program compiling to optimize out all the unnecessary checks in the LINPACK benchmark -- it's simple enough and well-structured enough so that a compiler can "realize" that DAXPY is always called with inc- rement 1, that the DAXPY can be in-lined, that N is always greater than zero, that one-trip DO loops will work, etc. But these optimizations take time -- so much time that on real codes larger than LINPACK, one of two things will happen: 1. The user will think the compiler has gone into an endless loop and zap the compile, or 2. The compiler writer, probably interfacing with Marketing, will build in lots of escapes into the optimizer so that compile times appear to be finite. There ARE users who claim they will tolerate long optimization times; my experience in Marketing was that such were few and far between. The main uses for such a compiler are large third-party application codes that are seldom compiled -- once or twice a year for the ones I worked with. But the size of these codes works AGAINST the whole- program compiler!
mash@mips.COM (John Mashey) (01/04/91)
In article <MCCALPIN.91Jan3100122@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: >>>>>> On 3 Jan 91 06:18:59 GMT, mash@mips.COM (John Mashey) said: >mash> at least with Whetstone, Dhrystone, and LINPACK. > >So what optimizations have been performed on the LINPACK 100x100 code >that are "absolutely useless" in real life? As reported in Patterson&Hennessy, page 47, "An extreme example of such targeted engineering employed compiler optimizations that were benchmark sensitive. ...a person at one company used a preprocessor that scanned the text for keywords to try to identify benchmarks by looking at the name of the author and the name of a key subroutine." Although it doesn't say so in P&H, this was a search for "DONGARRA". to be fair, one might call this outright cheating rather than a real optimization. Certainly, it is at the extreme useless end of the scale of: a) Hunt for Dongarra b) Hunt for the inner loop, and character-match, and if so, generate optimal code. c) Do all of the analysis like you should, and generate optimal code. For Whetstone, the classic one is recognizing a trigonometric identity, and thus avoiding a fairly expensive computation; for Dhrystone, it is the strcpy stuff, or earlier, removal of dead code. ------- Reinhold Weicker has published some good analyses of common benchmarks; there was a talk for Slater's Microprocessor Forum last Spring, and one for RISC'90 in Karlsruhe in November. I think there's one in the current (or immediate past) issue of COMPUTER: maybe somebody can cite the exact reference, as I don't have it handy. ------- To my knowledge, no one has so far found anything quite like these things in the SPEC benchmarks. The closest case is matrix300, where most of the time is spent inside a small loop. This benchmark was included, rightly or wrongly, as it at least thrashed the data cache around, and the inner loop is representative of some real calculations; certainly it was argued about a lot... For most of the SPEC benchmarks, even if you knew that you were looking at that benchmark, it wouldn't do you much good. I.e., exactly what good does it do to know you're compiling gcc? In some cases, about the only hack you could do would be to generate a program that emitted a copy of the required output, and did nothing else. I believe this would be fairly easy to discover :-) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
borasky@ogicse.ogi.edu (M. Edward Borasky) (01/04/91)
In article <44466@mips.mips.COM> mash@mips.COM (John Mashey) writes: >To my knowledge, no one has so far found anything quite like these things >in the SPEC benchmarks. The closest case is matrix300, where most of the >time is spent inside a small loop. This benchmark was included, >rightly or wrongly, as it at least thrashed the data cache around Correct me if I'm wrong, but the matrix300 is Jack Dongarra's 300- equation LINPACK benchmark, isn't it? If so, the inner loop is matrix multiply, unrolled 1, 2, 4, 8 and 16 ways. At least one compiler (FPS Model 500) recognizes the 1-way unrolled case as a matrix-vector multiply and calls in a hand-optimized matrix multiply routine. In this case, cache is not relevant because the hand-optimized routine knows about the memory hierarchy. On the subject of LINPACK cheating, we really ought to get Jack Dongarra involved in the discussion. Unfortunately, he HATES to talk about his experiences with vendors in this area. Every vendor had somebody who was ever vigilant for cases of OTHER vendors cheating; I filled that function for FPS at one time. >In some cases, about the only hack you could do would be to >generate a program that emitted a copy of the required output, and did nothing >else. Do you believe that vendors are working on this? If so, it would be relatively easy to defeat using techniques of public-key cryprography (which by the way is an interesting application for BENCHMARKING!!!!)
mash@mips.COM (John Mashey) (01/04/91)
In article <15577@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes: >In article <44466@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>To my knowledge, no one has so far found anything quite like these things >>in the SPEC benchmarks. The closest case is matrix300, where most of the >>time is spent inside a small loop. This benchmark was included, >>rightly or wrongly, as it at least thrashed the data cache around >Correct me if I'm wrong, but the matrix300 is Jack Dongarra's 300- >equation LINPACK benchmark, isn't it? If so, the inner loop is matrix >multiply, unrolled 1, 2, 4, 8 and 16 ways. At least one compiler (FPS >Model 500) recognizes the 1-way unrolled case as a matrix-vector >multiply and calls in a hand-optimized matrix multiply routine. In >this case, cache is not relevant because the hand-optimized routine >knows about the memory hierarchy. Well, like I said, that's why matrix300 was on the edge. However, I don't even mind this too much, if the compiler recognizes the case generically (as opposed to special-case recognition), given that certain kinds of code do really spend a lot fo time in such loops. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
gideony@microsoft.UUCP (Gideon YUVAL) (01/08/91)
I notice (Myers & Budde's 80960 book, p.5, footnote) "As a historical note, the Dhrystone benchmark was originally developed as part of the 80960 development program". It may therefore be that Dhrystone's love for register-window machines (i.e. 960 and SPARC) is not 100% an accident. -- Gideon Yuval, gideony@microsof.UUCP, 206-882-8080 (fax:206-883-8101;TWX:160520)