rrr@u02.svl.cdc.com (Rich Ragan) (04/09/91)
There has been discussion in the past about trying to separate out the effects of compilers from the underlying hardware performance. Control Data has just submitted the following SPEC numbers for publication. [A Control Data 4680 is the same as a Mips 6280, although the multi-processor version will be different. This data is from a vanilla scalar 4680 and so should be completely equivalent to a 6280.] The only thing we changed was to use a new FORTRAN compiler jointly developed with Kuck and Associates for the multi-processor CDC 4680. As it turns out, the compiler does optimizations which help single processor machines as well. This data provides insight into what changing the compiler and holding the machine constant can do to the performance of the machine/compiler combination. ----------------------------------------------------------------------------- gcc espr. li eqntott spice doduc nasa7 matrix fpppp tomcatv ----------------------------------------------------------------------------- Mips 6280 46.0 42.4 54.6 41.2 38.4 43.0 45.6 49.8 55.6 43.3 CDC 4680 46.0 42.4 54.6 41.2 40.3 44.0 62.4 181.7 56.5 57.5 ----------------------------------------------------------------------------- The Mips SPECmark is from our runs using the upcoming 2.20 compilers. I think they may have reported something a little higher (~46.5). Mips 6280 SPECmark: 45.7, IntSpecs: 45.8, FPSpecs: 45.6 CDC 4680 SPECmark: 55.7, IntSpecs: 45.8, FPSpecs: 63.5 -- Richard R. Ragan rrr@svl.cdc.com (408) 496-4340 Control Data Corporation - Silicon Valley Operations 5101 Patrick Henry Drive, Santa Clara, CA 95054-1111
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/09/91)
>> On 8 Apr 91 17:02:48 GMT, rrr@u02.svl.cdc.com (Rich Ragan) said:
Rich> There has been discussion in the past about trying to separate
Rich> out the effects of compilers from the underlying hardware performance.
[......]
Rich> The only thing we changed was to use a new FORTRAN compiler jointly
Rich> developed with Kuck and Associates for the multi-processor CDC 4680.
-----------------------------------------------------------------------------
gcc espr. li eqntott spice doduc nasa7 matrix fpppp tomcatv
-----------------------------------------------------------------------------
Mips 6280 46.0 42.4 54.6 41.2 38.4 43.0 45.6 49.8 55.6 43.3
CDC 4680 46.0 42.4 54.6 41.2 40.3 44.0 62.4 181.7 56.5 57.5
-----------------------------------------------------------------------------
^^^^^
I have worried about the inclusion of the 'matrix300' code in the SPEC
suite, as it is such a simple calculation mathematically that it is
possible that special compiler techniques can be used to greatly
enhance the performance without necessarily helping the performance of
more general codes.
In this case, the LINPACK routines SGEMV, SGEMM, and SAXPY are used.
It is well known that SGEMM can show very large improvements by
hand-coding (I get a 33 MFLOPS vs 6 MFLOPS on my IBM RS/6000-320 by
hand-coding), so an "SGEMM-recognizer" could short-circuit the
usefulness of this benchmark considerably. Note that the new
HP9000/730 gets a score of 273 on this test, which raises its SPEC
floating-point rating considerably!
This is not to suggest that Kuck & Associates did it this way, but the
block-mode approach that is so helpful on matrix operations is of much
more limited utility on more general array operations.
P.S. Tell us more about the multiprocessor CDC 4680!!!!
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
john@iastate.edu (Hascall John Paul) (04/09/91)
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: }>> On 8 Apr 91 17:02:48 GMT, rrr@u02.svl.cdc.com (Rich Ragan) said: } gcc espr. li eqntott spice doduc nasa7 matrix fpppp tomcatv }Mips 6280 46.0 42.4 54.6 41.2 38.4 43.0 45.6 49.8 55.6 43.3 }CDC 4680 46.0 42.4 54.6 41.2 40.3 44.0 62.4 181.7 56.5 57.5 } ^^^^^ }I have worried about the inclusion of the 'matrix300' code in the SPEC }suite, as it is such a simple calculation mathematically that it is }possible that special compiler techniques can be used to greatly }enhance the performance without necessarily helping the performance of }more general codes. ... ~~~~~~~~~~~~~~~~~~ This is the important factor, it would be nice to have a goodly number more than 10 tests... }HP9000/730 gets a score of 273 on this test, which raises its SPEC ... Is it time to axe "matrix300" from the SPECsuite? This makes the second instance (others?) of SPECnum which can only be explained by a compiler significantly reducing the problem. Also, I seem to recall there was another one on which the Snake SPEC was "out of character" (100+ while most of the rest were 30s or 40s)? I suppose that program suffers similarly? Comments? Corrections? Flames? (like I need to ask) John -- John Hascall An ill-chosen word is the fool's messenger. Project Vincent Iowa State University Computation Center john@iastate.edu Ames, IA 50011 (515) 294-9551
dfields@radium.urbana.mcd.mot.com (David Fields) (04/10/91)
In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu (Hascall John Paul) writes: < stuff about CDC/KAI improvement of the matrix300 portion of the SPEC > < suite deleted > |>This is the important factor, it would be nice to have a goodly |>number more than 10 tests... SPEC has been soliciting test from the general public for a good while and John Mashey and others have asked here. I'm sure SPEC would be willing to listen if you would like to donate some reasonably large, portable and real codes. |>}HP9000/730 gets a score of 273 on this test, which raises its SPEC ... |> |> Is it time to axe "matrix300" from the SPECsuite? This makes the |>second instance (others?) of SPECnum which can only be explained by |>a compiler significantly reducing the problem. |> |> Also, I seem to recall there was another one on which the Snake SPEC |>was "out of character" (100+ while most of the rest were 30s or 40s)? |> I suppose that program suffers similarly? While I realize at least some of the limitations of the matrix300 test, I have gotten the impression the some limited number of real codes do have at least portions which are vectorizable (sic). If that's true then I don't see why any one who doesn't believe in the "One True Number" has a problem with it. Dave Fields // Motorola Computer Group // dfields@urbana.mcd.mot.com
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/10/91)
> On 9 Apr 91 17:42:38 GMT, dfields@radium.urbana.mcd.mot.com > (David Fields) said: David> While I realize at least some of the limitations of the David> matrix300 test, I have gotten the impression the some limited David> number of real codes do have at least portions which are David> vectorizable (sic). If that's true then I don't see why any David> one who doesn't believe in the "One True Number" has a problem David> with it. The problem is not that the code is vectorizable, but that it consists entirely of one simple algorithm whose performance characteristics have been studied exhaustively over the years. For a matrix of order N, there are O(N^3) operations to be done on the N^2 data elements. The naive algorithm --- in fact, the algorithm *used by the code* --- requires O(N^3) loads, so that the cache is not useful and the code runs at main memory speed. Block-mode algorithms are available which reduce the number of loads to O(N^2). In this case, there is lots of operand re-use and main memory transfers are no longer the bottleneck. I believe that this switch in algorithms is essentially what the Kuck and Associates front-end to the Fortran compiler does with this test. There is, of course, nothing wrong with this, and it clearly speeds up the code. What is wrong is that there are *very few* algorithms which are as well understood as Gaussian elimination, so the probability that the compiler will be able to do something similar to your application is small --- especially since many applications do O(N^2) operations on O(N^2) data, so that no "block-mode" algorithms can exist. -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
khb@chiba.Eng.Sun.COM ((<khb@chiba Keith Bierman fpgroup>)) (04/11/91)
In article <2525@urbana.mcd.mot.com> dfields@radium.urbana.mcd.mot.com (David Fields) writes:
...
While I realize at least some of the limitations of the matrix300
test, I have gotten the impression the some limited number of real
codes do have at least portions which are vectorizable (sic). If
that's true then I don't see why any one who doesn't believe in
the "One True Number" has a problem with it.
Because it provides a truly distorted component. Once upon a time, the
story was told that SPEC was looking only for real codes (to avoid
such pitfalls). However, MATRIX300 (simple vectorizable matrix
multiply) and NASA7 (Bailey's collection of computational kernels of
interest to NASA AMES) were included anyway.
MATRIX300 tells us little that linpack doesn't (albeit a 300x300
rather than a 100x100, or a 1000x1000); the coding style is somewhat
different so slightly different transformations yield best results.
--
----------------------------------------------------------------
Keith H. Bierman keith.bierman@Sun.COM| khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
prener@watson.ibm.com (Dan Prener) (04/11/91)
In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu (Hascall John Paul) writes: |> Is it time to axe "matrix300" from the SPECsuite? This makes the |> second instance (others?) of SPECnum which can only be explained by |> a compiler significantly reducing the problem. Why is that a problem? What else could any set of benchmarks be testing but the combination of compiler and hardware (and operating system, if the benchmark isn't largely self-contained)? -- Dan Prener (prener @ watson.ibm.com)
streich@sgi.com (Mark Streich) (04/11/91)
|> In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu (Hascall John Paul) writes: |> |> Is it time to axe "matrix300" from the SPECsuite? This makes the |> second instance (others?) of SPECnum which can only be explained by |> a compiler significantly reducing the problem. Perhaps it is time to split the SPEC numbers into two pieces: 1. No change from what we get now, letting compilers do whatever they can to the code. 2. A number based on how many integer and floating point operations the program *actually* performs when being run. Instead of getting "credit" for the number of operations to be executed as defined by the source code, "credit" is given for the runtime frequency of ops in the executable. Example: A source program calculating PI uses 1 million floating point operations to do the job. The compiler, using pattern matching, detects this case, and simply stuffs the value of PI into a variable and avoids doing all of the calculations. (1) would give credit for all 1 millions flops (and why not? it solved the problem), but (2) would give credit only for the operations required to store the value into the variable. We can't go solely to (2) because there would be less incentive to improve the compilers, but it would give us the opportunity to compare architectures on a more equal footing. Would people prefer a system that performs 1M flops per second or 1 flop per second if both solve the same problem? Mark Streich streich@sgi.com #include <std.disclaimer>
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/12/91)
>>>>> On 11 Apr 91 14:35:29 GMT, streich@sgi.com (Mark Streich) said: Mark> |> In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu (Hascall John Paul) writes: Mark> |> Mark> |> Is it time to axe "matrix300" from the SPECsuite? This makes the Mark> |> second instance (others?) of SPECnum which can only be explained by Mark> |> a compiler significantly reducing the problem. Mark> Perhaps it is time to split the SPEC numbers into two pieces: Mark> 2. A number based on how many integer and floating point operations Mark> the program *actually* performs when being run. Instead of getting Mark> "credit" for the number of operations to be executed as defined by Mark> the source code, "credit" is given for the runtime frequency of ops Mark> in the executable. Note that this is not relevant to what is being done with the Matrix300 code. The aggressively optimized code is doing *exactly the same* operations as the source, but is doing them in a *significantly different* order. My understanding of the state of the art is that this is not really very generally do-able --- it works for matrix300 because Gaussian Elimination of dense matrices is very well understood. My guess is that the compiler would end up doing considerably less well on any other piece of code (for which the optimum answer is not necessarily known by the compiler writers in advance). There was a fellow at Argonne, Wayne Cowell (I think), who had some preprocessing tools which performed various loop unrolling and compression tricks to reduce memory accesses. An early version of these codes is available with TOOLPACK. I understand that a much improved version is available in the NAG distribution of toolpack, but I have not seen it. -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@brahms.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
mjs@hpfcso.FC.HP.COM (Marc Sabatella) (04/13/91)
> 2. A number based on how many integer and floating point operations > the program *actually* performs when being run. Instead of getting > "credit" for the number of operations to be executed as defined by > the source code, "credit" is given for the runtime frequency of ops > in the executable. Two obvious flaws: a) How on earth would you measure that? Have someone disasseble the compiled code and hand trace its execution, counting operations? Or perhaps supply hand coded assembly versions of the program for each architecture? b) The problem with some benchmarks is not that the compiler reduces the number of operations. For instance, matrix300 requires some set number of floating point operations on an array. By merely storing the array in row-major order rather than column major order, without changing the number of floating point operations (or user land integer operations), you can relieve the strain on the cache and TLB and get tremendous performance improvements.
preston@ariel.rice.edu (Preston Briggs) (04/13/91)
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: >Note that this is not relevant to what is being done with the >Matrix300 code. The aggressively optimized code is doing *exactly the >same* operations as the source, but is doing them in a *significantly >different* order. True. >My understanding of the state of the art is that >this is not really very generally do-able --- it works for matrix300 >because Gaussian Elimination of dense matrices is very well >understood. My guess is that the compiler would end up doing >considerably less well on any other piece of code (for which the >optimum answer is not necessarily known by the compiler writers in >advance). I think you're too pessimistic. Carr and Kennedy (and others) just know how a lot about nested loops. I think that's the key, rather than the particular computation. It may be that the Kuck analyser is recognizing DAXPY explicity and substituting a call to a hand-code routine. I'd say that's cheating the spirit. But inlining, and doing lots of general loop optimizations is just nice compiler technology (as in vectorization) and I'd think we'd like to encourage it's wide adoption. Preston Briggs
streich@tigger.asd.sgi.com (Mark Streich) (04/14/91)
In article <8840027@hpfcso.FC.HP.COM>, mjs@hpfcso.FC.HP.COM (Marc Sabatella) writes: |> > 2. A number based on how many integer and floating point operations |> > the program *actually* performs when being run. Instead of getting |> > "credit" for the number of operations to be executed as defined by |> > the source code, "credit" is given for the runtime frequency of ops |> > in the executable. |> |> Two obvious flaws: |> |> a) How on earth would you measure that? Have someone disasseble the compiled |> code and hand trace its execution, counting operations? Or perhaps supply |> hand coded assembly versions of the program for each architecture? There are tools to do exactly this. There were a number of papers presented at the recent ASPLOS IV that counted the frequency of instructions in the SPEC benchmarks for different architectures. Mark Streich streich@sgi.com #include <std.disclaimer>
mjs@hpfcso.FC.HP.COM (Marc Sabatella) (04/16/91)
Numerous people have pointed out that there exist tools to produce instructions counts, which would aid in the counting of integer / floating point operations. This is true (I have written a quick-and-dirty instruction counter myself, using ptrace), but such tools tend to be system/architecture dependent (even my single stepper didn't work on SPARC), cumbersome to use, and report with different granularity. For instance, what exactly would constitute a floating point operation, and how would the various tools report it? Benchmarks should be self-contained. Relying on extra system dependent hardware or software to analyze the results of a run will only serve to guarantee apples-to-oranges comparisons.