[comp.arch] Compilers & SPECmarks...

rrr@u02.svl.cdc.com (Rich Ragan) (04/09/91)

There has been discussion in the past about trying to separate
out the effects of compilers from the underlying hardware performance.
Control Data has just submitted the following SPEC numbers for
publication.  [A Control Data 4680 is the same as a Mips 6280,
although the multi-processor version will be different. This data is
from a vanilla scalar 4680 and so should be completely equivalent to
a 6280.]

The only thing we changed was to use a new FORTRAN compiler jointly
developed with Kuck and Associates for the multi-processor CDC 4680.
As it turns out, the compiler does optimizations which help single
processor machines as well. This data provides insight into what
changing the compiler and holding the machine constant can do to the
performance of the machine/compiler combination.

-----------------------------------------------------------------------------
            gcc  espr. li   eqntott spice doduc nasa7 matrix fpppp tomcatv
-----------------------------------------------------------------------------
Mips 6280   46.0 42.4  54.6  41.2   38.4  43.0  45.6   49.8   55.6  43.3
CDC  4680   46.0 42.4  54.6  41.2   40.3  44.0  62.4  181.7   56.5  57.5
-----------------------------------------------------------------------------

The Mips SPECmark is from our runs using the upcoming 2.20 compilers.
I think they may have reported something a little higher (~46.5).

Mips 6280 SPECmark: 45.7, IntSpecs: 45.8, FPSpecs: 45.6
CDC  4680 SPECmark: 55.7, IntSpecs: 45.8, FPSpecs: 63.5

--
Richard R. Ragan   rrr@svl.cdc.com    (408) 496-4340 
Control Data Corporation - Silicon Valley Operations
5101 Patrick Henry Drive, Santa Clara, CA 95054-1111

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/09/91)

>> On 8 Apr 91 17:02:48 GMT, rrr@u02.svl.cdc.com (Rich Ragan) said:

Rich> There has been discussion in the past about trying to separate
Rich> out the effects of compilers from the underlying hardware performance.
		[......]
Rich> The only thing we changed was to use a new FORTRAN compiler jointly
Rich> developed with Kuck and Associates for the multi-processor CDC 4680.

-----------------------------------------------------------------------------
            gcc  espr. li   eqntott spice doduc nasa7 matrix fpppp tomcatv
-----------------------------------------------------------------------------
Mips 6280   46.0 42.4  54.6  41.2   38.4  43.0  45.6   49.8   55.6  43.3
CDC  4680   46.0 42.4  54.6  41.2   40.3  44.0  62.4  181.7   56.5  57.5
-----------------------------------------------------------------------------
						      ^^^^^

I have worried about the inclusion of the 'matrix300' code in the SPEC
suite, as it is such a simple calculation mathematically that it is
possible that special compiler techniques can be used to greatly
enhance the performance without necessarily helping the performance of
more general codes.  

In this case, the LINPACK routines SGEMV, SGEMM, and SAXPY are used.
It is well known that SGEMM can show very large improvements by
hand-coding (I get a 33 MFLOPS vs 6 MFLOPS on my IBM RS/6000-320 by
hand-coding), so an "SGEMM-recognizer" could short-circuit the
usefulness of this benchmark considerably.  Note that the new
HP9000/730 gets a score of 273 on this test, which raises its SPEC
floating-point rating considerably!

This is not to suggest that Kuck & Associates did it this way, but the
block-mode approach that is so helpful on matrix operations is of much
more limited utility on more general array operations.

P.S. Tell us more about the multiprocessor CDC 4680!!!!
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

john@iastate.edu (Hascall John Paul) (04/09/91)

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
}>> On 8 Apr 91 17:02:48 GMT, rrr@u02.svl.cdc.com (Rich Ragan) said:
}            gcc  espr. li   eqntott spice doduc nasa7 matrix fpppp tomcatv
}Mips 6280   46.0 42.4  54.6  41.2   38.4  43.0  45.6   49.8   55.6  43.3
}CDC  4680   46.0 42.4  54.6  41.2   40.3  44.0  62.4  181.7   56.5  57.5
}						      ^^^^^
}I have worried about the inclusion of the 'matrix300' code in the SPEC
}suite, as it is such a simple calculation mathematically that it is
}possible that special compiler techniques can be used to greatly
}enhance the performance without necessarily helping the performance of
}more general codes.  ...
 ~~~~~~~~~~~~~~~~~~
This is the important factor, it would be nice to have a goodly
number more than 10 tests...

}HP9000/730 gets a score of 273 on this test, which raises its SPEC ...

   Is it time to axe "matrix300" from the SPECsuite?  This makes the
second instance (others?) of SPECnum which can only be explained by
a compiler significantly reducing the problem.

   Also, I seem to recall there was another one on which the Snake SPEC
was "out of character" (100+ while most of the rest were 30s or 40s)?
   I suppose that program suffers similarly?

   Comments?    Corrections?    Flames? (like I need to ask)

John

--
John Hascall                        An ill-chosen word is the fool's messenger.
Project Vincent
Iowa State University Computation Center                       john@iastate.edu
Ames, IA  50011                                                  (515) 294-9551

dfields@radium.urbana.mcd.mot.com (David Fields) (04/10/91)

In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu
(Hascall John Paul) writes:

 < stuff about CDC/KAI improvement of the matrix300 portion of the SPEC >
 < suite deleted							>

|>This is the important factor, it would be nice to have a goodly
|>number more than 10 tests...

SPEC has been soliciting test from the general public for a good while
and John Mashey and others have asked here.  I'm sure SPEC would be
willing to listen if you would like to donate some reasonably large,
portable and real codes.

|>}HP9000/730 gets a score of 273 on this test, which raises its SPEC ...
|>
|>   Is it time to axe "matrix300" from the SPECsuite?  This makes the
|>second instance (others?) of SPECnum which can only be explained by
|>a compiler significantly reducing the problem.
|>
|>   Also, I seem to recall there was another one on which the Snake SPEC
|>was "out of character" (100+ while most of the rest were 30s or 40s)?
|>   I suppose that program suffers similarly?

While I realize at least some of the limitations of the matrix300
test, I have gotten the impression the some limited number of real
codes do have at least portions which are vectorizable (sic).  If
that's true then I don't see why any one who doesn't believe in
the "One True Number" has a problem with it. 
                                  
Dave Fields // Motorola Computer Group // dfields@urbana.mcd.mot.com

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/10/91)

> On 9 Apr 91 17:42:38 GMT, dfields@radium.urbana.mcd.mot.com 
> (David Fields) said:

David> While I realize at least some of the limitations of the
David> matrix300 test, I have gotten the impression the some limited
David> number of real codes do have at least portions which are
David> vectorizable (sic).  If that's true then I don't see why any
David> one who doesn't believe in the "One True Number" has a problem
David> with it.

The problem is not that the code is vectorizable, but that it consists
entirely of one simple algorithm whose performance characteristics
have been studied exhaustively over the years.  For a matrix of order
N, there are O(N^3) operations to be done on the N^2 data elements.
The naive algorithm --- in fact, the algorithm *used by the code* ---
requires O(N^3) loads, so that the cache is not useful and the code
runs at main memory speed.  Block-mode algorithms are available which
reduce the number of loads to O(N^2).  In this case, there is lots of
operand re-use and main memory transfers are no longer the bottleneck.
I believe that this switch in algorithms is essentially what the Kuck
and Associates front-end to the Fortran compiler does with this test.

There is, of course, nothing wrong with this, and it clearly speeds up
the code.  What is wrong is that there are *very few* algorithms which
are as well understood as Gaussian elimination, so the probability
that the compiler will be able to do something similar to your
application is small --- especially since many applications do O(N^2)
operations on O(N^2) data, so that no "block-mode" algorithms can exist.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

khb@chiba.Eng.Sun.COM ((<khb@chiba Keith Bierman fpgroup>)) (04/11/91)

In article <2525@urbana.mcd.mot.com> dfields@radium.urbana.mcd.mot.com (David Fields) writes:

...
   While I realize at least some of the limitations of the matrix300
   test, I have gotten the impression the some limited number of real
   codes do have at least portions which are vectorizable (sic).  If
   that's true then I don't see why any one who doesn't believe in
   the "One True Number" has a problem with it. 

Because it provides a truly distorted component. Once upon a time, the
story was told that SPEC was looking only for real codes (to avoid
such pitfalls). However, MATRIX300 (simple vectorizable matrix
multiply) and NASA7 (Bailey's collection of computational kernels of
interest to NASA AMES) were included anyway.

MATRIX300 tells us little that linpack doesn't (albeit a 300x300
rather than a 100x100, or a 1000x1000); the coding style is somewhat
different so slightly different transformations yield best results.
--
----------------------------------------------------------------
Keith H. Bierman    keith.bierman@Sun.COM| khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

prener@watson.ibm.com (Dan Prener) (04/11/91)

In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu (Hascall John Paul) writes:

|>    Is it time to axe "matrix300" from the SPECsuite?  This makes the
|> second instance (others?) of SPECnum which can only be explained by
|> a compiler significantly reducing the problem.

Why is that a problem?  What else could any set of benchmarks be testing
but the combination of compiler and hardware (and operating system, if
the benchmark isn't largely self-contained)?
-- 
                                   Dan Prener (prener @ watson.ibm.com)

streich@sgi.com (Mark Streich) (04/11/91)

|> In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu (Hascall John Paul) writes:
|> 
|>    Is it time to axe "matrix300" from the SPECsuite?  This makes the
|> second instance (others?) of SPECnum which can only be explained by
|> a compiler significantly reducing the problem.

Perhaps it is time to split the SPEC numbers into two pieces:

  1.  No change from what we get now, letting compilers do whatever they 
      can to the code.
  2.  A number based on how many integer and floating point operations 
      the program *actually* performs when being run.  Instead of getting
      "credit" for the number of operations to be executed as defined by
      the source code, "credit" is given for the runtime frequency of ops
      in the executable.

Example:  A source program calculating PI uses 1 million floating point
      operations to do the job.  The compiler, using pattern matching, 
      detects this case, and simply stuffs the value of PI into a variable
      and avoids doing all of the calculations.  

      (1) would give credit for all 1 millions flops (and why not? it solved
      the problem), but (2) would give credit only for the operations
      required to store the value into the variable.

We can't go solely to (2) because there would be less incentive to improve
the compilers, but it would give us the opportunity to compare architectures
on a more equal footing.  Would people prefer a system that performs 1M flops
per second or 1 flop per second if both solve the same problem?

Mark Streich
streich@sgi.com

#include <std.disclaimer>

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/12/91)

>>>>> On 11 Apr 91 14:35:29 GMT, streich@sgi.com (Mark Streich) said:

Mark> |> In article <1991Apr9.052607.12055@news.iastate.edu>, john@iastate.edu (Hascall John Paul) writes:
Mark> |> 
Mark> |>    Is it time to axe "matrix300" from the SPECsuite?  This makes the
Mark> |> second instance (others?) of SPECnum which can only be explained by
Mark> |> a compiler significantly reducing the problem.

Mark> Perhaps it is time to split the SPEC numbers into two pieces:

Mark>   2.  A number based on how many integer and floating point operations 
Mark>       the program *actually* performs when being run.  Instead of getting
Mark>       "credit" for the number of operations to be executed as defined by
Mark>       the source code, "credit" is given for the runtime frequency of ops
Mark>       in the executable.

Note that this is not relevant to what is being done with the
Matrix300 code.  The aggressively optimized code is doing *exactly the
same* operations as the source, but is doing them in a *significantly
different* order.   My understanding of the state of the art is that
this is not really very generally do-able --- it works for matrix300
because Gaussian Elimination of dense matrices is very well
understood.   My guess is that the compiler would end up doing
considerably less well on any other piece of code (for which the
optimum answer is not necessarily known by the compiler writers in
advance). 

There was a fellow at Argonne, Wayne Cowell (I think), who had some
preprocessing tools which performed various loop unrolling and
compression tricks to reduce memory accesses.  An early version of
these codes is available with TOOLPACK.  I understand that a much
improved version is available in the NAG distribution of toolpack, but
I have not seen it.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mjs@hpfcso.FC.HP.COM (Marc Sabatella) (04/13/91)

>  2.  A number based on how many integer and floating point operations 
>      the program *actually* performs when being run.  Instead of getting
>      "credit" for the number of operations to be executed as defined by
>      the source code, "credit" is given for the runtime frequency of ops
>      in the executable.

Two obvious flaws:

a) How on earth would you measure that?  Have someone disasseble the compiled
   code and hand trace its execution, counting operations?  Or perhaps supply
   hand coded assembly versions of the program for each architecture?

b) The problem with some benchmarks is not that the compiler reduces the number
   of operations.  For instance, matrix300 requires some set number of floating
   point operations on an array.  By merely storing the array in row-major
   order rather than column major order, without changing the number of
   floating point operations (or user land integer operations), you can
   relieve the strain on the cache and TLB and get tremendous performance
   improvements.

preston@ariel.rice.edu (Preston Briggs) (04/13/91)

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:

>Note that this is not relevant to what is being done with the
>Matrix300 code.  The aggressively optimized code is doing *exactly the
>same* operations as the source, but is doing them in a *significantly
>different* order. 

True.

>My understanding of the state of the art is that
>this is not really very generally do-able --- it works for matrix300
>because Gaussian Elimination of dense matrices is very well
>understood.   My guess is that the compiler would end up doing
>considerably less well on any other piece of code (for which the
>optimum answer is not necessarily known by the compiler writers in
>advance). 

I think you're too pessimistic.
Carr and Kennedy (and others) just know how a lot about nested loops.
I think that's the key, rather than the particular computation.

It may be that the Kuck analyser is recognizing DAXPY explicity
and substituting a call to a hand-code routine.  I'd say that's
cheating the spirit.  But inlining, and doing lots of general loop
optimizations is just nice compiler technology (as in vectorization)
and I'd think we'd like to encourage it's wide adoption.

Preston Briggs

streich@tigger.asd.sgi.com (Mark Streich) (04/14/91)

In article <8840027@hpfcso.FC.HP.COM>, mjs@hpfcso.FC.HP.COM (Marc Sabatella) writes:
|> >  2.  A number based on how many integer and floating point operations 
|> >      the program *actually* performs when being run.  Instead of getting
|> >      "credit" for the number of operations to be executed as defined by
|> >      the source code, "credit" is given for the runtime frequency of ops
|> >      in the executable.
|> 
|> Two obvious flaws:
|> 
|> a) How on earth would you measure that?  Have someone disasseble the compiled
|>    code and hand trace its execution, counting operations?  Or perhaps supply
|>    hand coded assembly versions of the program for each architecture?

There are tools to do exactly this.  There were a number of papers presented
at the recent ASPLOS IV that counted the frequency of instructions in the
SPEC benchmarks for different architectures.

Mark Streich
streich@sgi.com

#include <std.disclaimer>

mjs@hpfcso.FC.HP.COM (Marc Sabatella) (04/16/91)

Numerous people have pointed out that there exist tools to produce
instructions counts, which would aid in the counting of integer / floating
point operations.  This is true (I have written a quick-and-dirty instruction
counter myself, using ptrace), but such tools tend to be system/architecture
dependent (even my single stepper didn't work on SPARC), cumbersome to use,
and report with different granularity.  For instance, what exactly would
constitute a floating point operation, and how would the various tools report
it?

Benchmarks should be self-contained.  Relying on extra system dependent
hardware or software to analyze the results of a run will only serve to
guarantee apples-to-oranges comparisons.