[comp.arch] SPECmarks for RS/6000 systems - lies???

CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) (10/04/90)

I have just got a copy of the September 1990 SPECwatch and I am
a bit concerned about the following paragraph:

"There also seems to be a problem with replicating IBM's RS6000
SPECmark results, and with achieving the expected levels of
performance with other code. It's known that IBM extensively
modified the compilers used to compile the benchmarks. If these
"knobs and dials" turn out to be not readily accessible to users
of the production compilers shipped with the systems, SPEC
will be faced with its first serious cheating problem. The usual
prize (a trial subscription or four month extension of an
existing subscription) for the first person to provide
independent RS/6000 SPECmark results using the compilers shipped
with the products."

Would anyone from IBM (or anywhere else for that matter) like
to comment? Has anyone with an RS/6000 and the SPEC benchmark
tape tried/succeeded in running them? I have a 530 on site and
expect two 540's this month. I'd be happy to run the benchmarks
on an idle machine but I don't have access to the tape.

Huw Davies
Computing Services
La Trobe University
Melbourne Australia

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (10/04/90)

> On 4 Oct 90 09:50:31 GMT, CCHD@lure.latrobe.edu.au (Huw Davies) said:

Huw> "There also seems to be a problem with replicating IBM's RS6000
Huw> SPECmark results, and with achieving the expected levels of
Huw> performance with other code. [....]

Huw> Would anyone from IBM (or anywhere else for that matter) like
Huw> to comment? 

The LINPACK 100x100 results are known to be valid.  In fact, I have
never gotten results quite as slow as those quoted in the
advertisements --- but then I have not been trying too hard either....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mash@mips.COM (John Mashey) (10/05/90)

In article <4734@lure.latrobe.edu.au> CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) writes:
>I have just got a copy of the September 1990 SPECwatch and I am
>a bit concerned about the following paragraph:
>
>"There also seems to be a problem with replicating IBM's RS6000
>SPECmark results, and with achieving the expected levels of
>performance with other code. It's known that IBM extensively
>modified the compilers used to compile the benchmarks. If these
>"knobs and dials" turn out to be not readily accessible to users
>of the production compilers shipped with the systems, SPEC
>will be faced with its first serious cheating problem. The usual
>prize (a trial subscription or four month extension of an
>existing subscription) for the first person to provide
>independent RS/6000 SPECmark results using the compilers shipped
>with the products."

It would be interesting to see other people's ability to replicate
the results, but let's be real careful before branding things lies.
The intent of SPEC is that users understand what they have, and
what versions of things are being used to get results, and in general,
the existing forms of disclosure do that fairly well.  Unfortunately,
what they do NOT do, is in the published form, describe all of the
compiler options.  (Sometimes they do, sometimes they don't, for
lack of space.) Now, when a SPECtape comes out, the makefiles are
there for the people who've reported results, but if someone
reports results AFTER a release tape, you can't find that out easily.

So, it is quite possible that:
	a) Somebody at IBM ran these things, turned knobs and dials
	on the compilers appropriately, (and there are plenty of
	knobs and dials on most compilers) and got these results.
	And in fact, as I have high respect for the folks at IBM
	doing the SPEC stuff, I personally believe that they got
	the answers they say they got, although I have not personally
	run them.
	b) In any such case, it is possible that:
		1) There are magic options that only the vendor knows
		about.  This is considered a no-no.
		2) There are magic tools, that only the vendor has, for
		analyzing the programs, to figure out the options that
		should be used.  One would expect, that a user who runs
		the result should get the same answers, even if the user
		has no obvious way to derive the right options.
		3) There are tools and explanations available to the
		user, which provide the same performance, with work,
		which would be expected to happen in the normal way that
		people would work.
		4) You just say -Ox, and if you need to do something
		else, the compiler tells you.
	Now, in this hierarchy, the ideal is 4), 3) is OK for some people, 2)
	is getting kind of marginal, and 1) is really a no-no, unless
	the magic options just aren't released yet, but will be.
	c) Of course, users must assess for themselves what to think,
	when faced with big performance differences between levels
	2, 3, and 4.
	d) SPEC is continually working to tighten this up, because the
	goal is that a user can for replicate results easily, and we're
	not quite there yet, sometimes.
	e) Certainly, a good calibration is to run the SPEC stuff the
	way you normally would, by starting with the vanilla makefiles,
	and supplying the options you'd pick from a quick reading of the
	systems cc & f77 manual pages.

I often use an extensive computer==car analogy, whose performance measurement
part goes like this:

unreal:	drag-strip, short distance in a straight-line as fast as possible,
	don't care if vehicle useful on the road.
exaggerated: (Dhrystone mips): on the road, but only downhill
real exaggerated: (peak mips & mflops, guaranteed not to exceed): drive it
	off a cliff, and measure as it falls.... hard on the drivers, but
	that's the way it goes
reality: up-hill, down-hill, around curves: Monte Carlo, etc, driven
	by real people

now, SPEC is a fairly good approximation to one slice of reality, with
the obvious niggle that the numbers reported by vendors are with real
machines, running on real programs .... but usually with skilled racing
drivers who can extract the most performance from their machines.
Sometimes, an average person can get the same performance, sometimes
not.

But, anyway, let us be careful not to characterize something as lies,'
when there are perfectly legitimate reasons, well within the rules,
that might explain this.  A far more interesting question is to
ask, in general, of all of us is:
	What effort does it take to achieve given levels of performance?
	What's the difference between -O and -O5 -x -y -z5000 -q -k300....?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/05/90)

In article <41935@mips.mips.COM> mash@mips.COM (John Mashey) writes:

| But, anyway, let us be careful not to characterize something as lies,'
| when there are perfectly legitimate reasons, well within the rules,
| that might explain this.  A far more interesting question is to
| ask, in general, of all of us is:
| 	What effort does it take to achieve given levels of performance?
| 	What's the difference between -O and -O5 -x -y -z5000 -q -k300....?

  This is a really good point, and at the risk of overkill I will remind
readers of some benchmark results which were labeled as lies, although
they turned out to be true.

  One of the early Radio Shack 386 systems (3000, perhaps) was
benchmarked by RS and the results published. Later many other people
tried to get those numbers and couldn't. Lots of accusations followed.
It turns out that at boot time the ammount of memory is tested, and
interleave turned on if there was a multiple of 4MB in the machine (or
2MB, memory is dim).

  At any rate when the system was run with less memory, say 1MB and DOS,
the results were a lot slower, since every memory access now had 1-2
wait states added. Not the fault of RS, they clearly stated the memory
size, people just didn't realize that it made a 30% (or so) improvement
in performance for many things.

  As John says, let's be very careful not to confuse the results of
carefully tweaked compilations, links, and kernel configuration with
something which a user just can't do. Tweaking machines and options is
an art, and if it wasn't a lot of us would be reading want ads instead
of netnews.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

ddt@walt.cc.utexas.edu (David Taylor) (10/06/90)

It would be interesting to know WHICH benchmarks are hard to duplicate.
My guess is that 70% of them were reproducable and the 2 or 3
easily vectorizable ones (tomcatv, dasa7, etc) were not.  And which
release are we talking about here?

The RS/6000 is a poor candidate for the SPECmark anyway, because it's
strengths lie in just a couple of instructions exploitable in some
programs.  The figures from those benchmarks seriously skew the SPECmark.
Remember, it's based on the geometric mean which doesn't reflect performance
well for poorly distributed benchmark performances.

I think that if you wanted a realistic evaluation of the RS/6000, it
would be safe to say that it will behave much as it did for the benchmarks
like gcc, espresso, and spice2g6, and that for some vectorizable programs
it will run several times faster IF the compiler can interpret them
correctly.

	=-ddt->

abe@mace.cc.purdue.edu (Vic Abell) (10/08/90)

In article <4733@lure.latrobe.edu.au>, CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) writes:
> I have just got a copy of the September 1990 SPECwatch and I am
> a bit concerned about the following paragraph:
> 
> "There also seems to be a problem with replicating IBM's RS6000
> SPECmark results,

I have been able to reproduce the SPEC ratio for the RS/6000-520 with
only minor changes needed to enable gcc.  (See the Notes/Summary of
Changes section in the report for gcc details.)

Here is a complete SPEC report for the RISC System/6000, model 520
POWERserver.  Its SPEC ratio is 0.1 larger than what IBM reported to me.


                    SPEC Throughput Method A# Results
                        for Release 1.0 Benchmarks

Results:       SPEC    IBM RS/6000                             IBM Corporation
               Ref.        520    
Benchmark      Time    Time    SPEC                       IBM RISC System/6000
No. & Name    (sec.)  (sec.)  Ratio                            POWERserver 520

001.gcc        1482     102    14.5*              Hardware
008.espresso   2266     139    16.3   Model Number:      POWERserver 520
013.spice2g6  23951    1189    20.1   CPU:               IBM POWER 20MHz
015.doduc      1863      88    21.2   FPU:               Integrated
020.nasa7     20093     771    26.1   Number of CPU's:   1
022.li         6206     398    15.6   Cache Size/CPU:    32 KB data, 8 KB ins.
023.eqntott    1101      60    18.4   Memory:            32 MB
030.matrix300  4525     263    17.2   Disk Subsystem:    2-857 MB SCSI
042.fpppp      3038      71    42.8   Network Interface: 1 Ethernet Controller
047.tomcatv    2649      47    56.4

Geometric Mean 3867.7   173.0  22.4* 

            Software                                System
O/S Type and Rev:  AIX 3.1, GA        Tuning Parameters: None in use
Compiler Rev:      XL Fortran 1.1     Background Load:   Normal Unix daemons
Other Software:    XL C 1.1           System State:      Multi-user, lightly
File System Type:  IBM Journaled                         loaded
                   File System
Firmware Level:    N/A

Tested in:      October 1990
By:             Victor A. Abell <abe@mace.cc.purdue.edu>
                Purdue University Computing Center
                Mathematical Sciences Building
                West Lafayette, IN 47907
                (317) 494-1787
SPEC License #  310

Notes/Summary of Changes:

  # Method A: Homogeneous Load

  * Portability changes were required:

    gcc:
        o  Modified alloca.o rule in Makefile to use XL C compiler.
        o  Activated #include of <time.h> in cccp.c.
        o  Disabled redefinition of ptrdiff_t and size_t in stddef.h.
        o  Modified erroneous assignment of enumerated variables to bit
           fields in rtl.h, tree.h, and varasm.c.
        o  Enabled vfork to fork redefinition in gcc.c.

Copyright 1990 Purdue Research Foundation, West Lafayette, Indiana 47907.
All rights reserved.

khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) (10/09/90)

In article <4734@lure.latrobe.edu.au> CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) writes:

   "There also seems to be a problem with replicating IBM's RS6000
   SPECmark results, and with achieving the expected levels of
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   performance with other code.
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Others have already commented on the left hand side of the and at
length. While the same is true for the right hand side, I don't think
we can flog this pony enough.

Looking at just one number won't/can't/will never provide a reasonable
basis for performance forecasting, across differing

		arch+implementation+applications


This does not imply any evil intent on the part of the IBM designers,
compiler implementors, etc.
--
----------------------------------------------------------------
Keith H. Bierman    kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33			 | (415 336 2648)   
    Mountain View, CA 94043

walter@vogons.UUCP (Walter Bays) (10/13/90)

In article <37935@ut-emx.uucp> ddt@walt.cc.utexas.edu (David Taylor) writes:
>The RS/6000 is a poor candidate for the SPECmark anyway, because it's
>strengths lie in just a couple of instructions exploitable in some
>programs.  The figures from those benchmarks seriously skew the SPECmark.
>Remember, it's based on the geometric mean which doesn't reflect performance
>well for poorly distributed benchmark performances.

It's probably more accurate to say the RS/6000 like all machines has
strengths and weaknesses, and some of the SPEC release 1 benchmarks hit
some of the strengths particularly well.  Presumably IBM designed the
machine to be strong in application areas they thought particulary
important, so it's not too surprising that they succeeded well for some
of the benchmarks.

I agree with you that a single geometric mean cannot characterize the
performance of such a machine, due to very large differences between
minimum and maximum performance, most obvious now for the RS/6000,
Stardent, and Intel 860, but you will see this effect for more machines
in the future.  As CPU's get faster by exploiting more fine-grained
parallelism in different ways, the differences increase, and the
"little" machines are becoming as difficult to classify as the
supercomputers have always been.  John Mashey's "Your Mileage May Vary"
paper is a very good treatment of the issue.

Difficulties interpreting SPECmarks for these machines does not mean the
machines nor the SPEC benchmarks are flawed, just that you have to look
beyond a single number.  If you know that your workload is adequately
represented by 6 of the benchmarks with a fixed amount of work to do,
you could use a weighted harmonic mean of those 6.  If you have latent
demand that will consume all available resources, then weighted
geometric mean may be more appropriate.  But in no case will speed on
tomcatv get your C compilations done more quickly, nor slowness on the
Lisp interpreter degrade your Spice simulations.

I think there's a big need for some commercial benchmark companies and
trade journals to get into the business of helping ordinary users
interpret the benchmark results for their own situations.

---
Double Disclaimer: speaking for myself, not for Intergraph nor for SPEC
Walter Bays		Phone (415) 852-2384	FAX (415) 856-9224
EMAIL uunet!ingr.com!bays   or   uunet!{apple.com,pyramid.com}!garth!walter
USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303

keller@keller.austin.ibm.com (10/18/90)

John Laskowski is the IBM SPEC representative.  I'm one of the people
who ran the SPEC suite for the announced numbers.  Here's what we've
got to say.

Several independent organizations have been able to verify the
IBM RISC System/6000 results.  Kavailya Dixit, from Sun Microsystems,
has verified the results and won the prize offered by the publication
that questioned RS/6000 results.

Since the RISC System/6000 was announced AFTER the SPEC Release 1.0
benchmarks were shipped, IBM specific makefiles for the RS/6000 were
not available on the SPEC tape (DEC, HP, and MIPS were all able to
include their vendor specific makefiles in the original tape).
This made it difficult for anyone to reproduce the RS/6000
results.  IBM specific makefiles will be available on
the next release level of the SPEC Suite.  This will make it much
easier for anyone to verify results.  In the meantime, a copy of
the IBM specific makefiles can be obtained by request from me.

P.S. The "exotic" compiler option used is "-O" for "optimize."
--------------------------------------------------------------
Tom W. Keller        IBM Advanced Workstation Division
IBM - 2501           internet: ibmchs!keller@cs.utexas.edu
11400 Burnet Rd      uucp:   ...!cs.utexas.edu!ibmchs!keller
Austin, TX 78758     
--------------------------------------------------------------