CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) (10/04/90)
I have just got a copy of the September 1990 SPECwatch and I am a bit concerned about the following paragraph: "There also seems to be a problem with replicating IBM's RS6000 SPECmark results, and with achieving the expected levels of performance with other code. It's known that IBM extensively modified the compilers used to compile the benchmarks. If these "knobs and dials" turn out to be not readily accessible to users of the production compilers shipped with the systems, SPEC will be faced with its first serious cheating problem. The usual prize (a trial subscription or four month extension of an existing subscription) for the first person to provide independent RS/6000 SPECmark results using the compilers shipped with the products." Would anyone from IBM (or anywhere else for that matter) like to comment? Has anyone with an RS/6000 and the SPEC benchmark tape tried/succeeded in running them? I have a 530 on site and expect two 540's this month. I'd be happy to run the benchmarks on an idle machine but I don't have access to the tape. Huw Davies Computing Services La Trobe University Melbourne Australia
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (10/04/90)
> On 4 Oct 90 09:50:31 GMT, CCHD@lure.latrobe.edu.au (Huw Davies) said:
Huw> "There also seems to be a problem with replicating IBM's RS6000
Huw> SPECmark results, and with achieving the expected levels of
Huw> performance with other code. [....]
Huw> Would anyone from IBM (or anywhere else for that matter) like
Huw> to comment?
The LINPACK 100x100 results are known to be valid. In fact, I have
never gotten results quite as slow as those quoted in the
advertisements --- but then I have not been trying too hard either....
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@vax1.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
mash@mips.COM (John Mashey) (10/05/90)
In article <4734@lure.latrobe.edu.au> CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) writes: >I have just got a copy of the September 1990 SPECwatch and I am >a bit concerned about the following paragraph: > >"There also seems to be a problem with replicating IBM's RS6000 >SPECmark results, and with achieving the expected levels of >performance with other code. It's known that IBM extensively >modified the compilers used to compile the benchmarks. If these >"knobs and dials" turn out to be not readily accessible to users >of the production compilers shipped with the systems, SPEC >will be faced with its first serious cheating problem. The usual >prize (a trial subscription or four month extension of an >existing subscription) for the first person to provide >independent RS/6000 SPECmark results using the compilers shipped >with the products." It would be interesting to see other people's ability to replicate the results, but let's be real careful before branding things lies. The intent of SPEC is that users understand what they have, and what versions of things are being used to get results, and in general, the existing forms of disclosure do that fairly well. Unfortunately, what they do NOT do, is in the published form, describe all of the compiler options. (Sometimes they do, sometimes they don't, for lack of space.) Now, when a SPECtape comes out, the makefiles are there for the people who've reported results, but if someone reports results AFTER a release tape, you can't find that out easily. So, it is quite possible that: a) Somebody at IBM ran these things, turned knobs and dials on the compilers appropriately, (and there are plenty of knobs and dials on most compilers) and got these results. And in fact, as I have high respect for the folks at IBM doing the SPEC stuff, I personally believe that they got the answers they say they got, although I have not personally run them. b) In any such case, it is possible that: 1) There are magic options that only the vendor knows about. This is considered a no-no. 2) There are magic tools, that only the vendor has, for analyzing the programs, to figure out the options that should be used. One would expect, that a user who runs the result should get the same answers, even if the user has no obvious way to derive the right options. 3) There are tools and explanations available to the user, which provide the same performance, with work, which would be expected to happen in the normal way that people would work. 4) You just say -Ox, and if you need to do something else, the compiler tells you. Now, in this hierarchy, the ideal is 4), 3) is OK for some people, 2) is getting kind of marginal, and 1) is really a no-no, unless the magic options just aren't released yet, but will be. c) Of course, users must assess for themselves what to think, when faced with big performance differences between levels 2, 3, and 4. d) SPEC is continually working to tighten this up, because the goal is that a user can for replicate results easily, and we're not quite there yet, sometimes. e) Certainly, a good calibration is to run the SPEC stuff the way you normally would, by starting with the vanilla makefiles, and supplying the options you'd pick from a quick reading of the systems cc & f77 manual pages. I often use an extensive computer==car analogy, whose performance measurement part goes like this: unreal: drag-strip, short distance in a straight-line as fast as possible, don't care if vehicle useful on the road. exaggerated: (Dhrystone mips): on the road, but only downhill real exaggerated: (peak mips & mflops, guaranteed not to exceed): drive it off a cliff, and measure as it falls.... hard on the drivers, but that's the way it goes reality: up-hill, down-hill, around curves: Monte Carlo, etc, driven by real people now, SPEC is a fairly good approximation to one slice of reality, with the obvious niggle that the numbers reported by vendors are with real machines, running on real programs .... but usually with skilled racing drivers who can extract the most performance from their machines. Sometimes, an average person can get the same performance, sometimes not. But, anyway, let us be careful not to characterize something as lies,' when there are perfectly legitimate reasons, well within the rules, that might explain this. A far more interesting question is to ask, in general, of all of us is: What effort does it take to achieve given levels of performance? What's the difference between -O and -O5 -x -y -z5000 -q -k300....? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/05/90)
In article <41935@mips.mips.COM> mash@mips.COM (John Mashey) writes: | But, anyway, let us be careful not to characterize something as lies,' | when there are perfectly legitimate reasons, well within the rules, | that might explain this. A far more interesting question is to | ask, in general, of all of us is: | What effort does it take to achieve given levels of performance? | What's the difference between -O and -O5 -x -y -z5000 -q -k300....? This is a really good point, and at the risk of overkill I will remind readers of some benchmark results which were labeled as lies, although they turned out to be true. One of the early Radio Shack 386 systems (3000, perhaps) was benchmarked by RS and the results published. Later many other people tried to get those numbers and couldn't. Lots of accusations followed. It turns out that at boot time the ammount of memory is tested, and interleave turned on if there was a multiple of 4MB in the machine (or 2MB, memory is dim). At any rate when the system was run with less memory, say 1MB and DOS, the results were a lot slower, since every memory access now had 1-2 wait states added. Not the fault of RS, they clearly stated the memory size, people just didn't realize that it made a 30% (or so) improvement in performance for many things. As John says, let's be very careful not to confuse the results of carefully tweaked compilations, links, and kernel configuration with something which a user just can't do. Tweaking machines and options is an art, and if it wasn't a lot of us would be reading want ads instead of netnews. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
ddt@walt.cc.utexas.edu (David Taylor) (10/06/90)
It would be interesting to know WHICH benchmarks are hard to duplicate. My guess is that 70% of them were reproducable and the 2 or 3 easily vectorizable ones (tomcatv, dasa7, etc) were not. And which release are we talking about here? The RS/6000 is a poor candidate for the SPECmark anyway, because it's strengths lie in just a couple of instructions exploitable in some programs. The figures from those benchmarks seriously skew the SPECmark. Remember, it's based on the geometric mean which doesn't reflect performance well for poorly distributed benchmark performances. I think that if you wanted a realistic evaluation of the RS/6000, it would be safe to say that it will behave much as it did for the benchmarks like gcc, espresso, and spice2g6, and that for some vectorizable programs it will run several times faster IF the compiler can interpret them correctly. =-ddt->
abe@mace.cc.purdue.edu (Vic Abell) (10/08/90)
In article <4733@lure.latrobe.edu.au>, CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) writes: > I have just got a copy of the September 1990 SPECwatch and I am > a bit concerned about the following paragraph: > > "There also seems to be a problem with replicating IBM's RS6000 > SPECmark results, I have been able to reproduce the SPEC ratio for the RS/6000-520 with only minor changes needed to enable gcc. (See the Notes/Summary of Changes section in the report for gcc details.) Here is a complete SPEC report for the RISC System/6000, model 520 POWERserver. Its SPEC ratio is 0.1 larger than what IBM reported to me. SPEC Throughput Method A# Results for Release 1.0 Benchmarks Results: SPEC IBM RS/6000 IBM Corporation Ref. 520 Benchmark Time Time SPEC IBM RISC System/6000 No. & Name (sec.) (sec.) Ratio POWERserver 520 001.gcc 1482 102 14.5* Hardware 008.espresso 2266 139 16.3 Model Number: POWERserver 520 013.spice2g6 23951 1189 20.1 CPU: IBM POWER 20MHz 015.doduc 1863 88 21.2 FPU: Integrated 020.nasa7 20093 771 26.1 Number of CPU's: 1 022.li 6206 398 15.6 Cache Size/CPU: 32 KB data, 8 KB ins. 023.eqntott 1101 60 18.4 Memory: 32 MB 030.matrix300 4525 263 17.2 Disk Subsystem: 2-857 MB SCSI 042.fpppp 3038 71 42.8 Network Interface: 1 Ethernet Controller 047.tomcatv 2649 47 56.4 Geometric Mean 3867.7 173.0 22.4* Software System O/S Type and Rev: AIX 3.1, GA Tuning Parameters: None in use Compiler Rev: XL Fortran 1.1 Background Load: Normal Unix daemons Other Software: XL C 1.1 System State: Multi-user, lightly File System Type: IBM Journaled loaded File System Firmware Level: N/A Tested in: October 1990 By: Victor A. Abell <abe@mace.cc.purdue.edu> Purdue University Computing Center Mathematical Sciences Building West Lafayette, IN 47907 (317) 494-1787 SPEC License # 310 Notes/Summary of Changes: # Method A: Homogeneous Load * Portability changes were required: gcc: o Modified alloca.o rule in Makefile to use XL C compiler. o Activated #include of <time.h> in cccp.c. o Disabled redefinition of ptrdiff_t and size_t in stddef.h. o Modified erroneous assignment of enumerated variables to bit fields in rtl.h, tree.h, and varasm.c. o Enabled vfork to fork redefinition in gcc.c. Copyright 1990 Purdue Research Foundation, West Lafayette, Indiana 47907. All rights reserved.
khb@chiba.Eng.Sun.COM (Keith Bierman - SPD Advanced Languages) (10/09/90)
In article <4734@lure.latrobe.edu.au> CCHD@lure.latrobe.edu.au (Huw Davies - La Trobe University Computer Centre) writes:
"There also seems to be a problem with replicating IBM's RS6000
SPECmark results, and with achieving the expected levels of
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
performance with other code.
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Others have already commented on the left hand side of the and at
length. While the same is true for the right hand side, I don't think
we can flog this pony enough.
Looking at just one number won't/can't/will never provide a reasonable
basis for performance forecasting, across differing
arch+implementation+applications
This does not imply any evil intent on the part of the IBM designers,
compiler implementors, etc.
--
----------------------------------------------------------------
Keith H. Bierman kbierman@Eng.Sun.COM | khb@chiba.Eng.Sun.COM
SMI 2550 Garcia 12-33 | (415 336 2648)
Mountain View, CA 94043
walter@vogons.UUCP (Walter Bays) (10/13/90)
In article <37935@ut-emx.uucp> ddt@walt.cc.utexas.edu (David Taylor) writes: >The RS/6000 is a poor candidate for the SPECmark anyway, because it's >strengths lie in just a couple of instructions exploitable in some >programs. The figures from those benchmarks seriously skew the SPECmark. >Remember, it's based on the geometric mean which doesn't reflect performance >well for poorly distributed benchmark performances. It's probably more accurate to say the RS/6000 like all machines has strengths and weaknesses, and some of the SPEC release 1 benchmarks hit some of the strengths particularly well. Presumably IBM designed the machine to be strong in application areas they thought particulary important, so it's not too surprising that they succeeded well for some of the benchmarks. I agree with you that a single geometric mean cannot characterize the performance of such a machine, due to very large differences between minimum and maximum performance, most obvious now for the RS/6000, Stardent, and Intel 860, but you will see this effect for more machines in the future. As CPU's get faster by exploiting more fine-grained parallelism in different ways, the differences increase, and the "little" machines are becoming as difficult to classify as the supercomputers have always been. John Mashey's "Your Mileage May Vary" paper is a very good treatment of the issue. Difficulties interpreting SPECmarks for these machines does not mean the machines nor the SPEC benchmarks are flawed, just that you have to look beyond a single number. If you know that your workload is adequately represented by 6 of the benchmarks with a fixed amount of work to do, you could use a weighted harmonic mean of those 6. If you have latent demand that will consume all available resources, then weighted geometric mean may be more appropriate. But in no case will speed on tomcatv get your C compilations done more quickly, nor slowness on the Lisp interpreter degrade your Spice simulations. I think there's a big need for some commercial benchmark companies and trade journals to get into the business of helping ordinary users interpret the benchmark results for their own situations. --- Double Disclaimer: speaking for myself, not for Intergraph nor for SPEC Walter Bays Phone (415) 852-2384 FAX (415) 856-9224 EMAIL uunet!ingr.com!bays or uunet!{apple.com,pyramid.com}!garth!walter USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303
keller@keller.austin.ibm.com (10/18/90)
John Laskowski is the IBM SPEC representative. I'm one of the people who ran the SPEC suite for the announced numbers. Here's what we've got to say. Several independent organizations have been able to verify the IBM RISC System/6000 results. Kavailya Dixit, from Sun Microsystems, has verified the results and won the prize offered by the publication that questioned RS/6000 results. Since the RISC System/6000 was announced AFTER the SPEC Release 1.0 benchmarks were shipped, IBM specific makefiles for the RS/6000 were not available on the SPEC tape (DEC, HP, and MIPS were all able to include their vendor specific makefiles in the original tape). This made it difficult for anyone to reproduce the RS/6000 results. IBM specific makefiles will be available on the next release level of the SPEC Suite. This will make it much easier for anyone to verify results. In the meantime, a copy of the IBM specific makefiles can be obtained by request from me. P.S. The "exotic" compiler option used is "-O" for "optimize." -------------------------------------------------------------- Tom W. Keller IBM Advanced Workstation Division IBM - 2501 internet: ibmchs!keller@cs.utexas.edu 11400 Burnet Rd uucp: ...!cs.utexas.edu!ibmchs!keller Austin, TX 78758 --------------------------------------------------------------