david@torsqnt.UUCP (David Haynes) (09/14/90)
I have heard a rumour that the benchmark results that IBM posted for their RS6000 system were the results of hand-coded, hand-optimized assembler coding rather than the result of compiling C or FORTRAN code. Can anyone confirm or deny this? -david- -- David Haynes Sequent Computer Systems (Canada) Ltd. uunet!utai!torsqnt!david -or- david@torsqnt.UUCP I cannot tell a lie, because I do not know the truth.
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (09/14/90)
>>>>> On 13 Sep 90 21:18:01 GMT, david@torsqnt.UUCP (David Haynes) said:
David> I have heard a rumour that the benchmark results that IBM posted for
David> their RS6000 system were the results of hand-coded, hand-optimized
David> assembler coding rather than the result of compiling C or FORTRAN
David> code. Can anyone confirm or deny this?
I can verify that the results of the LINPACK 100x100 test case are
quite valid. I have not re-run the 100x100 case specifically, but I
have run numerous cases with larger matrices, and have never gotten a
speed as slow as the IBM's rating! With one LINPACK-ish code (a
block-mode dense linear algebra solver) I get over 13 MFLOPS for
64-bit arithmetic on a model 320! With the plain vanilla LINPACK
source, I get about 8.4 MFLOPS for the 1000x1000 LINPACK test case.
The MIPS rating of the RS6000's is of course entirely bogus, as it is
based on the Dhrystone 1.0 (or 1.1?) benchmark. This benchmark is
very easy to optimize for, both in hardware and software, and in my
experience has essentially no correlation with real performance on
non-floating-point work. For example, both the Motorola 88000 and IBM
RS6000 score substantially higher than the MIPS R-3000 on the
Dhrystone 1.1 benchmark, but the three machines have very similar
scores on the 4 integer SPEC benchmarks. (I do not consider
differences of +/- 25% or less as significant).
On the SPEC benchmarks, the IBM RS6000 model 320 has a floating-point
mean performance of about 24 times a VAX 11/780, and an integer
performance of about 14 times the VAX 11/780. This level of
floating-point performance is incredible, while the integer
performance is not significantly different than is available in
machines like the DECstation 3100 and DECstation 5000 or in the SUN
Sparcstations.
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@vax1.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
abe@mace.cc.purdue.edu (Vic Abell) (09/14/90)
In article <1233@torsqnt.UUCP> david@torsqnt.UUCP (David Haynes) writes: >I have heard a rumour that the benchmark results that IBM posted for >their RS6000 system were the results of hand-coded, hand-optimized >assembler coding rather than the result of compiling C or FORTRAN >code. Can anyone confirm or deny this? Before we purchased a RISC System/6000 model 520, I ran my own versions of the Linpack and Dhrystone tests. I used straight C and Fortran code and the ``-O'' compiler option. My results confirm the figures published by IBM. If anything, the IBM benchmark results are conservatively stated. IBM's rating My results MFLOPS 7.4 7.55 [1] MIPS 27.5 29.4 [2] [1] My rating comes from the average of ten runs of the unmodified Linpack program, matrix order 100, leading array dimension 201. There was no difference between single and double precision results. [2] MIPS are derived by dividing the rating produced by version 1.1 of the Dhrystone test by 1,757. I used the average of ten runs of the test, 51,651. What is the source of your rumor, Sequent Marketing? :-)
shawn@jdyx.UUCP (Shawn Hayes) (09/14/90)
In the testing that I did with Dhrystone 2.0 I got results that were very close to the results that IBM published. Unless the 2.1 Dhrystone benchmark is significantly different from the 2.0 version I think IBM's published MIPS values are correct. (At least they are as correct as anyone else's. :)
madd@world.std.com (jim frost) (09/15/90)
david@torsqnt.UUCP (David Haynes) writes: >I have heard a rumour that the benchmark results that IBM posted for >their RS6000 system were the results of hand-coded, hand-optimized >assembler coding rather than the result of compiling C or FORTRAN >code. Can anyone confirm or deny this? This wouldn't be strange practice amongst vendors but I've done some of my own benchmarks with the RIOS and find that it does just under three times the raw performance of a sparcstation 1 in the drystone test. An IO benchmark I played with showed almost exactly 1mb/sec write throughput and 8mb/sec read throughput, which is what I would call `pretty fast'. The test was designed to negate the effects of in-memory disk caching (ie it used a large read/write space). File creation times weren't particularly fast, though. Interestingly the machine doesn't `feel' as fast as it tests -- the one I have here (RS/6000 model 520 w/ 32Mb RAM and all the graphics hardware I'll ever need) feels more sluggish than a 12Mb sparcstation running the same kinds of utilities (except things like `grep' which go quite fast). I hear that performance becomes much better once you get more than about 40Mb RAM in the thing so I wonder if there's not a VM problem. I won't know until we stuff more memory in it. While I expect they tuned code to get stellar performance, the numbers aren't way out of line. I expect that OS tuning will reduce the sluggishness I notice -- things don't seem to be very well tuned right now. Happy hacking, jim frost saber software jimf@saber.com
mherman@alias.UUCP (Michael Herman) (09/15/90)
I think the confusion probably comes from the fact that author and keeper of the Linpack benchmark actually keeps two sets of numbers: (1) the results from running the benchmark on a virgin copy of the benchmark source and (2) the results from hand optimizing a copy of the virgin source. I don't have the name of the person but the whole thing was covered in a recent issue of Supercomputing magazine. It include a couple pages of results for single and multi-processor machines. I can also atest (sp?) that the RS/6000 benchmark numbers are true.
wje@siia.mv.com (Bill Ezell) (09/17/90)
In <1233@torsqnt.UUCP> david@torsqnt.UUCP (David Haynes) writes: >I have heard a rumour that the benchmark results that IBM posted for >their RS6000 system were the results of hand-coded, hand-optimized... This would seem unlikely to me. According to IBM (a perhaps suspect source) their C compiler 'consistently produces code better than hand-coded assembler'. This isn't too surprizing when you consider that the compiler is tailored to generate instructions to take advantage of the pipelining inherent in the processor, a tedious process at best when done by hand. We've had our RS600 for about 6 months now, and our benchmarks, which are not always compute bound, show the machine to be MUCH faster than anything we've tested from DEC, or anyone else for that matter. A fascinating look into the hardware and software technology of the RS6000 can be found in the January issue of IBM's equivalent of the Bell System Technical Journal. It is truly an amazing system. It pains me to praise IBM so highly, since I (and everyone else here) have always been anti-IBM, sometimes rabidly so. However, when something this good comes along, even IBM deserves credit. -- Bill Ezell Software Innovations, Inc. wje@siia.mv.com (603) 883-9300
pbickers@groucho (09/18/90)
In article <1990Sep14.215517.28056@world.std.com> madd@world.std.com (jim frost) writes: >david@torsqnt.UUCP (David Haynes) writes: >>I have heard a rumour that the benchmark results that IBM posted for >>their RS6000 system were the results of hand-coded, hand-optimized >>assembler coding rather than the result of compiling C or FORTRAN >>code. Can anyone confirm or deny this? > I believe the IBM results but as everybody knows (or ought to know) numbers like MIPS and MFLOPS are well nigh useless for evaluating a machine. These rumours may originate in the very patchy performance of the RISC6000. (See below.) >Interestingly the machine doesn't `feel' as fast as it tests -- the >one I have here (RS/6000 model 520 w/ 32Mb RAM and all the graphics >hardware I'll ever need) feels more sluggish than a 12Mb sparcstation This is interesting! >running the same kinds of utilities (except things like `grep' which >go quite fast). I hear that performance becomes much better once you I'm surprised by this remark. When we evaluated this machine one of the things that struck us was that the Unix was slow. We found (on a 32 MB configuration) that grep and diff were *slower* than on a 12.5 MIPS, 0.5 MFLOP HP9000/375. >get more than about 40Mb RAM in the thing so I wonder if there's not a >VM problem. I won't know until we stuff more memory in it. > >sluggishness I notice -- things don't seem to be very well tuned right >now. > One thing to note about this machine is that it gets its 7.4 MFLOP performance from being *occasionally* able to execute more than one instruction at once. Thus it seems able to do very well on Fortran code that contains some vectorizability. It is not lightning fast on Fortran code per se -- though in general it is still quick. We run a lot of Fortran and tested our codes on it. Some did show the 7.4 MFLOP performance relative to rivals but for most it was within 10% of machines one might have expected it to thrash e.g. MIPS R3000 and 33 MHz SPARC cpus. On some Fortran codes the other machines were even faster. Another thing to note about its floating point performance is that it depends a lot on the type of operation being performed (true of any machine I guess). (It does come with a quite good Fortran compiler. It handles VMS source quite well and the optimizer does optimize.) I suppose what I'm trying to say is that the variation in performance for the RISC6000 is considerably greater than for other workstations from SUN, DEC, HP etc. -- i.e. its rivals in the market place. Thus while it may be super fast at some things it is awfully slow at others, much more so than one would expect. (There is an article in Unix Today, May 14 1990, which bears further testimony to this.) Thus anybody buying this machine solely on account of its 27 Dhrystone MIPS and 7.4 MFLOP performance runs a high risk of being very disappointed. Consideration of other benchmarks such as Whetstone or Livermore Loops will give a better overall but still incomplete picture. Moral: if you're considering this machine test *your* applications on it (and remember that you'll be doing editing etc as well as executing Fortran). You may love it, you may hate it. Good Luck! -- Paul Bickerstaff Internet: pbickers@neon.chem.uidaho.edu Physics Dept., Univ. of Idaho Phone: (208) 885 6809 Moscow ID 83843, USA FAX: (208) 885 6173
madd@world.std.com (jim frost) (09/18/90)
wje@siia.mv.com (Bill Ezell) writes: >In <1233@torsqnt.UUCP> david@torsqnt.UUCP (David Haynes) writes: >>I have heard a rumour that the benchmark results that IBM posted for >>their RS6000 system were the results of hand-coded, hand-optimized... >This would seem unlikely to me. According to IBM (a perhaps suspect source) >their C compiler 'consistently produces code better than hand-coded assembler'. >This isn't too surprizing when you consider that the compiler is tailored >to generate instructions to take advantage of the pipelining inherent >in the processor, a tedious process at best when done by hand. Actually the compiler doesn't do as good a job as you might think, although there's no doubt in my mind that it does a better job on a large project than you could do when hand-coding. One thing it doesn't seem to try very hard to do is instruction scheduling -- moving cmp instructions away from branch instructions so that the branches execute in zero cycles, for instance. This is surprising since zero-cycle branches are a nice feature of the system. I haven't gone out of my way to see just which optimizations it does well or poorly at, but it does seem like the compiler could do a bit better. Happy hacking, jim frost saber software jimf@saber.com
mherman@alias.UUCP (Michael Herman) (09/21/90)
Two of the major reasons why floating-point performance is very strong on the RS/6000 are (1) a floating-point multiple-add instruction that executes in 1 instruction cycle (potentially in "parallel" with an integer subscripting calculation instruction and a test-and-branch instruction) and (2) a very powerful optimizer that was co-developed with the processor hardware. To many people, a "RISC instruction" was synonymous with a "simple instruction" that executes in 1 cycle (or less). IBM took the tact that a RISC instruction could be complex as was possible as long as it executed in 1 instruction cycle or less - that gave them a lot of latitude in the instructions they implemented in silicon. Although I am less familiar with them, there are supposedly a number of C library (i.e. zero terminated) string functions what have also been implemented directly in the silicon.
madd@world.std.com (jim frost) (09/22/90)
mherman@alias.UUCP (Michael Herman) writes: >Although I am less familiar with them, there are supposedly a number of >C library (i.e. zero terminated) string functions what have also been >implemented directly in the silicon. Sort of. There are a number of operations which act on multiple sequential registers which presumably contain a string. It's fast and easy to load a bunch of registers with a bunch of sequential bytes, act on them in the registers (often with instructions that work across sequential registers) then write back the bytes. Not exactly C string handling in silicon but it's not a lot of work to implement the C functions and they'll have good performance. XLC generates some interesting code when dealing with structures that uses the string load and store instructions. Even more interesting is that it often doesn't use the string instructions when it could (eg when saving the argument registers to the stack in a varargs function). Happy hacking, jim frost saber software jimf@saber.com