[comp.unix.aix] Rumour about IBM benchmarks

david@torsqnt.UUCP (David Haynes) (09/14/90)

I have heard a rumour that the benchmark results that IBM posted for
their RS6000 system were the results of hand-coded, hand-optimized
assembler coding rather than the result of compiling C or FORTRAN
code. Can anyone confirm or deny this? 

-david-
-- 
David Haynes
Sequent Computer Systems (Canada) Ltd.
uunet!utai!torsqnt!david -or- david@torsqnt.UUCP
I cannot tell a lie, because I do not know the truth.

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (09/14/90)

>>>>> On 13 Sep 90 21:18:01 GMT, david@torsqnt.UUCP (David Haynes) said:

David> I have heard a rumour that the benchmark results that IBM posted for
David> their RS6000 system were the results of hand-coded, hand-optimized
David> assembler coding rather than the result of compiling C or FORTRAN
David> code. Can anyone confirm or deny this? 

I can verify that the results of the LINPACK 100x100 test case are
quite valid.  I have not re-run the 100x100 case specifically, but I
have run numerous cases with larger matrices, and have never gotten a
speed as slow as the IBM's rating!  With one LINPACK-ish code (a
block-mode dense linear algebra solver) I get over 13 MFLOPS for
64-bit arithmetic on a model 320!  With the plain vanilla LINPACK
source, I get about 8.4 MFLOPS for the 1000x1000 LINPACK test case.

The MIPS rating of the RS6000's is of course entirely bogus, as it is
based on the Dhrystone 1.0 (or 1.1?) benchmark.  This benchmark is
very easy to optimize for, both in hardware and software, and in my
experience has essentially no correlation with real performance on
non-floating-point work.  For example, both the Motorola 88000 and IBM
RS6000 score substantially higher than the MIPS R-3000 on the
Dhrystone 1.1 benchmark, but the three machines have very similar
scores on the 4 integer SPEC benchmarks. (I do not consider
differences of +/- 25% or less as significant).

On the SPEC benchmarks, the IBM RS6000 model 320 has a floating-point
mean performance of about 24 times a VAX 11/780, and an integer
performance of about 14 times the VAX 11/780.  This level of
floating-point performance is incredible, while the integer
performance is not significantly different than is available in
machines like the DECstation 3100 and DECstation 5000 or in the SUN
Sparcstations. 
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

abe@mace.cc.purdue.edu (Vic Abell) (09/14/90)

In article <1233@torsqnt.UUCP> david@torsqnt.UUCP (David Haynes) writes:
>I have heard a rumour that the benchmark results that IBM posted for
>their RS6000 system were the results of hand-coded, hand-optimized
>assembler coding rather than the result of compiling C or FORTRAN
>code. Can anyone confirm or deny this? 

Before we purchased a RISC System/6000 model 520, I ran my own versions of
the Linpack and Dhrystone tests.  I used straight C and Fortran code and
the ``-O'' compiler option.  My results confirm the figures published by
IBM.  If anything, the IBM benchmark results are conservatively stated.

			IBM's rating	My results

	MFLOPS		7.4 		7.55	[1]

	MIPS		27.5		29.4	[2]

[1] My rating comes from the average of ten runs of the unmodified Linpack
    program, matrix order 100, leading array dimension 201.  There was no
    difference between single and double precision results.

[2] MIPS are derived by dividing the rating produced by version 1.1 of the
    Dhrystone test by 1,757.  I used the average of ten runs of the test,
    51,651.

What is the source of your rumor, Sequent Marketing?  :-)

shawn@jdyx.UUCP (Shawn Hayes) (09/14/90)

	In the testing that I did with Dhrystone 2.0 I got results that
were very close to the results that IBM published.  Unless the 2.1 Dhrystone
benchmark is significantly different from the 2.0 version I think IBM's 
published MIPS values are correct. (At least they are as correct as anyone
else's. :)

madd@world.std.com (jim frost) (09/15/90)

david@torsqnt.UUCP (David Haynes) writes:
>I have heard a rumour that the benchmark results that IBM posted for
>their RS6000 system were the results of hand-coded, hand-optimized
>assembler coding rather than the result of compiling C or FORTRAN
>code. Can anyone confirm or deny this? 

This wouldn't be strange practice amongst vendors but I've done some
of my own benchmarks with the RIOS and find that it does just under
three times the raw performance of a sparcstation 1 in the drystone
test.  An IO benchmark I played with showed almost exactly 1mb/sec
write throughput and 8mb/sec read throughput, which is what I would
call `pretty fast'.  The test was designed to negate the effects of
in-memory disk caching (ie it used a large read/write space).  File
creation times weren't particularly fast, though.

Interestingly the machine doesn't `feel' as fast as it tests -- the
one I have here (RS/6000 model 520 w/ 32Mb RAM and all the graphics
hardware I'll ever need) feels more sluggish than a 12Mb sparcstation
running the same kinds of utilities (except things like `grep' which
go quite fast).  I hear that performance becomes much better once you
get more than about 40Mb RAM in the thing so I wonder if there's not a
VM problem.  I won't know until we stuff more memory in it.

While I expect they tuned code to get stellar performance, the numbers
aren't way out of line.  I expect that OS tuning will reduce the
sluggishness I notice -- things don't seem to be very well tuned right
now.

Happy hacking,

jim frost
saber software
jimf@saber.com

mherman@alias.UUCP (Michael Herman) (09/15/90)

I think the confusion probably comes from the fact that author and
keeper of the Linpack benchmark actually keeps two sets of numbers:
(1) the results from running the benchmark on a virgin copy of the
benchmark source and (2) the results from hand optimizing a copy of
the virgin source.

I don't have the name of the person but the whole thing was covered
in a recent issue of Supercomputing magazine.  It include a couple
pages of results for single and multi-processor machines.

I can also atest (sp?) that the RS/6000 benchmark numbers are true.

wje@siia.mv.com (Bill Ezell) (09/17/90)

In <1233@torsqnt.UUCP> david@torsqnt.UUCP (David Haynes) writes:

>I have heard a rumour that the benchmark results that IBM posted for
>their RS6000 system were the results of hand-coded, hand-optimized...

This would seem unlikely to me. According to IBM (a perhaps suspect source)
their C compiler 'consistently produces code better than hand-coded assembler'.
This isn't too surprizing when you consider that the compiler is tailored
to generate instructions to take advantage of the pipelining inherent
in the processor, a tedious process at best when done by hand.

We've had our RS600 for about 6 months now, and our benchmarks,
which are not always compute bound, show the machine to be MUCH faster than
anything we've tested from DEC, or anyone else for that matter.

A fascinating look into the hardware and software technology of the RS6000
can be found in the January issue of IBM's equivalent of the Bell System
Technical Journal. It is truly an amazing system.

It pains me to praise IBM so highly, since I (and everyone else here) have
always been anti-IBM, sometimes rabidly so. However, when something this
good comes along, even IBM deserves credit.
-- 
Bill Ezell
Software Innovations, Inc.
wje@siia.mv.com
(603) 883-9300

pbickers@groucho (09/18/90)

In article <1990Sep14.215517.28056@world.std.com> madd@world.std.com (jim frost) writes:
>david@torsqnt.UUCP (David Haynes) writes:
>>I have heard a rumour that the benchmark results that IBM posted for
>>their RS6000 system were the results of hand-coded, hand-optimized
>>assembler coding rather than the result of compiling C or FORTRAN
>>code. Can anyone confirm or deny this? 
>
I believe the IBM results but as everybody knows (or ought to know)
numbers like MIPS and MFLOPS are well nigh useless for evaluating a
machine. These rumours may originate in the very patchy performance
of the RISC6000. (See below.)

>Interestingly the machine doesn't `feel' as fast as it tests -- the
>one I have here (RS/6000 model 520 w/ 32Mb RAM and all the graphics
>hardware I'll ever need) feels more sluggish than a 12Mb sparcstation

This is interesting!

>running the same kinds of utilities (except things like `grep' which
>go quite fast).  I hear that performance becomes much better once you

I'm surprised by this remark. When we evaluated this machine one of the
things that struck us was that the Unix was slow. We found (on a 32 MB
configuration) that grep and diff were *slower* than on a 12.5 MIPS,
0.5 MFLOP HP9000/375.
>get more than about 40Mb RAM in the thing so I wonder if there's not a
>VM problem.  I won't know until we stuff more memory in it.
>
>sluggishness I notice -- things don't seem to be very well tuned right
>now.
>
One thing to note about this machine is that it gets its 7.4 MFLOP
performance from being *occasionally* able to execute more than one 
instruction at once. Thus it seems able to do very well on Fortran code
that contains some vectorizability. It is not lightning fast on Fortran
code per se -- though in general it is still quick. We run a lot of Fortran
and tested our codes on it. Some did show the 7.4 MFLOP performance relative
to rivals but for most it was within 10% of machines one might have expected
it to thrash e.g. MIPS R3000 and 33 MHz SPARC cpus. On some Fortran codes
the other machines were even faster.

Another thing to note about its floating point performance is that it
depends a lot on the type of operation being performed (true of any machine
I guess).

(It does come with a quite good Fortran compiler. It handles VMS source
quite well and the optimizer does optimize.)

I suppose what I'm trying to say is that the variation in performance for
the RISC6000 is considerably greater than for other workstations from SUN,
DEC, HP etc. -- i.e. its rivals in the market place. Thus while it may be
super fast at some things it is awfully slow at others, much more so than
one would expect. (There is an article in Unix Today, May 14 1990, which
bears further testimony to this.)

Thus anybody buying this machine solely on account of its 27 Dhrystone MIPS
and 7.4 MFLOP performance runs a high risk of being very disappointed. 

Consideration of other benchmarks such as Whetstone or Livermore Loops will
give a better overall but still incomplete picture.

Moral: if you're considering this machine test *your* applications on it (and
remember that you'll be doing editing etc as well as executing Fortran). You
may love it, you may hate it.

Good Luck!

-- 
Paul Bickerstaff                 Internet: pbickers@neon.chem.uidaho.edu
Physics Dept., Univ. of Idaho    Phone:    (208) 885 6809
Moscow ID 83843, USA             FAX:      (208) 885 6173

madd@world.std.com (jim frost) (09/18/90)

wje@siia.mv.com (Bill Ezell) writes:
>In <1233@torsqnt.UUCP> david@torsqnt.UUCP (David Haynes) writes:

>>I have heard a rumour that the benchmark results that IBM posted for
>>their RS6000 system were the results of hand-coded, hand-optimized...

>This would seem unlikely to me. According to IBM (a perhaps suspect source)
>their C compiler 'consistently produces code better than hand-coded assembler'.
>This isn't too surprizing when you consider that the compiler is tailored
>to generate instructions to take advantage of the pipelining inherent
>in the processor, a tedious process at best when done by hand.

Actually the compiler doesn't do as good a job as you might think,
although there's no doubt in my mind that it does a better job on a
large project than you could do when hand-coding.

One thing it doesn't seem to try very hard to do is instruction
scheduling -- moving cmp instructions away from branch instructions so
that the branches execute in zero cycles, for instance.  This is
surprising since zero-cycle branches are a nice feature of the system.

I haven't gone out of my way to see just which optimizations it does
well or poorly at, but it does seem like the compiler could do a bit
better.

Happy hacking,

jim frost
saber software
jimf@saber.com

mherman@alias.UUCP (Michael Herman) (09/21/90)

Two of the major reasons why floating-point performance is very strong
on the RS/6000 are (1) a floating-point multiple-add instruction that
executes in 1 instruction cycle (potentially in "parallel" with an
integer subscripting calculation instruction and a test-and-branch
instruction) and (2) a very powerful optimizer that was co-developed
with the processor hardware.

To many people, a "RISC instruction" was synonymous with a "simple
instruction" that executes in 1 cycle (or less).  IBM took the tact
that a RISC instruction could be complex as was possible as long as it
executed in 1 instruction cycle or less - that gave them a lot of
latitude in the instructions they implemented in silicon.

Although I am less familiar with them, there are supposedly a number of
C library (i.e. zero terminated) string functions what have also been
implemented directly in the silicon.

madd@world.std.com (jim frost) (09/22/90)

mherman@alias.UUCP (Michael Herman) writes:
>Although I am less familiar with them, there are supposedly a number of
>C library (i.e. zero terminated) string functions what have also been
>implemented directly in the silicon.

Sort of.  There are a number of operations which act on multiple
sequential registers which presumably contain a string.  It's fast and
easy to load a bunch of registers with a bunch of sequential bytes,
act on them in the registers (often with instructions that work across
sequential registers) then write back the bytes.  Not exactly C string
handling in silicon but it's not a lot of work to implement the C
functions and they'll have good performance.

XLC generates some interesting code when dealing with structures that
uses the string load and store instructions.  Even more interesting is
that it often doesn't use the string instructions when it could (eg
when saving the argument registers to the stack in a varargs
function).

Happy hacking,

jim frost
saber software
jimf@saber.com