[comp.sys.hp] A benchmark

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/03/91)

Some benchmarks for a typical large scale scientific program.

------------------------
     CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran).

     From the Geophysical Fluid Dynamics Lab - NOAA, Princeton, New Jersey.
     Written by Ron Pacanowski, Keith Dixon, and Tony Rosati; based on the
     original Cox-Bryan model.
     The code is well vectorized, and consists mostly of floating point
     multiplies/additions. The single and double precision times for
     the IBM were exactly the same. The cpu time for the two processor
     Convex is total sum from each processor (63 secs each).
     The estimated benchmark for the IBM540 is 11Mflops, and 22Mflops
     for the IBM550.

                                         Job
          Machine   | CPU secs | Mflop | Size MB |     Comments
        ----------------------------------------------------------------
        IBM-320         195       5.1      7.7     REAL*8, optimized.
                        195       5.1      4.0     REAL*4, "        "

	HP-720 "Coral"  197       5.1      8.0     REAL*8, optimized O2.
                        132       7.6      4.0     REAL*4, "           "

        DS5100          495       2.0      7.9     REAL*8, optimized O2.
                        330       3.0      4.0     REAL*4, "           "

   **   Convex C2400     97      10.8      8.9     REAL*8, vectorized,
                         73      14.2      4.5     REAL*4, 1 processor.
                       126/2     16.3      8.9     REAL*8, 2 processors


        Cray Y/MP        14      71.0     ~8.0     REAL*8, vectorized,
        (GFDL, NOAA)                                       1 processor.
        ----------------------------------------------------------------
**  Real time for 1 proc = 99sec, 2 proc = 78sec  using REAL*8.

Machine Specifics:
      Machine     Memory(MB)    Disk(MB)
      ----------------------------------
      IBM-320          16         340
      HP-720           32         340
      DS5100           24        1200
      C2400           512       ~3000
      Y/MP           ~512         ?
------------------------

You'd need more dollars than cents to even think of buying a Convex.
7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in
the HP blurb. Perhaps its time than ANSI created an "official" set
of benchmarks.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

guscus@katzo.rice.edu (Gustavo E. Scuseria) (05/03/91)

In article <1991May3.023705.5616@marlin.jcu.edu.au> csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:

>7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in
>the HP blurb. 

 What version of the operating system are you using ? 
 And what compiler ? 
 I believe HP claims a big boost of performance with the new
 (unreleased) 8.05 version.

--
Gustavo E. Scuseria              | guscus@katzo.rice.edu
Department of Chemistry          |
Rice University                  | office: (713) 527-4082
Houston, Texas 77251-1892        | fax   : (713) 285-5155

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/04/91)

Errata for my first two articles (see comp.benchmarks too).
i.  linpack300x300 should have been linpack100x100.
ii. "coral" should have been "cobra".

HP (Sydney) did the test for the HP9000-720 using the 8.05 compiler
with the new (Kuck?) front-end. The OS was older, 8.02 or 8.01, but that
shouldn't matter since there were no system calls, and no disk I/O.


Another philosophical note to all those
who sent venomous (snake) mail to me:

This application has been developed by ranking US government scientists
for the last 20 years; a LOT of brain power has gone into it.
Its a REAL program being used to solve REAL problems. If that
doesn't constitute a good benchmark, then what the hell does ???
Perhaps a floating point test using awk is more acceptable to the curly
bracket crowd.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/07/91)

In <1991May3.142148.6329@rice.edu> guscus@katzo.rice.edu (Gustavo E. Scuseria) writes:
> What version of the operating system are you using ? 
> And what compiler ? 
> I believe HP claims a big boost of performance with the new
> (unreleased) 8.05 version.

A quick follow up to all those who queried the "-O2" for the HP720
-O2 is the highest level of optimization (ftn) for the Snakes, unlike
-O3 for the 400 compilers. Thanks to Walt Underwood (HP) for clearing
that up.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

chuckc@hpfcdc.HP.COM (Chuck Cairns) (05/09/91)

Hi,  
Curious HP type here.
Just wondered what your memory access pattern is on this particular 
benchmark. If you access memory address X then X+1 do you take a large hop
in memory location?  What I'm suspecting is that your benchmark may be
cache-thrashing. I.E. causing lots of accesses to main-ram. I'd
like to see the actual numbers on the higher speed IBM's. I kinda suspect
that they wouldn't scale well with clock speed. Hmmm  What't the cache to
main-ram width on the IBM's.

Why?
If we assume a very fast floating point processor, which both the HP730 
and the high end IBM's have, then the job becomes one of feeding
the FP processor with enough data quickly enough. Since all the vendors
draw from roughly the same pool for memory components their speed becomes
at least one limiting factor. How wide the cache-to processor bus is or how
many bytes-are in a cache line may not be pertinent IF you're cache thrashing.
If we drop out of cache into main ram then we're going to take a major hit
even with an efficient main-ram to cache-ram to FP processor path. A wider
path cache to main-ram would help ... but costs $$. More cache helps ... 
until the arrays get bigger .. it costs $$. Faster memory chips help but
they cost $$  Murphy's Nth law of $$? I bet we could roughly predict the 
speed of a given program by knowing the cache to main-ram width and the speed
of main-ram memory ... IF we also know that said program IS cache thrashing.

Since I don't know your particular benchmark these may not be applicable
ideas ...  Is it possible to run the benchmark with "rows and columns" 
exchanged ? I'd also be keen on knowing what happens when the 730 runs
totally from it's 256Kbyte data cache i.e. ... smaller arrays.

The brakes on a Pinto at 55 may seem "spongy" on a Ferrari at 135 ?
The ram chips on an XYZ at 5 MFLOPS may seem ...

Regards, Chuck Cairns 
As usual: My opinions are my own and not HP's.

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/09/91)

Chuck Cairnes of HP writes:
<like to see the actual numbers on the higher speed IBM's. I kinda suspect
<that they wouldn't scale well with clock speed. Hmmm  What't the cache to
<main-ram width on the IBM's.

I have done extensive benchmarks on both the IBM model 530 and model 550.
These are a broad range of large 2D fluids codes and misc. chaos
codes.  Also FFT's and matrix multiplies.  In all cases, the speedup
went with the clock-ratio.  This is expected, as the 550 uses
faster memory and a faster bus. The more extensive testing of the
Los Alamos group (with a 540) supports this result.

In going from a 520 to a 550, the speed up is a lot MORE than the
clock ratio, because the 550 has a larger cache. I think probably
the superior numerical performance of the IBM architecture is
due to the superscalar processor, but it also may have something
to do with the bandwidth between memory and cache, since we're
talking about out-of cache codes here for both machines.  The
IBM has 128byte cache lines.  If you have a cache miss, it
takes 8 cycles to load the needed line.  Thus, in a streaming
mode, the bandwidth to cache is 16 bytes/cycle (or 672MB/sec on
a 550).  Given that the cache is 4-way set associative rather
than direct mapped, there can also be degradation due to certain
special memory access patterns.  Misses in the TLB take more
cycles to fix, but are rather rare.

I doubt that cache thrashing is a big problem with the ocean
model code. It was written very carefully for a Cyber 205
originally, and the Cyber really choked on non-unit strides. I
haven't seen the code in a while, though.  It is not
especially optimized for re-use of data, given the Cyber/Cray
architecture, but this is an equal performance hit on the
IBM and HP architectures.

So, we have a problem here.  How do we reconcile HP's published
float performance figures with the results on this real
scientific model code?  Do we perhaps have a compiler that
can do Linpak and specfp efficiently but nothing else?  That
certainly was the case on my DN10k, for example, where the
10.6 fortran compiler clearly had some kludges in it that
had the sole effect of making compiled BLAS run faster (and
just about nothing else).

rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/09/91)

Another thing I forgot to mention about IBM's cache architecture.  A
dirty cache line can be stored to memory concurrently with the
load of a new cache line, so in a streaming mode, the bandwidth
to main memory is effectively twice what I mentioned in my previous
post.  I'm curious if HP's cache can do this as well.

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/10/91)

In <5570629@hpfcdc.HP.COM> chuckc@hpfcdc.HP.COM (Chuck Cairns) writes:
....
>Just wondered what your memory access pattern is on this particular 
>benchmark. If you access memory address X then X+1 do you take a large hop
>in memory location?  What I'm suspecting is that your benchmark may be

No, the program is essentially vectorized with unit stride in most cases.
The data would have streamed through the cpu exactly linearly. Hence
its a good measure for streaming performance; see John McCalpin's stuff
in comp.benchmarks. Data reuse in the cache would only have existed for the
scalar parts ( < 15%). The vector size (for "A benchmark") is =< 90 (double
prec). I think thats why the Cray time was a little slow, its N1/2 was
the cause.

>Since I don't know your particular benchmark these may not be applicable
>ideas ...  Is it possible to run the benchmark with "rows and columns" 
>exchanged ? I'd also be keen on knowing what happens when the 730 runs
>totally from it's 256Kbyte data cache i.e. ... smaller arrays.

HP software engineers ran (and looked at) this program (Sydney HP). They
said the row/col major wasn't making any difference, so I'll have to
take their world for it.
I'll publish the times for the IBM540 in about 4 weeks. I doubt if a larger
cache would help, certainly not for the vector parts anyway.

NB. MOM is NOT a benchmark, its a real program  >:)

Cheers
-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (05/11/91)

In article <1991May9.155313.8671@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes:
>So, we have a problem here.  How do we reconcile HP's published
>float performance figures with the results on this real
>scientific model code?  Do we perhaps have a compiler that
>can do Linpak and specfp efficiently but nothing else?  That
>certainly was the case on my DN10k, for example, where the
>10.6 fortran compiler clearly had some kludges in it that
>had the sole effect of making compiled BLAS run faster (and
>just about nothing else).

While I agree about the Apollo DN10k compiler having stuff in it to run
LINPACK much faster, my general experience with the DN10000, IBM
RS/6000 320, SGI 4D2x0 and HP 720 would put the floating point
performance in the following order:

DN10000    ~5 Mflops/cpu with 10.7 compiler
4D2x0      ~5 Mflops/cpu
6000/320   ~8 Mflops
720        ~15 Mflops with compiler now on demo models

This is based on relative performance on an ab initio quantum chemistry
package (basically a subset of Gaussian 8x) looking only at cpu-intensive
jobs, and assuming 5 Mflops for DN10000/4D210 (as LINPACK would report).
It agrees pretty well with the vendor claims too (hard as that may be to
believe :-) ). We have, however, found specific codes that run much
faster and much slower (i.e. up to factors of 2 easily) than these
general guidelines on all the systems except the 720 (not enough time
to throw a wide variety at it :-( ).

Mike.
-- 
Mike Peterson, System Administrator, U/Toronto Department of Chemistry
E-mail: system@alchemy.chem.utoronto.ca
Tel: (416) 978-7094                  Fax: (416) 978-8775

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/11/91)

Another personal opinion:
Vector code (ie MOM) is probably the nastiest code that one could run on
any RISC machine. Its got large amounts of mem to cpu transfer, and
virtually no data reuse in the cache.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au