csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/03/91)
Some benchmarks for a typical large scale scientific program. ------------------------ CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran). From the Geophysical Fluid Dynamics Lab - NOAA, Princeton, New Jersey. Written by Ron Pacanowski, Keith Dixon, and Tony Rosati; based on the original Cox-Bryan model. The code is well vectorized, and consists mostly of floating point multiplies/additions. The single and double precision times for the IBM were exactly the same. The cpu time for the two processor Convex is total sum from each processor (63 secs each). The estimated benchmark for the IBM540 is 11Mflops, and 22Mflops for the IBM550. Job Machine | CPU secs | Mflop | Size MB | Comments ---------------------------------------------------------------- IBM-320 195 5.1 7.7 REAL*8, optimized. 195 5.1 4.0 REAL*4, " " HP-720 "Coral" 197 5.1 8.0 REAL*8, optimized O2. 132 7.6 4.0 REAL*4, " " DS5100 495 2.0 7.9 REAL*8, optimized O2. 330 3.0 4.0 REAL*4, " " ** Convex C2400 97 10.8 8.9 REAL*8, vectorized, 73 14.2 4.5 REAL*4, 1 processor. 126/2 16.3 8.9 REAL*8, 2 processors Cray Y/MP 14 71.0 ~8.0 REAL*8, vectorized, (GFDL, NOAA) 1 processor. ---------------------------------------------------------------- ** Real time for 1 proc = 99sec, 2 proc = 78sec using REAL*8. Machine Specifics: Machine Memory(MB) Disk(MB) ---------------------------------- IBM-320 16 340 HP-720 32 340 DS5100 24 1200 C2400 512 ~3000 Y/MP ~512 ? ------------------------ You'd need more dollars than cents to even think of buying a Convex. 7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in the HP blurb. Perhaps its time than ANSI created an "official" set of benchmarks. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
guscus@katzo.rice.edu (Gustavo E. Scuseria) (05/03/91)
In article <1991May3.023705.5616@marlin.jcu.edu.au> csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: >7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in >the HP blurb. What version of the operating system are you using ? And what compiler ? I believe HP claims a big boost of performance with the new (unreleased) 8.05 version. -- Gustavo E. Scuseria | guscus@katzo.rice.edu Department of Chemistry | Rice University | office: (713) 527-4082 Houston, Texas 77251-1892 | fax : (713) 285-5155
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/04/91)
Errata for my first two articles (see comp.benchmarks too). i. linpack300x300 should have been linpack100x100. ii. "coral" should have been "cobra". HP (Sydney) did the test for the HP9000-720 using the 8.05 compiler with the new (Kuck?) front-end. The OS was older, 8.02 or 8.01, but that shouldn't matter since there were no system calls, and no disk I/O. Another philosophical note to all those who sent venomous (snake) mail to me: This application has been developed by ranking US government scientists for the last 20 years; a LOT of brain power has gone into it. Its a REAL program being used to solve REAL problems. If that doesn't constitute a good benchmark, then what the hell does ??? Perhaps a floating point test using awk is more acceptable to the curly bracket crowd. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/07/91)
In <1991May3.142148.6329@rice.edu> guscus@katzo.rice.edu (Gustavo E. Scuseria) writes: > What version of the operating system are you using ? > And what compiler ? > I believe HP claims a big boost of performance with the new > (unreleased) 8.05 version. A quick follow up to all those who queried the "-O2" for the HP720 -O2 is the highest level of optimization (ftn) for the Snakes, unlike -O3 for the 400 compilers. Thanks to Walt Underwood (HP) for clearing that up. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
chuckc@hpfcdc.HP.COM (Chuck Cairns) (05/09/91)
Hi, Curious HP type here. Just wondered what your memory access pattern is on this particular benchmark. If you access memory address X then X+1 do you take a large hop in memory location? What I'm suspecting is that your benchmark may be cache-thrashing. I.E. causing lots of accesses to main-ram. I'd like to see the actual numbers on the higher speed IBM's. I kinda suspect that they wouldn't scale well with clock speed. Hmmm What't the cache to main-ram width on the IBM's. Why? If we assume a very fast floating point processor, which both the HP730 and the high end IBM's have, then the job becomes one of feeding the FP processor with enough data quickly enough. Since all the vendors draw from roughly the same pool for memory components their speed becomes at least one limiting factor. How wide the cache-to processor bus is or how many bytes-are in a cache line may not be pertinent IF you're cache thrashing. If we drop out of cache into main ram then we're going to take a major hit even with an efficient main-ram to cache-ram to FP processor path. A wider path cache to main-ram would help ... but costs $$. More cache helps ... until the arrays get bigger .. it costs $$. Faster memory chips help but they cost $$ Murphy's Nth law of $$? I bet we could roughly predict the speed of a given program by knowing the cache to main-ram width and the speed of main-ram memory ... IF we also know that said program IS cache thrashing. Since I don't know your particular benchmark these may not be applicable ideas ... Is it possible to run the benchmark with "rows and columns" exchanged ? I'd also be keen on knowing what happens when the 730 runs totally from it's 256Kbyte data cache i.e. ... smaller arrays. The brakes on a Pinto at 55 may seem "spongy" on a Ferrari at 135 ? The ram chips on an XYZ at 5 MFLOPS may seem ... Regards, Chuck Cairns As usual: My opinions are my own and not HP's.
rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/09/91)
Chuck Cairnes of HP writes:
<like to see the actual numbers on the higher speed IBM's. I kinda suspect
<that they wouldn't scale well with clock speed. Hmmm What't the cache to
<main-ram width on the IBM's.
I have done extensive benchmarks on both the IBM model 530 and model 550.
These are a broad range of large 2D fluids codes and misc. chaos
codes. Also FFT's and matrix multiplies. In all cases, the speedup
went with the clock-ratio. This is expected, as the 550 uses
faster memory and a faster bus. The more extensive testing of the
Los Alamos group (with a 540) supports this result.
In going from a 520 to a 550, the speed up is a lot MORE than the
clock ratio, because the 550 has a larger cache. I think probably
the superior numerical performance of the IBM architecture is
due to the superscalar processor, but it also may have something
to do with the bandwidth between memory and cache, since we're
talking about out-of cache codes here for both machines. The
IBM has 128byte cache lines. If you have a cache miss, it
takes 8 cycles to load the needed line. Thus, in a streaming
mode, the bandwidth to cache is 16 bytes/cycle (or 672MB/sec on
a 550). Given that the cache is 4-way set associative rather
than direct mapped, there can also be degradation due to certain
special memory access patterns. Misses in the TLB take more
cycles to fix, but are rather rare.
I doubt that cache thrashing is a big problem with the ocean
model code. It was written very carefully for a Cyber 205
originally, and the Cyber really choked on non-unit strides. I
haven't seen the code in a while, though. It is not
especially optimized for re-use of data, given the Cyber/Cray
architecture, but this is an equal performance hit on the
IBM and HP architectures.
So, we have a problem here. How do we reconcile HP's published
float performance figures with the results on this real
scientific model code? Do we perhaps have a compiler that
can do Linpak and specfp efficiently but nothing else? That
certainly was the case on my DN10k, for example, where the
10.6 fortran compiler clearly had some kludges in it that
had the sole effect of making compiled BLAS run faster (and
just about nothing else).
rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) (05/09/91)
Another thing I forgot to mention about IBM's cache architecture. A dirty cache line can be stored to memory concurrently with the load of a new cache line, so in a streaming mode, the bandwidth to main memory is effectively twice what I mentioned in my previous post. I'm curious if HP's cache can do this as well.
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/10/91)
In <5570629@hpfcdc.HP.COM> chuckc@hpfcdc.HP.COM (Chuck Cairns) writes: .... >Just wondered what your memory access pattern is on this particular >benchmark. If you access memory address X then X+1 do you take a large hop >in memory location? What I'm suspecting is that your benchmark may be No, the program is essentially vectorized with unit stride in most cases. The data would have streamed through the cpu exactly linearly. Hence its a good measure for streaming performance; see John McCalpin's stuff in comp.benchmarks. Data reuse in the cache would only have existed for the scalar parts ( < 15%). The vector size (for "A benchmark") is =< 90 (double prec). I think thats why the Cray time was a little slow, its N1/2 was the cause. >Since I don't know your particular benchmark these may not be applicable >ideas ... Is it possible to run the benchmark with "rows and columns" >exchanged ? I'd also be keen on knowing what happens when the 730 runs >totally from it's 256Kbyte data cache i.e. ... smaller arrays. HP software engineers ran (and looked at) this program (Sydney HP). They said the row/col major wasn't making any difference, so I'll have to take their world for it. I'll publish the times for the IBM540 in about 4 weeks. I doubt if a larger cache would help, certainly not for the vector parts anyway. NB. MOM is NOT a benchmark, its a real program >:) Cheers -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
system@alchemy.chem.utoronto.ca (System Admin (Mike Peterson)) (05/11/91)
In article <1991May9.155313.8671@midway.uchicago.edu> rtp1@quads.uchicago.edu (raymond thomas pierrehumbert) writes: >So, we have a problem here. How do we reconcile HP's published >float performance figures with the results on this real >scientific model code? Do we perhaps have a compiler that >can do Linpak and specfp efficiently but nothing else? That >certainly was the case on my DN10k, for example, where the >10.6 fortran compiler clearly had some kludges in it that >had the sole effect of making compiled BLAS run faster (and >just about nothing else). While I agree about the Apollo DN10k compiler having stuff in it to run LINPACK much faster, my general experience with the DN10000, IBM RS/6000 320, SGI 4D2x0 and HP 720 would put the floating point performance in the following order: DN10000 ~5 Mflops/cpu with 10.7 compiler 4D2x0 ~5 Mflops/cpu 6000/320 ~8 Mflops 720 ~15 Mflops with compiler now on demo models This is based on relative performance on an ab initio quantum chemistry package (basically a subset of Gaussian 8x) looking only at cpu-intensive jobs, and assuming 5 Mflops for DN10000/4D210 (as LINPACK would report). It agrees pretty well with the vendor claims too (hard as that may be to believe :-) ). We have, however, found specific codes that run much faster and much slower (i.e. up to factors of 2 easily) than these general guidelines on all the systems except the 720 (not enough time to throw a wide variety at it :-( ). Mike. -- Mike Peterson, System Administrator, U/Toronto Department of Chemistry E-mail: system@alchemy.chem.utoronto.ca Tel: (416) 978-7094 Fax: (416) 978-8775
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/11/91)
Another personal opinion: Vector code (ie MOM) is probably the nastiest code that one could run on any RISC machine. Its got large amounts of mem to cpu transfer, and virtually no data reuse in the cache. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au