csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/03/91)
Some benchmarks for a typical large scale scientific program. ------------------------ CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran). From the Geophysical Fluid Dynamics Lab - NOAA, Princeton, New Jersey. Written by Ron Pacanowski, Keith Dixon, and Tony Rosati; based on the original Cox-Bryan model. The code is well vectorized, and consists mostly of floating point multiplies/additions. The single and double precision times for the IBM were exactly the same. The cpu time for the two processor Convex is total sum from each processor (63 secs each). The estimated benchmark for the IBM540 is 11Mflops, and 22Mflops for the IBM550. Job Machine | CPU secs | Mflop | Size MB | Comments ---------------------------------------------------------------- IBM-320 195 5.1 7.7 REAL*8, optimized. 195 5.1 4.0 REAL*4, " " HP-720 "Coral" 197 5.1 8.0 REAL*8, optimized O2. 132 7.6 4.0 REAL*4, " " DS5100 495 2.0 7.9 REAL*8, optimized O2. 330 3.0 4.0 REAL*4, " " ** Convex C2400 97 10.8 8.9 REAL*8, vectorized, 73 14.2 4.5 REAL*4, 1 processor. 126/2 16.3 8.9 REAL*8, 2 processors Cray Y/MP 14 71.0 ~8.0 REAL*8, vectorized, (GFDL, NOAA) 1 processor. ---------------------------------------------------------------- ** Real time for 1 proc = 99sec, 2 proc = 78sec using REAL*8. Machine Specifics: Machine Memory(MB) Disk(MB) ---------------------------------- IBM-320 16 340 HP-720 32 340 DS5100 24 1200 C2400 512 ~3000 Y/MP ~512 ? ------------------------ You'd need more dollars than cents to even think of buying a Convex. 7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in the HP blurb. Perhaps its time than ANSI created an "official" set of benchmarks. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
preston@ariel.rice.edu (Preston Briggs) (05/03/91)
csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: > CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran). > The code is well vectorized, and consists mostly of floating point > multiplies/additions. It's important to remember that code that is ideal for a vector machine is not necessarily ideal for a scalar (or super-scalar) machine. Yes, the IBM and HP machines can cook on vector code, but often it can be rearranged for even better performance. For example, on vector machine, we don't like to see recurrences in the inner loop. On scalar machines, these are desirable. A simple (contrived) example: This is ok for vector machines DO j = 1, n DO i = 1, n A(i) = A(i) + B(j) ENDDO ENDDO but this is better for scalar machines (and terrible for vector machines, because of the recurrence on A(i)) DO i = 1, n DO j = 1, n A(i) = A(i) + B(j) ENDDO ENDDO Why? In the first case, we'll hold the inner-loop invariant B(j) in a register. Therefore, we'll require 1 load and 1 store for each flop. In the 2nd case, we'll hold A(i) in a register across the inner loop, requiring only one load per flop, with no stores in the inner loop. We can further munch the second example, by unrolling the outer loop and jamming the resulting inner loop bodies together DO i = 1, n, 4 DO j = 1, n A(i+0) = A(i+0) + B(j) A(i+1) = A(i+1) + B(j) A(i+2) = A(i+2) + B(j) A(i+3) = A(i+3) + B(j) ENDDO ENDDO In this case, we'll hold 4 parts of A in registers, and require only one load of B for every 4 flops. This also helps get better scheduling for the pipelines. So, the point is that the results of measuring "well vectorized" code will tend to favor vector machines. By reworking the code (a lot?), ala the Perfect Club, you should be able to achieve even better performance on the scalar machines. Preston Briggs
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/03/91)
In <1991May3.053053.29174@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: >It's important to remember that code that is ideal for a vector machine >is not necessarily ideal for a scalar (or super-scalar) machine. >Yes, the IBM and HP machines can cook on vector code, >but often it can be rearranged for even better performance. ... >This is ok for vector machines > DO j = 1, n > DO i = 1, n > A(i) = A(i) + B(j) > ENDDO > ENDDO >but this is better for scalar machines >(and terrible for vector machines, because of the recurrence on A(i)) > DO i = 1, n > DO j = 1, n > A(i) = A(i) + B(j) > ENDDO > ENDDO I do see your point about scalar machines. A quick scan through the hot spots in the code showed that potential chaining couldn't be had. Any sensible vector compiler would do a loop inversion for the second example, or possibly a chain with A(I) treated as a scalar. The GFDL code is essentially dot and vector products, and triads, which should be OK for both types of machines. Scalar chaining in the cache should be regarded as a bonus, not a necessity, and benchmarks which exclusively use it (eg linpack300x300 on HP9000-750) should be treated with contempt. Chaining is also quite useful for vector machines too, and necessary in a case like SUM=SUM+A(I). You should be aware that the backwards recurrence problem has been solved for vector compilers, and Fujitsu's latest vector compiler permitts 1st order forward recurrence. Philosophical note: I really hate the idea of butchering code to suite some particular machine; computer scientists were invented to write compilers that took care of that! The UNIX curly bracket bozos can't even write a bug-free f77 compiler !! -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
guscus@katzo.rice.edu (Gustavo E. Scuseria) (05/03/91)
In article <1991May3.023705.5616@marlin.jcu.edu.au> csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: >7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in >the HP blurb. What version of the operating system are you using ? And what compiler ? I believe HP claims a big boost of performance with the new (unreleased) 8.05 version. -- Gustavo E. Scuseria | guscus@katzo.rice.edu Department of Chemistry | Rice University | office: (713) 527-4082 Houston, Texas 77251-1892 | fax : (713) 285-5155
preston@ariel.rice.edu (Preston Briggs) (05/03/91)
csrdh@marlin.jcu.edu.au (Rowan Hughes) writes: >Any sensible vector compiler would do a loop inversion for the second >example, or possibly a chain with A(I) treated as a scalar. >Philosophical note: >I really hate the idea of butchering code to suite some particular >machine; computer scientists were invented to write compilers that >took care of that! Well, some computer scientists might disagree; however, I don't. The vector machines typically have sensible compilers; I believe scalar machines need them too. HP's newest compilers (unreleased?) supposedly incorporate a front-end from Kuck and Associates. Hopefully, they'll handle this sort of problem. I expect it'll raise the stakes considerably among workstation compilers. Preston Briggs
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/04/91)
Errata for my first two articles: i. linpack300x300 should have been linpack100x100. ii. "coral" should have been "cobra". HP (Sydney) did the test for the HP9000-720 using the 8.05 compiler with the new (Kuck?) front-end. The OS was older, 8.02 or 8.01, but that shouldn't matter since there were no system calls, and no disk I/O. Another philosophical note to all those who sent venomous (snake) mail to me: This application has been developed by ranking US government scientists for the last 20 years; a LOT of brain power has gone into it. Its a REAL program being used to solve REAL problems. If that doesn't constitute a good benchmark, then what the hell does ??? Perhaps a floating point test using awk is more acceptable to the curly bracket crowd. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/04/91)
Errata for my first two articles (see comp.benchmarks too). i. linpack300x300 should have been linpack100x100. ii. "coral" should have been "cobra". HP (Sydney) did the test for the HP9000-720 using the 8.05 compiler with the new (Kuck?) front-end. The OS was older, 8.02 or 8.01, but that shouldn't matter since there were no system calls, and no disk I/O. Another philosophical note to all those who sent venomous (snake) mail to me: This application has been developed by ranking US government scientists for the last 20 years; a LOT of brain power has gone into it. Its a REAL program being used to solve REAL problems. If that doesn't constitute a good benchmark, then what the hell does ??? Perhaps a floating point test using awk is more acceptable to the curly bracket crowd. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au
csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/07/91)
In <1991May3.142148.6329@rice.edu> guscus@katzo.rice.edu (Gustavo E. Scuseria) writes: > What version of the operating system are you using ? > And what compiler ? > I believe HP claims a big boost of performance with the new > (unreleased) 8.05 version. A quick follow up to all those who queried the "-O2" for the HP720 -O2 is the highest level of optimization (ftn) for the Snakes, unlike -O3 for the 400 compilers. Thanks to Walt Underwood (HP) for clearing that up. -- Rowan Hughes James Cook University Marine Modelling Unit Townsville, Australia. Dept. Civil and Systems Engineering csrdh@marlin.jcu.edu.au