[comp.benchmarks] A benchmark

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/03/91)

Some benchmarks for a typical large scale scientific program.

------------------------
     CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran).

     From the Geophysical Fluid Dynamics Lab - NOAA, Princeton, New Jersey.
     Written by Ron Pacanowski, Keith Dixon, and Tony Rosati; based on the
     original Cox-Bryan model.
     The code is well vectorized, and consists mostly of floating point
     multiplies/additions. The single and double precision times for
     the IBM were exactly the same. The cpu time for the two processor
     Convex is total sum from each processor (63 secs each).
     The estimated benchmark for the IBM540 is 11Mflops, and 22Mflops
     for the IBM550.

                                         Job
          Machine   | CPU secs | Mflop | Size MB |     Comments
        ----------------------------------------------------------------
        IBM-320         195       5.1      7.7     REAL*8, optimized.
                        195       5.1      4.0     REAL*4, "        "

	HP-720 "Coral"  197       5.1      8.0     REAL*8, optimized O2.
                        132       7.6      4.0     REAL*4, "           "

        DS5100          495       2.0      7.9     REAL*8, optimized O2.
                        330       3.0      4.0     REAL*4, "           "

   **   Convex C2400     97      10.8      8.9     REAL*8, vectorized,
                         73      14.2      4.5     REAL*4, 1 processor.
                       126/2     16.3      8.9     REAL*8, 2 processors


        Cray Y/MP        14      71.0     ~8.0     REAL*8, vectorized,
        (GFDL, NOAA)                                       1 processor.
        ----------------------------------------------------------------
**  Real time for 1 proc = 99sec, 2 proc = 78sec  using REAL*8.

Machine Specifics:
      Machine     Memory(MB)    Disk(MB)
      ----------------------------------
      IBM-320          16         340
      HP-720           32         340
      DS5100           24        1200
      C2400           512       ~3000
      Y/MP           ~512         ?
------------------------

You'd need more dollars than cents to even think of buying a Convex.
7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in
the HP blurb. Perhaps its time than ANSI created an "official" set
of benchmarks.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

preston@ariel.rice.edu (Preston Briggs) (05/03/91)

csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:

>     CPU times for the Geophysical Fluid Dynamics Model MOM_1.0 (Fortran).

>     The code is well vectorized, and consists mostly of floating point
>     multiplies/additions.

It's important to remember that code that is ideal for a vector machine
is not necessarily ideal for a scalar (or super-scalar) machine.
Yes, the IBM and HP machines can cook on vector code,
but often it can be rearranged for even better performance.

For example, on vector machine, we don't like to see recurrences
in the inner loop.  On scalar machines, these are desirable.

A simple (contrived) example:

This is ok for vector machines


	DO j = 1, n
	    DO i = 1, n
		A(i) = A(i) + B(j)
	    ENDDO
	ENDDO

but this is better for scalar machines
(and terrible for vector machines, because of the recurrence on A(i))

	DO i = 1, n
	    DO j = 1, n
		A(i) = A(i) + B(j)
	    ENDDO
	ENDDO

Why?  In the first case, we'll hold the inner-loop invariant B(j) in a 
register.  Therefore, we'll require 1 load and 1 store for each flop.
In the 2nd case, we'll hold A(i) in a register
across the inner loop, requiring only one load per flop, with no stores
in the inner loop.

We can further munch the second example, by unrolling the outer loop
and jamming the resulting inner loop bodies together

	DO i = 1, n, 4
	    DO j = 1, n
		A(i+0) = A(i+0) + B(j)
		A(i+1) = A(i+1) + B(j)
		A(i+2) = A(i+2) + B(j)
		A(i+3) = A(i+3) + B(j)
	    ENDDO
	ENDDO


In this case, we'll hold 4 parts of A in registers, and require
only one load of B for every 4 flops.  This also helps get better
scheduling for the pipelines.

So, the point is that the results of measuring "well vectorized"
code will tend to favor vector machines.  By reworking the code
(a lot?), ala the Perfect Club, you should be able to achieve
even better performance on the scalar machines.

Preston Briggs

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/03/91)

In <1991May3.053053.29174@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:

>It's important to remember that code that is ideal for a vector machine
>is not necessarily ideal for a scalar (or super-scalar) machine.
>Yes, the IBM and HP machines can cook on vector code,
>but often it can be rearranged for even better performance.
...
>This is ok for vector machines


>	DO j = 1, n
>	    DO i = 1, n
>		A(i) = A(i) + B(j)
>	    ENDDO
>	ENDDO

>but this is better for scalar machines
>(and terrible for vector machines, because of the recurrence on A(i))

>	DO i = 1, n
>	    DO j = 1, n
>		A(i) = A(i) + B(j)
>	    ENDDO
>	ENDDO

I do see your point about scalar machines. A quick scan through the
hot spots in the code showed that potential chaining couldn't be had.

Any sensible vector compiler would do a loop inversion for the second
example, or possibly a chain with A(I) treated as a scalar. 
The GFDL code is essentially dot and vector products, and triads,
which should be OK for both types of machines.

Scalar chaining in the cache should be regarded as a bonus, not a necessity,
and benchmarks which exclusively use it (eg linpack300x300 on HP9000-750)
should be treated with contempt.

Chaining is also quite useful for vector machines too, and necessary
in a case like SUM=SUM+A(I). You should be aware that the backwards
recurrence problem has been solved for vector compilers, and Fujitsu's
latest vector compiler permitts 1st order forward recurrence.

Philosophical note:
I really hate the idea of butchering code to suite some particular
machine; computer scientists were invented to write compilers that
took care of that! The UNIX curly bracket bozos can't even write
a bug-free f77 compiler !!

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

guscus@katzo.rice.edu (Gustavo E. Scuseria) (05/03/91)

In article <1991May3.023705.5616@marlin.jcu.edu.au> csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:

>7.6/5.1Mflops for the HP-720 is just a touch below the 17Mflops quoted in
>the HP blurb. 

 What version of the operating system are you using ? 
 And what compiler ? 
 I believe HP claims a big boost of performance with the new
 (unreleased) 8.05 version.


--
Gustavo E. Scuseria              | guscus@katzo.rice.edu
Department of Chemistry          |
Rice University                  | office: (713) 527-4082
Houston, Texas 77251-1892        | fax   : (713) 285-5155

preston@ariel.rice.edu (Preston Briggs) (05/03/91)

csrdh@marlin.jcu.edu.au (Rowan Hughes) writes:

>Any sensible vector compiler would do a loop inversion for the second
>example, or possibly a chain with A(I) treated as a scalar. 

>Philosophical note:
>I really hate the idea of butchering code to suite some particular
>machine; computer scientists were invented to write compilers that
>took care of that!

Well, some computer scientists might disagree; however, I don't.
The vector machines typically have sensible compilers; I believe scalar
machines need them too.

HP's newest compilers (unreleased?) supposedly incorporate a front-end
from Kuck and Associates.  Hopefully, they'll handle this sort of
problem.  I expect it'll raise the stakes considerably among
workstation compilers.

Preston Briggs

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/04/91)

Errata for my first two articles:
i.  linpack300x300 should have been linpack100x100.
ii. "coral" should have been "cobra".

HP (Sydney) did the test for the HP9000-720 using the 8.05 compiler
with the new (Kuck?) front-end. The OS was older, 8.02 or 8.01, but that
shouldn't matter since there were no system calls, and no disk I/O.


Another philosophical note to all those
who sent venomous (snake) mail to me:

This application has been developed by ranking US government scientists
for the last 20 years; a LOT of brain power has gone into it.
Its a REAL program being used to solve REAL problems. If that
doesn't constitute a good benchmark, then what the hell does ???
Perhaps a floating point test using awk is more acceptable to the curly
bracket crowd.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/04/91)

Errata for my first two articles (see comp.benchmarks too).
i.  linpack300x300 should have been linpack100x100.
ii. "coral" should have been "cobra".

HP (Sydney) did the test for the HP9000-720 using the 8.05 compiler
with the new (Kuck?) front-end. The OS was older, 8.02 or 8.01, but that
shouldn't matter since there were no system calls, and no disk I/O.


Another philosophical note to all those
who sent venomous (snake) mail to me:

This application has been developed by ranking US government scientists
for the last 20 years; a LOT of brain power has gone into it.
Its a REAL program being used to solve REAL problems. If that
doesn't constitute a good benchmark, then what the hell does ???
Perhaps a floating point test using awk is more acceptable to the curly
bracket crowd.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au

csrdh@marlin.jcu.edu.au (Rowan Hughes) (05/07/91)

In <1991May3.142148.6329@rice.edu> guscus@katzo.rice.edu (Gustavo E. Scuseria) writes:
> What version of the operating system are you using ? 
> And what compiler ? 
> I believe HP claims a big boost of performance with the new
> (unreleased) 8.05 version.

A quick follow up to all those who queried the "-O2" for the HP720
-O2 is the highest level of optimization (ftn) for the Snakes, unlike
-O3 for the 400 compilers. Thanks to Walt Underwood (HP) for clearing
that up.

-- 
Rowan Hughes                                James Cook University
Marine Modelling Unit                       Townsville, Australia.
Dept. Civil and Systems Engineering         csrdh@marlin.jcu.edu.au