[comp.benchmarks] Benchmark Summary - FP-intensive, 2-D CFD model

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (11/21/90)

I collecting the performance results on this code of mine for several
years, and since there is now a comp.benchmarks group, I offer them
for your edification.

The model is somewhat interesting from the benchmarking point of view
because the operations are about 70% vectorizable and 30%
non-vectorizable (in a recursive routine dominated by divides).  The
profiles of the runs (not shown here) show significant variability ---
most of the scalar machines are limited by the main part of the code,
while most of the vector and parallel machine are limited by the
scalar bottleneck.

This is one of the few cases I have seen that shows the IBM 3090/VF in
a good light relative to the Crays and relative to the IBM 3090 in
scalar mode.


=============================================================================
November 20, 1990   Quasigeostrophic Ocean Model Benchmark   John D. McCalpin
=============================================================================

The following timings are for the compilation and execution of a floating-
point-intensive fortran 77 program.  The program is typical of test runs of
numerical models used in meteorology/oceanography/geophysical fluid dynamics.
The model executed integrates a time-dependent equation for 60 time steps.
At each time step, an elliptic equation must be solved on the 101x41 finite
difference grid.  The Fishpak routine HWSCRT is used for this purpose.
Therefore, these benchmarks test the floating-point hardware and the array
access/manipulation efficiency of the hardware and software.

-------------------------------------------------------------------------------
machine				 execute    	    ratio to           notes
compiler			  model     	  VAX 780 (VMS)            
			 	  (sec)     	compile     execute
-------------------------------------------------------------------------------
Cray Y/MP (UNICOS, CFT77)          1.9 *          4.76       63.15       9
Cray X/MP (UNICOS, CFT77)	   1.9 *	  5.02	     65.16       9
Cray 2 (CFT77)			   2.6 *	  2.54	     46.40       9
Cray X/MP (8.5 ns, CFT 1.15)       2.3 *         24.25       52.80       8 
Cray 1S    (CFT 1.14)              3.3 *         20.78       37.58       8 

ETA-10G (7 ns)			   3.7 *	  3.05	     33.11
ETA-10E (10.5 ns, UNIX)	           5.7 *	  2.28       21.64
Cyber 205  (FTN200/670)            5.6 *          7.73       21.88       7
Cyber 760	                  14.0 *         10.08        8.75
Cyber 850			  30.2 *	   -	      4.06
Cyber 835                         79.9 *          3.46        1.53
Cyber 730                        128.3 *          2.31        0.95

IBM 3090/VF (vector)               3.6 *         14.38       34.31      12
IBM 3090 (scalar)                  5.0 *         19.59       24.65      12

IBM RS/6000 Model 530		   4.8 *	   --	     25.52
IBM RS/6000 Model 320		   6.3 *	   --	     19.44
IBM RS/6000 Model 320		   8.7		  1.04	     14.08

Convex C240B			   5.4 *	  1.36	     22.69      10
Convex C220B			   5.9 *	  1.36	     20.76      10
Convex C210B			   6.5 *	  1.36	     18.85      10
Convex C220			   7.9 *	  1.36	     15.51      10,11
Convex C210			   7.5 *	  1.36	     16.33      10,11
Convex C120			  16.4 ?	  0.54	      7.47      10

Alliant FX/8			  22.7 ?          0.31        5.40       2,7
Alliant FX/1              	  33.2 ?          0.29        3.69       9

SGI IRIS 4D/220 (1 cpu)            8.8            1.57       13.92      9
SGI IRIS 4D/120 (1 cpu)           16.2            0.88        7.57
SGI Personal IRIS                 22.7            0.67        5.40
SGI 4D-60 Turbo                   25.2            0.78        4.86      2
SGI 3030 (w/Weitek)              151.4            4.08        0.81

DECstation 3100			  15.5		   -	      7.90
VAX 8700 VMS 4.6               	  28.0            6.36        4.37
VAX 6210 VMS			  49.4		  2.68	      2.48
VAX 11/780 VMS 4.2               122.5            1.00        1.00
VAX 11/780 Ultrix                208.1            0.25        0.59
VAX 11/750 VMS 4.1               495.5            0.63        0.25
Micro VAX II VMS                 195.4            0.97        0.63
Micro VAX II Ultrix              356.1       	  0.25	      0.34
VAXstation 2000 VMS              197.8            0.96        0.62

Apollo DN 10000		          10.9 *	  1.30       11.24
HP 835 SRX turbo		  19.6		  0.79	      6.25

NeXT (Sun f77)			 158.4		  N/A	      0.77     13

-------------------------------------------------------------------------------
machine				 execute    	    ratio to           notes
compiler			  model     	  VAX 780 (VMS)            
			 	  (sec)     	compile     execute
-------------------------------------------------------------------------------
SUN 4/260                         37.0            0.72        3.31      2
SUN 3/260 (Weitek)                55.3            0.49        2.19      2
SUN 3/260 (68881)                275.2              "         0.45
SUN 3/160 (Weitek)                80.9            0.26        1.51      2
SUN 3/160 (68881)                324.0            0.28        0.38      2
SUN 3/50 (no fpa)               1692.3            0.17        0.07      4

Ridge 32                          81.6            0.92        1.50

Masscomp 5400 (68881)            420.3            1.20        0.29
Masscomp 5400 (Weitek)	         135.		   "	      0.91
Compaq 386/20 (Weitek)            59.6            2.51        2.05
IBM PC/AT (80287)               3640.0            0.10        0.03      5

--------------------------------------------------------------------------
COMMENTS:

   The code executed consists of a user program which is essentially
   100% vectorizable, and a call to the hwscrt library routine. About
   30% of the operations are in the library routine TRIX, which is not
   vectorizable, and which uses about 75% of the total time on vector
   machines. Parallelization can only help on the 15% of the cpu time
   spent in vector code.  The last 10% of the time is spent in
   formatted I/O.

NOTES:
(1) All codes run were computationally identical. Some changes
    were required for I/O compatibility.
	Code lengths (with comments):
           model.f  1129 lines  ;  hwscrt.f  2076 lines.  
    Timings marked with '*' employed 64-bit arithmetic.
    Timings marked with '?' might have used 64-bit arithmetic.
    Unmarked timings employed 32-bit arithmetic.

(2) Timings in parentheses are for the minimum settings of the
    compiler optimizer.  All other timings are with maximum
    optimization/vectorization/parallelization, where appropriate.

(3) Where possible (i.e., on the workstations), jobs were run alone
    on the machine.  The performance on the UNIX machines tended to
    degrade slightly as the load increased.
    Typically the timings would increase by 5% for each
    additional CPU-intensive job in the system.  

(4) The SUN 3/50 is an early version with a 12MHz clock. Current models
    run at 15MHz.  

(5) The IBM PC/AT used Ryan-McFarlane Professional Fortran. Other PC
    compilers failed to generate excutable code.
	
(6) The ratios shown are for compile time (including link), and
    execution time relative to the total time on the VAX 11/780 (VMS).
    On the Cyber 760, link time is included in the execution time.  

(7) The tests on the CRAY's, Cyber 205, ETA-10, and Alliant were run
    with the vectorizers on, but the program spends ~75% of its time
    in an unvectorized subroutine in the hwscrt package. A vectorized
    solver has been produced for the ETA-10 which is about 10 times
    faster than the scalar code. Similar improvements could be made on
    each of these machines.  

(8) These Cray results were provided by Mohan Ramamurthy, and utilized
    the CRAY machines at the National Center for Atmospheric Research.  

(9) These results were provided by Glenn Randers-Pehrson at the Ballistic
    Research Lab (May 9, 1989). He also provided results for the SGI IRIS
    2500T, which matched the Silicon Graphics 3030 results shown.  

(10) Convex results provided by Howard Page of Convex on May 22, 1989.
    The B series machines have faster scalar divide and square
    root hardware.  

(11) I have no immediate explanation for this reversal in orderings.  

(12) These results were provided by Claudia Stelz at the University
    of Delaware.  

(13) Code was compiled and linked on a Sun 3, using the Sun f77 compiler.

Benchmarks executed and compiled by:
	John D. McCalpin
	Graduate College of Marine Studies
	The University of Delaware
	Robinson Hall
	Newark, DE 19716

	mccalpin@perelandra.cms.udel.edu	(Internet)
	DELOCN::MCCALPIN			(SPAN)
	J.MCCALPIN/OMNET			(Telemail)
	(302) 292-3686 (voice)
	(302) 451-6838 (fax)
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (11/21/90)

> On 20 Nov 90 18:02:11 GMT, mccalpin@perelandra.cms.udel.edu I said:

John> I collecting the performance results on this code of mine for several
      ^^^^^^^^^^^^
John> years, and since there is now a comp.benchmarks group, I offer them
John> for your edification.

Sorry for the incompetence....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

crispin@csd.uwo.ca (Crispin Cowan) (11/21/90)

In article <MCCALPIN.90Nov20130211@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
[some very interesting data]
>Compaq 386/20 (Weitek)            59.6            2.51        2.05
>IBM PC/AT (80287)               3640.0            0.10        0.03      5
The difference between these numbers suggests that the 287 on the AT was
not used.  Is the Weitek chip really 60 times faster than a 287?  Ten,
I'd believe, but 60 looks more like the AT was doing it's FP
calculations in software.

Crispin
-----
Crispin Cowan, CS grad student, University of Western Ontario
Work:  MC28-C, x3342 crispin@csd.uwo.ca
890 Elias St., London, Ontario, N5W 3P2,  432-7823
	---> Support the GST:  Canada's first fair tax <---