[comp.unix.aix] RS/6000 Model 320 FP Performance

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (10/31/90)

I recently typed in the high-performance matrix multiply routine from
the technical report by Ron Bell.  In the report he states that he
gets 43 MFLOPS on a model 530 using this code.  In the absence of
cache misses (which should be minimal in a blocked code like this one)
the model 320 should run at 80% of that speed, or about 34 MFLOPS.

My own tests (with block sizes in the range of 16 to 32) show a very
consistent 12 MFLOPS performance.  

Has anyone else run this code?  Even with lots of cache misses, the
320 should be no more than a factor of about 2.5 slower than a 530,
and here I am seeing a ratio of 3.6....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

bowman@uiatma.atmos.uiuc.edu (11/01/90)

In article <MCCALPIN.90Oct31170825@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>
>Ooops, there must have been some typo in my code.  I extracted the
>code from the tech report again and got the following absolutely
>phenomenal results!
>
>	IBM RS/6000 Model 320 Matrix Multiply Performance
>	Matrix Order   Time per MM        MFLOPS
>	        32       .002             29.789
>	        64       .019             27.594
 .
 .
 .

The value of tailoring the algorithms to the architecture is apparent.  Is
anyone, including IBM, planning or willing to produce a library of basic
linear algebra subroutines that are optimized for the 6000?  Think of the
clock cycles that would be saved!

Prof. Kenneth P. Bowman
Department of Atmospheric Sciences
University of Illinois at Urbana-Champaign
105 S. Gregory Avenue
Urbana, IL   61801
217-328-3102
bowman@uiatma.atmos.uiuc.edu

sdl@adagio.austin.ibm.com (Stephen Linam) (11/01/90)

In article <1990Oct31.233855.1371@ux1.cso.uiuc.edu>,
bowman@uiatma.atmos.uiuc.edu writes:
|> In article <MCCALPIN.90Oct31170825@pereland.cms.udel.edu>
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
|> >
|> >Ooops, there must have been some typo in my code.  I extracted the
|> >code from the tech report again and got the following absolutely
|> >phenomenal results!
|> >
|> >	IBM RS/6000 Model 320 Matrix Multiply Performance
|> >	Matrix Order   Time per MM        MFLOPS
|> >	        32       .002             29.789
|> >	        64       .019             27.594
|>  .
|>  .
|>  .
|> 
|> The value of tailoring the algorithms to the architecture is apparent.  Is
|> anyone, including IBM, planning or willing to produce a library of basic
|> linear algebra subroutines that are optimized for the 6000?  Think of the
|> clock cycles that would be saved!

Yes.  Look for /lib/libblas.a.  In the initial release dgemm, sgemm, dgemv 
and sgemv are optimized.  In the update announced last Tuesday the
library will 
be refreshed with 22 single and double precision routines tuned.  The
tuned routines are [sd]gemv, [sd]trmv, [sd]trsv, [sd]gemm, [sd]symm,
[sd]ger, [sd]trmm, [sd]trsm, [sd]syrk, [sd]axpy and i[sd]amax.

Search for 'blas' in info for documentation on the routines.  The interfaces
are the same as the LAPACK blas.  The library includes the full set of
blas routines, however, only the ones listed above have been optimized.

--------------------------------------------------------------------
Stephen Linam   AWD Austin   T/L: 793-3674  Bell-net: (512) 832-3674
IBM Internet: sdl@adagio.austin.ibm.com        VNET: LINAM at AUSTIN
UUCP:  ...!cs.utexas.edu:ibmchs!auschs!adagio.austin.ibm.com!sdl