mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (10/31/90)
I recently typed in the high-performance matrix multiply routine from the technical report by Ron Bell. In the report he states that he gets 43 MFLOPS on a model 530 using this code. In the absence of cache misses (which should be minimal in a blocked code like this one) the model 320 should run at 80% of that speed, or about 34 MFLOPS. My own tests (with block sizes in the range of 16 to 32) show a very consistent 12 MFLOPS performance. Has anyone else run this code? Even with lots of cache misses, the 320 should be no more than a factor of about 2.5 slower than a 530, and here I am seeing a ratio of 3.6.... -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@vax1.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET
bowman@uiatma.atmos.uiuc.edu (11/01/90)
In article <MCCALPIN.90Oct31170825@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: > >Ooops, there must have been some typo in my code. I extracted the >code from the tech report again and got the following absolutely >phenomenal results! > > IBM RS/6000 Model 320 Matrix Multiply Performance > Matrix Order Time per MM MFLOPS > 32 .002 29.789 > 64 .019 27.594 . . . The value of tailoring the algorithms to the architecture is apparent. Is anyone, including IBM, planning or willing to produce a library of basic linear algebra subroutines that are optimized for the 6000? Think of the clock cycles that would be saved! Prof. Kenneth P. Bowman Department of Atmospheric Sciences University of Illinois at Urbana-Champaign 105 S. Gregory Avenue Urbana, IL 61801 217-328-3102 bowman@uiatma.atmos.uiuc.edu
sdl@adagio.austin.ibm.com (Stephen Linam) (11/01/90)
In article <1990Oct31.233855.1371@ux1.cso.uiuc.edu>, bowman@uiatma.atmos.uiuc.edu writes: |> In article <MCCALPIN.90Oct31170825@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes: |> > |> >Ooops, there must have been some typo in my code. I extracted the |> >code from the tech report again and got the following absolutely |> >phenomenal results! |> > |> > IBM RS/6000 Model 320 Matrix Multiply Performance |> > Matrix Order Time per MM MFLOPS |> > 32 .002 29.789 |> > 64 .019 27.594 |> . |> . |> . |> |> The value of tailoring the algorithms to the architecture is apparent. Is |> anyone, including IBM, planning or willing to produce a library of basic |> linear algebra subroutines that are optimized for the 6000? Think of the |> clock cycles that would be saved! Yes. Look for /lib/libblas.a. In the initial release dgemm, sgemm, dgemv and sgemv are optimized. In the update announced last Tuesday the library will be refreshed with 22 single and double precision routines tuned. The tuned routines are [sd]gemv, [sd]trmv, [sd]trsv, [sd]gemm, [sd]symm, [sd]ger, [sd]trmm, [sd]trsm, [sd]syrk, [sd]axpy and i[sd]amax. Search for 'blas' in info for documentation on the routines. The interfaces are the same as the LAPACK blas. The library includes the full set of blas routines, however, only the ones listed above have been optimized. -------------------------------------------------------------------- Stephen Linam AWD Austin T/L: 793-3674 Bell-net: (512) 832-3674 IBM Internet: sdl@adagio.austin.ibm.com VNET: LINAM at AUSTIN UUCP: ...!cs.utexas.edu:ibmchs!auschs!adagio.austin.ibm.com!sdl