bernhold@qtp.ufl.edu (David E. Bernholdt) (03/02/89)
I understand that Cray has implemented the BLAS (Basic Linear Algebra Subroutines) in assembler for their machines. The means all of the BLAS: levels 1, 2, and 3 (I'm not sure about sparse-BLAS-1, though). Is anyone out there aware of other vendors implementing the BLAS for their machines? On a Sun with FPA, for example, the routines could take advantage of the 2x2, 3x3, and 4x4 matrix and vector operations on the FPA. I would expect that most presently available could get improved performance from special implementations of the BLAS (as opposed to just compiling the FORTRAN version). -- David Bernholdt bernhold@qtp.ufl.edu Quantum Theory Project bernhold@ufpine.bitnet University of Florida Gainesville, FL 32611 904/392 6365
rogerk@mips.COM (Roger B.A. Klorese) (03/03/89)
In article <449@orange19.qtp.ufl.edu> bernhold@qtp.ufl.edu (David E. Bernholdt) writes: >Is anyone out there aware of other vendors implementing the BLAS for >their machines? >I would expect that most presently available could get improved >performance from special implementations of the BLAS (as opposed to >just compiling the FORTRAN version). If special implementations could provide dramatically better performance than a compiler, the compiler needs work. Our all-FORTRAN number is within about 10% of our coded rate. -- Roger B.A. Klorese MIPS Computer Systems, Inc. {ames,decwrl,pyramid}!mips!rogerk 928 E. Arques Ave. Sunnyvale, CA 94086 rogerk@servitude.mips.COM (rogerk%mips.COM@ames.arc.nasa.gov) +1 408 991-7802 "I majored in nursing, but I had to drop it. I ran out of milk." - Judy Tenuta
per@tdb.uu.se (Per Wahlund, TDB, Uppsala) (03/03/89)
In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes: > I understand that Cray has implemented the BLAS (Basic Linear Algebra > Subroutines) in assembler for their machines. The means all of the > BLAS: levels 1, 2, and 3 (I'm not sure about sparse-BLAS-1, though). > > Is anyone out there aware of other vendors implementing the BLAS for > their machines? > I am not aware of any vendors implementing BLAS, although they certainly are to some extent. However, there is an all-European research institute called C.E.R.F.A.C.S. in Toulouse, France, working with large-scale computations. They have a research group for parallel algorithms and they are at least using BLAS 1, 2, 3 on different machines and I think they are also working on special implmentations. In that case the machines are IBM 3090, ETA-10 and Alliant FX/80, (and perhaps also Encore Multimax, which is sold under the name Matra in France) Per Wahlund Dept. of Scientific Computing Sturegatan 4B S-752 23 Uppsala, SWEDEN
mccalpin@loligo.uucp (John McCalpin) (03/03/89)
In article <14500@admin.mips.COM> rogerk@mips.COM (Roger B.A. Klorese) writes: >In article <449@orange19.qtp.ufl.edu> bernhold@qtp.ufl.edu (David E. Bernholdt) writes: >>Is anyone out there aware of other vendors implementing the BLAS for >>their machines? I would expect that most presently available could get >>improved performance from special implementations of the BLAS (as >>opposed to just compiling the FORTRAN version). > >If special implementations could provide dramatically better performance >than a compiler, the compiler needs work. Our all-FORTRAN number is within >about 10% of our coded rate. Roger B.A. Klorese MIPS Computer Systems, Inc. On some vector machines, the best improvements can be obtained by inlining BLAS calls. This removes the subroutine call overhead and the check for non-unit stride (which is never used in LINPACK). On the Cyber 205, I got an immediate factor of 2 speedup on the order 100 LINPACK case by changing just the BLAS calls in SGEFA to in-line vector instructions. The compiler on the ETA-10 can do the in-lining now, but it is not so clever about removing the extraneous stride test (which can be evaluated at compile time). Many scalar machines show speedups of >20% with coded BLAS on the LINPACK test. I consider this level of improvement sufficient for me to want the coded BLAS -- though not sufficient for me to do it myself :-) ---------------------- John D. McCalpin ------------------------ Dept of Oceanography & Supercomputer Computations Research Institute mccalpin@masig1.ocean.fsu.edu mccalpin@nu.cs.fsu.edu --------------------------------------------------------------------
ssd@sugar.hackercorp.com (Scott Denham) (03/07/89)
In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes: > > Is anyone out there aware of other vendors implementing the BLAS for > their machines? > > -- > David Bernholdt bernhold@qtp.ufl.edu Well, I don't know about the complete BLAS package, but IBM's ESSL (engineering and scientific subroutine library) conforms to some of the BLAS calling sequences for the functions it does support. Some of ESSL is excellent in performance (though awkward to use) like FFT's, while some of it shows little or no performance advantage over straight vector FORTRAN code until you get to arrays large enough to start experiencing cache effects. Scott Denham Western Atlas Houston, TX
mccalpin@loligo (John McCalpin) (03/18/89)
In response to the following: >In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes: >> Is anyone out there aware of other vendors implementing the BLAS for >> their machines? David Bernholdt bernhold@qtp.ufl.edu someone from MIPS made the comment that the compilers should be generating near-optimal code anyway, and said that the MIPS LINPACK performance was within 10% of optimal for the compiler-generated code. This did not seem to agree with my recollection, so here are the published LINPACK results from the January 29 LINPACK summary: Machine Test Fortran/compiler Coded/compiler %Speedup ------------------------------------------------------------------------ M-2000 25.0 MHz 64-bit 3.6 (????) 4.0 (????) 11% M-120/5 16.7 MHz " 2.1 (1.30) 2.2 (1.31) 5% M-1000 15.0 MHz " 1.5 (1.30) 1.6 (1.21) 7% M-800 12.5 MHz " 1.2 (1.30) 1.1 (1.10) -9% *** ------------------------------------------------------------------------ M-2000 32-bit 5.7 (????) 7.2 (????) 26% M-120/5 " 3.9 (1.31) 4.8 (1.31) 23% M-1000 " 3.6 (1.30) 4.3 (1.21) 19% M-800 " 3.0 (1.30) 2.4 (1.10) -18% *** ------------------------------------------------------------------------ *** In these cases, the coded results used an old (1.10) compiler, and so are not competitive. The results are within 10% for the 64-bit results, but the 32-bit code clearly benefits from hand-optimization. On a Silicon Graphics Personal IRIS (which should have the same CPU and clock as the M-800 and which uses level 1.31 of the compiler), I have not been able to exceed 1.96 MFLOPS for the 32-bit all Fortran code, using full (-O3) optimization and a variety of loop unrolling lengths (1-32). I can't yet account for this discrepency --- anyone want to volunteer to explain it? ---------------------- John D. McCalpin ------------------------ Dept of Oceanography & Supercomputer Computations Research Institute mccalpin@masig1.ocean.fsu.edu mccalpin@nu.cs.fsu.edu --------------------------------------------------------------------