[comp.lang.fortran] Are vendors implementing BLAS?

bernhold@qtp.ufl.edu (David E. Bernholdt) (03/02/89)

I understand that Cray has implemented the BLAS (Basic Linear Algebra
Subroutines) in assembler for their machines.  The means all of the
BLAS: levels 1, 2, and 3 (I'm not sure about sparse-BLAS-1, though).

Is anyone out there aware of other vendors implementing the BLAS for
their machines?

On a Sun with FPA, for example, the routines could take advantage of
the 2x2, 3x3, and 4x4 matrix and vector operations on the FPA.
I would expect that most presently available could get improved
performance from special implementations of the BLAS (as opposed to
just compiling the FORTRAN version).
-- 
David Bernholdt			bernhold@qtp.ufl.edu
Quantum Theory Project		bernhold@ufpine.bitnet
University of Florida
Gainesville, FL  32611		904/392 6365

rogerk@mips.COM (Roger B.A. Klorese) (03/03/89)

In article <449@orange19.qtp.ufl.edu> bernhold@qtp.ufl.edu (David E. Bernholdt) writes:
>Is anyone out there aware of other vendors implementing the BLAS for
>their machines?
>I would expect that most presently available could get improved
>performance from special implementations of the BLAS (as opposed to
>just compiling the FORTRAN version).

If special implementations could provide dramatically better performance
than a compiler, the compiler needs work.  Our all-FORTRAN number is
within about 10% of our coded rate.
-- 
Roger B.A. Klorese                                  MIPS Computer Systems, Inc.
{ames,decwrl,pyramid}!mips!rogerk      928 E. Arques Ave.  Sunnyvale, CA  94086
rogerk@servitude.mips.COM (rogerk%mips.COM@ames.arc.nasa.gov)   +1 408 991-7802
"I majored in nursing, but I had to drop it.  I ran out of milk." - Judy Tenuta

per@tdb.uu.se (Per Wahlund, TDB, Uppsala) (03/03/89)

In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes:
> I understand that Cray has implemented the BLAS (Basic Linear Algebra
> Subroutines) in assembler for their machines.  The means all of the
> BLAS: levels 1, 2, and 3 (I'm not sure about sparse-BLAS-1, though).
> 
> Is anyone out there aware of other vendors implementing the BLAS for
> their machines?
> 

   I am not aware of any vendors implementing BLAS, although they
   certainly are to some extent. However, there is an all-European
   research institute called C.E.R.F.A.C.S. in Toulouse, France,
   working with large-scale computations. They have a research group
   for parallel algorithms and they are at least using BLAS 1, 2, 3
   on different machines and I think they are also working on special
   implmentations. In that case the machines are IBM 3090, ETA-10
   and Alliant FX/80, (and perhaps also Encore Multimax, which is
   sold under the name Matra in France)


      Per Wahlund
      Dept. of Scientific Computing
      Sturegatan 4B
      S-752 23 Uppsala, SWEDEN

mccalpin@loligo.uucp (John McCalpin) (03/03/89)

In article <14500@admin.mips.COM> rogerk@mips.COM (Roger B.A. Klorese) writes:
>In article <449@orange19.qtp.ufl.edu> bernhold@qtp.ufl.edu (David E. Bernholdt) writes:
>>Is anyone out there aware of other vendors implementing the BLAS for
>>their machines? I would expect that most presently available could get
>>improved performance from special implementations of the BLAS (as
>>opposed to just compiling the FORTRAN version).
>
>If special implementations could provide dramatically better performance
>than a compiler, the compiler needs work.  Our all-FORTRAN number is within
>about 10% of our coded rate. Roger B.A. Klorese MIPS Computer Systems, Inc.

On some vector machines, the best improvements can be obtained by inlining
BLAS calls.  This removes the subroutine call overhead and the check for
non-unit stride (which is never used in LINPACK).  On the Cyber 205, I got
an immediate factor of 2 speedup on the order 100 LINPACK case by changing
just the BLAS calls in SGEFA to in-line vector instructions. 

The compiler on the ETA-10 can do the in-lining now, but it is not so clever
about removing the extraneous stride test (which can be evaluated at
compile time). 

Many scalar machines show speedups of >20% with coded BLAS on the LINPACK
test.  I consider this level of improvement sufficient for me to want
the coded BLAS -- though not sufficient for me to do it myself :-)
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------

ssd@sugar.hackercorp.com (Scott Denham) (03/07/89)

In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes:
> 
> Is anyone out there aware of other vendors implementing the BLAS for
> their machines?
> 
> -- 
> David Bernholdt			bernhold@qtp.ufl.edu

Well, I don't know about the complete BLAS package, but IBM's ESSL
(engineering and scientific subroutine library) conforms to some of the
BLAS calling sequences for the functions it does support. Some of ESSL
is excellent in performance (though awkward to use) like FFT's, while
some of it shows little or no performance advantage over straight vector
FORTRAN code until you get to arrays large enough to start experiencing
cache effects.
 
    Scott Denham 
      Western Atlas
        Houston, TX

mccalpin@loligo (John McCalpin) (03/18/89)

In response to the following:

>In article <449@orange19.qtp.ufl.edu>, bernhold@qtp.ufl.edu (David E. Bernholdt) writes:
>> Is anyone out there aware of other vendors implementing the BLAS for
>> their machines? David Bernholdt	bernhold@qtp.ufl.edu

someone from MIPS made the comment that the compilers should be
generating near-optimal code anyway, and said that the MIPS LINPACK
performance was within 10% of optimal for the compiler-generated code.

This did not seem to agree with my recollection, so here are the published
LINPACK results from the January 29 LINPACK summary:

Machine		  Test	     Fortran/compiler  Coded/compiler %Speedup
------------------------------------------------------------------------
M-2000  25.0 MHz 64-bit		3.6  (????)	4.0  (????)	11%
M-120/5	16.7 MHz   " 		2.1  (1.30)	2.2  (1.31)	 5%
M-1000  15.0 MHz   " 		1.5  (1.30)	1.6  (1.21)	 7%
M-800   12.5 MHz   "		1.2  (1.30)	1.1  (1.10)	-9% ***
------------------------------------------------------------------------
M-2000		 32-bit		5.7  (????)	7.2  (????)	26%
M-120/5		   "		3.9  (1.31)	4.8  (1.31)	23%
M-1000		   "		3.6  (1.30)	4.3  (1.21)	19%
M-800		   "		3.0  (1.30)	2.4  (1.10)    -18% ***
------------------------------------------------------------------------
*** In these cases, the coded results used an old (1.10) compiler, and
    so are not competitive.

The results are within 10% for the 64-bit results, but the 32-bit code
clearly benefits from hand-optimization.

On a Silicon Graphics Personal IRIS (which should have the same CPU and
clock as the M-800 and which uses level 1.31 of the compiler), I have
not been able to exceed 1.96 MFLOPS for the 32-bit all Fortran code,
using full (-O3) optimization and a variety of loop unrolling lengths
(1-32). I can't yet account for this discrepency --- anyone want to 
volunteer to explain it?
----------------------   John D. McCalpin   ------------------------
Dept of Oceanography & Supercomputer Computations Research Institute
mccalpin@masig1.ocean.fsu.edu		mccalpin@nu.cs.fsu.edu
--------------------------------------------------------------------