[comp.unix.cray] A Correction to IMSL/NAG comparsion

benseb@nic.cerf.net (Booker Bense) (03/08/91)
This is in reference to a previous article I posted. It seems I have
put the Nag libraries in a bad light by choosing the incorrect
subroutine for comparsion. First the IMSL people get on my case and
now I've got the NAG people irked at me %-)! What large software
organization can I offend in my next post?!! Seriously, this is only one
set of results for one particular class of problem on one particular
machine. Like most benchmarks, it's probably meaningless to your
particular problem. People at both organizations have been very
helpful in advising me. Ask them about your particular problem and
you might be surprised at the response. 
 
	I have obtained some interesting preliminary results and have
some ***MORE**** retractions to make.
 
--First as far as I can tell NAG does not use BLAS in any form.
-- Examining the loadmaps from the code
--that ran these tests reveals that the NAG routines are largely
--self-contained, the only calls they make are to error handling and
--machine constant routines.

- I was using the wrong NAG routine. Some one from NAG kindly
corrected my mistake. I was originally going to use F03AFF
(recommended in the documentation) but it was doing more than the
other routines (i.e. computing the determinant to high accuracy). So
I looked for something that was simpler, however in this case it
turned out to be the wrong thing to do.

- F03AFF does use BLAS level 2, SGEMV and STRSV from libsci.

-- IMSL uses BLAS level 1 calls from the
--system libraries and has it's own version of some BLAS level 2
--routines ( SGEMV in this example). These times are determined by
--querying the hardware performance monitor before and after the
--subroutine call. The test matrices in this case were the best possible
--case i.e:
--
--	 	cond(A) ~= 1
--		A(i,i) > A(i,j) i != j.  
--
--Each routine returned results accurate to machine precision.
--More difficult cases will be included in the final version.
--
--
--SGEFA - CRI libsci optimized version of linpack routines
--FO1BTF - Nag Mark 13 ( References an algorithm by Croz,Nugent,Reid & Taylor )
--LFTRG  - IMSLmath version 10.0 ( Uses linpack Algorithm )
--GENERIC - fortran linpack complied with vector optimization on.
--
--All units are in Mflops/second. A = A(size,size)
--
--
--Size        	         101	  203	     407	  815
--
--SGEFA 		99.955 	131.174    148.675 	158.382 
--
--FO1BTF		77.289 	105.933    131.063 	146.328 
--
--LFTRG          	72.544 	156.559    218.848 	257.777 
--
--
--The next set of results is from forcing IMSL to use the libsci
--version of SGEMV.
--
-- 
--Size        	 101	  203	     407	  815
--        
--SGEFA 		97.777 	130.377    149.025 	157.939 
--
--FO1BTF		72.429 	108.292    132.440 	147.396 
--
--LFTRG          	105.384 213.625    255.089 	289.730
-- 
--This result is from a run using generic fortran BLAS and Linpack routines
--from the slatec libraries.
--
--Size        	 101	  203	     407	  815
--
--GENERIC         35.94    64.359     96.345      136.265
--        
--
--This set of results is from using BLAS level 1 from bcslib and SGEFA
--from bcslib
--
--Size        	 101	  	203	     407	  815
--        
--BcsSGEFA	175.777 	238.377    274.025 	292.939 
-- 
*****NEW******
  F03AFF	128.968 	189.378	   238.028      277.606
*****NEW******
--
--LFTRG         139.384 	218.625    269.089 	289.730
--
--
--
--- The mflops rates are all from a running on 1 cpu of an 8 cpu YMP in
--multi-user mode (UNICOS 5.1) i.e. around 0% idle time.I would say that
--the results have a repeatablity of around 5% with results from the
--small sizes being more repeatable. Due to the way the YMP memory is
--organized, memory fetchs are a function of system load and the larger
--problems are more affected by this.
--
---Conclusions:
--
--1. It pays to read the loadmap, the only difference between run 1 and
--run2 was in the load command.
--	   
--	   1:  segldr -limslmath,nag *.o
--	   2:  segldr -lsci,imslmath,nag *.o
--
--2. These are only best case results. I wanted to find out the the
--fastest possible speed for these routines. The routines in question
--are the simplest possible, in a real problem you would probably want
--to use the more sophisticated versions and do some checking on the
--condition number before you believe the results.
--
--3. Imsl is alot faster than I would have expected, I thought the
--speeds for the SGEFA would be consistently faster that either IMSL or
--NAG. 290 Mflops is as fast as any code I've run on a single processor,
--330 is the speed you're guaranteed never to exceed. The algorithm
--quoted in the Nag reference manual is one designed for pageing
--machines, I don't know how much they massaged it for the YMP. All of
--these numbers do reflect some effort at machine optimization ( compare
--with generic ).

F03AFF does Crout LU decomposition and is obviously a far better choice 
than the original subroutine that I used. The documentation mentions
something about ``higher precision '' used for inner products. This makes
it somewhat of an ``apples & oranges'' comparsion. Perhaps the difference
will be noticable when I get the ``bad case'' version running.

--
--4. Subroutine calls are expensive, the large difference between the
--generic version and the libsci version is can in part be explained by
--increased number of subroutine calls. The libsci versions of both SGEMV
--and SGEFA have had almost all of their subroutine calls inlined. As
--the size of the problem becomes larger the generic version approaches
--the optimized version because the subroutine overhead is roughly
--linear in the problem size while the number of required flops is	
--cubic.  This also explains the large difference between imsl with and without
--the libsci SGEMV for small problems.
--
--

-It's hard to draw conclusions on speed with the routines doing
somewhat different tasks. One result that appears from the f03aff data
is that Level 2 blas does not provide you with any speed advantage
until you reach a certain minimum size. One thing to note is that the
``unsophisticated'' user (i.e. one that doesn't read loadmaps ) %-)
would not see the advantages in speed that using libsci BLAS provides
IMSL. Whether we messed up in our installation is another question
entirely.

- Booker C. Bense                    
prefered: benseb@grumpy.sdsc.edu	"I think it's GOOD that everyone 
NeXT Mail: benseb@next.sdsc.edu 	   becomes food " - Hobbes