benseb@nic.cerf.net (Booker Bense) (02/13/91)
A while ago I posted some remarks about math libraries and decided they were rather hasty and I should problably get some facts. When I have completed the benchmark I will post the code that was used to obtain it. I have obtained some interesting preliminary results and have some retractions to make. First as far as I can tell NAG does not use BLAS in any form. Examining the loadmaps from the code that ran these tests reveals that the NAG routines are largely self-contained, the only calls they make are to error handling and machine constant routines. IMSL uses BLAS level 1 calls from the system libraries and has it's own version of some BLAS level 2 routines ( SGEMV in this example). These times are determined by querying the hardware performance monitor before and after the subroutine call. The test matrices in this case were the best possible case i.e: cond(A) ~= 1 A(i,i) > A(i,j) i != j. Each routine returned results accurate to machine precision. More difficult cases will be included in the final version. SGEFA - CRI libsci optimized version of linpack routines FO1BTF - Nag Mark 13 ( References an algorithm by Croz,Nugent,Reid & Taylor ) LFTRG - IMSLmath version 10.0 ( Uses linpack Algorithm ) GENERIC - fortran linpack complied with vector optimization on. All units are in Mflops/second. A = A(size,size) Size 101 203 407 815 SGEFA 99.955 131.174 148.675 158.382 FO1BTF 77.289 105.933 131.063 146.328 LFTRG 72.544 156.559 218.848 257.777 The next set of results is from forcing IMSL to use the libsci version of SGEMV. Size 101 203 407 815 SGEFA 97.777 130.377 149.025 157.939 FO1BTF 72.429 108.292 132.440 147.396 LFTRG 105.384 213.625 255.089 289.730 This result is from a run using generic fortran BLAS and Linpack routines. Size 101 203 407 815 GENERIC 35.94 64.359 96.345 136.265 - The mflops rates are all from a running on 1 cpu of an 8 cpu YMP in multi-user mode (UNICOS 5.1) i.e. around 0% idle time.I would say that the results have a repeatablity of around 5% with results from the small sizes being more repeatable. Due to the way the YMP memory is organized, memory fetchs are a function of system load and the larger problems are more affected by this. -Conclusions: 1. It pays to read the loadmap, the only difference between run 1 and run2 was in the load command. 1: segldr -limslmath,nag *.o 2: segldr -lsci,imslmath,nag *.o 2. These are only best case results. I wanted to find out the the fastest possible speed for these routines. The routines in question are the simplest possible, in a real problem you would probably want to use the more sophisticated versions and do some checking on the condition number before you believe the results. 3. Imsl is alot faster than I would have expected, I thought the speeds for the SGEFA would be consistently faster that either IMSL or NAG. 290 Mflops is as fast as any code I've run on a single processor, 330 is the speed you're guaranteed never to exceed. The algorithm quoted in the Nag reference manual is one designed for pageing machines, I don't know how much they massaged it for the YMP. All of these numbers do reflect some effort at machine optimization ( compare with generic ). 4. Subroutine calls are expensive, the large difference between the generic version and the libsci version is can in part be explained by increased number of subroutine calls. The libsci versions of both SGEMV and SGEFA have had almost all of their subroutine calls inlined. As the size of the problem becomes larger the generic version approaches the optimized version because the subroutine overhead is roughly linear in the problem size while the number of required flops is cubic. This also explains the large difference between imsl with and without the libsci SGEMV for small problems. - Booker C. Bense /* benseb@grumpy.sdsc.edu */