[comp.sys.sgi] LINPACK

bron@bronze.SGI.COM (Bron Campbell Nelson) (03/18/89)
In article <8903161616.AA10569@masig1.ocean.fsu.edu>, mccalpin@MASIG1.OCEAN.FSU.EDU ("John D. McCalpin") writes:
> So far it looks like loop unrolling buys a lot on this machine.
> On 32-bit LINPACK (order 100 case), with full optimization -O3,
> unrolling the innermost loops (the BLAS subroutines) gives a speedup
> from 1.4 to 1.9 MFLOPS (unrolled to a depth of 16).  I still can't 
> recover the 3.0 MFLOPS in the LINPACK published results for the MIPS
> M-800 (which should be the same CPU and clock speed).


Unrolling (especially of things like DAXPY) does indeed help on the machine.
In addition to all the "standard" reasons, the MIPS f.p. unit has independent
multiply and add units.  So by unrolling the loop, you provide a number
of fp adds that can happen concurrently with the multiplies; a bit win.

I don't know why you're seeing such low numbers; they look more like the
double precision rates (rather than the 32bit rates you say you're working
on).  I just ran the code on my machine (a 12Mhz 4D/60) and I just got
about 2.3MFLOPS using -O2 optimization (straight Fortran code as supplied
from Dongarra; daxpy unrolled to a depth of 4).

--
Bron Campbell Nelson
bron@sgi.com  or possibly  ..!ames!sgi!bron