bron@bronze.SGI.COM (Bron Campbell Nelson) (03/18/89)
In article <8903161616.AA10569@masig1.ocean.fsu.edu>, mccalpin@MASIG1.OCEAN.FSU.EDU ("John D. McCalpin") writes: > So far it looks like loop unrolling buys a lot on this machine. > On 32-bit LINPACK (order 100 case), with full optimization -O3, > unrolling the innermost loops (the BLAS subroutines) gives a speedup > from 1.4 to 1.9 MFLOPS (unrolled to a depth of 16). I still can't > recover the 3.0 MFLOPS in the LINPACK published results for the MIPS > M-800 (which should be the same CPU and clock speed). Unrolling (especially of things like DAXPY) does indeed help on the machine. In addition to all the "standard" reasons, the MIPS f.p. unit has independent multiply and add units. So by unrolling the loop, you provide a number of fp adds that can happen concurrently with the multiplies; a bit win. I don't know why you're seeing such low numbers; they look more like the double precision rates (rather than the 32bit rates you say you're working on). I just ran the code on my machine (a 12Mhz 4D/60) and I just got about 2.3MFLOPS using -O2 optimization (straight Fortran code as supplied from Dongarra; daxpy unrolled to a depth of 4). -- Bron Campbell Nelson bron@sgi.com or possibly ..!ames!sgi!bron