[comp.arch] "Smart" cache re-loading

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (08/21/90)

I have a question on optimization of "vector" code on the IBM RS/6000
and related processors.  I will preface these remarks by saying that I
do not have a good understanding of the behavior of caches on the
writing side, though I think I understand them pretty well on the
reading side.

The IBM 320 is the machine in question.  It has a 20 MHz clock, a
64-bit memory bus, an 8-word (=64-byte) cache line size, and a fully
pipelined floating-point unit with a 2-cycle pipe length.  (At least
that is how I understand what I have read about it).

A very common operation in scientific programming that is not amenable
to cacheing is the long vector dyad (add or multiply).
	do i=1,n
	    a(i) = b(i)+c(i)
	end_do
where 'n' is large enough (and there are enough other things going on
in the code) that b() and c() are not going to be cache-contained, and
only part of a() will fit in cache before it gets written back to main
memory. 

Suppose that a cache line reload takes 12 cycles (the machine has a
64-bit bus).  Then the code will require roughly 24 cycles to do the
two cache loads required to do 8 operations.  (I will assume that the
cache can write back a() asynchronously for this case). Then the FP
ops need to be performed.  I will assume that they are pipelined and
all 8 require about another 12 cycles.  So even though I have a fully
pipelined FP unit and a fast memory bus, I still get only 1 operation
per 3 clocks.  The 8 cycles during which the FP unit is crunching
provides enough time with the memory bus free to write back 8 elements
of the a() matrix from previous iterations of the loop.

This looks like an optimal use of the available memory bandwidth for
this type of operation, and results in a peak speed of MHz/3 MFLOPS
for vector dyad operations and MHz/1.5 MFLOPS for vector triad
operations (where one operand is a scalar).  For the IBM 320, these
numbers are 6.667 MFLOPS and 13.333 MFLOPS.

My best observation to date for long vector operations from FORTRAN is
8.4 MFLOPS for triad operations (average vector length=666).

What other overheads exist preventing the attainment of this peak
speed?

Is it possible to execute a load from the integer unit to force the
next desired cache line to load while the FP unit is busy?
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@vax1.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET