mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (08/21/90)
I have a question on optimization of "vector" code on the IBM RS/6000 and related processors. I will preface these remarks by saying that I do not have a good understanding of the behavior of caches on the writing side, though I think I understand them pretty well on the reading side. The IBM 320 is the machine in question. It has a 20 MHz clock, a 64-bit memory bus, an 8-word (=64-byte) cache line size, and a fully pipelined floating-point unit with a 2-cycle pipe length. (At least that is how I understand what I have read about it). A very common operation in scientific programming that is not amenable to cacheing is the long vector dyad (add or multiply). do i=1,n a(i) = b(i)+c(i) end_do where 'n' is large enough (and there are enough other things going on in the code) that b() and c() are not going to be cache-contained, and only part of a() will fit in cache before it gets written back to main memory. Suppose that a cache line reload takes 12 cycles (the machine has a 64-bit bus). Then the code will require roughly 24 cycles to do the two cache loads required to do 8 operations. (I will assume that the cache can write back a() asynchronously for this case). Then the FP ops need to be performed. I will assume that they are pipelined and all 8 require about another 12 cycles. So even though I have a fully pipelined FP unit and a fast memory bus, I still get only 1 operation per 3 clocks. The 8 cycles during which the FP unit is crunching provides enough time with the memory bus free to write back 8 elements of the a() matrix from previous iterations of the loop. This looks like an optimal use of the available memory bandwidth for this type of operation, and results in a peak speed of MHz/3 MFLOPS for vector dyad operations and MHz/1.5 MFLOPS for vector triad operations (where one operand is a scalar). For the IBM 320, these numbers are 6.667 MFLOPS and 13.333 MFLOPS. My best observation to date for long vector operations from FORTRAN is 8.4 MFLOPS for triad operations (average vector length=666). What other overheads exist preventing the attainment of this peak speed? Is it possible to execute a load from the integer unit to force the next desired cache line to load while the FP unit is busy? -- John D. McCalpin mccalpin@perelandra.cms.udel.edu Assistant Professor mccalpin@vax1.udel.edu College of Marine Studies, U. Del. J.MCCALPIN/OMNET