[comp.lang.fortran] Memory access optimization

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (09/03/89)
I have been working on a piece of code that I would like to be able to
run on both SIMD parallel machines (e.g. the Connection Machine) and
on more traditional vector register machines. 

The code is a 3D hybrid finite-difference and finite-element
hydrodynamic model, for which the SIMD style of coding maps very
nicely to the physics. The SIMD coding style is both readable and
easily portable to the Connection Machine.

A trend that I have been seeing in the vector register machines is
that most new vector machines have insufficient memory bandwidth to
keep the vector units busy.  (Followups to this topic to comp.arch).
An excellent example of this is the Cray-2, along with all of the
"mini-supers" that I can think of (Convex, Alliant, and Ardent, for
example). I think this trend will continue in the shared-memory vector
machines that make up the mainline of scientific computers, since it
is relatively cheap to increase the speed or number of vector units
and relatively expensive to up the memory bandwidth to match.

So what I have is lots of code that looks like this (in Fortran-8X):

	subroutine stuff ( u, v, output )
	real u(M,N,K), output(M,N,K)		! M,N,K are parameters
	real tmp1(M,N,K)

	tmp1 = (CSHIFT(u,1,DIM=1)+u)**2		! average u in x

	output = c*(tmp1-CSHIFT(tmp1,-1,DIM=1))	&	! d(u^2)/dx
		+ other_stuff_with_other_variables

It is pretty clear to the human that a 3-D tmp array is massive
overkill for this purpose - yet that is the "correct" coding style for
a SIMD parallel or memory-to-memory vector machine (e.g ETA-10).  On a
register-based machine, this uses lots of memory bandwidth
unnecessarily -- which can be a serious problem on a shared-memory
machine.  Best performance on register-based machines can be attained
by replacing the 3-D tmp array with a 1-D tmp array, and never
writing that array to memory.

So the question is: What is the status of compiler or source-to-source
translator technology to eliminate extraneous memory traffic like
that shown above?

It would be nice if the compiler's optimizer recognized that each row
of tmp1 could be stored in a vector register until used in the
calculation of output, and then never stored any of tmp1.  The Ardent
compiler (the only one I tested) doesn't do this, though if the
statements are combined into one complicated Fortran statement, the
compiler does a very good job of managing the temporary arrays that it
creates.
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu