mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (09/03/89)
I have been working on a piece of code that I would like to be able to run on both SIMD parallel machines (e.g. the Connection Machine) and on more traditional vector register machines. The code is a 3D hybrid finite-difference and finite-element hydrodynamic model, for which the SIMD style of coding maps very nicely to the physics. The SIMD coding style is both readable and easily portable to the Connection Machine. A trend that I have been seeing in the vector register machines is that most new vector machines have insufficient memory bandwidth to keep the vector units busy. (Followups to this topic to comp.arch). An excellent example of this is the Cray-2, along with all of the "mini-supers" that I can think of (Convex, Alliant, and Ardent, for example). I think this trend will continue in the shared-memory vector machines that make up the mainline of scientific computers, since it is relatively cheap to increase the speed or number of vector units and relatively expensive to up the memory bandwidth to match. So what I have is lots of code that looks like this (in Fortran-8X): subroutine stuff ( u, v, output ) real u(M,N,K), output(M,N,K) ! M,N,K are parameters real tmp1(M,N,K) tmp1 = (CSHIFT(u,1,DIM=1)+u)**2 ! average u in x output = c*(tmp1-CSHIFT(tmp1,-1,DIM=1)) & ! d(u^2)/dx + other_stuff_with_other_variables It is pretty clear to the human that a 3-D tmp array is massive overkill for this purpose - yet that is the "correct" coding style for a SIMD parallel or memory-to-memory vector machine (e.g ETA-10). On a register-based machine, this uses lots of memory bandwidth unnecessarily -- which can be a serious problem on a shared-memory machine. Best performance on register-based machines can be attained by replacing the 3-D tmp array with a 1-D tmp array, and never writing that array to memory. So the question is: What is the status of compiler or source-to-source translator technology to eliminate extraneous memory traffic like that shown above? It would be nice if the compiler's optimizer recognized that each row of tmp1 could be stored in a vector register until used in the calculation of output, and then never stored any of tmp1. The Ardent compiler (the only one I tested) doesn't do this, though if the statements are combined into one complicated Fortran statement, the compiler does a very good job of managing the temporary arrays that it creates. -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu