prentice@triton.unm.edu (John Prentice) (03/30/91)
My thanks to those people who responsed to my query about how to collapse a three tier array into one. Unforunately, none of the responses were quite what I needed (I neglected to specify the constraints: 1) Fortran 77, 2) clean, 3) portable, 4) must calculate i,j,k, and 5) vectorizeable). The most serious of my ommissions was to neglect to say that I need the i,j,k indices. A solution I came up with that is vectorizeable is: program test parameter (imax=100,jmax=150,kmax=200) dimension a(imax,jmax,kmax) do 10 n=1,imax*jmax*kmax i=mod(n,imax) if(i.eq.0) i=imax j=n/imax+1 if(j*imax.gt.n) j=j-1 k=n/(imax*jmax)+1 if(k*imax*jmax.gt.n) k=k-1 a(i,j,k)= some function of i,j,k 10 continue end I would now appreciate it if people would tell me any pitfalls they see to this. In particular, is the compiler (cft77 4.0.2 is what I am using on a YMP) smart enough to realize that my stride is 1 here and therefore not do a gather/scatter operation when it is not necessary? One other question. What is the size of the memory banks on the YMP and hence what strides should you avoid? Also, suppose you have the following code: subroutine ralph (a,b,c,indx) dimension a(100),b(100),c(100),indx(100) do 10 n=1,100 a(n)=b(indx(n))*c(indx(n)) 10 continue end where indx(n) maps into 1 to 100. To vectorize this, the system has to do a gather/scatter. If indx(i) and indx(i+1) happen to point to locations in the same memory bank, will the system do anything to reorder the references or will this indeed cause a bank conflict? What about cache? Does the YMP have a cache and if so, what rules are there for maximizing performance with it? Many thanks. John -- John K. Prentice john@unmfys.unm.edu (Internet) Dept. of Physics and Astronomy, University of New Mexico, Albuquerque, NM, USA Computational Physics Group, Amparo Corporation, Albuquerque, NM, USA
mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/30/91)
>>>>> On 30 Mar 91 14:29:03 GMT, prentice@triton.unm.edu (John Prentice) said:
John> My thanks to those people who responsed to my query about how to
John> collapse a three tier array into one. Unforunately, none of the
John> responses were quite what I needed (I neglected to specify the
John> constraints: 1) Fortran 77, 2) clean, 3) portable, 4) must
John> calculate i,j,k, and 5) vectorizeable). The most serious of my
John> ommissions was to neglect to say that I need the i,j,k indices.
John> A solution I came up with that is vectorizeable is: [omitted]
Unless memory is very tight, the following solution is likely to be
*much* faster:
subroutine foobar
real a(imax,jmax,kmax)
real xi(imax,jmax,kmax),xj(imax,jmax,kmax),xk(imax,jmax,kmax)
logical first
save xi,xj,xk,first
data first /.true./
* initialize once
if (first) then
do k=1,kmax
do j=1,jmax
do xi=1,imax
xi(i,j,k) = float(i)
xj(i,j,k) = float(j)
xk(i,j,k) = float(k)
end do
end do
end do
first = .false.
endif
* now do work (presumably many times?)
do k=1,kmax
do j=1,jmax
do i=1,imax
a(i,j,k) = xi(i,j,k) + etc....
end do
end do
end do
etc......
John> One other question. What is the size of the memory banks on the
John> YMP and hence what strides should you avoid?
The number of banks depends on the machine configuration. Avoid
powers of 2 in your strides. 2 and 4 are probably OK, but higher
powers of 2 will likely cause performance penalties.
--
John D. McCalpin mccalpin@perelandra.cms.udel.edu
Assistant Professor mccalpin@brahms.udel.edu
College of Marine Studies, U. Del. J.MCCALPIN/OMNET
ftower@ncar.ucar.EDU (Francis Tower) (03/31/91)
Dear John, Most of the following I've taken from the Cray manual TR-OPT, CF77 & Standard C Features and Optimization. The rest, I just made up. 1. Each CPU on the Y-MP has 4 each buffers. Each can hold 128 instruction 'parcels'. However there is no data or instruction cache as others systems might use to speed up serial processing. 2. I don't believe 'fpp' will attempt to reorder your indices. Even if it did try, it wouldn't be able to do much because of the use of indirect indices. For irregular grids (many Finite Element problems) the indirect way is frequently used. The indicies are fixed, we know that, but the pre-compiler doesn't. 'fpp' or 'cft77' must assume the worst case. If there are bank conflicts, you eat them. 3. Each Y-MP CPU has 4 ports to memory so it's possible for each CPU to be accessing each of the Y-MP's 4 memory sections. Each memory section has 8 subsections. Depending on the amount of memory, each section is broken into banks. The minimum configuration (Y-MP 2/116 has 64 banks. The 4/132 has 128 banks, and the 8/432 upwards have 256 banks. Each bank has a 5 clock period cycle time, and all the banks are interleaved. The worst case is when the stride =0 or the number of banks you have. On a 256 bank machine: STRIDE Relative Performance 0 (or 256) 1 / 5 192 4 / 5 128 2 / 5 64 4 / 5 every-thing else 1 / 1 Sorry, I gotta run and play B-Ball. I'll Be back Francis G Tower Software QA NCAR/CGD/ICS << Middle-aged Mutant Ninja Modelers >> "Don't be misled by truth. Science is fact!"