[comp.lang.fortran] vectorization question reposed

prentice@triton.unm.edu (John Prentice) (03/30/91)

My thanks to those people who responsed to my query about how to
collapse a three tier array into one.  Unforunately, none of the
responses were quite what I needed (I neglected to specify the constraints:
1) Fortran 77, 2) clean, 3) portable, 4) must calculate i,j,k, and 5)
vectorizeable).  The most serious of my ommissions was to neglect to
say that I need the i,j,k indices.  A solution I came up with that is
vectorizeable is:


      program test
      parameter (imax=100,jmax=150,kmax=200)
      dimension a(imax,jmax,kmax)
      do 10 n=1,imax*jmax*kmax
          i=mod(n,imax)
          if(i.eq.0) i=imax
          j=n/imax+1
          if(j*imax.gt.n) j=j-1
          k=n/(imax*jmax)+1
          if(k*imax*jmax.gt.n) k=k-1
          a(i,j,k)= some function of i,j,k
   10 continue
      end

I would now appreciate it if people would tell me any pitfalls they
see to this.  In particular, is the compiler (cft77 4.0.2 is what I
am using on a YMP) smart enough to realize that my stride is 1 here
and therefore not do a gather/scatter operation when it is not necessary?

One other question.  What is the size of the memory banks on the 
YMP and hence what strides should you avoid?  Also, suppose you 
have the following code:


      subroutine ralph (a,b,c,indx)
      dimension a(100),b(100),c(100),indx(100)
      do 10 n=1,100
	  a(n)=b(indx(n))*c(indx(n))
   10 continue
      end

where indx(n) maps into 1 to 100.  To vectorize this, the system
has to do a gather/scatter.  If indx(i) and indx(i+1) happen to 
point to locations in the same memory bank, will the system do
anything to reorder the references or will this indeed cause
a bank conflict?  What about cache?  Does the YMP have a cache and
if so, what rules are there for maximizing performance with it?

Many thanks.

John
-- 
John K. Prentice    john@unmfys.unm.edu (Internet)
Dept. of Physics and Astronomy, University of New Mexico, Albuquerque, NM, USA
Computational Physics Group, Amparo Corporation, Albuquerque, NM, USA

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/30/91)

>>>>> On 30 Mar 91 14:29:03 GMT, prentice@triton.unm.edu (John Prentice) said:

John> My thanks to those people who responsed to my query about how to
John> collapse a three tier array into one.  Unforunately, none of the
John> responses were quite what I needed (I neglected to specify the
John> constraints: 1) Fortran 77, 2) clean, 3) portable, 4) must
John> calculate i,j,k, and 5) vectorizeable).  The most serious of my
John> ommissions was to neglect to say that I need the i,j,k indices.
John> A solution I came up with that is vectorizeable is: [omitted]

Unless memory is very tight, the following solution is likely to be
*much* faster:

	subroutine foobar
	real a(imax,jmax,kmax)
	real xi(imax,jmax,kmax),xj(imax,jmax,kmax),xk(imax,jmax,kmax)
	logical first
	save xi,xj,xk,first
	data first /.true./

*	initialize once
	if (first) then
	    do k=1,kmax
	        do j=1,jmax
		    do xi=1,imax
		        xi(i,j,k) = float(i)
		        xj(i,j,k) = float(j)
		        xk(i,j,k) = float(k)
		    end do
	        end do
	    end do
	    first = .false.
	endif

*	now do work (presumably many times?)
	do k=1,kmax
	    do j=1,jmax
		do i=1,imax
		    a(i,j,k) = xi(i,j,k) + etc....
		end do
	    end do
	end do
	etc......

John> One other question.  What is the size of the memory banks on the 
John> YMP and hence what strides should you avoid?

The number of banks depends on the machine configuration.  Avoid
powers of 2 in your strides.  2 and 4 are probably OK, but higher
powers of 2 will likely cause performance penalties.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

ftower@ncar.ucar.EDU (Francis Tower) (03/31/91)

Dear John,

   Most of the following I've taken from the Cray manual TR-OPT,
CF77 & Standard C Features and Optimization. The rest, I just made
up. 


1.  Each CPU on the Y-MP has 4 each buffers.  Each can hold 128
instruction 'parcels'.  However there is no data or instruction
cache as others systems might use to speed up serial processing.

2.  I don't believe 'fpp' will attempt to reorder your indices.
Even if it did try, it wouldn't be able to do much because of
the use of indirect indices.  For irregular grids (many Finite 
Element problems)  the indirect way is frequently used.  The
indicies are fixed, we know that, but the pre-compiler doesn't.
'fpp' or 'cft77' must assume the worst case.  If there are 
bank conflicts, you eat them.

3.  Each Y-MP CPU has 4 ports to memory so it's possible for each
CPU to be accessing  each of the Y-MP's 4 memory sections.
Each memory section has 8 subsections.  Depending on the amount
of memory, each section is broken into banks.

The minimum configuration (Y-MP 2/116 has 64 banks. The 4/132 has
128 banks, and the 8/432 upwards have 256 banks.  Each bank has a
5 clock period cycle time, and all the banks are interleaved.  The
worst case is when the stride =0 or the number of banks you have.

On a 256 bank machine:

     STRIDE                    Relative Performance


      0 (or 256)                    1 / 5

      192                           4 / 5

      128                           2 / 5

      64                            4 / 5

   every-thing else                 1 / 1


Sorry, I gotta run and play B-Ball.  I'll Be back


Francis G Tower
Software QA
NCAR/CGD/ICS

<< Middle-aged Mutant Ninja Modelers >>
"Don't be misled by truth. Science is fact!"