[comp.lang.fortran] vectorization question

prentice@triton.unm.edu (John Prentice) (03/29/91)

Consider the following loop:

      do 30 k=1,kmax
	  do 20 j=1,jmax
	      do 10 i=1,imax
		  a(i,j,k)=...
   10         continue
   20     continue
   30 continue

On the Cray, only the inner most loop will vectorize.  Does anyone
have a suggestion for how to collapse this loop while still using
the three dimensional array?  In other words, I need a vectorized
equivalent to:

      i=0
      j=1
      k=1
      do 10 n=1,imax*jmax*kmax
	  i=i+1
	  if (i.gt.imax) then
	      i=1
	      j=j+1
	      if (j.gt.jmax) then
		  j=1
		  k=k+1
              endif
          endif
	  a(i,j,k)=...
   10 continue

Any suggestions would be welcomed.  Thanks.

John
-- 
John K. Prentice    john@unmfys.unm.edu (Internet)
Dept. of Physics and Astronomy, University of New Mexico, Albuquerque, NM, USA
Computational Physics Group, Amparo Corporation, Albuquerque, NM, USA

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (03/30/91)

>>>>> On 29 Mar 91 14:13:13 GMT, prentice@triton.unm.edu (John Prentice) said:

John> Consider the following loop:

John>     do 30 k=1,kmax
John>       do 20 j=1,jmax
John> 	      do 10 i=1,imax
John> 		a(i,j,k)=...
John>  10     continue
John>  20   continue
John>  30 continue

John> On the Cray, only the inner most loop will vectorize.  

That is not strictly true.  CFT77 will automatically collapse all
three loops if the arrays are all dimensioned (imax,jmax,*).

Furthermore, the parallelizer will strip-mine the collapsed version of
the loop in this case, which will make for the lowest possible
overhead....

John> Does anyone have a suggestion for how to collapse this loop
John> while still using the three dimensional array?  [....]

No matter what you do you will need to know the leading dimensions of
the arrays.  If (imax,jmax) are not the leading dimensions, then you
can vectorize over the whole array anyway and use a mask.  Whether or
not this will run faster than the inner-loop-vectorized code depends
on too many factors to talk about in general, but you need to take
into acount the relative sizes of imax and IDIM (etc), the complexity
of the RHS of the assignment statement, the absolute sizes of imax,
jmax, IDIM, JDIM, and perhaps a few more things.....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

paco@rice.edu (Paul Havlak) (03/30/91)

In article <1991Mar29.141313.7418@ariel.unm.edu>, prentice@triton.unm.edu (John Prentice) writes:
|> Consider the following loop:
|> 
|>       do 30 k=1,kmax
|> 	  do 20 j=1,jmax
|> 	      do 10 i=1,imax
|> 		  a(i,j,k)=...
|>    10         continue
|>    20     continue
|>    30 continue
|> 
|> On the Cray, only the inner most loop will vectorize. 
|> ...

If the Cray compiler really doesn't catch that case, you should complain
loudly to Cray.  Multi-dimensional vectorization is not much harder than 
single-dimensional (in this case, both are trivial).  The PFC system at
Rice does multi-dimensional vectorization, as does the commercially available
KAP system from Kuck and Assoc.

The failure of compilers to catch such simple cases drives programmers to
try and trick the compiler (see the Perfect benchmark source for examples).
Unfortunately, what tricks one compiler confuses others (even those that could
have properly optimized the original code).

So please, before uglifying your code for a compiler, complain!  You may
still have to rewrite the code, but they might eventually get the message.

Good news about Fortran 90:  You can write the above loop in triplet notation:

	a(1:imax,1:jmax,1:kmax) = ...

Bad news about Fortran 90:  If "a" is a formal parameter array (dummy arg),
	it might not be contiguous (assuming the implementation of array
	sections as parameters is not copy-in/copy-out).  That inner loop
	will be hard to optimize if the stride between array elements is
	unknown.  Interprocedural analysis will help, but will it be enough?

Paul Havlak
"I'd rather optimize Fortran than write it."

bernhold@red8 (David E. Bernholdt) (03/30/91)

In article <1991Mar29.141313.7418@ariel.unm.edu> prentice@triton.unm.edu (John Prentice) writes:
>Consider the following loop:
>
>      do 30 k=1,kmax
>	  do 20 j=1,jmax
>	      do 10 i=1,imax
>		  a(i,j,k)=...
>   10         continue
>   20     continue
>   30 continue

Pretty obvious, and possibly not useful in this case, but...

How about passing a in as a 1-d array, dimension kmax*jmax*imax?

Problems:
1) if the physical dimension of the matrix is larger than the
computational dimension.
2) if the rhs has explicit dependence on i, j, k rather than the
meta-index.

If you get any better responses, please cc to me or post 'em.

-- 
David Bernholdt			bernhold@qtp.ufl.edu
Quantum Theory Project		bernhold@ufpine.bitnet
University of Florida
Gainesville, FL  32611		904/392 6365

morreale@bierstadt.scd.ucar.edu (Peter Morreale) (03/30/91)

In article <1991Mar29.141313.7418@ariel.unm.edu>, prentice@triton.unm.edu (John Prentice) writes:
> Consider the following loop:
> 
>       do 30 k=1,kmax
> 	  do 20 j=1,jmax
> 	      do 10 i=1,imax
> 		  a(i,j,k)=...
>    10         continue
>    20     continue
>    30 continue
> 
> On the Cray, only the inner most loop will vectorize.  Does anyone
> have a suggestion for how to collapse this loop while still using
> the three dimensional array?  In other words, I need a vectorized
> equivalent to:
> 

  Whether or not you can collaspe the loop structure depends entirely on
  whether i, j, or k is used on the right hand side of the assignment. 

  If the loop indices are not used, you could collaspe the loop as
  follows:

	DO 30 K = 1, IMAX*JMAX*KMAX
	   A(K,1,1) = <whatever>
     30 CONTINUE


  In any event, the Cray Fortran preprocessor (fpp) can help
  significantly in determining whether or not the loop can collaspe. 
  (since it *will* collaspe the loop if it can)  You can still trick
  fpp, but it's a terrific start.

  I'd suggest you run the routine through fpp and see what it produces;

	   % fpp sub.f > sub.fpp

-PWM
------------------------------------------------------------------
Peter W. Morreale                  email:  morreale@ncar.ucar.edu
Nat'l Center for Atmos Research    voice:  (303) 497-1293
Scientific Computing Division     
Consulting Office
------------------------------------------------------------------

bleikamp@convex.com (Richard Bleikamp) (03/30/91)

In article <1991Mar29.165126.11431@rice.edu> paco@rice.edu (Paul Havlak) writes:
>
>In article <1991Mar29.141313.7418@ariel.unm.edu>, prentice@triton.unm.edu (John Prentice) writes:
>|> Consider the following loop:
>|> 
>|>       do 30 k=1,kmax
>|> 	  do 20 j=1,jmax
>|> 	      do 10 i=1,imax
>|> 		  a(i,j,k)=...
>|>    10         continue
>|>    20     continue
>|>    30 continue
>|> 
>|> On the Cray, only the inner most loop will vectorize. 
>|> ...
>
> useful discussion deleted.
>
>Good news about Fortran 90:  You can write the above loop in triplet notation:
>
>	a(1:imax,1:jmax,1:kmax) = ...
>
> more discussion deleted.
>
>Paul Havlak

Note that substituting Fortran 90 array notation is ONLY valid when the
assignment statement "a(i,j,k)= ..." does NOT contain a loop carried dependency.

It can be non-trivial (in complicated loops) to decide if any dependencies
exist.  When they do, you can sometimes rewrite the loop nest as two or more
separate Fortran 90 assignment statements, but sometimes there is no
semantically equivalent array notation.  

--
------------------------------------------------------------------------------
Rich Bleikamp			    bleikamp@convex.com
Convex Computer Corporation

ftower@ncar.ucar.EDU (Francis Tower) (03/30/91)

Depending on what you're up to you could use:
      A(:,:,:) =  ...   or
      A = ...

for example:  A = 0.0   will zero the entire array (FORTRAN90 syntax)