[comp.sys.super] Fortran optimization - THE ANSWER!

fsjohnv@alliant1.lerc.nasa.gov (Richard Mulac) (04/05/91)

   First, thanks to all of you who took the time to analyze the problem I
posed (especially those of you who do or did work for Cray and have insight
into the internal strategy of the compiler).  I will try to summarize the
comments which were sent to me by mail, but before I do that let me paraphrase
the original posting:

   Last year I was attempting to optimize a code for the CRAY Y-MP by merging
several DO loops and ran into a situation which on the surface appeared to
contradict one of the old "rules" one abides by when trying to write efficient
Fortran code.  The reason I was doing this was that back in the CFT days if
you tried to make your loop too long you would break the optimization block
and the loop would not vectorize.  So now with CFT77 I was going to go back
and try to merge some loops hoping to reduce memory traffic and enhance
chaining.  The benefits can be seen in the following simplistic example:

 OLD CODE

       DO I=1,1000
       C(I) = A(I) * B(I)
       ENDDO
       DO I=1,1000
       E(I) = C(I) + D(I)
       ENDDO
       DO I=1,1000
       G(I) = E(I) * F(I)
       ENDDO

 NEW CODE

       DO I=1,1000
       C(I) = A(I) * B(I)
       E(I) = C(I) + D(I)
       G(I) = E(I) * F(I)
       ENDDO

   The new code should reduce memory access and should also be able to chain
the floating point add and multiply functional units together, leading to a
decrease in the computational time.  The contradiction that I found in my
real code can be described using the following program TEST, which calls
subroutines JOINED and SPLIT.

      PROGRAM TEST
      PARAMETER (IX=10000,JX=100,NX=100)
      COMMON /GLOBAL/
     .  A(IX,JX),B(IX,JX),C(IX,JX)
      CALL JOINED
      CALL SPLIT
      STOP
      END
      SUBROUTINE JOINED
      PARAMETER (IX=10000,JX=100,NX=100)
      COMMON /GLOBAL/
     .  A(IX,JX),B(IX,JX),C(IX,JX)
      DO N=1,NX
      DO I=1,IX
      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
               :
      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
      A(I,100) = A(I,100) * B(I,100) + C(I,100)
      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
               :
      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
      A(I,100) = A(I,100) * B(I,100) + C(I,100)
      ENDDO
      ENDDO
      RETURN
      END
      SUBROUTINE SPLIT
      PARAMETER (IX=10000,JX=100,NX=100)
      COMMON /GLOBAL/
     .  A(IX,JX),B(IX,JX),C(IX,JX)
      DO N=1,NX
      DO I=1,IX
      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
               :
      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
      A(I,100) = A(I,100) * B(I,100) + C(I,100)
      ENDDO
      DO I=1,IX
      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
               :
      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
      A(I,100) = A(I,100) * B(I,100) + C(I,100)
      ENDDO
      ENDDO
      RETURN
      END

   The code was compiled with the -o aggress option which vectorized the
inner loops in both JOINED and SPLIT.  It was expected that JOINED would run
faster and more efficiently than SPLIT, but the opposite result occurred.
Using the hardware performance monitor (hpm) I found that JOINED ran in 2.8
seconds at a rate of 140 Mflops, while SPLIT ran in 2.2 seconds at a rate
of 180 Mflops.  The results from this test program were similiar to those
I found when combining 10 to 50 line loops with complicated right hand sides
in the original code I was working on.
   Several people analyzed the program and found the problem to be due to the
large amount of address locations which are computed and saved for later use
within each loop (i.e., JOINED).  Because the Y-MP (or any other computer) has
a limited number of registers, a large amount of address calculations saved
within a loop will eventually be spilled from registers to memory.  This is
NOT a bug in the compiler, but simply the strategy chosen to optimize the
majority of code which will be seen by the compiler.  Most of the time it will
be more efficient to save address calculations which will be used again
later in the loop than to recalculate them, even though in some cases they may
spill out of the registers into memory.  This spillover could lead to an
increase in scalar memory references which would inhibit vector performance.
   This now leads to a grey area as to how "large" one should make their
loops in order to get the best return in performance.  Whatever the answer
is, old "rules" for optimal coding may not apply any more (This topic
expanded a bit to include other "rules" might make for a good article in Cray
Channels).
   One last note - In the original posting I did not mean to imply that this
effect was caused by a "bug" in the compiler.  The people that I had run this
by said they thought it was probably caused by a bug because they did not take
the time to analyze the problem.  I did not accept that it was a bug and thought
that maybe the experts out on the net could help me out (which you did).
Thanks again!

            Rick Mulac  ----  fsjohnv@alliant1.lerc.nasa.gov

preston@ariel.rice.edu (Preston Briggs) (04/05/91)

fsjohnv@alliant1.lerc.nasa.gov (Richard Mulac) writes:

>The contradiction that I found in my
>real code can be described using the following program TEST, which calls
>subroutines JOINED and SPLIT.
>
>      PROGRAM TEST
>      PARAMETER (IX=10000,JX=100,NX=100)
>      COMMON /GLOBAL/
>     .  A(IX,JX),B(IX,JX),C(IX,JX)
>      CALL JOINED
>      CALL SPLIT
>      STOP
>      END
>      SUBROUTINE JOINED
>      PARAMETER (IX=10000,JX=100,NX=100)
>      COMMON /GLOBAL/
>     .  A(IX,JX),B(IX,JX),C(IX,JX)
>      DO N=1,NX
>      DO I=1,IX
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      ENDDO
>      ENDDO
>      RETURN
>      END
>      SUBROUTINE SPLIT
>      PARAMETER (IX=10000,JX=100,NX=100)
>      COMMON /GLOBAL/
>     .  A(IX,JX),B(IX,JX),C(IX,JX)
>      DO N=1,NX
>      DO I=1,IX
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      ENDDO
>      DO I=1,IX
>      A(I,  1) = A(I,  1) * B(I,  1) + C(I,  1)
>      A(I,  2) = A(I,  2) * B(I,  2) + C(I,  2)
>               :
>      A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99)
>      A(I,100) = A(I,100) * B(I,100) + C(I,100)
>      ENDDO
>      ENDDO
>      RETURN
>      END
>
>   The code was compiled with the -o aggress option which vectorized the
>inner loops in both JOINED and SPLIT.  It was expected that JOINED would run
>faster and more efficiently than SPLIT, but the opposite result occurred.
>Using the hardware performance monitor (hpm) I found that JOINED ran in 2.8
>seconds at a rate of 140 Mflops, while SPLIT ran in 2.2 seconds at a rate
>of 180 Mflops.  The results from this test program were similiar to those
>I found when combining 10 to 50 line loops with complicated right hand sides
>in the original code I was working on.
>   Several people analyzed the program and found the problem to be due to the
>large amount of address locations which are computed and saved for later use
>within each loop (i.e., JOINED).

Mulac concludes that the code is poorly written and there's is no compiler bug.
I agree that the code is poorly written, but I don't believe the cause
is addressing expressions.

Since A, B, and C all have the same dimensions,
we should be able to handle all the addressing in the inner loops
with a single index register and constant offsets.  All the addressing
expressions should be of the form

	I*8 + A + 0*100		I*8 + B + 0*100		I*8 + C + 0*100
	I*8 + A + 1*100		I*8 + B + 1*100		I*8 + C + 1*100
	...			...			...

Since I is always multiplied by 8, it should be strength reduced.
The rest of each addressing expression is constant.
So, index + constant.

If the code is indeed suffering from too many addressing expressions,
then Cray should look to their strength reduction and consider
reassociation of loop-invariants.

There are alternative explanations.

You're addressing A, B, and C with the 2nd index varying
(running from 1 to 100).  On a cache machine, this is a bad move.
I dunno about Crays.  Certainly you're causing non-unit stride accesses.

Further, if the compiler detects that A(I, 1) is unchanged
across the 99 intervening assignments, it may be trying to
hold 100 FP values in registers, optimistically awaiting
their reuse.

I would propose an alternative implementation

	DO N=1, NX
	  DO J = 1, JX
	    DO I=1, IX
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	    ENDDO
	  ENDDO
	ENDDO

We would hope the compiler will achieve the code below

	DO N=1, NX
	  DO J = 1, JX
	    DO I=1, IX
	      aij = A(I, J)
	      bij = B(I, J)
	      cij = C(I, J)
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      A(I, J) = aij
	    ENDDO
	  ENDDO
	ENDDO

If you must unroll, unroll the outermost loop, giving

	DO N=1, NX, 4
	  DO J = 1, JX
	    DO I=1, IX
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
	    ENDDO
	  ENDDO
	ENDDO

which the compiler will probably turn into

	DO N=1, NX, 4
	  DO J = 1, JX
	    DO I=1, IX
	      aij = A(I, J)
	      bij = B(I, J)
	      cij = C(I, J)
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      aij = aij * bij + cij
	      A(I, J) = aij
	    ENDDO
	  ENDDO
	ENDDO

Be aware though, that unrolling the outer loop and jamming the bodies isn't
always legal or the right thing to do.

Preston Briggs

preston@ariel.rice.edu (Preston Briggs) (04/05/91)

I wrote

>If you must unroll, unroll the outermost loop, giving
>
>	DO N=1, NX, 4
>	  DO J = 1, JX
>	    DO I=1, IX
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	      A(I, J) = A(I, J) * B(I, J) + C(I, J)
>	    ENDDO
>	  ENDDO
>	ENDDO

On further thought (!), I'd unroll the middle loop a little
(use moderation in your experiments).  Something like

	DO N=1, NX
	  DO J = 1, JX, 4
	    DO I=1, IX
	      A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0)
	      A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0)
	      A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1)
	      A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1)
	      A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2)
	      A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2)
	      A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3)
	      A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3)
	    ENDDO
	  ENDDO
	ENDDO

the idea being that the compiler would be better able to schedule
this stuff.  Instead of 1 expression, we now get 4 expressions
that can be run in parallel, hopefully filling the pipe lines.

Experiment a little with the amount of unrolling and see what happens.

Preston Briggs