fsjohnv@alliant1.lerc.nasa.gov (Richard Mulac) (04/05/91)
First, thanks to all of you who took the time to analyze the problem I posed (especially those of you who do or did work for Cray and have insight into the internal strategy of the compiler). I will try to summarize the comments which were sent to me by mail, but before I do that let me paraphrase the original posting: Last year I was attempting to optimize a code for the CRAY Y-MP by merging several DO loops and ran into a situation which on the surface appeared to contradict one of the old "rules" one abides by when trying to write efficient Fortran code. The reason I was doing this was that back in the CFT days if you tried to make your loop too long you would break the optimization block and the loop would not vectorize. So now with CFT77 I was going to go back and try to merge some loops hoping to reduce memory traffic and enhance chaining. The benefits can be seen in the following simplistic example: OLD CODE DO I=1,1000 C(I) = A(I) * B(I) ENDDO DO I=1,1000 E(I) = C(I) + D(I) ENDDO DO I=1,1000 G(I) = E(I) * F(I) ENDDO NEW CODE DO I=1,1000 C(I) = A(I) * B(I) E(I) = C(I) + D(I) G(I) = E(I) * F(I) ENDDO The new code should reduce memory access and should also be able to chain the floating point add and multiply functional units together, leading to a decrease in the computational time. The contradiction that I found in my real code can be described using the following program TEST, which calls subroutines JOINED and SPLIT. PROGRAM TEST PARAMETER (IX=10000,JX=100,NX=100) COMMON /GLOBAL/ . A(IX,JX),B(IX,JX),C(IX,JX) CALL JOINED CALL SPLIT STOP END SUBROUTINE JOINED PARAMETER (IX=10000,JX=100,NX=100) COMMON /GLOBAL/ . A(IX,JX),B(IX,JX),C(IX,JX) DO N=1,NX DO I=1,IX A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) : A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) A(I,100) = A(I,100) * B(I,100) + C(I,100) A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) : A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) A(I,100) = A(I,100) * B(I,100) + C(I,100) ENDDO ENDDO RETURN END SUBROUTINE SPLIT PARAMETER (IX=10000,JX=100,NX=100) COMMON /GLOBAL/ . A(IX,JX),B(IX,JX),C(IX,JX) DO N=1,NX DO I=1,IX A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) : A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) A(I,100) = A(I,100) * B(I,100) + C(I,100) ENDDO DO I=1,IX A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) : A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) A(I,100) = A(I,100) * B(I,100) + C(I,100) ENDDO ENDDO RETURN END The code was compiled with the -o aggress option which vectorized the inner loops in both JOINED and SPLIT. It was expected that JOINED would run faster and more efficiently than SPLIT, but the opposite result occurred. Using the hardware performance monitor (hpm) I found that JOINED ran in 2.8 seconds at a rate of 140 Mflops, while SPLIT ran in 2.2 seconds at a rate of 180 Mflops. The results from this test program were similiar to those I found when combining 10 to 50 line loops with complicated right hand sides in the original code I was working on. Several people analyzed the program and found the problem to be due to the large amount of address locations which are computed and saved for later use within each loop (i.e., JOINED). Because the Y-MP (or any other computer) has a limited number of registers, a large amount of address calculations saved within a loop will eventually be spilled from registers to memory. This is NOT a bug in the compiler, but simply the strategy chosen to optimize the majority of code which will be seen by the compiler. Most of the time it will be more efficient to save address calculations which will be used again later in the loop than to recalculate them, even though in some cases they may spill out of the registers into memory. This spillover could lead to an increase in scalar memory references which would inhibit vector performance. This now leads to a grey area as to how "large" one should make their loops in order to get the best return in performance. Whatever the answer is, old "rules" for optimal coding may not apply any more (This topic expanded a bit to include other "rules" might make for a good article in Cray Channels). One last note - In the original posting I did not mean to imply that this effect was caused by a "bug" in the compiler. The people that I had run this by said they thought it was probably caused by a bug because they did not take the time to analyze the problem. I did not accept that it was a bug and thought that maybe the experts out on the net could help me out (which you did). Thanks again! Rick Mulac ---- fsjohnv@alliant1.lerc.nasa.gov
preston@ariel.rice.edu (Preston Briggs) (04/05/91)
fsjohnv@alliant1.lerc.nasa.gov (Richard Mulac) writes: >The contradiction that I found in my >real code can be described using the following program TEST, which calls >subroutines JOINED and SPLIT. > > PROGRAM TEST > PARAMETER (IX=10000,JX=100,NX=100) > COMMON /GLOBAL/ > . A(IX,JX),B(IX,JX),C(IX,JX) > CALL JOINED > CALL SPLIT > STOP > END > SUBROUTINE JOINED > PARAMETER (IX=10000,JX=100,NX=100) > COMMON /GLOBAL/ > . A(IX,JX),B(IX,JX),C(IX,JX) > DO N=1,NX > DO I=1,IX > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > ENDDO > ENDDO > RETURN > END > SUBROUTINE SPLIT > PARAMETER (IX=10000,JX=100,NX=100) > COMMON /GLOBAL/ > . A(IX,JX),B(IX,JX),C(IX,JX) > DO N=1,NX > DO I=1,IX > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > ENDDO > DO I=1,IX > A(I, 1) = A(I, 1) * B(I, 1) + C(I, 1) > A(I, 2) = A(I, 2) * B(I, 2) + C(I, 2) > : > A(I, 99) = A(I, 99) * B(I, 99) + C(I, 99) > A(I,100) = A(I,100) * B(I,100) + C(I,100) > ENDDO > ENDDO > RETURN > END > > The code was compiled with the -o aggress option which vectorized the >inner loops in both JOINED and SPLIT. It was expected that JOINED would run >faster and more efficiently than SPLIT, but the opposite result occurred. >Using the hardware performance monitor (hpm) I found that JOINED ran in 2.8 >seconds at a rate of 140 Mflops, while SPLIT ran in 2.2 seconds at a rate >of 180 Mflops. The results from this test program were similiar to those >I found when combining 10 to 50 line loops with complicated right hand sides >in the original code I was working on. > Several people analyzed the program and found the problem to be due to the >large amount of address locations which are computed and saved for later use >within each loop (i.e., JOINED). Mulac concludes that the code is poorly written and there's is no compiler bug. I agree that the code is poorly written, but I don't believe the cause is addressing expressions. Since A, B, and C all have the same dimensions, we should be able to handle all the addressing in the inner loops with a single index register and constant offsets. All the addressing expressions should be of the form I*8 + A + 0*100 I*8 + B + 0*100 I*8 + C + 0*100 I*8 + A + 1*100 I*8 + B + 1*100 I*8 + C + 1*100 ... ... ... Since I is always multiplied by 8, it should be strength reduced. The rest of each addressing expression is constant. So, index + constant. If the code is indeed suffering from too many addressing expressions, then Cray should look to their strength reduction and consider reassociation of loop-invariants. There are alternative explanations. You're addressing A, B, and C with the 2nd index varying (running from 1 to 100). On a cache machine, this is a bad move. I dunno about Crays. Certainly you're causing non-unit stride accesses. Further, if the compiler detects that A(I, 1) is unchanged across the 99 intervening assignments, it may be trying to hold 100 FP values in registers, optimistically awaiting their reuse. I would propose an alternative implementation DO N=1, NX DO J = 1, JX DO I=1, IX A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) ENDDO ENDDO ENDDO We would hope the compiler will achieve the code below DO N=1, NX DO J = 1, JX DO I=1, IX aij = A(I, J) bij = B(I, J) cij = C(I, J) aij = aij * bij + cij aij = aij * bij + cij A(I, J) = aij ENDDO ENDDO ENDDO If you must unroll, unroll the outermost loop, giving DO N=1, NX, 4 DO J = 1, JX DO I=1, IX A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) A(I, J) = A(I, J) * B(I, J) + C(I, J) ENDDO ENDDO ENDDO which the compiler will probably turn into DO N=1, NX, 4 DO J = 1, JX DO I=1, IX aij = A(I, J) bij = B(I, J) cij = C(I, J) aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij aij = aij * bij + cij A(I, J) = aij ENDDO ENDDO ENDDO Be aware though, that unrolling the outer loop and jamming the bodies isn't always legal or the right thing to do. Preston Briggs
preston@ariel.rice.edu (Preston Briggs) (04/05/91)
I wrote >If you must unroll, unroll the outermost loop, giving > > DO N=1, NX, 4 > DO J = 1, JX > DO I=1, IX > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > A(I, J) = A(I, J) * B(I, J) + C(I, J) > ENDDO > ENDDO > ENDDO On further thought (!), I'd unroll the middle loop a little (use moderation in your experiments). Something like DO N=1, NX DO J = 1, JX, 4 DO I=1, IX A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0) A(I, J+0) = A(I, J+0) * B(I, J+0) + C(I, J+0) A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1) A(I, J+1) = A(I, J+1) * B(I, J+1) + C(I, J+1) A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2) A(I, J+2) = A(I, J+2) * B(I, J+2) + C(I, J+2) A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3) A(I, J+3) = A(I, J+3) * B(I, J+3) + C(I, J+3) ENDDO ENDDO ENDDO the idea being that the compiler would be better able to schedule this stuff. Instead of 1 expression, we now get 4 expressions that can be run in parallel, hopefully filling the pipe lines. Experiment a little with the amount of unrolling and see what happens. Preston Briggs