[comp.lang.misc] optimizations

preston@titan.rice.edu (Preston Briggs) (02/11/90)
In article <4616@brazos.Rice.edu> I wrote:
>I'm still hoping to see somebody recode matrix multiply in C, using 
>all the source hacks they want, so we can compare it to what we get
>using Fortran and our (not mine, friends') latest transformations.

In article <1990Feb10.160605.25254@ux1.cso.uiuc.edu> 
	mcdonald@aries.scs.uiuc.edu (Doug McDonald) replied:

>double a[200][200], b[200][200], c[200][200];
>
>main()
>{
>int i,j;
>
>    for(i = 0; i < 200; i++) {
>        for(j = 0; j < 200; j++){
>          b[i][j] = j;
>          c[i][j] = i;
>        }
>    }
>    mm(a,b,c,200);
>}
>
>mm(a, b, c,  n)
>double *a, *b, *c;
>int n;
>{
>double t;
>double *bb, *cc;
>int i, j, k;
>
>    for(i = 0; i < n; i++) {
>       for(j = 0; j < n ;j++) {
>	    t = 0;
>            bb = b + n*i;
>            cc = c + j;
>            for ( k = 0; k < n; k++) {
>                t += *bb++ * *cc;
>                cc += n;
>            }
>            a[i*n + j] = t;
>       }
>    } 
>}

A less trusting person might have moved n*i out of the j loop, where
it occurs twice (loop invariant code motion).  Additionally, 3 registers 
could be saved by controlling the for loops with the pointers (linear 
function test replacement).  Finally, we might limit the scope of the
variables and add register declarations.  Nevertheless, the code
is reasonable and (given the MIPS optimizer to cleanup the nits
I mentioned) should run pretty fast.

I reran DMcD's tests, also on a MIPS M/120, but using the system time command.

Fortran gave 18.1, 19.0, and 18.1 seconds on 3 runs.
C gave 11.9, 12.1, and 11.5 seconds on 3 runs.

The problem with the Fortran is a weakness in the strength reduction
algorithm.  It's been reported (months ago), and newer releases
of the compiler are to have fixed it.

Pretty embarrassing for my arguments!

Inlining the MM routine in both examples (which will avoid the weakness)
gives improved numbers for the Fortran and similar for the C.

Fortran gave 11.2, 10.9, and 10.9 seconds on 3 runs.
C gave 11.7, 11.5, and 11.5 seconds on 3 runs.


    Using a system for source-to-source transformation developed at Rice,
    we can automatically transform the Fortran to run in 6.2 seconds, 
    nearly a factor of 2 over the C version.  For larger matrices
    (say 500x500(, the improvement approaches a factor of 3.


So why can't we do this in C?  You can, if you know how and feel up to the
effort; I'll show you.  Here's what the fortran gets changed into:

      DO 20  J = 1, MOD(N, 3), 1
         DO 1001  I = 1, MOD(N, 9), 1
            A(I, J) = 0.0
            IF (1 .GT. N) GOTO 1006
            A$1$0 = A(I, J) + B(I, 1)*C(1, J)
            DO 1002  K = 2, N, 1
 1002          A$1$0 = A$1$0 + B(I, K)*C(K, J)
            A(I, J) = A$1$0
 1006       CONTINUE
 1001       CONTINUE
         DO 1022  I = MOD(N, 9) + 1, N, 9
            A(I, J) = 0.0
            A(I + 1, J) = 0.0
            A(I + 2, J) = 0.0
            A(I + 3, J) = 0.0
            A(I + 4, J) = 0.0
            A(I + 5, J) = 0.0
            A(I + 6, J) = 0.0
            A(I + 7, J) = 0.0
            A(I + 8, J) = 0.0
            IF (1 .GT. N) GOTO 1008
            C$1$0 = C(1, J)
            A$1$0 = A(I, J) + B(I, 1)*C$1$0
            A$2$0 = A(I + 1, J) + B(I + 1, 1)*C$1$0
            A$3$0 = A(I + 2, J) + B(I + 2, 1)*C$1$0
            A$4$0 = A(I + 3, J) + B(I + 3, 1)*C$1$0
            A$5$0 = A(I + 4, J) + B(I + 4, 1)*C$1$0
            A$6$0 = A(I + 5, J) + B(I + 5, 1)*C$1$0
            A$7$0 = A(I + 6, J) + B(I + 6, 1)*C$1$0
            A$8$0 = A(I + 7, J) + B(I + 7, 1)*C$1$0
            A$9$0 = A(I + 8, J) + B(I + 8, 1)*C$1$0
            DO 1023  K = 2, N, 1
               C$1$0 = C(K, J)
               A$1$0 = A$1$0 + B(I, K)*C$1$0
               A$2$0 = A$2$0 + B(I + 1, K)*C$1$0
               A$3$0 = A$3$0 + B(I + 2, K)*C$1$0
               A$4$0 = A$4$0 + B(I + 3, K)*C$1$0
               A$5$0 = A$5$0 + B(I + 4, K)*C$1$0
               A$6$0 = A$6$0 + B(I + 5, K)*C$1$0
               A$7$0 = A$7$0 + B(I + 6, K)*C$1$0
               A$8$0 = A$8$0 + B(I + 7, K)*C$1$0
 1023          A$9$0 = A$9$0 + B(I + 8, K)*C$1$0
            A(I + 8, J) = A$9$0
            A(I + 7, J) = A$8$0
            A(I + 6, J) = A$7$0
            A(I + 5, J) = A$6$0
            A(I + 4, J) = A$5$0
            A(I + 3, J) = A$4$0
            A(I + 2, J) = A$3$0
            A(I + 1, J) = A$2$0
            A(I, J) = A$1$0
 1008       CONTINUE
 1022       CONTINUE
   20    CONTINUE
      DO 1021  J = MOD(N, 3) + 1, N, 3
         DO 1024  I = 1, MOD(N, 3), 1
            A(I, J) = 0.0
            A(I, J + 1) = 0.0
            A(I, J + 2) = 0.0
            IF (1 .GT. N) GOTO 1009
            B$1$0 = B(I, 1)
            A$1$0 = A(I, J) + B$1$0*C(1, J)
            A$2$0 = A(I, J + 1) + B$1$0*C(1, J + 1)
            A$3$0 = A(I, J + 2) + B$1$0*C(1, J + 2)
            DO 1025  K = 2, N, 1
               B$1$0 = B(I, K)
               A$1$0 = A$1$0 + B$1$0*C(K, J)
               A$2$0 = A$2$0 + B$1$0*C(K, J + 1)
 1025          A$3$0 = A$3$0 + B$1$0*C(K, J + 2)
            A(I, J + 2) = A$3$0
            A(I, J + 1) = A$2$0
            A(I, J) = A$1$0
 1009       CONTINUE
 1024       CONTINUE
         DO 1027  I = MOD(N, 3) + 1, N, 3
            A(I, J) = 0.0
            A(I, J + 1) = 0.0
            A(I, J + 2) = 0.0
            A(I + 1, J) = 0.0
            A(I + 1, J + 1) = 0.0
            A(I + 1, J + 2) = 0.0
            A(I + 2, J) = 0.0
            A(I + 2, J + 1) = 0.0
            A(I + 2, J + 2) = 0.0
            IF (1 .GT. N) GOTO 1010
            C$1$0 = C(1, J)
            B$1$0 = B(I, 1)
            A$1$0 = A(I, J) + B$1$0*C$1$0
            C$2$0 = C(1, J + 1)
            A$2$0 = A(I, J + 1) + B$1$0*C$2$0
            C$3$0 = C(1, J + 2)
            A$3$0 = A(I, J + 2) + B$1$0*C$3$0
            B$2$0 = B(I + 1, 1)
            A$4$0 = A(I + 1, J) + B$2$0*C$1$0
            A$5$0 = A(I + 1, J + 1) + B$2$0*C$2$0
            A$6$0 = A(I + 1, J + 2) + B$2$0*C$3$0
            A$7$0 = A(I + 2, J) + B(I + 2, 1)*C$1$0
            A(I + 2, J + 1) = A(I + 2, J + 1) + B(I + 2, 1)*C$2$0
            A(I + 2, J + 2) = A(I + 2, J + 2) + B(I + 2, 1)*C$3$0
            DO 1028  K = 2, N, 1
               C$1$0 = C(K, J)
               B$1$0 = B(I, K)
               A$1$0 = A$1$0 + B$1$0*C$1$0
               C$2$0 = C(K, J + 1)
               A$2$0 = A$2$0 + B$1$0*C$2$0
               C$3$0 = C(K, J + 2)
               A$3$0 = A$3$0 + B$1$0*C$3$0
               B$2$0 = B(I + 1, K)
               A$4$0 = A$4$0 + B$2$0*C$1$0
               A$5$0 = A$5$0 + B$2$0*C$2$0
               A$6$0 = A$6$0 + B$2$0*C$3$0
               A$7$0 = A$7$0 + B(I + 2, K)*C$1$0
               A(I + 2, J + 1) = A(I + 2, J + 1) + B(I + 2, K)*C$2$0
 1028          A(I + 2, J + 2) = A(I + 2, J + 2) + B(I + 2, K)*C$3$0
            A(I + 2, J) = A$7$0
            A(I + 1, J + 2) = A$6$0
            A(I + 1, J + 1) = A$5$0
            A(I + 1, J) = A$4$0
            A(I, J + 2) = A$3$0
            A(I, J + 1) = A$2$0
            A(I, J) = A$1$0
 1010       CONTINUE
 1027       CONTINUE
 1021    CONTINUE

Now isn't that special.  Aren't you glad you didn't try and do it in C?
Of course we still have to trust the MIPS compiler to do the appropriate
CSE, strength reduction, and register allocation.  Fortunately, it does
the right thing.

Why does this mess run fast?  Basically, we're keeping portions
of the arrays in registers to avoid memory accesses.
The original loop does 1 load for every FP operation.  
The transformed version does 5 loads for every 9 FP operations
(with more registers, this could have been improved slightly).
Since they both does the same number of FP operations (16 million)
the transformed version executed many fewer loads (about 9 million
vs. 16 million).  Additionally, the larger basic block in the inner-most
loop can be scheduled more effectively, getting more benefit
from the asynchronous FP units and the load pipeline.
 
The bottom portion of the code has been "unrolled and jammed"
several times.  The upper portion is inserted to handle the extra
iterations that don't match the unroll amounts.
Finally, the outer two loops have been interchanged to improve 
cache performance.

So, what points am I trying to make?

   o	Sufficiently powerful optimizers can do more than you
	want to do by hand.
   o	In some cases, they can do things you've never thought of trying.
   o 	Certain Fortran restrictions, such as "parameters can't be aliased",
	enable some fairly hairy transformations that would be otherwise
	impossible.
   o	Lots of registers can be useful, just like lots of cache.
   o	Commonly available optimizers can be improved, 
	sometimes dramatically (2 to 3 times).


Credit is due to people who originated and deveoped these ideas:
Randy Allen (Ardent), David Callahan (Tera), Steve Carr (Rice), 
John Cocke (IBM), and Ken Kennedy (Rice).