preston@titan.rice.edu (Preston Briggs) (02/11/90)
In article <4616@brazos.Rice.edu> I wrote: >I'm still hoping to see somebody recode matrix multiply in C, using >all the source hacks they want, so we can compare it to what we get >using Fortran and our (not mine, friends') latest transformations. In article <1990Feb10.160605.25254@ux1.cso.uiuc.edu> mcdonald@aries.scs.uiuc.edu (Doug McDonald) replied: >double a[200][200], b[200][200], c[200][200]; > >main() >{ >int i,j; > > for(i = 0; i < 200; i++) { > for(j = 0; j < 200; j++){ > b[i][j] = j; > c[i][j] = i; > } > } > mm(a,b,c,200); >} > >mm(a, b, c, n) >double *a, *b, *c; >int n; >{ >double t; >double *bb, *cc; >int i, j, k; > > for(i = 0; i < n; i++) { > for(j = 0; j < n ;j++) { > t = 0; > bb = b + n*i; > cc = c + j; > for ( k = 0; k < n; k++) { > t += *bb++ * *cc; > cc += n; > } > a[i*n + j] = t; > } > } >} A less trusting person might have moved n*i out of the j loop, where it occurs twice (loop invariant code motion). Additionally, 3 registers could be saved by controlling the for loops with the pointers (linear function test replacement). Finally, we might limit the scope of the variables and add register declarations. Nevertheless, the code is reasonable and (given the MIPS optimizer to cleanup the nits I mentioned) should run pretty fast. I reran DMcD's tests, also on a MIPS M/120, but using the system time command. Fortran gave 18.1, 19.0, and 18.1 seconds on 3 runs. C gave 11.9, 12.1, and 11.5 seconds on 3 runs. The problem with the Fortran is a weakness in the strength reduction algorithm. It's been reported (months ago), and newer releases of the compiler are to have fixed it. Pretty embarrassing for my arguments! Inlining the MM routine in both examples (which will avoid the weakness) gives improved numbers for the Fortran and similar for the C. Fortran gave 11.2, 10.9, and 10.9 seconds on 3 runs. C gave 11.7, 11.5, and 11.5 seconds on 3 runs. Using a system for source-to-source transformation developed at Rice, we can automatically transform the Fortran to run in 6.2 seconds, nearly a factor of 2 over the C version. For larger matrices (say 500x500(, the improvement approaches a factor of 3. So why can't we do this in C? You can, if you know how and feel up to the effort; I'll show you. Here's what the fortran gets changed into: DO 20 J = 1, MOD(N, 3), 1 DO 1001 I = 1, MOD(N, 9), 1 A(I, J) = 0.0 IF (1 .GT. N) GOTO 1006 A$1$0 = A(I, J) + B(I, 1)*C(1, J) DO 1002 K = 2, N, 1 1002 A$1$0 = A$1$0 + B(I, K)*C(K, J) A(I, J) = A$1$0 1006 CONTINUE 1001 CONTINUE DO 1022 I = MOD(N, 9) + 1, N, 9 A(I, J) = 0.0 A(I + 1, J) = 0.0 A(I + 2, J) = 0.0 A(I + 3, J) = 0.0 A(I + 4, J) = 0.0 A(I + 5, J) = 0.0 A(I + 6, J) = 0.0 A(I + 7, J) = 0.0 A(I + 8, J) = 0.0 IF (1 .GT. N) GOTO 1008 C$1$0 = C(1, J) A$1$0 = A(I, J) + B(I, 1)*C$1$0 A$2$0 = A(I + 1, J) + B(I + 1, 1)*C$1$0 A$3$0 = A(I + 2, J) + B(I + 2, 1)*C$1$0 A$4$0 = A(I + 3, J) + B(I + 3, 1)*C$1$0 A$5$0 = A(I + 4, J) + B(I + 4, 1)*C$1$0 A$6$0 = A(I + 5, J) + B(I + 5, 1)*C$1$0 A$7$0 = A(I + 6, J) + B(I + 6, 1)*C$1$0 A$8$0 = A(I + 7, J) + B(I + 7, 1)*C$1$0 A$9$0 = A(I + 8, J) + B(I + 8, 1)*C$1$0 DO 1023 K = 2, N, 1 C$1$0 = C(K, J) A$1$0 = A$1$0 + B(I, K)*C$1$0 A$2$0 = A$2$0 + B(I + 1, K)*C$1$0 A$3$0 = A$3$0 + B(I + 2, K)*C$1$0 A$4$0 = A$4$0 + B(I + 3, K)*C$1$0 A$5$0 = A$5$0 + B(I + 4, K)*C$1$0 A$6$0 = A$6$0 + B(I + 5, K)*C$1$0 A$7$0 = A$7$0 + B(I + 6, K)*C$1$0 A$8$0 = A$8$0 + B(I + 7, K)*C$1$0 1023 A$9$0 = A$9$0 + B(I + 8, K)*C$1$0 A(I + 8, J) = A$9$0 A(I + 7, J) = A$8$0 A(I + 6, J) = A$7$0 A(I + 5, J) = A$6$0 A(I + 4, J) = A$5$0 A(I + 3, J) = A$4$0 A(I + 2, J) = A$3$0 A(I + 1, J) = A$2$0 A(I, J) = A$1$0 1008 CONTINUE 1022 CONTINUE 20 CONTINUE DO 1021 J = MOD(N, 3) + 1, N, 3 DO 1024 I = 1, MOD(N, 3), 1 A(I, J) = 0.0 A(I, J + 1) = 0.0 A(I, J + 2) = 0.0 IF (1 .GT. N) GOTO 1009 B$1$0 = B(I, 1) A$1$0 = A(I, J) + B$1$0*C(1, J) A$2$0 = A(I, J + 1) + B$1$0*C(1, J + 1) A$3$0 = A(I, J + 2) + B$1$0*C(1, J + 2) DO 1025 K = 2, N, 1 B$1$0 = B(I, K) A$1$0 = A$1$0 + B$1$0*C(K, J) A$2$0 = A$2$0 + B$1$0*C(K, J + 1) 1025 A$3$0 = A$3$0 + B$1$0*C(K, J + 2) A(I, J + 2) = A$3$0 A(I, J + 1) = A$2$0 A(I, J) = A$1$0 1009 CONTINUE 1024 CONTINUE DO 1027 I = MOD(N, 3) + 1, N, 3 A(I, J) = 0.0 A(I, J + 1) = 0.0 A(I, J + 2) = 0.0 A(I + 1, J) = 0.0 A(I + 1, J + 1) = 0.0 A(I + 1, J + 2) = 0.0 A(I + 2, J) = 0.0 A(I + 2, J + 1) = 0.0 A(I + 2, J + 2) = 0.0 IF (1 .GT. N) GOTO 1010 C$1$0 = C(1, J) B$1$0 = B(I, 1) A$1$0 = A(I, J) + B$1$0*C$1$0 C$2$0 = C(1, J + 1) A$2$0 = A(I, J + 1) + B$1$0*C$2$0 C$3$0 = C(1, J + 2) A$3$0 = A(I, J + 2) + B$1$0*C$3$0 B$2$0 = B(I + 1, 1) A$4$0 = A(I + 1, J) + B$2$0*C$1$0 A$5$0 = A(I + 1, J + 1) + B$2$0*C$2$0 A$6$0 = A(I + 1, J + 2) + B$2$0*C$3$0 A$7$0 = A(I + 2, J) + B(I + 2, 1)*C$1$0 A(I + 2, J + 1) = A(I + 2, J + 1) + B(I + 2, 1)*C$2$0 A(I + 2, J + 2) = A(I + 2, J + 2) + B(I + 2, 1)*C$3$0 DO 1028 K = 2, N, 1 C$1$0 = C(K, J) B$1$0 = B(I, K) A$1$0 = A$1$0 + B$1$0*C$1$0 C$2$0 = C(K, J + 1) A$2$0 = A$2$0 + B$1$0*C$2$0 C$3$0 = C(K, J + 2) A$3$0 = A$3$0 + B$1$0*C$3$0 B$2$0 = B(I + 1, K) A$4$0 = A$4$0 + B$2$0*C$1$0 A$5$0 = A$5$0 + B$2$0*C$2$0 A$6$0 = A$6$0 + B$2$0*C$3$0 A$7$0 = A$7$0 + B(I + 2, K)*C$1$0 A(I + 2, J + 1) = A(I + 2, J + 1) + B(I + 2, K)*C$2$0 1028 A(I + 2, J + 2) = A(I + 2, J + 2) + B(I + 2, K)*C$3$0 A(I + 2, J) = A$7$0 A(I + 1, J + 2) = A$6$0 A(I + 1, J + 1) = A$5$0 A(I + 1, J) = A$4$0 A(I, J + 2) = A$3$0 A(I, J + 1) = A$2$0 A(I, J) = A$1$0 1010 CONTINUE 1027 CONTINUE 1021 CONTINUE Now isn't that special. Aren't you glad you didn't try and do it in C? Of course we still have to trust the MIPS compiler to do the appropriate CSE, strength reduction, and register allocation. Fortunately, it does the right thing. Why does this mess run fast? Basically, we're keeping portions of the arrays in registers to avoid memory accesses. The original loop does 1 load for every FP operation. The transformed version does 5 loads for every 9 FP operations (with more registers, this could have been improved slightly). Since they both does the same number of FP operations (16 million) the transformed version executed many fewer loads (about 9 million vs. 16 million). Additionally, the larger basic block in the inner-most loop can be scheduled more effectively, getting more benefit from the asynchronous FP units and the load pipeline. The bottom portion of the code has been "unrolled and jammed" several times. The upper portion is inserted to handle the extra iterations that don't match the unroll amounts. Finally, the outer two loops have been interchanged to improve cache performance. So, what points am I trying to make? o Sufficiently powerful optimizers can do more than you want to do by hand. o In some cases, they can do things you've never thought of trying. o Certain Fortran restrictions, such as "parameters can't be aliased", enable some fairly hairy transformations that would be otherwise impossible. o Lots of registers can be useful, just like lots of cache. o Commonly available optimizers can be improved, sometimes dramatically (2 to 3 times). Credit is due to people who originated and deveoped these ideas: Randy Allen (Ardent), David Callahan (Tera), Steve Carr (Rice), John Cocke (IBM), and Ken Kennedy (Rice).