dgh@sun.uucp (David Hough) (01/23/86)
Benchmarking and the 68020 Cache David Hough ABSTRACT Seemingly insignificant changes in the order of subroutines in a floating point benchmark pro- gram can affect the performance by as much as 20%. Is the hardware or software to blame? nroff source for this report is available from ucbvax!sun!dhough. No sooner had I reported what I thought to be a defini- tive performance rating for the Sun-3 FPA than I learned that the issue is rather more complicated. Chris Aoki discovered that Sun-3 FPA Whetstone benchmark results can vary by 20% just by rearranging the order in which the sub- routines are compiled or loaded. A similar effect would be expected in any 68020-based system in which the order of loading is variable. Even on a system in which the loader placed all subroutines in alphabetical order, the same effect would exist but would be slightly harder to demon- strate; changing the names of the subroutines rather than the compilation order would be required. Chris correctly inferred the cause of the variation: the CPU of the Sun-3, a 16.7 MHz 68020, has an internal 256 byte direct-mapped instruction cache. This cache maps memory address X to cache address (X mod 256). Thus if location 4 contains a branch to location 260, and 260 con- tains a branch back to 4, there will be a cache miss every time and that two-instruction loop will execute at the worst case rate. It so happens that the most heavily used loop in the Whetstone floating point benchmark contains a call to P3, a short subroutine. There should be no problem accomodating both P3 and the loop that calls it in the instruction cache, since the number of instructions is less than 256 bytes. Unless, of course, P3 and the loop should happen to be loaded into addresses that differ by a multiple of 256 bytes. It turns out that this is very easy to arrange. Our 22 January 1986 D. Hough Benchmarking and the 68020 Cache 2 f77 version of the Whetstone program is a single .f source file that contains subroutines in the order: main/pa/p0/p3/pout/time. Below are the results produced after rearranging the source of the program in various ways. Note that only the order of subroutines within the source files varies; the compiler is the same, the hardware is the same, and no changes are made within a Fortran subroutine. Single precision results follow in KWIPS, thousands of Whet- stone interpreter instructions per second: cache enabled cache disabled KWIPS KWIPS order of compilation/loading: p3/main/pa/p0/pout/time 2605 1760 main/p3/pa/p0/pout/time 2220 1760 main/pa/p3/p0/pout/time 2415 main/pa/p0/p3/pout/time 2407 1730 main/pa/p0/pout/p3/time 2405 1760 main/pa/p0/pout/time/p3 2134 time/pout/p3/p0/pa/main 2150 The disabled cache differences are larger than I expected, but they are an order of magnitude smaller than the enabled cache differences, confirming the role of the cache in the latter. Rearranging the compiled assembly language code to move the p3 calling loop adjacent to P3 results in a confirmation of the maximum score of 2605. This rearrangment did not change the number of instructions executed, but it did insure that there were no cache clashes between P3 and the loop that calls it. I then took the further step of rewrit- ing the source code slightly, changing the caller from IF (N8.EQ.0) GO TO 95 DO 90 I=1,N8 CALL P3(X,Y,Z) 90 CONTINUE 95 CONTINUE to IF (N8.EQ.0) GO TO 95 CALL callP3(n8,X,Y,Z) 95 CONTINUE and the callee from 22 January 1986 D. Hough Benchmarking and the 68020 Cache 3 SUBROUTINE P3(X,Y,Z) IMPLICIT REAL*4 (A-H,O-Z) COMMON T,T1,T2,E1(4),J,K,L X1 = X Y1 = Y X1 = T * (X1 + Y1) Y1 = T * (X1 + Y1) Z = (X1 + Y1) / T2 RETURN END to subroutine callp3(n8,x,y,z) DO 90 I=1,N8 CALL P3(X,Y,Z) 90 CONTINUE end SUBROUTINE P3(X,Y,Z) IMPLICIT REAL*4 (A-H,O-Z) COMMON T,T1,T2,E1(4),J,K,L X1 = X Y1 = Y X1 = T * (X1 + Y1) Y1 = T * (X1 + Y1) Z = (X1 + Y1) / T2 RETURN END The cache-enabled Whetstone score increased from 2605 to 2775! Similar changes for the P0 and PA subroutines had no significant effect. This extra increment of performance was due to different register allocations caused by removing the P3 calling loop from the main program. What does it all mean? Which one of the numbers represents the true performance of the hardware and software? What's the honest thing to tell a customer? I have previously blamed the Whetstone benchmark for not being very realistic; floating point intensive applica- tions do not often look like P3 and its calling loop - usu- ally the innermost loop or subroutine consumes enough time that cache misses outside it don't matter. I do not expect that a comparable effect would be noticeable in the Linpack, large Linpack, or Livermore loops benchmark results previ- ously reported. Alternately one might blame Motorola for using a simple cache structure, but a more complicated cache might well have run slower with a net loss in overall performance, especially on real applications that do not have the struc- ture of the Whetstone benchmark. 22 January 1986 D. Hough Benchmarking and the 68020 Cache 4 If everyone ran 68020 benchmarks with the cache dis- abled the results would be more uniform but less interest- ing. As technology advances, the suitability of short benchmark programs for projecting complex applications' per- formance declines.