dgh@sun.uucp (David Hough) (01/23/86)
Benchmarking and the 68020 Cache
David Hough
ABSTRACT
Seemingly insignificant changes in the order
of subroutines in a floating point benchmark pro-
gram can affect the performance by as much as 20%.
Is the hardware or software to blame?
nroff source for this report is available
from ucbvax!sun!dhough.
No sooner had I reported what I thought to be a defini-
tive performance rating for the Sun-3 FPA than I learned
that the issue is rather more complicated. Chris Aoki
discovered that Sun-3 FPA Whetstone benchmark results can
vary by 20% just by rearranging the order in which the sub-
routines are compiled or loaded. A similar effect would be
expected in any 68020-based system in which the order of
loading is variable. Even on a system in which the loader
placed all subroutines in alphabetical order, the same
effect would exist but would be slightly harder to demon-
strate; changing the names of the subroutines rather than
the compilation order would be required.
Chris correctly inferred the cause of the variation:
the CPU of the Sun-3, a 16.7 MHz 68020, has an internal 256
byte direct-mapped instruction cache. This cache maps
memory address X to cache address (X mod 256). Thus if
location 4 contains a branch to location 260, and 260 con-
tains a branch back to 4, there will be a cache miss every
time and that two-instruction loop will execute at the worst
case rate.
It so happens that the most heavily used loop in the
Whetstone floating point benchmark contains a call to P3, a
short subroutine. There should be no problem accomodating
both P3 and the loop that calls it in the instruction cache,
since the number of instructions is less than 256 bytes.
Unless, of course, P3 and the loop should happen to be
loaded into addresses that differ by a multiple of 256
bytes.
It turns out that this is very easy to arrange. Our
22 January 1986
D. Hough Benchmarking and the 68020 Cache 2
f77 version of the Whetstone program is a single .f source
file that contains subroutines in the order:
main/pa/p0/p3/pout/time. Below are the results produced
after rearranging the source of the program in various ways.
Note that only the order of subroutines within the source
files varies; the compiler is the same, the hardware is the
same, and no changes are made within a Fortran subroutine.
Single precision results follow in KWIPS, thousands of Whet-
stone interpreter instructions per second:
cache enabled cache disabled
KWIPS KWIPS
order of compilation/loading:
p3/main/pa/p0/pout/time 2605 1760
main/p3/pa/p0/pout/time 2220 1760
main/pa/p3/p0/pout/time 2415
main/pa/p0/p3/pout/time 2407 1730
main/pa/p0/pout/p3/time 2405 1760
main/pa/p0/pout/time/p3 2134
time/pout/p3/p0/pa/main 2150
The disabled cache differences are larger than I expected,
but they are an order of magnitude smaller than the enabled
cache differences, confirming the role of the cache in the
latter.
Rearranging the compiled assembly language code to move
the p3 calling loop adjacent to P3 results in a confirmation
of the maximum score of 2605. This rearrangment did not
change the number of instructions executed, but it did
insure that there were no cache clashes between P3 and the
loop that calls it. I then took the further step of rewrit-
ing the source code slightly, changing the caller from
IF (N8.EQ.0) GO TO 95
DO 90 I=1,N8
CALL P3(X,Y,Z)
90 CONTINUE
95 CONTINUE
to
IF (N8.EQ.0) GO TO 95
CALL callP3(n8,X,Y,Z)
95 CONTINUE
and the callee from
22 January 1986
D. Hough Benchmarking and the 68020 Cache 3
SUBROUTINE P3(X,Y,Z)
IMPLICIT REAL*4 (A-H,O-Z)
COMMON T,T1,T2,E1(4),J,K,L
X1 = X
Y1 = Y
X1 = T * (X1 + Y1)
Y1 = T * (X1 + Y1)
Z = (X1 + Y1) / T2
RETURN
END
to
subroutine callp3(n8,x,y,z)
DO 90 I=1,N8
CALL P3(X,Y,Z)
90 CONTINUE
end
SUBROUTINE P3(X,Y,Z)
IMPLICIT REAL*4 (A-H,O-Z)
COMMON T,T1,T2,E1(4),J,K,L
X1 = X
Y1 = Y
X1 = T * (X1 + Y1)
Y1 = T * (X1 + Y1)
Z = (X1 + Y1) / T2
RETURN
END
The cache-enabled Whetstone score increased from 2605 to
2775! Similar changes for the P0 and PA subroutines had no
significant effect. This extra increment of performance was
due to different register allocations caused by removing the
P3 calling loop from the main program.
What does it all mean? Which one of the numbers
represents the true performance of the hardware and
software? What's the honest thing to tell a customer?
I have previously blamed the Whetstone benchmark for
not being very realistic; floating point intensive applica-
tions do not often look like P3 and its calling loop - usu-
ally the innermost loop or subroutine consumes enough time
that cache misses outside it don't matter. I do not expect
that a comparable effect would be noticeable in the Linpack,
large Linpack, or Livermore loops benchmark results previ-
ously reported.
Alternately one might blame Motorola for using a simple
cache structure, but a more complicated cache might well
have run slower with a net loss in overall performance,
especially on real applications that do not have the struc-
ture of the Whetstone benchmark.
22 January 1986
D. Hough Benchmarking and the 68020 Cache 4
If everyone ran 68020 benchmarks with the cache dis-
abled the results would be more uniform but less interest-
ing. As technology advances, the suitability of short
benchmark programs for projecting complex applications' per-
formance declines.