[net.micro.68k] Benchmarking and the 68020 Cache

dgh@sun.uucp (David Hough) (01/23/86)
              Benchmarking and the 68020 Cache


                        David Hough



                          ABSTRACT

          Seemingly insignificant changes in the  order
     of  subroutines in a floating point benchmark pro-
     gram can affect the performance by as much as 20%.
     Is the hardware or software to blame?

          nroff source for  this  report  is  available
     from ucbvax!sun!dhough.



     No sooner had I reported what I thought to be a defini-
tive  performance  rating  for  the Sun-3 FPA than I learned
that the issue  is  rather  more  complicated.   Chris  Aoki
discovered  that  Sun-3  FPA Whetstone benchmark results can
vary by 20% just by rearranging the order in which the  sub-
routines  are compiled or loaded.  A similar effect would be
expected in any 68020-based system in  which  the  order  of
loading  is  variable.  Even on a system in which the loader
placed all  subroutines  in  alphabetical  order,  the  same
effect  would  exist  but would be slightly harder to demon-
strate; changing the names of the  subroutines  rather  than
the compilation order would be required.

     Chris correctly inferred the cause  of  the  variation:
the  CPU of the Sun-3, a 16.7 MHz 68020, has an internal 256
byte  direct-mapped  instruction  cache.   This  cache  maps
memory  address  X  to  cache  address (X mod 256).  Thus if
location 4 contains a branch to location 260, and  260  con-
tains  a  branch back to 4, there will be a cache miss every
time and that two-instruction loop will execute at the worst
case rate.

     It so happens that the most heavily used  loop  in  the
Whetstone  floating point benchmark contains a call to P3, a
short subroutine.   There should be no problem  accomodating
both P3 and the loop that calls it in the instruction cache,
since the number of instructions is  less  than  256  bytes.
Unless,  of  course,  P3  and  the  loop should happen to be
loaded into addresses that  differ  by  a  multiple  of  256
bytes.

     It turns out that this is very easy  to  arrange.   Our



                                             22 January 1986





D. Hough      Benchmarking and the 68020 Cache             2


f77  version  of the Whetstone program is a single .f source
file   that   contains    subroutines    in    the    order:
main/pa/p0/p3/pout/time.    Below  are  the results produced
after rearranging the source of the program in various ways.
Note  that  only  the order of subroutines within the source
files varies; the compiler is the same, the hardware is  the
same,  and  no changes are made within a Fortran subroutine.
Single precision results follow in KWIPS, thousands of Whet-
stone interpreter instructions per second:

                                cache enabled   cache disabled
                                KWIPS           KWIPS
order of compilation/loading:

p3/main/pa/p0/pout/time         2605            1760
main/p3/pa/p0/pout/time         2220            1760
main/pa/p3/p0/pout/time         2415
main/pa/p0/p3/pout/time         2407            1730
main/pa/p0/pout/p3/time         2405            1760
main/pa/p0/pout/time/p3         2134
time/pout/p3/p0/pa/main         2150

The disabled cache differences are larger than  I  expected,
but  they are an order of magnitude smaller than the enabled
cache differences, confirming the role of the cache  in  the
latter.

     Rearranging the compiled assembly language code to move
the p3 calling loop adjacent to P3 results in a confirmation
of the maximum score of 2605.   This  rearrangment  did  not
change  the  number  of  instructions  executed,  but it did
insure that there were no cache clashes between P3  and  the
loop that calls it.  I then took the further step of rewrit-
ing the source code slightly, changing the caller from

        IF (N8.EQ.0) GO TO 95
                DO 90 I=1,N8
                CALL P3(X,Y,Z)
   90           CONTINUE
   95   CONTINUE

to

        IF (N8.EQ.0) GO TO 95
                CALL callP3(n8,X,Y,Z)
   95   CONTINUE

and the callee from









                                             22 January 1986





D. Hough      Benchmarking and the 68020 Cache             3


        SUBROUTINE P3(X,Y,Z)
        IMPLICIT REAL*4 (A-H,O-Z)
        COMMON T,T1,T2,E1(4),J,K,L
        X1 = X
        Y1 = Y
        X1 = T * (X1 + Y1)
        Y1 = T * (X1 + Y1)
        Z = (X1 + Y1) / T2
        RETURN
        END

to

        subroutine callp3(n8,x,y,z)
                DO 90 I=1,N8
                CALL P3(X,Y,Z)
   90           CONTINUE
        end
        SUBROUTINE P3(X,Y,Z)
        IMPLICIT REAL*4 (A-H,O-Z)
        COMMON T,T1,T2,E1(4),J,K,L
        X1 = X
        Y1 = Y
        X1 = T * (X1 + Y1)
        Y1 = T * (X1 + Y1)
        Z = (X1 + Y1) / T2
        RETURN
        END

The cache-enabled Whetstone score  increased  from  2605  to
2775!  Similar  changes for the P0 and PA subroutines had no
significant effect.  This extra increment of performance was
due to different register allocations caused by removing the
P3 calling loop from the main program.

     What does it  all  mean?   Which  one  of  the  numbers
represents   the   true  performance  of  the  hardware  and
software?  What's the honest thing to tell a customer?

     I have previously blamed the  Whetstone  benchmark  for
not  being very realistic; floating point intensive applica-
tions do not often look like P3 and its calling loop -  usu-
ally  the  innermost loop or subroutine consumes enough time
that cache misses outside it don't matter. I do  not  expect
that a comparable effect would be noticeable in the Linpack,
large Linpack, or Livermore loops benchmark  results  previ-
ously reported.

     Alternately one might blame Motorola for using a simple
cache  structure,  but  a  more complicated cache might well
have run slower with a  net  loss  in  overall  performance,
especially  on real applications that do not have the struc-
ture of the Whetstone benchmark.




                                             22 January 1986





D. Hough      Benchmarking and the 68020 Cache             4


     If everyone ran 68020 benchmarks with  the  cache  dis-
abled  the  results would be more uniform but less interest-
ing.  As  technology  advances,  the  suitability  of  short
benchmark programs for projecting complex applications' per-
formance declines.