          Weitek 1164/5 Floating Point Acclerators

                        David Hough


          Sun-3  Floating  Point  Accelerator  measured
     performance  exceeds 600,000 floating point opera-
     tions per second on some popular benchmarks.   Not
     all popular benchmarks are worth running, however;
     the results of the Whetstone benchmark, in partic-
     ular, are difficult to interpret.

          tbl/nroff source for this report is available
     from ucbvax!sun!dhough.

     Sun Microsystems, along with many of  its  competitors,
has  announced  a  Floating  Point Accelerator product as an
option for its new 68020-based Sun-3 systems.  These  Float-
ing  Point  Accelerators  are  often  based  on  the  Weitek
1164/1165 chip set.  The 1164/1165 set is  currently  avail-
able only as engineering samples, so few of these FPA's have
been used by customers.  Consequently there is  some  uncer-
tainty  as  to  exactly what performance to expect, although
most vendors projected similar results when  they  announced
their products.  Any performance differences among implemen-
tations are due to the hardware  surrounding  the  1164/1165
and the quality of the compiler-generated code.  The purpose
of this report is to indicate what I have measured  at  Sun,
and  to  encourage  customers  to report results they obtain
from measurement of Sun's or competitors' products.

     Here are the current single and double precision bench-
mark results for Sun's software release 3.1, currently under
development and expected to be shipped to customers in quan-
tity  in the second quarter of this year.  All programs were
compiled with f77's  -O  option  for  maximum  optimization.
Results  are measured in KFLOPS, thousands of floating point
operations per second, except Whetstone  results  which  are
measured   in  KWIPS,  thousands  of  Whetstone  interpreter
instructions per second. Note that  all  these  numbers  are
MEASURED  (not  projections)  except the spec sheet numbers,
which are estimates derived last summer.

Sun-3 SINGLE Precision KFLOPS:

f77 option           -fswitch   -f68881   -f68881   -fswitch   -ffpa    FPA
FP hardware           68881      68881     68881      FPA       FPA    spec
FP clock MHz           12.5      12.5      16.7       16.7     16.7    sheet

Whetstone KWIPS        530        860      1030       1400     2300    2000
Linpack rolled          52         86       108        180      610     450
Linpack unrolled        52         85       107        180      500     450

Large Linpack 1                    79       100                 370
Large Linpack 2                   101       130                 510
Large Linpack 4                   115       150                 630
Large Linpack 8                   105       130                 600
Large Linpack 16                   96       120                 400
Livermore max                     210       280                1200
Livermore median                   97       120                 510
Livermore harmonic                 86       110                 420
Livermore loop #6                  80       103                 430
Livermore min                      41        51                 130

Sun-3 DOUBLE Precision KFLOPS:

f77 option           -fswitch   -f68881   -f68881   -fswitch   -ffpa    FPA
FP hardware           68881      68881     68881      FPA       FPA    spec
FP clock MHz           12.5      12.5      16.7       16.7     16.7    sheet

Whetstone KWIPS        400        790       930       860      1700    1500
Linpack rolled          39         80       101       100       400     350
Linpack unrolled        39         80        99       100       310     350

Large Linpack 1                    74        92                 250
Large Linpack 2                    95       120                 370
Large Linpack 4                   109       130                 450
Large Linpack 8                    98       120                 380
Large Linpack 16                   90       108                 290
Livermore max                     200       270                 830
Livermore median                   90       110                 320
Livermore harmonic                 80       100                 280
Livermore loop #6                  75        92                 270
Livermore min                      38        48                 110

     Production Sun-3's run the 68020 CPU at  16.7  MHz  and
68881  mask  set  A79J at 12.5 MHz.  16.7 MHz 68881 mask set
A93N is currently available only as engineering samples.

     Note the difference between switched floating point  (-
fswitch)  and  inline  floating point (-f68881 or -ffpa).  A
program compiled with switched floating point  will  use  an
fpa  if  it is there or else a 68881 if it is there.  A pro-
gram compiled with  inline  code  will  only  run  with  the
hardware  for which it is compiled.  As is evident, there is

a  considerable  performance  penalty  for  using   switched
instead of inline floating point.

     The usual Linpack benchmark measures the time  required
to  solve  a  100x100 system of linear equations.  The inner
loop of the Linpack benchmark looks like this when rolled:

        do 1 i = 1, n
 1      x(i  ) = x(i  ) + c * y(i  )

and like this when unrolled:

        do 1 i = 1, n, 4
        x(i  ) = x(i  ) + c * y(i  )
        x(i+1) = x(i+1) + c * y(i+1)
        x(i+2) = x(i+2) + c * y(i+2)
 1      x(i+3) = x(i+3) + c * y(i+3)

The distributed version of the  Linpack  benchmark  has  the
inner loop unrolled because that was faster on certain main-
frames common in the  mid-1970's.   However,  the  unrolling
defeats many current vectorizing compilers, so supercomputer
manufacturers usually measure the rolled speed. Further com-
plicating  the  issue is that some compilers do not generate
optimum code for the inner loop whether rolled or  unrolled,
so  hand  coded assembly language is faster yet.  The situa-
tion for the usual Linpack benchmark and the Sun-3  is  that
code compiled inline for rolled loops is truly optimized and
cannot be improved by  hand  coding  in  assembly  language.
Rolled  loops  are what a programmer would be most likely to
write, so it is does not bother me that Sun's  f77  compiler
does  not  generate  quite  as  good code when the loops are
unrolled. The FPA spec sheet  projections  were  derived  by
considering the rolled loop; it did not occur to me that the
results would be different from unrolled  until  I  measured
the hardware.

     The usual Linpack benchmark is a good  one  for  scien-
tific  and  engineering floating point calculations, in part
because it measures the performance of hardware and compiler
in  an  indisputable  way  on  a  realistic computation.  An
optimizing compiler can't optimize away any of the  floating
point  work in the Linpack benchmark, although it can organ-
ize it more or less efficiently.

     Less widely used than the program just discussed is the
Large Linpack benchmark, which measures the time required to
solve a 300x300 system of linear equations, with the  compu-
tation  organized  rather differently than the usual Linpack
benchmark.  The program reports KFLOPS rates for solving the
problem for different source codings corresponding to unrol-
ling 1, 2, 4, 8, or 16 times.

     The  Livermore  Loops  benchmark  measures   the   time

required to perform 24 inner loops taken from important pro-
duction codes run at Livermore.  Max, min, median, and  har-
monic  mean KFLOPS rates are reported above for data vectors
of length 468.  The  KFLOPS  rating  for  loop  #6  is  also
reported;  it has been identified by Patterson as the single
loop best correlating with overall Livermore  Loops  perfor-

     Some vendors prefer to talk about results of the  Whet-
stone benchmark, which was synthesized to mimic the instruc-
tion stream created by the Whetstone  Algol  interpreter  of
the 1960's. Hardware and software progress have rendered the
Whetstone  benchmark  obsolete  but  relevance  has   seldom
affected  the  science  of marketing.  At least one of Sun's
competitors has claimed 3000 K  Whetstone  instructions  per
second  for  single  precision,  using  the  same  68020 and
1164/1165, which is an amazing accomplishment.  Anyone  that
can  independently  verify such claims should so respond and
explain how it's done!

     In the meantime I might consider how to  improve  Sun's
2300K  to 3000.  About half the time in the Whetstone bench-
mark is taken by the P3 subroutine, and on an 1164/1165 sys-
tem  about  half  the  P3  time  is consumed by the division
instruction.  The most direct way to  obtain  a  substantial
improvement  is to get rid of that division!  Looking at our
hardware architecture and  local  compiler  optimization,  I
can't  imagine  any incremental improvements that would have
significant effect.

     Certain types of global cross-procedural  optimizations
can have a profound impact, however.  Since P3's division is
by a global variable whose value happens to be 2.0, in prin-
ciple the division could be converted to a multiplication by
0.5.  Another possibility is to expand short procedures such
as  P3  inline  in  the  calling  code, then notice that the
expanded computation is invariant and could  be  removed  to
the  outside  of the do loop, leaving an empty loop.  Anyone
who built such inline expansion into  their  compiler  would
double  their Whetstone scores, and the only cost would be a
substantial diversion of software resources away from  other
projects that might actually benefit customers.  Since crit-
ical loops in real applications are usually source coded  by
the programmer to avoid division by 2.0 or invariant subrou-
tine calls, corresponding optimizations in the compiler sel-
dom  pay  off  in  realistic floating point applications, so
Sun's efforts are focused elsewhere.

     The moral of this digression is "don't pay much  atten-
tion  to Whetstone results".  If you want a single number to
characterize performance on scientific and engineering prob-
lems,  use the usual Linpack benchmark.  If you want lots of
numbers, the Livermore loops benchmark  provides  them.   If
you  want  accuracy and IEEE conformance as well as speed...

that's a topic for another report.

Code fragments from the Whetstone program...

        T = .499975
        T2 = 2.0


        DO 90 I=1,N8
                CALL P3(X,Y,Z)
   90           CONTINUE


        SUBROUTINE P3(X,Y,Z)
        IMPLICIT REAL*4 (A-H,O-Z)
        COMMON T,T1,T2,E1(4),J,K,L
        X1 = X
        Y1 = Y
        X1 = T * (X1 + Y1)
        Y1 = T * (X1 + Y1)
        Z = (X1 + Y1) / T2

note that with Weitek 1164/1165, the one division takes longer than
the three additions and two multiplications combined...

Does anyone have any comments about the new consortium formed to
produce a uniform network model (file servers, print servers, etc)
among all computer vendors (except IBM). Bell, DEC, Burroughs, CDC,
etc announced this effort in the last week.

My feeling was this was not so much a technical move (there is a lot
of room for innovation in the area of servers, especially when one
realizes that in the future there will be voice-servers,
video-servers, parrallel process servers - and locking oneself into a
standard across hybred operating systems will constrain all operating
system development at these firms).

The actual move seems to have been more marketing focused. I think
everyone was shocked when they woke up and realized that in one day,
IBM could say no, they would not go ethernet with the PC, they would
go ring - and all the Interlan's and 3-Coms and Apples of the world
would have to dance to IBM's tune. This effort seems to be aimed at
coming up with a marketing counterforce to keep IBM from making SNA a
defacto standard.

                                                - Steve