[comp.arch] Performance Brief; Benchmarks for Zillions of Computers long

mash@mips.UUCP (04/02/87)
MIPS Performance Brief, Issue 2.2. April 1, 1987 [but real!]
This is a highly-crunched version of the publication, with much of the
extra explanations and commentary omitted.  Most of what's left is large
tables of comparative numbers. We're interested in additions or corrections.
Nice, uncrunched printed copies are available [they do contain a
little advertising that disappeared in this version!] The 2nd page of this
summarizes the benchmarks, followed by tons of detail.

1.  Introduction

New Features of This Issue
New features in this issue include benchmarks using the new R2010 Floating
Point chip, both in the existing MIPS M/500 and our new M/800 systems.  We've
also gathered numbers from other processors for more comparisons.  Some bench-
marks are now normalized to the VAX-11/780 under VAX/VMS, rather than UNIX.
Finally, we include benchmarks for UMIPS-BSD and UMIPS-V.

Benchmarking - Caveats and Comments
While no one benchmark can fully characterize overall system performance, the
results of several benchmarks can give some insight into expected real perfor-
mance.  A more important benchmarking methodology is a side-by-side comparison
of two systems running the same real application.

We don't believe in characterizing a processor with only a single number, but
we follow (what seems to be) standard industry practice of using a mips-rating
that essentially describes overall integer performance.  Thus, we label a 5
mips machine to be one that is about 5X (i.e., anywhere from 4X to 6X!) faster
than a VAX 11/780 (UNIX 4.3BSD) on integer performance, since this seems to be
how most people compute mips-ratings.  Even within the same computer family,
performance ratios between processors vary widely.  For example, [McInnis 87]
characterizes a VAX 8700 as anywhere from 3X to 7X faster than the 11/780.
Floating point performance often varies even more than, and scales up slower
than integer performance versus the 11/780.

Most of this paper analyzes one important aspect of overall computer system
performance - user-level CPU performance.  A small set of operating system
benchmarks are also included.





































2.  Benchmark Summary

The following table summarizes the benchmark results described in more detail
throughout this document.

On three rows, all VAX systems run VAX/VMS, and all numbers are computed rela-
tive to a VAX 11/780 running VAX/VMS.  The VAX/VMS FORTRAN numbers were often
1.1-1.2X faster than our Lawrence Livermore Labs FORTRAN, and often 2X better
than those of 4.3BSD.  We've done this because many people are not familiar
with the LLL FORTRAN, and don't take any UNIX FORTRAN seriously unless compared
with VMS.  Other rows use 4.3BSD UNIX as the base for relative performance, but
show a few VAX/VMS numbers, i.e., those marked.#

The rows marked + are those that we consider truly representative benchmarks.

                         Summary of Benchmark Results

                     (VAX 11/780 = 1.0, Bigger is Faster)
                    (VAX/VMS when possible, UNIX otherwise)



                                  MIPS    VAX     MIPS    VAX    Sun-3/   VAX
                                 M/800    8650   M/500    8600    160M   11/780


                                RelativeRelativeRelativeRelativeRelativeRelative
Benchmark                       Perform Perform Perform Perform Perform Perform


Integer Benchmarks

  MIPS UNIX Total+                9.0      -      5.6     3.7     2.0     1.0

  Dhrystone: registers,           9.7     6.9#    6.6     4.7#    2.1     1.0
  no opt for MIPS, -O for others

  Stanford Integer Aggregate      12.8     -      8.5     3.6     2.2     1.0

Floating Point Benchmarks
  (Relative to VAX/VMS)

   LINPACK: FORTRAN DP+           5.7     5.0     4.1     3.5     0.7     1.0

   Whetstone: Double Precision+   8.3     4.8     5.4     3.2     1.2     1.0

   DoDuc: Double Precision+       5.6     5.0     3.6     3.5     0.7     1.0

  (Relative to 4.3BSD UNIX)

   Spice, elh2a+                  7.6      -      4.0      -      1.6     1.0















   Stanford FP Aggregate (C)      14.9     -      9.4     2.9&    1.6@    1.0

System Benchmarks

  Byte Total (UMIPS-BSD)          7.4      -      4.6     3.2     1.8     1.0
  Byte Total (UMIPS-V)             -       -      5.4     3.2     1.8     1.0




& In-house, FPA-less VAX 8600, Ultrix.


@ In-house Sun3/160 had 12.5MHz 68881.  Other FP benchmarks used published fig-
  ures with 16.6MHz 68881s.



















































3.  Methodology

3.1.  Tested Configurations

The computer systems configurations described below do not necessarily reflect
optimal configurations, but rather the in-house systems to which we had repeat-
able access.  When we've had the faster results available, we've quoted them in
place of our own system's numbers.


DEC VAX-11/780
Main Memory:        8 Mbytes
Floating Point:     Configured with FPA board.
Operating System:   4.3 BSD UNIX.

DEC VAX 8600
Main Memory:        20 Mbytes
Floating Point:     Configured without FPA board.
Operating System:   Ultrix V1.2. (4.2BSD with many 4.3BSD tunings).

Sun-3/160M
CPU:                (16.67 MHz MC68020)
Main Memory:        8 Mbytes
Floating Point:     12.5 MHz MC68881 coprocessor (compiled -f68881).
Operating System:   4.2 BSD UNIX, Release 3.2.

MIPS M/500
CPU:                8MHz R2000, in R2300 CPU board, 16K I-cache, 8K D-cache
Floating Point:     R2010 FPA chip (8MHz) (has 1 performance bug, see next page)
Main Memory:        8 Mbytes (2 R2350 memory boards)
Operating System:   UMIPS-BSD 2.0Beta (4.3BSD)
                    UMIPS-V 1.1 (System V R3) also shown where differs from BSD

MIPS M/800
CPU:                12.5MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache
Floating Point:     R2010 FPA chip (12.5MHz) (has 1 performance bug, see next page)
Main Memory:        8 Mbytes (2 R2350 memory boards)
Operating System:   UMIPS-BSD 2.0Beta (4.3BSD)


Test Conditions

All programs were compiled with -O (optimize).

C is used for all benchmarks except Whetstone, LINPACK, and DoDuc, which use
FORTRAN.  When possible, we've obtained numbers for VAX/VMS, and use them in
place of UNIX numbers.  The MIPS compilers are version 1.10.

User time was measured for all benchmarks using /bin/time command.

Systems were tested in normal multi-user development environment, with load
factor <0.2 (as measured by uptime command).  (Note that this occasionally














makes them run longer, due to slight interference from background daemons and
clock handling, even on an otherwise empty system.) Benchmarks were run at
least 3 times and averaged.  The intent is to show numbers that people can
reproduce.






























































3.2.  MIPS Results

MIPS R2300 and R2600 CPU Boards

The MIPS R2300 CPU board is based on the 8 MHz (125 ns cycle time) version of
the MIPS R2000 CPU chip. The R2300 also includes the MIPS R2010 Floating Point
Accelerator chip, a 16 Kbyte instruction cache, an 8 Kbyte data cache, and a
four-stage write buffer.

The MIPS R2600 CPU board is similar to the R2300, but runs at 12.5MHz (80ns
cycle time), and has larger caches (64KB each for instruction and data).

MIPS R2010 FPA Chip

The R2010 has the following cycle counts:

                      Single      Double
     Operation       Precision   Precision


     Add, Subtract    2 cycles    2 cycles
     Multiply         4 cycles    5 cycles
     Divide          12 cycles   19 cycles

This yields 250ns for an Add @ 8 MHz, 160ns @ 12.5MHz or 120ns @ 16.6 MHz.  In
addition, the R2010 overlaps operations extensively, i.e., load and store can
be overlapped with arithmetic, while multiply, divide, and add are mostly
independent.  Note: the early parts (first silicon) used in these benchmarks
had a bug that inhibited overlap of add, multiply, and divide.


Peak Performance Numbers

We don't believe these mean much, but people ask.  Note that VAX-Relative and
Peak mips are not the same!


                         MIPS    MIPS
     Category            M/800   M/500


     Peak (Burst) Mips   12.5     8.0
     Peak DP Megaflops    6.25    4.0























2.2  MIPS Results (cont.)

How to Interpret the Numbers

Times (or rates, such as for Dhrystones, Whetstones, and LINPACK KFlops) are
shown for the VAX 11/780.  Other machines' times or rates are shown, and the
relative performance ("Rel." column) normalized to the 11/780 treated as 1.0.
VAX/VMS is used whenever possible as the base.

Compilers and Operating Systems

Unless otherwise specified, The M/500 and M/800 benchmark numbers use the
Release 1.10 of the MIPS compilers and UMIPS-BSD 2.0 Beta.

Some numbers are also given for UMIPS-V 1.1, also using compiler release 1.10.
Most user-level benchmarks have similar performance for the two UMIPS variants,
so we report only major differences, mainly for the Byte benchmarks.

Optimization Levels

Unless otherwise specified, all benchmarks were compiled -O, i.e., with optimi-
zation.  UMIPS compilers call this level -O2, and it includes global optimiza-
tion.  In a few cases, we show numbers for -O3 and -O4 optimization levels,
which do inter-procedural register allocation and procedure merging.










































4.  Integer Benchmarks

4.1.  MIPS UNIX Benchmarks

The MIPS UNIX Benchmarks described below are fairly typical of nontrivial UNIX
programs.  This benchmark suite provides the opportunity to execute the same
code across several different machines, in contrast to the compilers and link-
ers for each machine, which have substantially different code.  User time is
shown; kernel time is typically 10-15% of the user time (on the 780), so these
are good indications of integer/character compute-intensive programs.

                         MIPS UNIX Benchmarks Results



               MIPS          MIPS         DEC VAX       Sun-3/        DEC VAX
               M/800         M/500         8600          160M         11/780


Benchmark   Secs   Rel.   Secs   Rel.   Secs   Rel.   Secs   Rel.   Secs   Rel.


grep         0.6   7.8     1.0   4.7     1.2   3.9     2.3   2.0     4.7   1.0
diff         0.3   9.0     0.4   6.7     0.9   3.0     1.8   1.5     2.7   1.0
yacc         0.4   8.5     0.6   5.7     0.9   3.8     1.7   2.0     3.4   1.0
nroff        2.0   9.4     3.3   5.7     5.0   3.8     9.0   2.1    18.8   1.0


Total+       3.3   9.0     5.3   5.6     8.0   3.7    14.8   2.0    29.6   1.0



+ Simple summation of the time for all benchmarks.  "Total Rel." is ratio of
the totals.


Note: in order to assure "apples-to-apples" comparisons, we moved the same
copies of the sources for these to the various machines, compiled them there,
and ran them, to avoid surprises from different binary versions of commands
resident on these machines.

Note that the granularity here is at the edge of UNIX timing, i.e., tenths of
seconds make large differences.























4.2.  Dhrystone

Dhrystone is a synthetic programming benchmark that measures processor and com-
piler efficiency in executing a ``typical'' benchmark. The Dhrystone results
shown below are measured in Dhrystones / second, using the 1.1 version of the
benchmark. According to [Richardson 87], 1.1 cleans up a bug, and is the
correct version to use, even though results for a given machine are typically
about 15% less for 1.1 than with 1.0.

                          Dhrystone Benchmark Results



                    MIPS         MIPS       DEC VAX      Sun-3/      DEC VAX
                    M/800        M/500       8600         160M       11/780


                 Dhry's       Dhry's      Dhry's       Dhry's      Dhry's
Benchmark         /Sec  Rel.   /Sec  Rel.  /Sec  Rel.   /Sec  Rel.  /Sec  Rel.

no Registers     12,900 8.9     8900 6.2   4,896 3.4    2,800 1.9   1,442 1.0

Registers        15,300 10.4  10,300 7.0   5,130 3.5    3,025 2.0   1,474 1.0

-O, no Registers 18,500 11.9  12,500 8.0   5,154 3.3    3,030 1.9   1,559 1.0

-O, Registers    18,500 11.8  12,500 8.0   5,235 3.3    3,325 2.1   1,571 1.0
-O3, Registers   20,000  -    13,000  -        -  -         -  -        -

-O4, Registers   21,300  -    14,200  -        -  -         -  -        -



Advice for running Dhrystone has changed over time with regard to optimization,
and currently asks that people turn off optimizers that are more than peephole
optimizers.  [This penalizes people who have good optimizers versus those who
don't!] However, from examination of the published numbers, and from personal
knowledge, we've found that many people are using compilers that do substantial
optimization, and are publishing those numbers as their standard Dhrystone rat-
ings.

We continue to include a range of numbers to show the difference optimization
technology makes on this particular benchmark, and to provide a range for com-
parison when others' cited Dhrystone figures are not clearly defined by optimi-
zation levels.  For example, -O3 does interprocedural register allocation, and
-O4 does procedure inlining, and we suspect -O4 is beyond the spirit of the
benchmark.  To compare with other systems, we'd suggest using the numbers
18,500 (12,500 for M/500), unless you know that a number is obtained without
optimization, in which case use 15,300 (10,300).

Although this commonly-used benchmark claims that the M/500 is 7-8X faster than
the 11/780, we believe that this benchmark is not necessarily representative.














Some other published numbers of interest include the following, all of which
are taken from [Richardson 87], unless otherwise noted.  Items marked * are
those that we know (or have good reason to believe) use optimizing compilers.
These are the "register" versions of the numbers, i.e., the highest ones
reported by people.

Note that we standardly report the unoptimized versions (-O1) of our numbers,
although we report -O2 numbers also below.  Thus, we're giving every other
machine every possible benefit of the doubt on this one.

                     Selected Dhrystone Benchmark Results



Dhry's
 /Sec       Rel.      Processor

  1571      1.0       VAX 11/780, 4.3BSD [our number]
  1757      1.1       VAX 11/780, VAX/VMS 4.2 [Intergraph 86]*

  3325      2.1       Sun3/160, SunOS 3.2 [our number]
  3773      2.4       Pyramid 98X, OSx 3.1, CLE 3.2.0
  4061      2.6       Celerity 1260-D, 4.2BSD
  4433      2.8       MASSCOMP MC-5700, 16.7MHz 68020, RTU 3.1*
  5156      3.3       Intergraph InterPro 32C, SYSV R3 2.0.0, Greenhills*

  6240      4.0       Ridge 3200, ROS 3.4
  6329      4.0       IBM RT PC, AIX 2.1, Advanced C*
  6340      4.0       Gould PN 9080
  6362      4.0       Sun3/260, 25MHz 68020
  6423      4.1       VAX 8600, 4.3BSD
  6440      4.1       IBM 4381-2, UTS V, cc 1.11
  6500      4.1       IBM RT PC , AIX 2.1, New Models *Dhrystone 1.0* [IBM 87]*

  7409      4.7       VAX 8600, VAX/VMS in [Intergraph 86]*
  7810      5.0       Intel 80386, 20MHz, 64K-cache, Green Hills*
  8300      5.3       DG MV20000-I and MV15000-20 [Stahlman 87]
  8309      5.3       InterPro-32C,30MHz Clipper,Green Hills[Intergraph 86]*

  9600      6.1       HP9000 Model 840 [Stahlman, 87, estimate]
 10300      6.6       MIPS M/500, 8MHz R2000, no optimization
 10787      6.9       VAX-8650, VAX/VMS, [Intergraph 86]*
 12500      8.0       MIPS M/500, 8MHz R2000, -O*

 15300      9.7       MIPS M/800, 12.5MHz R2000, no optimization
 18500      11.8      MIPS M/800, 12.5MHz R2000, -O*

 28846      18.4      Amdahl 5860, UTS-V, cc1.22
 31250      19.9      IBM 3090/200

















4.3.  Stanford Small Integer Benchmarks

John Hennessy, Director of the Computer Systems Laboratory at Stanford Univer-
sity, has collected a set of programs to compare the performance of various
systems.

It is well known that small benchmarks can be misleading.  If you see claims
that machine X is up to N times a VAX on some (unspecified) benchmarks, these
benchmarks are probably the sort they're talking about.

                   Stanford Small Integer Benchmark Results



                MIPS         MIPS        DEC VAX       Sun-3/       DEC VAX
                M/800        M/500        8600          160M         11/780


Benchmark    Secs   Rel.  Secs   Rel.  Secs   Rel.   Secs   Rel.   Secs   Rel.

Perm         0.13   18.0  0.18   13.0  0.63   3.7    0.72   3.3    2.34   1.0
Towers       0.18   12.8  0.24   9.6   0.63   3.7    1.07   2.2    2.30   1.0
Queen        0.11   8.5   0.15   6.3   0.27   3.5    0.50   1.9    0.94   1.0
Intmm        0.15   11.1  0.23   7.3   0.73   2.3    0.93   1.8    1.67   1.0
Puzzle       0.63   17.8  1.15   9.8   2.96   3.8    5.53   2.0   11.23   1.0
Quick        0.11   10.2  0.17   6.6   0.31   3.6    0.58   1.9    1.12   1.0
Bubble       0.12   12.6  0.19   7.9   0.44   3.4    0.97   1.6    1.51   1.0
Tree         0.23   11.8  0.34   8.0   0.69   3.9    1.05   2.6    2.72   1.0


Aggregate *  0.24   12.8  0.36   8.5   0.86   3.6    1.42   2.2    3.08   1.0

Total +      1.66   14.4  2.65   9.0   6.66   3.6   11.35   2.1   23.83   1.0



* As weighted by the Stanford Benchmark Suite

+ Simple summation of the times for all benchmarks



























5.  Floating Point Benchmarks

5.1.  LINPACK

The LINPACK benchmark has evolved into one of the most widely used single
benchmarks to predict relative performance in scientific and engineering
environments.  The usual LINPACK benchmark measures the time required to solve
a 100x100 system of linear equations.

LINPACK results are measured in MFlops, millions of floating point operations
per second.  The results below use compiled FORTRAN, often called "FORTRAN
BLAS", for FORTRAN Basic Linear Algebra Subroutines.  Some hand-coded, or
"Coded BLAS" results are shown later, along with results from many other
machines, mostly taken from [Dongarra 87].


              LINPACK Benchmark Results - FORTRAN and Coded BLAS




                    MIPS        MIPS        DEC VAX      Sun-3/      DEC VAX
                   M/800        M/500        8600#        160M&      11/780#


Benchmark       MFlops  Rel. MFlops Rel.  MFlops Rel.  MFlops Rel. MFlops  Rel.


Dbl Precision:
FORTRAN            .80  5.7     .58 4.1      .49 3.5      .10 0.6     .14  1.0
Coded BLAS        1.08  6.4     .72 4.2      .66 3.9      .10 0.6     .17  1.0

Sngl Precision:
FORTRAN           1.46  5.8     .89 3.6      .88 3.5      .11 0.4     .24  1.0
Coded BLAS        1.90  5.6    1.18 3.5     1.30 3.8      .11 0.3     .34  1.0




# VAX/VMS with FPA [Dongarra 87].  Actually, our LLL Fortran numbers were quite
  close: our in-house 11/780 showed .14 MFlops (DP) and .24 MFlops (SP).

& [Dongarra 87], [Sun 86] give these numbers for a Sun3/160 with 16.7MHz
  MC68881, so we used them instead of those for our in-house (12.5MHz 68881)
  system.





















  Following are some additional numbers for LINPACK MFlops, showing DP (FORTRAN
  and Coded BLAS), followed by SP (FORTRAN and Coded BLAS).  Coded BLAS is
  obtained by hand-coding the inner loops in assembler.  All numbers are from
  [Dongarra 87], unless otherwise noted.

                Selected LINPACK Results - FORTRAN and Coded BLAS



    DP      DP       SP      SP
  Fortran  Coded   Fortran  Coded                      System

     .08     -        -       -     IBM RT PC, standard FP [IBM 87]
     .10     .10      .11    .11    Sun-3/160, 16.7MHz [Rolled BLAS]+
     .11     .11      .13    .11    Sun-3/260,25MHz 68020+20MHz 68881[RolledBLAS]
     .14     -        .24     -     VAX 11/780, 4.3BSD, LLL Fortran [ours]
     .14     .17      .25    .34    VAX 11/780, VAX/VMS
     .20     -        .24     -     80386+80387, 20MHz, 64K cache, GreenHills
     .20     .23      .40    .51    VAX 11/785, VAX/VMS

     .29     .49      .45    .69    Intergraph IP-32C,30Mz Clipper[Intergraph 86]
     .30     -        -       -     IBM RT PC, optional FPA [IBM 87]
     .30     -        .39     -     DG MV/10000
     .33     -        .57     -     OPUS 300PM, Greenhills, 30MHz Clipper

     .36     .59      .51    .72    Celerity C1230, 4.2BSD f77
     .38     -        .67     -     80386+Weitek 1167,20MHz,64K cache, GreenHills
     .39     .50      .66    .81    Ridge 3200/90
     .41     .41      .62    .62    Sun-3/160, Weitek FPA [Rolled BLAS]
     .42     -        .56     -     HP9000 Series 840 [Stahlman, 87]
     .46     .46      .86    .86    Sun-3/260, Weitek FPA [Rolled BLAS]
     .47     .81      .69   1.30    Gould PN9000
     .48     -        .94     -     Harris HCX-7, CCI Power 6/32
     .49     .66      .84   1.20    VAX 8600, VAX/VMS 4.5

     .56     .85      .68    .99    Harris H1200
     .58     .72      .89   1.18    MIPS M/500 [ours]
     .61     -        .84     -     DG MV20000-I, MV15000-20 [Stahlman, 87]
     .65     .76      .80    .96    VAX 8500, VAX/VMS
     .70     .96     1.30   1.90    VAX 8650, VAX/VMS

     .78     -       1.10     -     IBM 9370-90, VS FORT 1.3.0
     .80    1.08     1.46   1.90*   MIPS M/800
     .87     -       1.20     -     Concurrent 3280XP
     .97    1.13     1.40   1.70    VAX 8700/8800, VAX/VMS
    1.10    1.40     1.20   1.60    ELXSI, MOD 2 (6420 CPU, new model)

    1.6     2.0      1.6    2.0     Alliant FX-1 (1 CE)
    3.0     3.3      4.3    4.9     CONVEX C-1/XP, Fort 2.0 (Rolled BLAS)

    7.0    11.0      7.6    9.8     Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9















   43.0      -      44.0      -     NEC SX-2, Fortran 77/SX (Rolled BLAS)



+ Sun FORTRAN and coded numbers are the same, since the code cannot be improved
  by hand.


* When the multiply-add overlap bug is fixed, this should rise to 2.50, accord-
  ing to our simulations.

Note that relative ordering even within families is not particularly con-
sistent, illustrating the extreme sensitivity of these benchmarks to memory
system design.




















































5.2.  Whetstone

Whetstone instructions are a synthetic mix of both floating point and integer
arithmetic, functional calls, array indexing, conditional jumps, and transcen-
dental functions [Curnow 76].

Whetstone results are measured in KWips, thousands of Whetstone interpreter
instructions per second.


                     Whetstone Benchmark Results - FORTRAN




                     MIPS        MIPS       DEC VAX      Sun-3/      DEC VAX
                    M/800        M/500       8600 #      160M &      11/780 #


Benchmark         KWips Rel.  KWips  Rel.  KWips Rel.  KWips  Rel.  KWips Rel.


Double Precision  6,900 8.3   4,450  5.4   2,670 3.2     960  1.2     830 1.0

Single Precision  8,900 7.1   5,800  4.6   4,590 3.7     970  0.8   1,250 1.0




# VAX/VMS, with FPA.

& This column was from another Sun-3/160 that had a 16.6MHz 68881.


































  Following are Whetstone figures gathered from miscellaneous sources:

                      Selected Whetstone Benchmark Results



   DP     DP     SP     SP
  KWips  Rel.   Kwips  Rel.   System

    410   0.5     500   0.4   VAX 11/780, 4.3BSD, f77 [ours]
    700   0.8     810   0.6   IBM RT PC, standard FP [IBM 87]
    715   0.9   1,083   0.9   VAX 11/780, LLL compiler [ours]
    830   1.0       -   -     Symbolics 3650, with FPA [Garren 87]
    830   1.0   1,250   1.0   VAX 11/780 VAX/VMS [Intergraph 86]

    960   1.2   1,040   0.8   Sun3/160, 68881 [Intergraph 86]
  1,110   1.3   1,670   1.3   VAX 11/785, VAX/VMS [Intergraph 86]
  1,230   1.5   1,250   1.0   Sun3/260, 25MHz 68020, 20MHz 68881
  1,400   1.7   1,600   1.3   IBM RT PC, optional FPA [IBM 87]

  1,730   2.1   1,860   1.5   Intel 80386+80387, 20MHz, 64K cache, GreenHills
  1,740   2.1   2,980   2.4   Intergraph InterPro-32C,30MHz Clipper[Intergraph86]
  1,860   2.2   2,400   1.9   Sun3/160, Weitek FPA [measured elsewhere]
      -   -     2,900   2.3   DG MV/15000-8

  2,590   3.1   4,170   3.3   Intel 80386+Weitek 1167, 20MHz, Green Hills
  2,600   3.1   3,400   2.7   Sun3/260, Weitek FPA [measured elsewhere]
      -   -     4,300   3.4   DG MV/15000-10
  2,670   3.2   4,590   3.7   VAX 8600, VAX/VMS [Intergraph 86]

      -   -     6,400   5.1   DG MV/15000-12
  3,950   4.8   6,670   5.3   VAX 8700, VAX/VMS, Pascal(?) [McInnis, 1987]
  4,000   4.8   6,900   5.5   VAX 8650, VAX/VMS [Intergraph 86]
  4,120   5.0   4,930   3.9   Alliant FX/8  (1 CE) [Alliant 86]

  4,450   5.4   5,800   4.6   MIPS M/500 [ours]

  6,900   8.3   8,900   7.1   MIPS M/800 [ours]




























5.3.  DoDuc Benchmark

This benchmark [DoDuc 86] is a 5300-line FORTRAN program that simulates aspects
of nuclear reactors, has little vectorizable code, and is thought to be
representative of Monte-Carlo simulations.  It uses mostly double precision
floating point, and is often viewed as a ``nasty'' benchmark, i.e., it breaks
things.  The performance is given as a number R normalized to 100 for an IBM
370/168-3 or 170 for an IBM 3033-U, [ R = 48671/(cpu time in seconds) ], so
that larger R's are better.

In order of increasing performance, following are numbers for various machines.
All are from [DoDuc 87] unless otherwise specified.

                       Selected DoDuc Benchmark Results



DoDuc R     Relative
Factor       Perf.        System

     14        0.5        DEC uVAXII, Ultrix
     16        0.5        Apollo DN580
     17        0.7        Sun3/110, 16.7MHz
     19        0.7        Intel 80386+80387, 16MHz, iRMX
     22        0.8        DEC uVAXII, VAX/VMS
     22        0.8        Sun-3/260, 25MHz 68020, 20MHz 68881+
     26        1.0        VAX 11/780, VMS

     43        1.7        Sun-3/260, 25MHz, Weitek FPA*
     48        1.8        Celerity C1260
     50        1.9        CCI Power 6/32
     53        2.0        Edge 1
     57        2.2        DG MV/10000
     64        2.5        Harris HCX-7

     75        2.9        MIPS M/500  [f77 -O2] [runs 655 seconds]
     85        3.3        Alliant FX/1
     90        3.5        IBM 4381-2
     91        3.5        DEC VAX 8600, VAX/VMS
     92        3.5        Gould PN 9050
     93        3.6        MIPS M/500, f77 -O3 [runs 527 seconds]
     97        3.7        ELXSI 6400
     99        3.8        DG MV/20000
    101        3.9        Alliant FX/8

    113        4.3        FPSystems 164
    119        4.6        Gould 32/8750
    136        5.2        MIPS M/800, f77 -O2 [runs 358 seconds]
    129        5.0        DEC VAX 8650
    136        5.2        DEC VAX 8800, VAX/VMS
















    146        5.6        MIPS M/800, f77 -O3 [runs 334 seconds]
    150        5.7        Amdahl 470 V8, VM/UTS
    181        7.0        IBM 3081-G, F4H ext, opt=2

    475       18.3        Amdahl 5860

   1080       41.6        Cray X/MP [for perspective: we have a way to go yet!]



5.4.  Spice Benchmarks

The Spice program is a large circuit simulator that uses floating-point
heavily, but not in the same way as LINPACK.  The numbers below use Spice 2G.6
(FORTRAN version).  We wish we had some VAX/VMS numbers for these, but we
don't.


                         Spice 2G.6 Benchmark Results




                   MIPS          MIPS       Sun-3/      Sun-3/     VAX 11/780
                  M/800         M/500        260         160M        4.3BSD


Benchmark     Secs  Relative  Secs  Rel.  Secs  Rel.  Secs  Rel.   Secs   Rel.


elh1*          3.8    6.9      6.2  3.8      -   -       -   -     26.2   1.0
elh2a*        44.4    7.6     84.0  4.0     81  4.2    123  2.7   337.1   1.0
"" Sun 68881                               163  2.1    211  1.6
cring+        17.6    7.1     31.3  4.0      -   -       -   -    124.9   1.0
mosamp2#      15.1    7.8     29.1  4.1     31  3.8     45  2.6   118.2   1.0
"" Sun 68881                                62  1.9     78  1.5
tunnel         1.7    11.5     3.3  5.9      -   -       -   -     19.5   1.0




# The Sun mosamp2 numbers are from [Hough,86,3].  The first row for Suns uses
  Weitek FPA, the second uses 68881s.


* These are considered the most representative.  They are simulations of real
  circuits using modern technology.


+ This was proposed as a standard benchmark in Usenet newsgroup comp.lsi.
















Mosamp2 and tunnel were included in the original Berkeley Spice benchmarks.

































































5.5.  Stanford Floating Point Benchmarks

The following two benchmarks from the Stanford Benchmark Suite emphasize float-
ing point performance.  They exemplify that class of programs that use tight
loops and a high proportion of actual FP code, unlike, for example Linpack,
which depends as much on the speed of accessing main memory as on FP perfor-
mance itself.  These benchmarks are also quite responsive to good optimizing
compiler technology.


                   Stanford Floating Point Benchmark Results




                      MIPS          MIPS      DEC VAX      Sun-3/     DEC VAX
                     M/800         M/500       8600 #      160M #      11/780


Benchmark        Secs  Relative  Secs  Rel.  Secs  Rel.  Secs  Rel.  Secs  Rel.


FFT              0.20    19.8    0.34  11.7  1.42  2.8   3.12  1.3   3.97  1.0
Matrix Multiply  0.13    17.8    0.26  8.9   1.37  1.7   2.14  1.1   2.31  1.0


Aggregate*       0.37    14.9    0.59  9.4   1.88  2.9   3.41  1.6   5.52  1.0

Total +          0.33    19.0    0.60  10.5  2.79  2.2   5.17  1.2   6.28  1.0





* As weighted by the Stanford Benchmark Suite
  (includes some contribution from the integer results)


+ Simple summation of the times for both benchmarks


# Performance would be substantially improved with the respective FPA boards.
  Also, recall that the in-house Sun has only a 12.5MHz 68881.

  Benchmark Descriptions


FFT                Computes a 256-point Fast Fourier Transform (FFT) twenty
                   times.

















Matrix Multiply    Multiplies two 40x40 single-precision matrices.

































































6.  System Benchmarks

6.1.  Byte Benchmarks

The Byte benchmarks are intended to provide a mixture of user and kernel-level
actions.  Although they are used frequently, their user:kernel balance of 30:70
is unlike normal time-sharing use, which is more like 50:50 or even 70:30.

In the table below, "Tot" is the sum of the "User" and "Sys" numbers reported
by /bin/time, and "Rel" is the performance relative to the 11/780 (4.3BSD).
(You can derive the "User" time by subtracting "Sys" from "Tot".)

The first group of tests (call with assignment, call with empty function, moop,
and sieve) measure a few aspects of user-level behavior.  The second group of
tests (piping data, system calls, disk write, and disk read) measure kernel
performance on narrowly-focused benchmarks, which are subject especially to
unusual cache behavior.  The third group show running 1, ... 6 shell procedures
together, each containing a sort, an od | sort pipeline, and a grep | tee | wc
pipeline, and an rm.  Thus, they mostly measure the efficiency of fork and exec
system calls, with relatively little user-level computation.  Although not par-
ticularly an accurate model of a real mix, this group is a bit more realistic
that the first two groups.

We run these with a makefile that includes 3 each of calla, calle, loop, sieve,
pipes, and scall, and the time shown is the mean of the 3 times.  The others
are run once.  We run the makefile 3 times, and average everything, especially
since these runs are quite short, and even .1 second variation can cause large
jumps in relative performance.

                     Byte Benchmarks - Part 1 - BSD Systems



        MIPS M/800     MIPS M/500    DEC VAX 8600  Sun-3/160M   DEC VAX11/780
      UMIPS-BSD2.0B  UMIPS-BSD2.0B    Ultrix 1.2    SunOS 3.2       4.3BSD


Bench Sys  Tot   Rel  Sys   Tot Rel  Sys  Tot  Rel Sys  Tot Rel  Sys  Tot  Rel

calla 0.0 0.0    -    0.0  0.0  -    0.0  0.0  -   0.0  0.0  -   0.0  0.1   1
calle 0.0 0.0    -    0.0  0.0  -    0.0  0.2 5.5  0.0  0.2 5.5  0.0  1.1   1
loop  0.0 0.2  12.0   0.0  0.3  8.0  0.0  0.9 2.7  0.0  1.5 1.6  0.0  2.4   1
sieve 0.0 0.2   7.0   0.0  0.2  7.0  0.0  0.4 3.5  0.0  0.6 2.3  0.0  1.4   1

pipes 0.3 0.3   6.0   0.7  0.7  2.6  0.3  0.3 6.0  1.6  1.6 1.1  1.6  1.8   1
scall 0.6 0.6   6.3   0.7  0.7  5.4  0.8  1.0 3.8  2.0  2.6 1.5  3.2  3.8   1
diskw 0.1 0.1   7.0   0.2  0.2  3.5  0.2  0.2 3.5  0.5  0.5 1.4  0.7  0.7   1
diskr 0.1 0.1   5.0   0.2  0.2  2.5  0.2  0.2 2.5  0.3  0.3 1.7  0.5  0.5   1

sh 1  0.3 0.3   5.7   0.4  0.4  4.1  0.5  0.6 2.8  0.5  0.6 2.8  1.4  1.7   1
sh 2  0.5 0.5   6.4   0.7  0.8  4.0  0.9  1.1 2.9  1.1  1.3 2.5  2.6  3.2   1















sh 3  0.7 0.8   6.0   1.1  1.2  4.0  1.4  1.7 2.8  1.6  1.9 2.5  3.8  4.8   1
sh 4  1.1 1.2   5.3   1.4  1.6  4.0  1.9  2.3 2.8  2.2  2.5 2.6  5.1  6.4   1
sh 5  1.2 1.4   5.8   1.9  2.1  3.9  2.4  2.9 2.8  2.9  3.3 2.5  6.3  8.1   1
sh 6  1.5 1.7   5.8   2.2  2.5  3.9  2.9  3.5 2.8  3.4  3.9 2.5  7.7  9.8   1


Tot*  8.2 9.0   7.4  12.3 14.7  4.6 13.7 20.9 3.2 24.9 33.7 2.0 42.5 67.0   1



* Summation of times as we run them, which includes multiple instances of some
of the tests, whose average numbers are shown.  Precisely:

 total = 3* (calla + calle + loop + sieve + pipes + scall) +
           diskw + diskr + sh1 + sh2 + sh3 + sh4 + sh5 + sh6


These numbers can be interpreted in many different ways, and are presented
because many people use them, not because we consider them to be representative
of overall performance.

Following are some additional Byte benchmarks, showing UMIPS-V performance on
M/500 (and, sooner or later, M/800), plus a 30MHz Intergraph InterPro-32C
[Intergraph 86], with the Sun & VAX numbers replicated for context.  We don't
know why the IP-32C's kernel performance is as shown, since it seems out of
line with other areas of its performance.  UMIPS-V numbers for the M/800 will
be available soon.

                Byte Benchmarks - Part 2 - Some System V Numbers



       MIPS M/800   MIPS M/500     Intergraph    Sun-3/160M     DEC VAX11/780
       UMIPS-V1.1   UMIPS-V1.1    IP-32C,SVR3     SunOS 3.2        4.3BSD


Bench Sys Tot  Rel Sys   Tot Rel  Sys  Tot  Rel  Sys  Tot Rel   Sys   Tot  Rel

calla 0.0 0.0  -   0.0   0.0  -   0.0  0.0  -    0.0  0.0  -    0.0   0.1  1
calle 0.0 0.0  -   0.0   0.0  -   0.0  0.1 11.0  0.0  0.2 5.5   0.0   1.1  1
loop  0.0 0.2 12.0 0.0   0.4 7.0  0.0  0.5 4.8   0.0  1.5 1.6   0.0   2.4  1
sieve   0 0.2 7.0  0.0   0.2 7.0  0.0  0.4 3.5   0.0  0.6 2.3   0.0   1.4  1

pipes 0.X 0.X X.X  0.3   0.3 6.0  1.4  1.4 1.3   1.6  1.6 1.1   1.6   1.8  1
scall 0.X 0.X X.X  0.6   0.6 5.4  1.4  1.5 2.5   2.0  2.6 1.5   3.2   3.8  1
diskw 0.1 0.1 7.0  0.1   0.1 7.0  0.2  0.2 3.5   0.5  0.5 1.4   0.7   0.7  1
diskr 0.1 0.1 5.0  0.1   0.1 5.0  0.3  0.3 1.7   0.3  0.3 1.7   0.5   0.5  1

sh 1  0.X 0.X X.X  0.2   0.3 5.7  2.4  2.4 0.7   0.5  0.6 2.8   1.4   1.7  1
sh 2  0.X 0.X X.X  0.5   0.7 4.6  2.7  2.8 1.1   1.1  1.3 2.5   2.6   3.2  1
sh 3  0.X 0.X X.X  0.7   1.1 4.4  4.0  4.1 1.2   1.6  1.9 2.5   3.8   4.8  1















sh 4  1.X 1.X X.X  1.0   1.5 4.3  5.8  6.0 1.1   2.2  2.5 2.6   5.1   6.4  1
sh 5  1.X 1.X X.X  1.2   1.8 4.5  7.2  7.4 1.1   2.9  3.3 2.5   6.3   8.1  1
sh 6  1.X 1.X X.X  1.3   2.0 4.7  8.3  8.6 1.1   3.4  3.9 2.5   7.7   9.8  1


Tot*  X.X X.X X.X  7.8  12.4 5.4 34.3 43.5 1.5  24.9 33.7 2.0  42.5  67.0  1




























































7.  Acknowledgements

Many people at MIPS contributed to this document, which was originally created
by Web Augustine.

We thank David Hough of Sun Microsystems, who kindly supplied numbers for some
of the Sun configurations, even correcting a few of our numbers that were
incorrectly high.

We also thank Cliff Purkiser of Intel, who posted the Intel 80386 Whetstone and
LINPACK numbers on Usenet.























































8.  References

[Alliant 86]
   Alliant Computer Systems Corp, "FX/Series Product Summary", October 1986.

[Curnow 76]
   Curnow, H. J., and Wichman, B. A., ``A Synthetic Benchmark'', Computing
   Journal, Vol. 19, No. 1, February 1976, pp. 43-49.

[DoDuc 87]
   DoDuc, N., FORTRAN Central Processor Time Benchmark, Framentec, June 1986,
   Version 13.  Newer numbers were received 03/17/87, and we used them where
   different.

[Dongarra 87]
   Dongarra, J. J., ``Performance of Various Computers Using Standard Linear
   Equations in a Fortran Environment'', Technical Memo. No. 23, Argonne
   National Laboratory, March, 1987.

[Fleming 86]
   Fleming, P.J. and Wallace, J.J.,``How Not to Lie With Statistics: The
   Correct Way to Summarize Benchmark Results'', Communications of the ACM,
   Vol. 29, No. 3, March 1986, 218-221.

[Garren 87]
   Garren, S., ``Symbolics on Performance'', Symbolics, Jan 9, 1987, Cambridge,
   MA.

[Hough 86,1]
   Hough, D., ``Weitek 1164/5 Floating Point Accelerators'', Usenet, January
   1986.

[Hough 86,2]
   Hough, D., ``Benchmarking and the 68020 Cache'', Usenet, January 1986.

[Hough 86,3]
   Hough, D., ``Floating-Point Programmer's Guide for the Sun Workstation'',
   Sun Microsystems, September 1986. [an excellent document, including a good
   set of references on IEEE floating point, especially on micros, and good
   notes on benchmarking hazards].  Sun3/260 Spice numbers are from later mail.

[IBM 87]
   IBM, ``IBM RT Personal Computer (RT PC) New Models, Features, and Software
   Overview, February 17, 1987.

[Intergraph 86]
   Intergraph Corporation, ``Benchmarks for the InterPro 32C'', December 1986.

[McInnis 87]
   McInnis, D., Kusik, R., Bhandarkar, D., ``VAX 8800 System Overview'', Proc.
   IEEE COMPCON, March 1987, San Francisco, 316-321.

[Purkiser 87]













   Purkiser, C., ``Whetstone and LINPACK Numbers'', Usenet, March 1987.

[Richardson 87]
   Richardson, R., ``3/15/87 Dhrystone Benchmark Results'', Usenet, March 1987.

[Stahlman 87]
   Stahlman, M., "The Myth of Price/performance", Sanford C. Bernstein & Co,
   Inc, NY, NY, March 17, 1987.

[Sun 86]
   SUN Microsystems, ``The SUN-3 Family: A Hardware Overview'', August 1986.

[Weicker 84]
   Weicker, R. P., ``Dhrystone: A Synthetic Systems Programming Benchmark'',
   Communications of the ACM, Vol. 27, No. 10, October 1984, pp.  1013-1030.












































-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086