[comp.arch] MIPS Performance Brief, zillions of numbers, very long

mash@mips.UUCP (10/30/87)

MIPS Performance Brief, PART 1 : CPU Benchmarks, Issue 3.0, October 1987.
This is a condensed version, with MAC Charts, some explanatory details,
and most advertisements deleted.  For a full version on paper,
with MAC charts an tables that aren't squished toether, send mail,
NOT TO ME, but to:
....{ucbvax, decvax, ihnp4}!decwrl!mips!eleanor [Eleanor Bishop]
(Please be patient: it will take a little while to get them sent.)

As usual, I tried pretty hard to get the numbers right, but let us know if we
goofed anywhere.  We'll update the next time.

The posting is 1200+ lines long, with a thousand or so benchmark
numbers & performance ratios for dozens of machines and benchmarks. 
If you're not a glutton for data, "n" now or after the main summary!
------------

1.  Introduction

New Features of This Issue
More benchmarks are normalized to the VAX-11/780 under VAX/VMS,rather than UNIX.
Livermore Loops and Digital Review Magazine benchmarks have been added, and
the Spice section uses new public domain inputs.

The Brief has been divided into two parts: user and system.  The system bench-
mark part is being greatly expanded (beyond Byte), and has been moved to a new
document "MIPS Performance Brief - Part 2".  User-level performance is mostly
driven by compiler and library tuning, whereas system performance also depends
on operating system and hardware configuration issues.  The two Briefs syn-
chronize with different release cycles: Part 2 will appear 1-2 months after
Part 1.

Benchmarking - Caveats and Comments
While no one benchmark can fully characterize overall system performance, the
results of a variety of benchmarks can give some insight into expected real
performance.  A more important benchmarking methodology is a side-by-side com-
parison of two systems running the same real application.

We don't believe in characterizing a processor with just a single number, but
we follow (what seems to be) standard industry practice of using a mips-rating
that essentially describes overall integer performance.  Thus, we label a 5-
mips machine to be one that is about 5X (i.e., anywhere from 4X to 6X!) faster
than a VAX 11/780 (UNIX 4.3BSD, unless we can get Ultrix or VAX/VMS numbers) on
integer performance, since this seems to be how most people intuitively compute
mips-ratings.  Even within the same computer family, performance ratios between
processors vary widely.  For example, [McInnis 87] characterizes a ``6 mips''
VAX 8700 as anywhere from 3X to 7X faster than the 11/780. Floating point speed
often varies more than, and scales up slower than integer speed versus the
11/780.

This paper analyzes one important aspect of overall computer system performance
- user-level CPU performance.

MIPS Computer Systems does not warrant or represent that the performance data
stated in this document will be achieved by any particular application.  (We
have to say that, sorry.)

2.  Benchmark Summary

2.1.  Choice of Benchmarks

This brief offers both public-domain and MIPS-created benchmarks.  We prefer
public domain ones, but some of the most popular ones are inadequate for accu-
rately characterizing performance.  In this section, we give an overview of the
importance we attach to the various benchmarks, whose results are summarized on
the next page.

Dhrystone [DHRY 1.1] and Stanford [STAN INT] are two popular small integer
benchmarks.  Compared with the fastest VAX 11/780 systems, the M/1000 is 13-14X
faster than the VAX on these tests, and yet, we rate the M/1000 as a 10-vax-
mips machine.

While we present Dhrystone and Stanford, we feel that the performance of large
UNIX utilities, such as grep, yacc, diff, and nroff is a better (but not per-
fect!) guide to the performance customers will receive.  These four, which make
up our [MIPS UNIX] benchmark, demonstrate that performance ratios are not sin-
gle numbers, but range here from 8.6X to 13.7X faster than the VAX.

Even these UNIX utilities tend to overstate performance relative to large
applications, such as CAD applications.  Our own vax-mips ratings are based on
a proprietary set of larger and more stressful real programs, such as our com-
piler, assembler, debugger, and various CAD programs.

For floating point, the public domain benchmarks are much better.  We're still
careful not to use a single benchmark to characterize all floating point appli-
cations.

The Livermore Fortran kernels [LLNL DP] give insight into both vector and non-
vector performance for scientific applications.  Linpack [LNPK DP and LNPK SP]
tests vector performance on a single scientific application, and stresses cache
performance.  Spice [SPCE 2G6] and Doduc [DDUC] test a different part of the
floating point application spectrum.  The codes are large and thus test both
instruction fetch bandwidth and scalar floating point.  Digital Review Maga-
zines benchmark [DIG REV] is a compendium of FORTRAN tests that measure a wide
variety of behavior, and seem to correlate well with some classes of real pro-
grams.

2.2.  Benchmark Summary Data     This section summarizes the most important
benchmark results described in more detail throughout this document. The
numbers show performance relative to the VAX 11/780, i.e., larger numbers are
better/faster.

o    A few numbers have been estimated by interpolations from closely-related
     benchmarks and/or closely-related machines.  The methods are given in
     great detail in the individual sections.

o    Several of the columns represent summaries of multiple benchmarks.  For
     example, the MIPS UNIX column represents 4 benchmarks, the SPICE 2G6
     column 3, and LLNL DP represents 24.

o    In the Integer section, MIPS UNIX is the most indicative of real perfor-
     mance.

o    For Floating Point, we especially like LLNL DP (Livermore FORTRAN ker-
     nels), but all of these are useful, non-toy benchmarks.

o    In the following table, "Pub mips" gives the manufacturer-published mips-
     ratings.  As in all tables in this document, the machines are listed in
     increasing order of performance according to the benchmarks, in this case,
     by Integer performance.

o    The summary includes only those machines for which we could get measured
     results on almost all the benchmarks and good estimates on the the results
     for the few missing data items.

                         Summary of Benchmark Results
                     (VAX 11/780 = 1.0, Bigger is Faster)
   Integer (C)            Floating Point (FORTRAN)
----------------    -------------------------------------
MIPS   DHRY  STAN   LLNL  LNPK   LNPK   SPCE  DIG    DDUC   Publ
UNIX   1.1   INT     DP    DP     SP    2G6   REV           mips  System
 1      1     1     1      1      1     1      1      1       1   VAX 11/780#
 2.1    1.9   1.8   1.9    2.9    2.5   1.6   *2     *1.3     2   Sun3/160 FPA
*4      4.1   4.7   2.8    3.3    3.4   2.4   *3      1.7     4   Sun3/260 FPA
 5.5    7.4   7.2   2.5    4.3    3.7   3.4    4.9    3.8     5   MIPS M/500

*6      5.9   6.5   5.9    6.9    5.6   5.3    6.2    5.2     6   VAX 8700
 8.0   10.8   7.3   4.5    7.9    6.4   4.1    4.4    3.5    10   Sun4/260
 9.2   11.3  11.8   8.1    7.1    7.6   6.6    7.6    7.3     8   MIPS M/800
11.3   13.5  14.1   9.7    8.6    9.2   8.0    9.3    8.8    10   MIPS M/1000

# VAX 11/780 runs 4.3BSD for MIPS UNIX, Ultrix 2.0 (vcc) for Stanford, VAX/VMS
  for all others.  Use of 4.3BSD (no global optimizer) probably inflates the
  MIPS UNIX column by about 10%.

* Although it is nontrivial to gather full set of numbers, it is important to
  avoid holes in benchmark tables, as it is too easy to be misleading.  Thus,
  we had to make reasoned guesses at these numbers.  The MIPS UNIX values for
  VAX 8700 and Sun-3/260 were taken from the Published mips-ratings, which are
  consistent (+/- 10%) with experience with these machines.  DIG REV and DDUC
  were guessed by noting that most machines do somewhat better on DIG REV than
  on SPCE, and than a Sun-3/260 is usually 1.5X faster than a Sun-3/160 on
  floating-point benchmarks.

Benchmark Descriptions:

MIPS UNIX
  MIPS UNIX benchmarks: grep, diff, yacc, nroff, same 4.2BSD C source compiled
  and run on all machines. The summary number is the geometric mean of the 4
  relative performance numbers.

DHRY 1.1
  Dhrystone 1.1, any optimization except inlining.

STAN INT
  Stanford Integer.

LLNL DP
  Lawrence Livermore Fortran Kernels, 64-bit.  The summary number is the given
  as the relative performance based on the geometric mean, i.e., the "middle"
  of the 3 means.

LNPK DP
  Linpack Double Precision, FORTRAN.

LNPK DP
  Linpack Single Precision, FORTRAN.

SPCE 2G6
  Spice 2G6, 3 public-domain circuits, for which the geometric mean is shown.

DIG REV
  Digital Review magazine, combination of 33 benchmarks.

DDOC
  Doduc Monte Carlo benchmark.

3.  Methodology

Tested Configurations

When we report measured results, rather than numbers published elsewhere, the
configurations were as shown below.  These system configurations do not neces-
sarily reflect optimal configurations, but rather the in-house systems to which
we had repeatable access.  When we've had the faster results available, we've
quoted them in place of our own system's numbers.

DEC VAX-11/780
Main Memory:        8 Mbytes
Floating Point:     Configured with FPA board.
Operating System:   4.3 BSD UNIX.

DEC VAX 8600
Main Memory:        20 Mbytes
Floating Point:     Configured without FPA board.
Operating System:   Ultrix V1.2. (4.2BSD with many 4.3BSD tunings).

Sun-3/160M
CPU:                (16.67 MHz MC68020)
Main Memory:        8 Mbytes
Floating Point:     12.5 MHz MC68881 coprocessor (compiled -f68881).
Operating System:   SunOS 3.2 (4.2BSD)

MIPS M/500
CPU:                8MHz R2000, in R2300 CPU board, 16K I-cache, 8K D-cache
Floating Point:     R2010 FPA chip (8MHz)
Main Memory:        8 Mbytes (2 R2350 memory boards)
Operating System:   UMIPS-BSD 2.1 (4.3BSD UNIX with NFS)

MIPS M/800
CPU:                12.5 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache
Floating Point:     R2010 FPA chip (12.5MHz)
Main Memory:        8 Mbytes (2 R2350 memory boards)
Operating System:   UMIPS-BSD 2.1

MIPS M/1000
CPU:                15 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache
Floating Point:     R2010 FPA chip (15 MHz)
Main Memory:        16 Mbytes (4 R2350 memory boards)
Operating System:   UMIPS-BSD 2.1
Test Conditions

All programs were compiled with -O (optimize), unless otherwise noted.

C is used for all benchmarks except Whetstone, LINPACK, Doduc, Spice 2g.6,
Hspice, and the Livermore Fortran Kernels, which use FORTRAN.  When possible,
we've obtained numbers for VAX/VMS, and use them in place of UNIX numbers.  The
MIPS compilers are version 1.21.

User time was measured for all benchmarks using the /bin/time command.

Systems were tested in normal multi-user development environment, with load
factor <0.2 (as measured by uptime command).  Note that this occasionally makes
them run longer, due to slight interference from background daemons and clock
handling, even on an otherwise empty system.  Benchmarks were run at least 3
times and averaged.  The intent is to show numbers that can be reproduced on
live systems.

Times (or rates, such as for Dhrystones, Whetstones, and LINPACK KFlops) are
shown for the VAX 11/780.  Other machines' times or rates are shown, and their
relative performance ("Rel." column) normalized to the 11/780 treated as 1.0.
VAX/VMS is used whenever possible as the base.

Compilers and Operating Systems

Unless otherwise specified, The M-series benchmark numbers use Release 1.21 of
the MIPS compilers and UMIPS-BSD 2.1.

Optimization Levels

Unless otherwise specified, all benchmarks were compiled -O, i.e., with optimi-
zation.  UMIPS compilers call this level -O2, and it includes global intra-
procedural optimization.  In a few cases, we show numbers for -O3 and -O4
optimization levels, which do inter-procedural register allocation and pro-
cedure merging.  -O3 is now generally available.

Now, let's look at the benchmarks.  Each section title includes the (CODE NAME)
that relates it back to the earlier Summary, if it is included there.

4.  Integer Benchmarks

4.1.  MIPS UNIX Benchmarks (MIPS UNIX)

The MIPS UNIX Benchmarks described below are fairly typical of nontrivial UNIX
programs.  This benchmark suite provides the opportunity to execute the same
code across several different machines, in contrast to the compilers and link-
ers for each machine, which have substantially different code.  User time is
shown; kernel time is typically 10-15% of the user time (on the 780), so these
are good indications of integer/character compute-intensive programs.  The
first 3 benchmarks were running too fast to be meaningful on our faster
machines, so we modified the input files to get larger times.  The VAX 8600 ran
consistently around 3.8X faster than the 11/780 on these tests, but we sold it,
so it's started to drop out as we've changed benchmarks.  These benchmarks con-
tain UNIX source code, and are thus not generally distributable.

For better statistical properties, we now report the Geometric Mean of the
Relative performance numbers, because it does not ignore the performance con-
tributions of the shorter benchmarks.  (In this case, the grep ratios drag the
Geometric Mean down.) Expect real performance to lie between the Geometric Mean
and the Total Relative number.

Note: the Geometric Mean of N numbers is the Nth root of the product of those
numbers.  It is necessarily used in place of the arithmetic mean when computing
the mean of performance ratios, or of benchmarks whose runtimes are quite dif-
ferent.  See [Fleming 86] for a detailed discussion.

                         MIPS UNIX Benchmarks Results
   grep       diff        yacc        nroff       Total    Geom     System
Secs  Rel.  Secs  Rel.  Secs  Rel.  Secs Rel.  Secs+  Rel. Mean

11.2  1.0  246.4  1.0  101.1  1.0   18.8 1.0   377.5  1.0  1.0   11/780 4.3BSD
 5.6  2.0  105.3  2.3   48.1  2.1    9.0 2.1   168.0  2.2  2.1   Sun-3/160M
   -   -       -   -       -   -     5.0 3.8       -  3.8  3.8   DEC VAX 8600
 2.4  4.7   35.8  6.9   19.5  5.2    3.3 5.7    61.0  6.2  5.5   MIPS M/500

   -   7       -  8.5      -   9       - 7.5       -   -   8.0   Sun-4 *
 1.6  7.0   21.6  11.4  11.2  9.0    1.9 9.9    36.3  10.4 9.2   MIPS M/800
 1.3  8.6   18.0  13.7   9.3  10.9   1.5 12.5   30.1  12.5 11.3  MIPS M/1000

+ Simple summation of the time for all benchmarks.  "Total Rel." is ratio of
the totals.

* These numbers derived as shown on next page.

Note: in order to assure "apples-to-apples" comparisons, we moved the same
copies of the (4.2BSD) sources for these to the various machines, compiled them
there, and ran them, to avoid surprises from different binary versions of com-
mands resident on these machines.

Note that the granularity here is at the edge of UNIX timing, i.e., tenths of
seconds make differences, especially on the faster machines.  The performance
ratios seen here seem typical of large UNIX commands on MIPS systems.

Estimation of Sun-4 Numbers

The 4 ratios cited in the table above were pieced together from somewhat frag-
mentary information.  Earlier Briefs used shorter benchmarks for the grep,
diff, and yacc tests, for which we were able to get numbers from a Sun-4, with
SunOS 3.2L, at 16.7MHz.  By comparing with M/500 and M/800 numbers on the same
tests, we can interpolate and at least estimate bounds on performance.

On grep, both Sun-4 and M/800 used .6 seconds user time, so we assumed the same
relative performance (7.0).

On yacc, both Sun-4 and M/800 used .4 seconds, so we assumed the same relative
performance (9.0).

On diff, 3 runs on the Sun-4 yielded .3, .4. and .4 seconds.  A MIPS M/500 is
consistently .4, an M/800 .2, and an M/1000 usually .2, but occasionally .1.
This test was clearly too small, and it is difficult to make strong assertions.
However, it does seem that the Sun-4 is faster than the M/500  (6.9X VAX) but
noticeably closer to it than to an M/800 (11.4X).  We thus estimate around
8.5X, a little lower than halfway between the two MIPS systems.

On nroff, a setup problem of ours aborted the Sun-4 run.  At a seminar at Stan-
ford this summer, the following number was given by Sun: 18.5 seconds for:
        troff -man csh.1
This is not sufficient information to allow exact duplication, but we tried
running it two different ways:
        troff -a -man csh.1 >/dev/null          troff -man ..... csh.1
The second case used numerous arguments to actually run it for a typesetter,
whereas the first emits a representation to the standard output.  The M/500
required 23.4 and 27.9 seconds user time, respectively, while the M/800 gave
user times of 11.2 and 14.1 seconds.  Assuming that the troff results are simi-
lar to those of nroff, and using the worst of the M/800 times, we get a VAX-
relative estimate of:
(9.9X for M/800) X 14.1 (M/800 secs) / 18.5 (Sun-4 secs)
which yields 7.5X for the Sun-4.

All of this is obviously a gross approximation with numerous items of missing
information.  Timing granularity is inadequate.  Results are being generalized
from small samples.  The source code may well differ for these programs.  Our
csh.1 manual pages may not be identical.  Sun compilers will improve, etc, etc.

We apologize for the low quality of this data, as Sun-4 access is not something
we have in good supply.  We'll run the newest forms of the benchmarks as soon
as possible.  However, the end results does seem quite consistent with with
other benchmarks that various people in the industry have tried.

Finally, note that this benchmark set is running versus 4.3BSD, not versus
Ultrix 2.0 with vcc.  Hence, the relative performance numbers are inflated
somewhat relative to VAX/VMS or VAX-Ultrix numbers.  From other experience,
we'd guess that subtracting 10% from most of the computed mips-ratings would
give a good estimate of the Ultrix 2.0 (vcc)-relative mips-ratings.

4.2.  Dhrystone (DHRY 1.1)

Dhrystone is a synthetic programming benchmark that measures processor and com-
piler efficiency in executing a ``typical'' benchmark. The Dhrystone results
shown below are measured in Dhrystones / second, using the 1.1 version of the
benchmark.

We include Dhrystone because it is popular.  MIPS systems do extremely well on
it.  However, comparisons of systems based on Dhrystone and especially, only on
Dhrystone, are unreliable and should be avoided.  More details are given at the
end of this section.  According to [Richardson 87], 1.1 cleans up a bug, and is
the correct version to use, even though results for a given machine are typi-
cally about 15% less for 1.1 than with 1.0.

Advice for running Dhrystone has changed over time with regard to optimization.
It used to ask that people turn off optimizers that were more than peephole
optimizers, because the benchmark contained a modest amount of "dead" code that
optimizers were eliminating.  However, it turned out that many people were sub-
mitting optimized results, often unlabeled, confusing everyone.  Currently, any
numbers can be submitted, as long as they're appropriately labeled, except that
procedure inlining (done by only a few very advanced compilers) must be
avoided.

We continue to include a range of numbers to show the difference optimization
technology makes on this particular benchmark, and to provide a range for com-
parison when others' cited Dhrystone figures are not clearly defined by optimi-
zation levels.  For example, -O3 does interprocedural register allocation, and
-O4 does procedure inlining, and we know -O4 is beyond the spirit of the bench-
mark.  Hence, we now cite the -O3 numbers.  We're not sure what the Sun-4's -O3
level does, but we do not believe that it does inlining either.

In the table below, it is interesting to compare the performance of the two
Ultrix compilers.  Also, examination of the MIPS and Sun-4 numbers shows the
performance gained by the high-powered optimizers available on these machines.

The numbers are ordered in what we think is the overall integer performance of
the processors.

              Dhrystone Benchmark Results - Optimization Effects
    No Opt             -O          -O3     -O4
NoReg    Regs    NoReg    Regs     Regs    Regs
Dhry's  Dhry's   Dhry's  Dhry's   Dhry's  Dhry's
 /Sec    /Sec     /Sec    /Sec     /Sec    /Sec    System
 1,442   1,474    1,559   1,571                    DEC VAX 11/780, 4.3BSD
 2,800   3,025    3,030   3,325                    Sun-3/160M
 4,896   5,130    5,154   5,235                    DEC VAX 8600, Ultrix 1.2
 8,800  10,200   12,300  12,300   13,000  14,200   MIPS M/500
 8,000   8,000    8,700   8,700                    DEC VAX 8550, Ultrix 2.0 cc
 9,600   9,600    9,600   9,700                    DEC VAX 8550, Ultrix 2.0 vcc
10,550  12,750   17,700  17,700   19,000           Sun-4, SunOS 3.2L
12,800  15,300   18,500  18,500   19,800  21,300   MIPS M/800
15,100  18,300   22,000  22,000   23,700  25,000   MIPS M/1000

Some other published numbers of interest include the following, all of which
are taken from [Richardson 87], unless otherwise noted.  Items marked * are
those that we know (or have good reason to believe) use optimizing compilers.
These are the "register" versions of the numbers, i.e., the highest ones
reported by people.
                          Dhrystone Benchmark Results
Dhry's
 /Sec        Rel.       System
  1571       0.9        VAX 11/780, 4.3BSD [in-house]
  1757       1.0        VAX 11/780, VAX/VMS 4.2 [Intergraph 86]*
  3325       1.9        Sun3/160, SunOS 3.2 [in-house]
  3856       2.2        Pyramid 98X, OSx 3.1, CLE 3.2.0
  4433       2.5        MASSCOMP MC-5700, 16.7MHz 68020, RTU 3.1*
  4716       2.7        Celerity 1230, 4.2BSD, v3.2

  6240       3.6        Ridge 3200, ROS 3.4
  6374       3.6        Sun3/260, 25MHz 68020, SunOS 3.2
  6423       3.7        VAX 8600, 4.3BSD
  6440       3.7        IBM 4381-2, UTS V, cc 1.11
  6896       3.9        Intergraph InterPro 32C, SYSV R3 3.0.0, Greenhills, -O*
  7109       4.0        Apollo DN4000 -O
  7142       4.1        Sun-3/200 [Sun 87] *
  7249       4.2        Convex C-1 XP 6.0, vc 1.1
  7409       4.2        VAX 8600, VAX/VMS in [Intergraph 86]*

  7655       4.4        Alliant FX/8 [Multiflow]
  8300       4.7        DG MV20000-I and MV15000-20 [Stahlman 87]
  8309       4.7        InterPro-32C,30MHz Clipper,Green Hills[Intergraph 86]*
  9436       5.4        Convergent Server PC, 20MHz 80386, GreenHills*
  9920       5.6        HP 9000/840S [HP 87]
 10416       5.9        VAX 8550, VAX/VMS 4.5, cc 2.2*
 10787       6.1        VAX 8650, VAX/VMS, [Intergraph 86]*

 11215       6.4        HP 9000/840, HP-UX, full optimization*
 12639       7.2        HP 9000/825S [HP 87]*
 13000       7.4        MIPS M/500, 8MHz R2000, -O3*
 13157       7.5        HP 825SRX [Sun 87]*
 14195       8.1        Multiflow Trace 7/200 [Multiflow]
 14820       8.4        CRAY 1S
 15007       8.5        IBM 3081, UTS SVR2.5, cc 1.5
 15576       8.9        HP 9000/850S [HP 87]
 18530       10.5       CRAY X-MP
 19000       10.8       Sun-4/200, -O3* [Sun 87]
 19800       11.3       MIPS M/800, 12.5MHz R2000, -O3*

 23700       13.5       MIPS M/1000, 15MHz R2000, -O3*
 28846       16.4       Amdahl 5860, UTS-V, cc1.22
 31250       17.8       IBM 3090/200
 43668       24.9       Amdahl 5890/300E, cc -O

Unusual Dhrystone Attributes

We've calibrated this benchmark against many more realistic ones, and we
believe that its results must be treated with care, because the detailed pro-
gram statistics are unusual in some ways.  It has an unusually low number of
instructions per function call (35-40 on our machines), where most C programs
fall in in the 50-60 range or higher.  Stated another way, Dhrystone does more
function calls than usual, which especially penalizes the DEC VAX, making this
a favored benchmark for inflating one's "VAX-mips" rating.  Any machine with a
lean function call sequence looks a little better on Dhrystone that it does on
others.

The dynamic nesting depth of function calls inside the timed part of Dhrystone
is low (3-4).  This means that most register-window RISC machines would never
even once overflow/underflow their register windows and be required to
save/restore registers.

This is not to say fast function calls or register windows are bad (they're
not!), merely that this benchmark overstates their performance effects.

Dhrystone can spend 30-40% of the time in the strcpy function, copying atypi-
cally long (30-character) strings, which happen to be alignable on word boun-
daries.  More realistic programs don't spend this much time in this sort of
code, and when they do, they handle more shorter strings: 6 characters would be
much more typical.

On our machines, Dhrystone uses 0-offset addressing for 50% of memory data
references (dynamic).  Most real programs use 0-offsets 10-15% of the time.
Of course, Dhrystone is a fairly small benchmark, and thus fits into almost any
reasonable instruction cache.

In conclusion, Dhrystone gives some indication of user-level integer perfor-
mance, but is susceptible to surprises when comparing amongst architectures
that differ strongly.  Unfortunately, the industry seems to lack a good set of
widely-available integer benchmarks that are as representative as are some of
the popular floating point ones.

4.3.  Stanford Small Integer Benchmarks (STAN INT)

The Computer Systems Laboratory at Stanford University, has collected a set of
programs to compare the performance of various systems.  These benchmarks are
popular in some circles as they are small enough to simulate, and are respon-
sive to compiler optimizations.

It is well known that small benchmarks can be misleading.  If you see claims
that machine X is up to N times a VAX on some (unspecified) benchmarks, these
benchmarks are probably the sort they're talking about.

                   Stanford Small Integer Benchmark Results
Perm Tower Queen Intmm Puzzle Quick Bubble Tree Aggr  Rel.
Secs Secs  Secs  Secs  Secs   Secs  Secs   Secs Secs* Perf+ System

2.34 2.30  .94   1.67  11.23  1.12  1.51   2.72  3.08  .84   VAX 11/780 4.3BSD
                                                 2.60   1.0 VAX 11/780@
 .72  1.07   .50   .93   5.53   .58    .97 1.05  1.42   1.8 Sun-3/160M [ours]

 .63   .63   .27   .73   2.96   .31    .44  .69   .86   3.0 VAX 8600 Ultrix1.2
 .28   .35   .17   .42   2.22   .18    .25  .35   .50   5.2 VAX 8550#
 .28   .35   .13   .15    .88   .13    .17  .50   .40   6.5 VAX 8550##
                                                  .65   4.7 Sun-3/200 [Sun 87]

 .18   .24   .15   .23   1.15   .17    .19  .34   .36   7.2 MIPS M/500
                                                  .36   7.3 Sun-4/200 [Sun 87]
 .12   .16   .11   .13    .61   .10    .12  .22   .22  11.8 MIPS M/800
 .10   .13   .10   .11    .51   .08    .10  .17   .18  14.1 MIPS M/1000

*   As weighted by the Stanford Benchmark Suite

+   Ratios of the Aggregate times

@   Estimated VAX 11/780 Ultrix 2.0 vcc -O time.  We get this by 3.08 *
    (.40o.02)/.50 = 2.60, i.e., using the VAX 8550 numbers to estimate the
    effect of optimization.  The ".02" is a guess that optimization helps the
    8550 a little more than it does the 11/780, because the former's cache is
    big enough to hold the whole program and data, whereas the latter's does
    not.  Another way to put it is that the 8550 is not cache-missing very
    much, and so optimization pays off more in removing what's left, whereas
    the 11/780 will cache-miss more, and the nature of these particular tests
    is that the optimizations won't fix cache-misses.  (None of this is very
    scientific, but it's probably within 10%!)

#   Ultrix 2.0 cc -O

##  Ultrix 2.0 vcc -O.  The quick and bubble tests actually had errors; how-
    ever, the times were in line with expectations (these two optimize well),
    so we used them.  All 8550 numbers thanks to Greg Pavlov
    (ames!harvard!hscvax!pavlov, of Amherst, NY).

The Sun numbers are from [Sun 87].  The published Sun-4 number is .356, for
SunOS 3.2L software, i.e., it is slightly faster than the M/500.

5.  Floating Point Benchmarks

5.1.  Livermore Fortran Kernels (LLNL DP)

Lawrence Livermore National Labs' workload is dominated by large scientific
calculations that are largely vectorizable.  The workload is primarily served
by expensive supercomputers.  This benchmark was designed for evaluation of
such machines, although it has been run on a wide variety of hardware, includ-
ing workstations and PCs [McMahon86].

The Livermore Fortran Kernels are 24 pieces of code abstracted from the appli-
cations at Lawrence Livermore Labs.  These kernels are embedded in a large,
carefully engineered benchmark driver.  The driver runs the kernels multiple
times on different data sets, checks for correct results, verifies timing accu-
racy, reports execution rates for all 24 kernels, and summarizes the results
with several statistics.

Unlike many other benchmarks, there is no attempt to distill the benchmark
results down to a single number.  Instead all 24 kernel rates, measured in
mflops (million floating point operations per second) are presented individu-
ally for three different vector lengths (a total of 72 results).  The minimum
and maximum rates define the performance range of the hardware.  Various
statistics of the 24 or 72 rates, such as the harmonic, geometric, and arith-
metic means give insight into general behavior.  Any one of these statistics
might suffice for comparisons of scalar machines, but multiple statistics are
necessary for comparisons involving machines with vector or parallel features.
These machines have unbalanced, bimodal performance, and a single statistic is
insufficient characterization.  McMahon asserts:

    ``When the computer performance range is very large the net Mflops rate
    of many Fortran programs and workloads will be in the sub-range between
    the equi-weighted harmonic and arithmetic means depending on the degree
    of  code  parallelism and optimization.  More accurate estimates of cpu
    workload rates depend on assigning appropriate weights for each kernel.

McMahon's analysis goes on to suggest that the harmonic mean corresponds to
approximately 40% vectorization, the geometric mean to approximately 70% vec-
torization, and the arithmetic mean to 90%o vectorization.  These three statis-
tics can be interpreted as different benchmarks that each characterize certain
applications.  For example, there is fair agreement between the kernels' har-
monic mean and Spice performance.  LINPACK, on the other hand, is better
characterized by the geometric mean.

On the next two pages are shown a summary of results from McMahon's report,
followed by the complete M/1000 results.  (Given the volume of data, we've only
done this on M/1000s.  M/800s scale directly by .833X).

The complete M/1000 data shows that MIPS performance is insensitive to vector
length.  The minimum to maximum variation is also small for this benchmark.
Both characteristics are typical of scalar machines with mature compilers.
Performance of vector and parallel machines, on the hand, may span two orders
of magnitude on this benchmark, or more, depending on the kernel and the vector
length.
                       64-Bit Livermore FORTRAN Kernels
                 MegaFlops, L = 167, Sorted by Geometric Mean
      Harm.  Geom.   Arith.           Rel.*
Min   Mean   Mean     Mean     Max    Geom.               System
 .05    .12    .12      .13      .24     .7  VAX 780 w/FPA 4.3BSD f77 [ours]
 .06    .16    .17      .18      .28    1.0  VAX 780 w/FPA VMS 4.1
 .11    .30    .33      .37      .87    1.9  SUN 3/160 w/FPA

 .20    .42    .46      .50     1.42    2.5  MIPS M/500
 .17    .43    .48      .53     1.13    2.8  SUN 3/260 w/FPA [our numbers]
 .29    .58    .64      .70     1.21    3.8  Alliant FX/1 FX 2.0.2 Scalar
 .38    .72    .77      .83     1.57    4.5  SUN 4/260 w/FPA [Hough 87]

 .39    .94   1.00     1.04     1.64    5.9  VAX 8700 w/FPA VMS 4.1
 .10    .76   1.06     1.50     5.23    6.2  Alliant FX/1 FX 2.0.2 Vector
 .33    .92   1.06     1.20     2.88    6.2  Convex C-1 F77 V2.1 Scalar
 .52   1.09   1.19     1.30     2.74    7.0  ELXSI 6420 EMBOS F77 MP=1
 .51   1.26   1.37     1.48     2.70    8.1  MIPS M/800
 .61   1.51   1.65     1.78     3.24    9.7  MIPS M/1000

1.01   1.06   1.94     3.33    12.79   11.4  Convex C-1 F77 V2.1 Vector
 .28   1.24   2.32     5.11    29.20   13.7  Alliant FX/8 FX 2.0.2 MP=8*Vec
1.51   4.93   5.86     7.00    17.43   34.5  Cray-1S CFT 1.4 scalar
1.23   4.74   6.09     7.67    21.64   35.8  FPS 264 SJE APFTN64
3.43   9.29  10.68    12.15    25.89   62.8  Cray-XMP/1 COS CFT77.12 scalar
0.97   6.47  11.94    22.20    82.05   70.2  Cray-1S CFT 1.4 vector
4.47  11.35  13.08    15.20    45.07   76.9  NEC SX-2 SXOS1.21 F77/SX24 scalar
1.47  12.33  24.84    50.18   188     146    Cray-XMP/1 COS CFT77.12 vector
4.47  19.07  43.94   140     1042     258    NEC SX-2 SXOS1.21 F77/SX24 vector

                       32-Bit Livermore FORTRAN Kernels
                 MegaFlops, L = 167, Sorted by Geometric Mean
      Harm.   Geom.    Arith.           Rel.*
Min   Mean    Mean      Mean     Max    Geom.               System
.05     .18     .20      .23      .48     .7    VAX 780 4.3BSD f77 [ours]
.10     .28     .30      .32      .58    1.0    VAX 780 w/FPA VMS 4.1
.19     .46     .50      .56     1.26    1.7    SUN 3/160 w/FPA
.30     .65     .71      .77     1.55    2.4    SUN 3/260 w/FPA [ours]
.30     .66     .74      .83     1.60    2.5    Alliant FX/1 FX 2.0.2 Scalar
.10     .60     .90     1.31     4.23    3.0    Alliant FX/1 FX 2.0.2 Vector

.40     .97    1.05     1.14     2.08    3.5    MIPS M/500
.55    1.04    1.12     1.20     2.21    3.7    SUN 4/260 w/FPA [Hough 87]
.36    1.11    1.27     1.42     3.61    4.2    Convex C-1 F77 V2.1 Scalar
.46    1.26    1.36     1.45     2.41    4.5    VAX 8700 w/FPA VMS 4.1
.68    1.31    1.46     1.61     3.19    4.9    ELXSI 6420 EMBOS F77 MP=1

.93    2.02    2.19     2.36     3.96    7.3    MIPS M/1000

.28    1.30    2.47     5.59    33.52    8.2    Alliant FX/8 FX 2.0.2 MP=8*Vec
.12    1.27    2.73     5.44    23.60    9.1    Convex C-1 F77 V2.1 Vector

* Relative Performance, as ratio of the Geometric Mean numbers.  This is a
  simplistic attempt to extract a single figure-of-merit.  We admit this goes
  against the grain of this benchmark.  The next table gives the complete
  M/1000 output, in the form used by McMahon.

            Livermore FORTRAN Kernels - Complete MIPS M/1000 Output
Vendor            MIPS    MIPS    MIPS    MIPS |   MIPS    MIPS    MIPS    MIPS
Model           M/1000  M/1000  M/1000  M/1000 | M/1000  M/1000  M/1000  M/1000
OSystem         BSD2.1  BSD2.1  BSD2.1  BSD2.1 | BSD2.1  BSD2.1  BSD2.1  BSD2.1
Compiler          1.21    1.21    1.21    1.21 |   1.21    1.21    1.21    1.21
OptLevel            O2      O2      O2      O2 |     O2      O2      O2      O2
Samples             72      24      24      24 |     72      24      24      24
WordSize            64      64      64      64 |     32      32      32      32
DO Span            167      19      90     471 |    167      19      90     471
Year              1987    1987    1987    1987 |   1987    1987    1987    1987
Kernel          ------  ------  ------  ------ | ------  ------  ------  ------
       1        2.2946  2.2946  2.3180  2.3136 | 2.9536  2.9536  2.9613  2.9727
       2        1.6427  1.6427  1.8531  1.8381 | 2.1218  2.1218  2.4691  2.4758
       3        2.0625  2.0625  2.1260  2.1021 | 2.8935  2.8935  2.9389  2.9853
       4        1.3440  1.3440  1.7954  1.9600 | 1.6836  1.6836  2.2084  2.4248
       5        1.4652  1.4652  1.4879  1.4776 | 2.0924  2.0924  2.1096  2.1374
       6        1.0453  1.0453  1.3734  1.4183 | 1.3076  1.3076  1.7920  1.8517
       7        3.1165  3.1165  3.1304  3.1281 | 3.9336  3.9336  3.9581  3.9623
       8        2.4829  2.4829  2.5725  2.5686 | 3.1625  3.1625  3.2853  3.2612
       9        3.2215  3.2215  3.2359  3.2290 | 3.8708  3.8708  3.8831  3.8632
      10        1.2293  1.2293  1.2336  1.2327 | 2.4413  2.4413  2.3419  2.3263
      11        1.1907  1.1907  1.2274  1.2320 | 1.5789  1.5789  1.6365  1.6559
      12        1.2102  1.2102  1.2404  1.2308 | 1.6004  1.6004  1.6414  1.6471
      13        0.6095  0.6095  0.6272  0.6378 | 0.9288  0.9288  0.9334  0.9428
      14        0.9712  0.9712  0.9455  0.6695 | 1.2133  1.2133  1.2175  1.0092
      15        0.9894  0.9894  0.9605  0.9585 | 1.3701  1.3701  1.3314  1.3314
      16        1.6427  1.6427  1.6159  1.6272 | 1.7904  1.7904  1.7567  1.7500
      17        2.3898  2.3898  2.2674  2.2667 | 3.7320  3.7320  3.4866  3.5210
      18        2.1462  2.1462  2.3321  2.3276 | 2.5642  2.5642  2.8431  2.8365
      19        1.8268  1.8268  1.8540  1.8536 | 2.2883  2.2883  2.3369  2.3466
      20        2.7821  2.7821  2.7800  1.9158 | 3.7104  3.7104  3.7214  3.6583
      21        1.5372  1.5372  1.6004  1.6201 | 1.9644  1.9644  2.0564  2.0855
      22        1.4507  1.4507  1.4506  1.4489 | 2.0802  2.0802  2.0646  2.0658
      23        2.1395  2.1395  2.3897  2.3729 | 2.5972  2.5972  2.9489  2.9532
      24        1.0148  1.0148  1.0458  1.0448 | 1.1995  1.1995  1.2226  1.2281
--------------  ------  ------  ------  ------ | ------  ------  ------  ------
Standard  Dev.  0.6878  0.6938  0.6902  0.6742 | 0.8702  0.8870  0.8616  0.8669
Median    Dev.  0.6869  0.6445  0.7023  0.6546 | 0.9837  0.8980  1.0239  1.0360

Maximum   Rate  3.2359* 3.2215  3.2359  3.2290 | 3.9623* 3.9336  3.9581  3.9623
Average   Rate  1.7834* 1.7419  1.8110  1.7698 | 2.3611* 2.2949  2.3811  2.3872
Geometric Mean  1.6469* 1.6053  1.6752  1.6330 | 2.1935* 2.1238  2.2175  2.2168
Median    Rate  1.6272* 1.5899  1.7056  1.7326 | 2.2084* 2.1071  2.2727  2.3365
Harmonic  Mean  1.5078* 1.4724  1.5368  1.4874 | 2.0235* 1.9590  2.0503  2.0372
Minimum   Rate  0.6095* 0.6095  0.6272  0.6378 | 0.9288* 0.9288  0.9334  0.9428

Maximum   Ratio 1.0000  0.9955  1.0000  0.9978 | 1.0000  0.9927  0.9989  1.0000
Average   Ratio 1.0000  0.9767  1.0154  0.9923 | 1.0000  0.9719  1.0084  1.0110
Geometric Ratio 1.0000  0.9747  1.0171  0.9915 | 1.0000  0.9682  1.0109  1.0106
Harmonic  Mean  1.0000  0.9765  1.0192  0.9864 | 1.0000  0.9681  1.0132  1.0067
Minimum   Rate  1.0000  1.0000  1.0290  1.0464 | 1.0000  1.0000  1.0049  1.0150

* These are the numbers brought forward into the summary section.

5.2.  LINPACK (LNPK DP and LNPK SP)

The LINPACK benchmark has become one of the most widely used single benchmarks
to predict relative performance in scientific and engineering environments.
The usual LINPACK benchmark measures the time required to solve a 100x100 sys-
tem of linear equations using the LINPACK package.  LINPACK results are meas-
ured in MFlops, millions of floating point operations per second.  All numbers
are from [Dongarra 87], unless otherwise noted.

The LINPACK package calls on a set of general-purpose utility routines called
BLAS -- Basic Linear Algebra Subroutines -- to do most of the actual computa-
tion.  A FORTRAN version of the BLAS is available, and the appropriate routines
are included in the benchmark.  However, vendors often provide hand-coded ver-
sions of the BLAS as a library package.  Thus LINPACK results are usually cited
in two forms: FORTRAN BLAS and Coded BLAS.  The FORTRAN BLAS actually come in
two forms as well, depending on whether the loops are 4X unrolled in the FOR-
TRAN source (the usual) or whether the unrolling is undone to facilitate recog-
nition of the loop as a vector instruction.  According to the ground rules of
the benchmark, either may be used when citing FORTRAN BLAS results, although it
is typical to note rolled loops with the annotation ``(Rolled BLAS).''

For our own numbers, we've corrected a few to follow Dongarra more closely than
we have in the past.  LINPACK output produces quite a few MFlops numbers, and
we've tended to use the fourth one in each group, which uses more iterations,
and thus is more immune to clock randomness.  Dongarra uses the highest MFlops
number that appears, then rounds to two digits.

Note that relative ordering even within families is not particularly con-
sistent, illustrating the extreme sensitivity of these benchmarks to memory
system design.
               100x100 LINPACK Results - FORTRAN and Coded BLAS
                  From [Dongarra 87], Unless Noted Otherwise
  DP      DP     SP      SP
Fortran  Coded Fortran  Coded                       System
   .10     .10    .11     .11  Sun-3/160, 16.7MHz (Rolled BLAS)o
   .11     .11    .13     .11  Sun-3/260,25MHz 68020o20MHz 68881 (Rolled BLAS)+
   .14     -      -       -    Apollo DN4000, 25MHz (68020 o 68881) [ENEWS 87]
   .14     -      .24     -    VAX 11/780, 4.3BSD, LLL Fortran [ours]
   .14     .17    .25     .34  VAX 11/780, VAX/VMS
   .20     -      .24     -    80386+80387, 20MHz, 64K cache, GreenHills
   .20     .23    .40     .51  VAX 11/785, VAX/VMS

   .29     .49    .45     .69  Intergraph IP-32C,30Mz Clipper[Intergraph 86]
   .30     -      -       -    IBM RT PC, optional FPA [IBM 87]
   .33     -      .57     -    OPUS 300PM, Greenhills, 30MHz Clipper
   .36     .59    .51     .72  Celerity C1230, 4.2BSD f77
   .38     -      .67     -    80386oWeitek 1167,20MHz,64K cache, GreenHills
   .39     .50    .66     .81  Ridge 3200/90
   .41     .41    .62     .62  Sun-3/160, Weitek FPA (Rolled BLAS)o
   .45     .54    .60     .74  HP9000 Model 840S [HP 87]
   .46     .46    .86     .86  Sun-3/260, Weitek FPA (Rolled BLAS)o
   .47     .81    .69    1.30  Gould PN9000
   .49     .66    .84    1.20  VAX 8600, VAX/VMS 4.5
   .49     .54    .62     .68  HP 9000/825S [HP 87]

   .57     .72    .86     .87  HP9000 Model 850S [HP 87]
   .60     .72    .93    1.2   MIPS M/500
   .61     -      .84     -    DG MV20000-I, MV15000-20 [Stahlman, 87]
   .65     .76    .80     .96  VAX 8500, VAX/VMS
   .70     .96   1.3     1.9   VAX 8650, VAX/VMS
   .78     -     1.1      -    IBM 9370-90, VS FORT 1.3.0
   .97    1.1    1.4     1.7   VAX 8550/8700/8800, VAX/VMS
  1.0     1.3    1.9     3.6   MIPS M/800
  1.1     1.1    1.6     1.6   SUN 4/260 (Rolled BLAS)+
  1.2     1.7    1.3     1.6   ELXSI 6420
  1.2     1.6*   2.3*    4.3   MIPS M/1000

  1.6     2.0    1.6     2.0   Alliant FX-1 (1 CE)
  2.1      -     2.4      -    IBM 3081K H enhanced opt=3
  3.0     3.3    4.3     4.9   CONVEX C-1/XP, Fort 2.0 (Rolled BLAS)
  6.0      -      -       -    Multiflow Trace 7/200 Fortran 1.4 (Rolled BLAS)
  7.0    11.0    7.6     9.8   Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9

 12      23     n.a.    n.a.   CRAY 1S CFT (Rolled BLAS)
 39      57     n.a.    n.a.   CRAY X-MP CFT (Rolled BLAS)
 43        -    44        -    NEC SX-2, Fortran 77/SX (Rolled BLAS)

+ The Sun FORTRAN Rolled BLAS code appears to be optimal, so we used the same
  numbers for Coded BLAS.  The 4X unrolled numbers for Sun-4 are .86 (DP) and
  1.25 (SP) [Hough 87].

* These numbers are as reported by Dongarra.  We prefer the typical results,
  which are slightly lower, viz. 1.2, 1.5, 2.2, and 4.3.

On the next page, we take a subset of these numbers, and normalize them to the
VAX/VMS 11/780.

               100x100 LINPACK Results - FORTRAN and Coded BLAS
                         VAX/VMS Relative Performance
                          For A Subset of the Systems
 Rel.    Rel.     Rel.    Rel.
  DP      DP       SP      SP
Fortran  Coded   Fortran  Coded                      System
    .8      .6       .5      .3   Sun-3/260,25MHz 68020o20MHz 68881 (Rolled)
   1.0     1.0      1.0     1.0   VAX 11/780, VAX/VMS
   1.4     -        1.0     -     80386o80387, 20MHz, 64K cache, GreenHills
   2.0     2.9      1.8     2.0   Intergraph IP-32C,30Mz Clipper[Intergraph 86]

   2.7     -        2.7     -     80386oWeitek 1167,20MHz,64K cache, GreenHills
   2.9     2.4      2.5     1.8   Sun-3/160, Weitek FPA (Rolled BLAS)
   3.3     2.7      3.4     2.5   Sun-3/260, Weitek FPA (Rolled BLAS)
   3.5     3.9      3.4     3.5   VAX 8600, VAX/VMS 4.5

   4.1     4.2      3.4     2.6   HP9000 Model 850S [HP 87]
   4.3     4.2      3.7     3.5   MIPS M/500
   6.9     6.6      5.6     5.0   VAX 8550/8700/8800, VAX/VMS
   7.1     7.6      7.6    10.6   MIPS M/800
   7.9     6.5      6.4     4.7   SUN 4/260 (Rolled BLAS)
   8.6     9.4      9.2    12.6   MIPS M/1000

  11.4    11.8      6.4     5.9   Alliant FX-1 (1 CE)
  21.4    19.4     17.2    14.4   CONVEX C-1/XP, Fort 2.0 (Rolled BLAS)
  50      65       30      28.8   Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9
 307       -      176       -     NEC SX-2, Fortran 77/SX (Rolled BLAS)

5.3.  Spice Benchmarks (SPCE 2G6)

Spice [UCB 87] is a general-purpose circuit simulator written at U.C.  Berke-
ley.  Spice and its derivatives are widely used in the semiconductor industry.
It is a valuable benchmark because it shares many characteristics with other
real-world programs that are not represented in popular small benchmarks.  It
uses both integer and floating-point computation heavily.  The floating-point
calculations are not vector oriented, as in LINPACK.  Also, the program itself
is very large and therefore tests both instruction and data cache performance.

We have chosen to benchmark Spice version 2g.6 because of its general availa-
bility.  This is one of the later and more popular Fortran versions of Spice
distributed by Berkeley.  We felt that the circuits distributed with the Berke-
ley distribution for testing and benchmarking were not sufficiently large and
modern to serve as benchmarks.  In previous version of this brief, we presented
results on circuits we felt were representative, but which contained
proprietary data.  This time, we gathered and produced appropriate benchmark
circuits that can be distributed, and have since been posted as public domain
on Usenet.  The Spice group at Berkeley found these circuits to be up-to-date
and good candidates for Spice benchmarking.  By distributing the circuits we
obtained results for many other machines.  In the table below, "Geom Mean" is
the geometric mean of the 3 "Rel." columns.

                          Spice2G6 Benchmarks Results
   digsr       bipole    comparator   Geom
  Secs Rel.   Secs Rel.   Secs Rel.   Mean System
1354.0 0.60  439.6 0.68  460.3 0.63    .6  VAX 11/780 4.3BSD, f77 V2.0
 993.5 0.81  394.3 0.76  366.9 0.80    .8  Microvax-II Ultrix 1.1, fortrel
 901.9 0.90  285.1 1.0   328.6 0.89    .9  SUN 3/160 SunOS 3.2 f77 -O -f68881
 848.0 0.95  312.6 0.96  302.9 0.96   1.0  VAX 11/780 4.3BSD, fortrel -opt
 808.1 1.0   299.1 1.0   291.7  1.0   1.0  VAX 11/780 VMS 4.4 /optimize
 744.8 1.1   221.7 1.3   266.0  1.1   1.2  SUN 3/260 SunOS 3.2 f77 -O -f68881
 506.5 1.6   170.0 1.8   189.1  1.5   1.6  SUN 3/160 SunOS 3.2 f77 -O -ffpa

 361.2 2.2   112.0 2.7   129.4  2.3   2.4  SUN 3/260 SunOS 3.2 f77 -O -ffpa
 296.5 2.7    73.4 4.1    83.0  3.5   3.4  MIPS M/500
 225.9 3.6    63.7 4.7    73.4  4.0   4.1  SUN 4/260 f77 -O3 -Qoption as -Ff0+

     -  -        -  -        -   -    5.3  VAX 8700 (estimate)
 136.5 5.9    42.6 7.0    41.4  7.0   6.6  MIPS M/800
 125.5 6.4    39.5 7.6    39.3  7.4   7.1  AMDAHL 470V7 VMSP FORTVS4.1
 114.3 7.1    35.4 8.4    34.5  8.5   8.0  MIPS M/1000

  48.0 16.8   12.5 23.9   17.5 16.7   18.9 FPS 20/64 VSPICE (2g6 derivative)

+   Sun numbers are from [Hough 87], who notes that the Sun-4 number was beta
    software, and that a few modules did not optimize.

Benchmark descriptions:
digsr    CMOS 9 bit Dynamic shift register with parallel load capability, i.e.,
         SISO (Serial Input Serial Output) and PISO (Parallel Input Serial Out-
         put), widely used in microprocessors.  Clock period is 10 ns.  Channel
         length = 2 um, Gate Oxide = 400 Angstrom.  Uses MOS LEVEL=2.
bipole   Schottky TTL edge-triggered register.  Supplied with nearly- coin-
         cident inputs (synchronizer application).
comparator
         Analog CMOS auto-zeroed comparator, composed of Input, Differential
         Amplifier and Latch.  Input signal is 10 microvolts.  Channel Length =
         3 um, Gate Oxide = 500 Angstrom.  Uses MOS LEVEL=3.  Each part is con-
         nected by the capacitance coupling, which is often used for the offset
         cancellation.  (Sometimes called Toronto, in honor of its source).

Hspice is a commercial version of Spice offered by Meta-Software, which
recently published benchmark results for a variety of machines [Meta-software
87].  (Note that the M/800 number cited there was before the UMIPS-BSD 2.1 and
f77 1.21 releases, and the numbers have improved).  The VAX 8700 Spice number
(5.3X) was estimated by using the Hspice numbers below for 8700 and M/800, and
the M/800 Spice number:
(5.5: 8700 Hspice) / (6.9: M/800 Hspice) X (6.6: M/800 Spice) yields 5.3X.

This section indicates that the performance ratios seem to hold for at least
one important commercial version as well.

                           Hspice Benchmarks Results
                                 HSPICE-8601K
               ST230
 Secs                          Rel.                          System

166.5                            .6                          VAX 11/780, 4.2BSD
92.2                            1.0                          VAX 11/780 VMS
91.5                            1.0                          Microvax-II VMS

29.2                            3.2                          ELXSI 6400
29.1                            3.2                          Alliant FX/1
25.3                            3.6                          HyperSPICE (EDGE)
16.8                            5.5                          VAX 8700 VMS
16.3                            5.7                          IBM 4381-12

13.4                            6.9                          MIPS M/800 [ours]
11.3                            8.2                          MIPS M/1000 [ours]

 3.27                          28.2                          IBM 3090
 2.71                          34.0                          CRAY-1S

5.4.  Digital Review (DIG REV)

The Digital Review magazine benchmark [DR 87] is a 3300-line FORTRAN program
that includes 33 separate tests, mostly floating-point, some integer.  The
magazine reports the times for all tests, and summarizes them with the
geometric mean seconds shown below.  All numbers below are from [DR 87], except
the M/500 and M/800 figures.

                       Digital Review Benchmarks Results
 Secs                      Rel.                      System
9.17                       1.0                       VAXstation II/GPX, VMS 4.5

2.90                       3.2                       VAXstation 3200
2.32                       4.0                       VAX 8600, VMS 4.5
2.09                       4.4                       Sun-4, SunOS 3.2L
1.86                       4.9                       MIPS M/500 [ours]

1.584                      5.8                       VAX 8650
1.480                      6.2                       Alliant FX/8, 1 CE
1.469                      6.2                       VAX 8700

1.200                      7.6                       MIPS M/800 [ours]
1.193                      7.7                       ELXSI 6420

 .990                      9.3                       MIPS M/1000*

 .487                      18.8                      Convex C-1 XP

* The actual run number was .99, which [DR 87] reported as 1.00.

5.5.  Doduc Benchmark (DDUC)

This benchmark [Doduc 86] is a 5300-line FORTRAN program that simulates aspects
of nuclear reactors, has little vectorizable code, and is thought to be
representative of Monte-Carlo simulations.  It uses mostly double precision
floating point, and is often viewed as a ``nasty'' benchmark, i.e., it breaks
things, and makes machines underperform their usual VAX-mips ratings.  Perfor-
mance is given as a number R normalized to 100 for an IBM 370/168-3 or 170 for
an IBM 3033-U, [ R = 48671/(cpu time in seconds) ], so that larger R's are
better.

In order of increasing performance, following are numbers for various machines.
All are from [Doduc 87] unless otherwise specified.

                   Double Precision Doduc Benchmark Results
DoDuc R   Relative
Factor     Perf.     System
     17      0.7     Sun3/110, 16.7MHz
     19      0.7     Intel 80386o80387, 16MHz, iRMX
     22      0.8     Sun-3/260, 25MHz 68020, 20MHz 68881
     26      1.0     VAX 11/780, VMS
     33      1.3     Fairchild Clipper, 30MHz, Green Hills

     43      1.7     Sun-3/260, 25MHz, Weitek FPA
     48      1.8     Celerity C1260
     50      1.9     CCI Power 6/32
     53      2.0     Edge 1
     64      2.5     Harris HCX-7

     85      3.3     Alliant FX/1
     88      3.4     MIPS M/500, f77 -O2, runs 553 seconds
     90      3.5     IBM 4381-2
     90      3.5     Sun-4/200 [Hough 1987], SunOS 3.2L, runs 540 seconds
     91      3.5     DEC VAX 8600, VAX/VMS
     97      3.7     ELXSI 6400
     99      3.8     DG MV/20000
    100      3.8     MIPS M/500, f77 -O3, runs 488 seconds
    101      3.9     Alliant FX/8
    113      4.3     FPSystems 164
    119      4.6     Gould 32/8750
    129      5.0     DEC VAX 8650
    136      5.2     DEC VAX 8700, VAX/VMS

    150      5.7     Amdahl 470 V8, VM/UTS
    178      6.8     MIPS M/800, f77 -O2, runs 273 secs
    181      7.0     IBM 3081-G, F4H ext, opt=2
    190      7.3     MIPS M/800, f77 -O3, runs 256 secs
    218      8.4     MIPS M/1000, f77 -O2, runs 223 secs
    229      8.8     MIPS M/1000, f77 -O3, runs 213 secs
    236      9.1     IBM 3081-K

    475     18.3     Amdahl 5860
    714     27.5     IBM 3090-200, scalar mode

   1080     41.6     Cray X/MP [for perspective:we have a lonnng way to go yet!]

5.6.  Whetstone

Whetstone is a synthetic mix of floating point and integer arithmetic, function
calls, array indexing, conditional jumps, and transcendental functions [Curnow
76].

Whetstone results are measured in KWips, thousands of Whetstone interpreter
instructions per second.  In this case, some of our numbers actually went down,
although compiled code has generally improved.  First, the accuracy of several
library routines was improved, at a slight cost in performance.  Second, on
machines this fast, relatively few clock ticks are actually counted, and UNIX
timing includes some variance.  We've been running many runs and averaging.
We've now increased the loop counts from 10 to 1000 to increase the total run-
ning time to the point where the variance is reduced.  This changed the bench-
mark slightly.  Our experiences show some general uncertainty about the numbers
reported by anybody: we've heard that various different source programs are
being used.

                          Whetstone Benchmark Results
  DP     DP     SP     SP
KWips   Rel.  Kwips   Rel.  System
   410   0.5     500  0.4   VAX 11/780, 4.3BSD, f77 [ours]
   715   0.9   1,083  0.9   VAX 11/780, LLL compiler [ours]
   830   1.0   1,250  1.0   VAX 11/780 VAX/VMS [Intergraph 86]

   960   1.2   1,040  0.8   Sun3/160, 68881 [Intergraph 86]
 1,110   1.3   1,670  1.3   VAX 11/785, VAX/VMS [Intergraph 86]
 1,230   1.5   1,250  1.0   Sun3/260, 25MHz 68020, 20MHz 68881
 1,400   1.7   1,600  1.3   IBM RT PC, optional FPA [IBM 87]

 1,730   2.1   1,860  1.5   Intel 80386o80387, 20MHz, 64K cache, GreenHills
 1,740   2.1   2,980  2.4   Intergraph InterPro-32C,30MHz Clipper[Intergraph86]
 1,744   2.1   2,170  1.7   Apollo DN4000, 25MHz 68020, 25MHz 68881 [ENEWS 87]
 1,860   2.2   2,400  1.9   Sun3/160, FPA
 2,092   2.5   3,115  2.5   HP 9000/840S [HP 87]
 2,433   2.9   3,521  2.8   HP 9000/825S [HP 87]
 2,590   3.1   4,170  3.3   Intel 80386oWeitek 1167, 20MHz, Green Hills
 2,600   3.1   3,400  2.7   Sun3/260, Weitek FPA [measured elsewhere]
 2,670   3.2   4,590  3.7   VAX 8600, VAX/VMS [Intergraph 86]
 2,907   3.5   4,202  3.4   HP 9000 Model 850S [HP 87]

 3,540   4.3   5,290  4.2   Sun-4 (reported secondhand, not confirmed)
     -   -     6,400  5.1   DG MV/15000-12
 3,950   4.8   6,670  5.3   VAX 8700, VAX/VMS, Pascal(?) [McInnis, 1987]
 4,000   4.8   6,900  5.5   VAX 8650, VAX/VMS [Intergraph 86]
 4,120   5.0   4,930  3.9   Alliant FX/8  (1 CE) [Alliant 86]
 4,200   5.1       -   -    Convex C-1 XP [Multiflow]
 4,220   5.1   5,430  4.3   MIPS M/500

 6,930   8.0   8,570  6.9   MIPS M/800
 7,960   9.6  10,280  8.2   MIPS M/1000
12,605  15         -   -    Multiflow Trace 7/200 [Multiflow]
25,000  30         -   -    IBM 3090-200 [Multiflow]
35,000  42         -   -    Cray X-MP/12

6.  Acknowledgements

Some people have noted that they seldom believe the numbers that come from cor-
porations unless accompanied by names of people who take responsibility for the
numbers.  Many people at MIPS have contributed to this document, which was ori-
ginally created by Web Augustine.  Particular contributors to this issue
include Mark Johnson (much Spice work, including creation of public-domain
Spice benchmarks), and especially Earl Killian (a great deal of work in various
areas, particularly floating-point).  Final responsibility for the numbers in
this Brief is taken by the editor, John Mashey.

We thank David Hough of Sun Microsystems, who kindly supplied numbers for some
of the Sun configurations, even fixing a few of our numbers that were
incorrectly high, and who has also offered good comments on joint efforts look-
ing for higher-quality benchmarks.

We also thank Cliff Purkiser of Intel, who posted the Intel 80386 Whetstone and
LINPACK numbers on Usenet.

We also thank Greg Pavlov, who ran hordes of Stanford and Dhrystone benchmarks
for us on a VAX 8550, Ultrix 2.0 system.

7.  References

[Alliant 86]
   Alliant Computer Systems Corp, "FX/Series Product Summary", October 1986.

[Curnow 76]
   Curnow, H. J., and Wichman, B. A., ``A Synthetic Benchmark'', Computing
   Journal, Vol. 19, No. 1, February 1976, pp. 43-49.

[Doduc 87]
   Doduc, N., FORTRAN Central Processor Time Benchmark, Framentec, June 1986,
   Version 13.  Newer numbers were received 03/17/87, and we used them where
   different.
   E-mail: seismo!mcvax!ftcsun3!ndoduc

[Dongarra 87]
   Dongarra, J., ``Performance of Various Computers Using Standard Linear Equa-
   tions in a Fortran Environment'', Argonne National Laboratory, August 10,
   1987.

[Dongarra 87b]
   Dongarra, J., Marin, J., Worlton, J., "Computer Benchmarking: paths and pit-
   falls", IEEE Spectrum, July 1987, 38-43.

[DR 87]
   "A New Twist: Vectors in Parallel", June 29, 1987, "The M/1000: VAX 8800
   Power for Price of a MicroVAX II", August 24, 1987, and "VAXstation 3200
   Benchmarks: CVAX Eclipses MicroVAX II", September 14, 1987.  Digital Review,
   One Park Ave., NY, NY 10016.

[ENEWS 87]
   Electronic News, ``Apollo Cuts Prices on Low-End Stations'', July 6, 1987,
   p. 16.

[Fleming 86]
   Fleming, P.J. and Wallace, J.J.,``How Not to Lie With Statistics: The
   Correct Way to Summarize Benchmark Results'', Communications of the ACM,
   Vol. 29, No. 3, March 1986, 218-221.

[HP 87]
   Hewlett Packard, ``HP 9000 Series 800 Performance Brief'', 5954-9903, 5/87.
   (A comprehensive 40-page characterization of 825S, 840S, 850S).

[Hough 86,1]
   Hough, D., ``Weitek 1164/5 Floating Point Accelerators'', Usenet, January
   1986.

[Hough 86,2]
   Hough, D., ``Benchmarking and the 68020 Cache'', Usenet, January 1986.

[Hough 86,3]
   Hough, D., ``Floating-Point Programmer's Guide for the Sun Workstation'',
   Sun Microsystems, September 1986. [an excellent document, including a good
   set of references on IEEE floating point, especially on micros, and good
   notes on benchmarking hazards].  Sun3/260 Spice numbers are from later mail.

[Hough 87]
   Hough, D., ``Sun-4 Floating-Point Performance'', Usenet, 08/04/87.

[IBM 87]
   IBM, ``IBM RT Personal Computer (RT PC) New Models, Features, and Software
   Overview, February 17, 1987.

[Intergraph 86]
   Intergraph Corporation, ``Benchmarks for the InterPro 32C'', December 1986.

[Meta-Software 87]
   Meta-Software, ``HSPICE Performance Benchmarks'', June 1987.  50 Curtner
   Avenue, Suite 16, Campbell, CA 95008.

[McInnis 87]
   McInnis, D., Kusik, R., Bhandarkar, D., ``VAX 8800 System Overview'', Proc.
   IEEE COMPCON, March 1987, San Francisco, 316-321.

[McMahon 86]
   ``The Livermore Fortran Kernels: A Computer Test of the Numerical Perfor-
   mance Range'', December 1986, Lawrence Livermore National Labs.

[MIPS 87]
   MIPS Computer Systems, "A Sun-4 Benchmark Analysis", and "RISC System Bench-
   mark Comparison: Sun-4 vs MIPS", July 23, 1987.

[Purkiser 87]
   Purkiser, C., ``Whetstone and LINPACK Numbers'', Usenet, March 1987.

[Richardson 87]
   Richardson, R., ``9/20/87 Dhrystone Benchmark Results'', Usenet, Sept. 1987.
   Rick publishes the source several times a year.  E-mail address:
   ...!seismo!uunet!pcrat!rick

[Serlin 87a]
   Serlin, O., ``MIPS, DHRYSTONES, AND OTHER TALES'', Reprinted with revisions
   from SUPERMICRO Newsletter, April 1986, ITOM International, P.O. Box 1450,
   Los Altos, CA 94023.
   Analyses on the perils of simplistic benchmark measures.

[Serlin 87b]
   Serlin, O., SUPERMICRO #69, July 31, 1987. pp. 1-2.
   Offers good list of attributes customers should demand of vendor benchmark-
   ing.

[Stahlman 87]
   Stahlman, M., "The Myth of Price/performance", Sanford C. Bernstein & Co,
   Inc, NY, NY, March 17, 1987.

[Sun 86]
   SUN Microsystems, ``The SUN-3 Family: A Hardware Overview'', August 1986.

[Sun 87]
   SUN Microsystems, SUN-4 Product Introduction Material, July 7, 1987.

[UCB 87]
   U. C. Berkeley, CAD/IC group, ``SPICE2G.6'', March 1987. Contact: Cindy
   Manly, EECS/ERL Industrial Liason Program, 479 Cory Hall, University of Cal-
   ifornia, Berkeley, CA 94720.

[Weicker 84]
   Weicker, R. P., ``Dhrystone: A Synthetic Systems Programming Benchmark'',
   Communications of the ACM, Vol. 27, No. 10, October 1984, pp.  1013-1030.
________
UNIX is a Registered Trademark of AT&T.  DEC, VAX, Ultrix, and VAX/VMS are
trademarks of Digital Equipment Corp.  Sun-3, Sun-4 are Trademarks on Sun
Microsystems.  Many others are trademarks of their respective companies.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

manson@tut.cis.ohio-state.edu (Bob Manson) (11/02/87)

Alright, I'm a little pissed over the recent posting of these supposed
"performance figures". Exactly what are these things supposed to mean?
Well, they are compiled programs on different systems that are run and
supposed to represent the speed of the various processors, MIPS etc. We
all know that MIPS is mostly a meaningless figure (Okay, so there's some
debate there, but I see no meaning to how many instructions per second a
processor runs-"The COPYMEM instruction blockmoves the entire memory to
disk space, but takes 1,000,000 usec to execute, with an effective MIPS
of 1/1000000").  Plus let's say I write my own compiler for my new machine.
It has no optimizing and is real brain-damaged-it writes code like a Forth
machine (lots of stack ops). It means nothing to try to compare my computer
with my dumb compiler to a well-developed VAX machine with a real compiler.
So why run benchmarks in compiled languages??? It's easier that way, you
don't have to write individual programs for each machine that might actually
show off their abilities and improved instructions. I'll admit that it does
mean something for compiled language users but not really for performance
comparisons-you're comparing apples and oranges, or really the efficiency
of the compilers on the machines. 
If anyone has done any studies using machine language programs I'd be very
interested in that....
						Bob M.
...!ihnp4!cbosgd!osu-cis!tut.cis.ohio-state.edu!manson
or
manson@tut.cis.ohio-state.edu

Disclamer:My employers don't care what I say....






-- 
Batches? We don't need no stinkin batches!

sjc@mips.UUCP (Steve "The" Correll) (11/03/87)

In article <864@tut.cis.ohio-state.edu>, manson@tut.cis.ohio-state.edu (Bob Manson) writes:
> Exactly what are these things supposed to mean?
> Well, they are compiled programs on different systems that are run and
> supposed to represent the speed of the various processors, MIPS etc. We
> all know that MIPS is mostly a meaningless figure...
> So why run benchmarks in compiled languages??? It's easier that way, you
> don't have to write individual programs for each machine that might actually
> show off their abilities and improved instructions. I'll admit that it does
> mean something for compiled language users but not really for performance
> comparisons-you're comparing apples and oranges, or really the efficiency
> of the compilers on the machines. 

The best benchmark for person x is clearly the program which accounts for most
of the cycles that person executes. But there are so many different "x"s!

We assume that grep, nroff, and Unix system calls are important to most
readers of comp.arch, so we study them.  A sizeable class of Fortran users
tells us that Linpack and the Livermore loops are representative of their
programs, so we study them. IC circuit designers tell us they execute most of
their cycles within Spice, so we pay a lot of attention to that.

If, on the other hand, you execute most of your cycles within hand-tuned
assembly language, and you are willing to revise your programs completely to
best use the instruction set of each new machine you acquire, and you are a
serious potential customer, the sales people at most computer vendors will be
happy to run your own specific benchmarks; ours do so all the time.

Tuned code is the right measurement for some people; compiled code is right
for others.

I view a computer as a system, so for me it makes poor sense to omit the effect
of compilers. And since one can argue forever about how well a hypothetical
compiler _might_ use a particular instruction set, I prefer to ask how well the
best existing compiler _does_ use it. While I'm all in favor of hand-coding
inner loops and library routines to improve performance, one can argue forever
about how easy and how profitable that is; so I think the best test of that
is to measure the effects of the tradeoffs that people made in constructing
an actual OS and compiler system, rather than a hypothetical one.

An article in IEEE Micro some time back which measured assembly-coded
algorithms on 68xxx and xx86 machines seemed pretty useless to me, more
like a contest between assembly coders than an indication of the useful
work I might get out of the machines when running Unix or any other OS.

Incidentally, as explained in the Performance Brief, our definition of "mips"
is _not_ the meaningless "millions of instructions per second"; it's "number
of times faster than a Vax 780 on this particular problem", where we
arbitrarily declare a Vax 780 to have a mips rating of 1. We would be better
off using "Vax780s" rather than "mips" as our unit of measure, except that
we'd have to put so many "TM"s in the document that you wouldn't be able to
find the numbers. :-)

-- 
...decwrl!mips!sjc						Steve Correll