mash@mips.UUCP (10/30/87)
MIPS Performance Brief, PART 1 : CPU Benchmarks, Issue 3.0, October 1987.
This is a condensed version, with MAC Charts, some explanatory details,
and most advertisements deleted. For a full version on paper,
with MAC charts an tables that aren't squished toether, send mail,
NOT TO ME, but to:
....{ucbvax, decvax, ihnp4}!decwrl!mips!eleanor [Eleanor Bishop]
(Please be patient: it will take a little while to get them sent.)
As usual, I tried pretty hard to get the numbers right, but let us know if we
goofed anywhere. We'll update the next time.
The posting is 1200+ lines long, with a thousand or so benchmark
numbers & performance ratios for dozens of machines and benchmarks.
If you're not a glutton for data, "n" now or after the main summary!
------------
1. Introduction
New Features of This Issue
More benchmarks are normalized to the VAX-11/780 under VAX/VMS,rather than UNIX.
Livermore Loops and Digital Review Magazine benchmarks have been added, and
the Spice section uses new public domain inputs.
The Brief has been divided into two parts: user and system. The system bench-
mark part is being greatly expanded (beyond Byte), and has been moved to a new
document "MIPS Performance Brief - Part 2". User-level performance is mostly
driven by compiler and library tuning, whereas system performance also depends
on operating system and hardware configuration issues. The two Briefs syn-
chronize with different release cycles: Part 2 will appear 1-2 months after
Part 1.
Benchmarking - Caveats and Comments
While no one benchmark can fully characterize overall system performance, the
results of a variety of benchmarks can give some insight into expected real
performance. A more important benchmarking methodology is a side-by-side com-
parison of two systems running the same real application.
We don't believe in characterizing a processor with just a single number, but
we follow (what seems to be) standard industry practice of using a mips-rating
that essentially describes overall integer performance. Thus, we label a 5-
mips machine to be one that is about 5X (i.e., anywhere from 4X to 6X!) faster
than a VAX 11/780 (UNIX 4.3BSD, unless we can get Ultrix or VAX/VMS numbers) on
integer performance, since this seems to be how most people intuitively compute
mips-ratings. Even within the same computer family, performance ratios between
processors vary widely. For example, [McInnis 87] characterizes a ``6 mips''
VAX 8700 as anywhere from 3X to 7X faster than the 11/780. Floating point speed
often varies more than, and scales up slower than integer speed versus the
11/780.
This paper analyzes one important aspect of overall computer system performance
- user-level CPU performance.
MIPS Computer Systems does not warrant or represent that the performance data
stated in this document will be achieved by any particular application. (We
have to say that, sorry.)
2. Benchmark Summary
2.1. Choice of Benchmarks
This brief offers both public-domain and MIPS-created benchmarks. We prefer
public domain ones, but some of the most popular ones are inadequate for accu-
rately characterizing performance. In this section, we give an overview of the
importance we attach to the various benchmarks, whose results are summarized on
the next page.
Dhrystone [DHRY 1.1] and Stanford [STAN INT] are two popular small integer
benchmarks. Compared with the fastest VAX 11/780 systems, the M/1000 is 13-14X
faster than the VAX on these tests, and yet, we rate the M/1000 as a 10-vax-
mips machine.
While we present Dhrystone and Stanford, we feel that the performance of large
UNIX utilities, such as grep, yacc, diff, and nroff is a better (but not per-
fect!) guide to the performance customers will receive. These four, which make
up our [MIPS UNIX] benchmark, demonstrate that performance ratios are not sin-
gle numbers, but range here from 8.6X to 13.7X faster than the VAX.
Even these UNIX utilities tend to overstate performance relative to large
applications, such as CAD applications. Our own vax-mips ratings are based on
a proprietary set of larger and more stressful real programs, such as our com-
piler, assembler, debugger, and various CAD programs.
For floating point, the public domain benchmarks are much better. We're still
careful not to use a single benchmark to characterize all floating point appli-
cations.
The Livermore Fortran kernels [LLNL DP] give insight into both vector and non-
vector performance for scientific applications. Linpack [LNPK DP and LNPK SP]
tests vector performance on a single scientific application, and stresses cache
performance. Spice [SPCE 2G6] and Doduc [DDUC] test a different part of the
floating point application spectrum. The codes are large and thus test both
instruction fetch bandwidth and scalar floating point. Digital Review Maga-
zines benchmark [DIG REV] is a compendium of FORTRAN tests that measure a wide
variety of behavior, and seem to correlate well with some classes of real pro-
grams.
2.2. Benchmark Summary Data This section summarizes the most important
benchmark results described in more detail throughout this document. The
numbers show performance relative to the VAX 11/780, i.e., larger numbers are
better/faster.
o A few numbers have been estimated by interpolations from closely-related
benchmarks and/or closely-related machines. The methods are given in
great detail in the individual sections.
o Several of the columns represent summaries of multiple benchmarks. For
example, the MIPS UNIX column represents 4 benchmarks, the SPICE 2G6
column 3, and LLNL DP represents 24.
o In the Integer section, MIPS UNIX is the most indicative of real perfor-
mance.
o For Floating Point, we especially like LLNL DP (Livermore FORTRAN ker-
nels), but all of these are useful, non-toy benchmarks.
o In the following table, "Pub mips" gives the manufacturer-published mips-
ratings. As in all tables in this document, the machines are listed in
increasing order of performance according to the benchmarks, in this case,
by Integer performance.
o The summary includes only those machines for which we could get measured
results on almost all the benchmarks and good estimates on the the results
for the few missing data items.
Summary of Benchmark Results
(VAX 11/780 = 1.0, Bigger is Faster)
Integer (C) Floating Point (FORTRAN)
---------------- -------------------------------------
MIPS DHRY STAN LLNL LNPK LNPK SPCE DIG DDUC Publ
UNIX 1.1 INT DP DP SP 2G6 REV mips System
1 1 1 1 1 1 1 1 1 1 VAX 11/780#
2.1 1.9 1.8 1.9 2.9 2.5 1.6 *2 *1.3 2 Sun3/160 FPA
*4 4.1 4.7 2.8 3.3 3.4 2.4 *3 1.7 4 Sun3/260 FPA
5.5 7.4 7.2 2.5 4.3 3.7 3.4 4.9 3.8 5 MIPS M/500
*6 5.9 6.5 5.9 6.9 5.6 5.3 6.2 5.2 6 VAX 8700
8.0 10.8 7.3 4.5 7.9 6.4 4.1 4.4 3.5 10 Sun4/260
9.2 11.3 11.8 8.1 7.1 7.6 6.6 7.6 7.3 8 MIPS M/800
11.3 13.5 14.1 9.7 8.6 9.2 8.0 9.3 8.8 10 MIPS M/1000
# VAX 11/780 runs 4.3BSD for MIPS UNIX, Ultrix 2.0 (vcc) for Stanford, VAX/VMS
for all others. Use of 4.3BSD (no global optimizer) probably inflates the
MIPS UNIX column by about 10%.
* Although it is nontrivial to gather full set of numbers, it is important to
avoid holes in benchmark tables, as it is too easy to be misleading. Thus,
we had to make reasoned guesses at these numbers. The MIPS UNIX values for
VAX 8700 and Sun-3/260 were taken from the Published mips-ratings, which are
consistent (+/- 10%) with experience with these machines. DIG REV and DDUC
were guessed by noting that most machines do somewhat better on DIG REV than
on SPCE, and than a Sun-3/260 is usually 1.5X faster than a Sun-3/160 on
floating-point benchmarks.
Benchmark Descriptions:
MIPS UNIX
MIPS UNIX benchmarks: grep, diff, yacc, nroff, same 4.2BSD C source compiled
and run on all machines. The summary number is the geometric mean of the 4
relative performance numbers.
DHRY 1.1
Dhrystone 1.1, any optimization except inlining.
STAN INT
Stanford Integer.
LLNL DP
Lawrence Livermore Fortran Kernels, 64-bit. The summary number is the given
as the relative performance based on the geometric mean, i.e., the "middle"
of the 3 means.
LNPK DP
Linpack Double Precision, FORTRAN.
LNPK DP
Linpack Single Precision, FORTRAN.
SPCE 2G6
Spice 2G6, 3 public-domain circuits, for which the geometric mean is shown.
DIG REV
Digital Review magazine, combination of 33 benchmarks.
DDOC
Doduc Monte Carlo benchmark.
3. Methodology
Tested Configurations
When we report measured results, rather than numbers published elsewhere, the
configurations were as shown below. These system configurations do not neces-
sarily reflect optimal configurations, but rather the in-house systems to which
we had repeatable access. When we've had the faster results available, we've
quoted them in place of our own system's numbers.
DEC VAX-11/780
Main Memory: 8 Mbytes
Floating Point: Configured with FPA board.
Operating System: 4.3 BSD UNIX.
DEC VAX 8600
Main Memory: 20 Mbytes
Floating Point: Configured without FPA board.
Operating System: Ultrix V1.2. (4.2BSD with many 4.3BSD tunings).
Sun-3/160M
CPU: (16.67 MHz MC68020)
Main Memory: 8 Mbytes
Floating Point: 12.5 MHz MC68881 coprocessor (compiled -f68881).
Operating System: SunOS 3.2 (4.2BSD)
MIPS M/500
CPU: 8MHz R2000, in R2300 CPU board, 16K I-cache, 8K D-cache
Floating Point: R2010 FPA chip (8MHz)
Main Memory: 8 Mbytes (2 R2350 memory boards)
Operating System: UMIPS-BSD 2.1 (4.3BSD UNIX with NFS)
MIPS M/800
CPU: 12.5 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache
Floating Point: R2010 FPA chip (12.5MHz)
Main Memory: 8 Mbytes (2 R2350 memory boards)
Operating System: UMIPS-BSD 2.1
MIPS M/1000
CPU: 15 MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache
Floating Point: R2010 FPA chip (15 MHz)
Main Memory: 16 Mbytes (4 R2350 memory boards)
Operating System: UMIPS-BSD 2.1
Test Conditions
All programs were compiled with -O (optimize), unless otherwise noted.
C is used for all benchmarks except Whetstone, LINPACK, Doduc, Spice 2g.6,
Hspice, and the Livermore Fortran Kernels, which use FORTRAN. When possible,
we've obtained numbers for VAX/VMS, and use them in place of UNIX numbers. The
MIPS compilers are version 1.21.
User time was measured for all benchmarks using the /bin/time command.
Systems were tested in normal multi-user development environment, with load
factor <0.2 (as measured by uptime command). Note that this occasionally makes
them run longer, due to slight interference from background daemons and clock
handling, even on an otherwise empty system. Benchmarks were run at least 3
times and averaged. The intent is to show numbers that can be reproduced on
live systems.
Times (or rates, such as for Dhrystones, Whetstones, and LINPACK KFlops) are
shown for the VAX 11/780. Other machines' times or rates are shown, and their
relative performance ("Rel." column) normalized to the 11/780 treated as 1.0.
VAX/VMS is used whenever possible as the base.
Compilers and Operating Systems
Unless otherwise specified, The M-series benchmark numbers use Release 1.21 of
the MIPS compilers and UMIPS-BSD 2.1.
Optimization Levels
Unless otherwise specified, all benchmarks were compiled -O, i.e., with optimi-
zation. UMIPS compilers call this level -O2, and it includes global intra-
procedural optimization. In a few cases, we show numbers for -O3 and -O4
optimization levels, which do inter-procedural register allocation and pro-
cedure merging. -O3 is now generally available.
Now, let's look at the benchmarks. Each section title includes the (CODE NAME)
that relates it back to the earlier Summary, if it is included there.
4. Integer Benchmarks
4.1. MIPS UNIX Benchmarks (MIPS UNIX)
The MIPS UNIX Benchmarks described below are fairly typical of nontrivial UNIX
programs. This benchmark suite provides the opportunity to execute the same
code across several different machines, in contrast to the compilers and link-
ers for each machine, which have substantially different code. User time is
shown; kernel time is typically 10-15% of the user time (on the 780), so these
are good indications of integer/character compute-intensive programs. The
first 3 benchmarks were running too fast to be meaningful on our faster
machines, so we modified the input files to get larger times. The VAX 8600 ran
consistently around 3.8X faster than the 11/780 on these tests, but we sold it,
so it's started to drop out as we've changed benchmarks. These benchmarks con-
tain UNIX source code, and are thus not generally distributable.
For better statistical properties, we now report the Geometric Mean of the
Relative performance numbers, because it does not ignore the performance con-
tributions of the shorter benchmarks. (In this case, the grep ratios drag the
Geometric Mean down.) Expect real performance to lie between the Geometric Mean
and the Total Relative number.
Note: the Geometric Mean of N numbers is the Nth root of the product of those
numbers. It is necessarily used in place of the arithmetic mean when computing
the mean of performance ratios, or of benchmarks whose runtimes are quite dif-
ferent. See [Fleming 86] for a detailed discussion.
MIPS UNIX Benchmarks Results
grep diff yacc nroff Total Geom System
Secs Rel. Secs Rel. Secs Rel. Secs Rel. Secs+ Rel. Mean
11.2 1.0 246.4 1.0 101.1 1.0 18.8 1.0 377.5 1.0 1.0 11/780 4.3BSD
5.6 2.0 105.3 2.3 48.1 2.1 9.0 2.1 168.0 2.2 2.1 Sun-3/160M
- - - - - - 5.0 3.8 - 3.8 3.8 DEC VAX 8600
2.4 4.7 35.8 6.9 19.5 5.2 3.3 5.7 61.0 6.2 5.5 MIPS M/500
- 7 - 8.5 - 9 - 7.5 - - 8.0 Sun-4 *
1.6 7.0 21.6 11.4 11.2 9.0 1.9 9.9 36.3 10.4 9.2 MIPS M/800
1.3 8.6 18.0 13.7 9.3 10.9 1.5 12.5 30.1 12.5 11.3 MIPS M/1000
+ Simple summation of the time for all benchmarks. "Total Rel." is ratio of
the totals.
* These numbers derived as shown on next page.
Note: in order to assure "apples-to-apples" comparisons, we moved the same
copies of the (4.2BSD) sources for these to the various machines, compiled them
there, and ran them, to avoid surprises from different binary versions of com-
mands resident on these machines.
Note that the granularity here is at the edge of UNIX timing, i.e., tenths of
seconds make differences, especially on the faster machines. The performance
ratios seen here seem typical of large UNIX commands on MIPS systems.
Estimation of Sun-4 Numbers
The 4 ratios cited in the table above were pieced together from somewhat frag-
mentary information. Earlier Briefs used shorter benchmarks for the grep,
diff, and yacc tests, for which we were able to get numbers from a Sun-4, with
SunOS 3.2L, at 16.7MHz. By comparing with M/500 and M/800 numbers on the same
tests, we can interpolate and at least estimate bounds on performance.
On grep, both Sun-4 and M/800 used .6 seconds user time, so we assumed the same
relative performance (7.0).
On yacc, both Sun-4 and M/800 used .4 seconds, so we assumed the same relative
performance (9.0).
On diff, 3 runs on the Sun-4 yielded .3, .4. and .4 seconds. A MIPS M/500 is
consistently .4, an M/800 .2, and an M/1000 usually .2, but occasionally .1.
This test was clearly too small, and it is difficult to make strong assertions.
However, it does seem that the Sun-4 is faster than the M/500 (6.9X VAX) but
noticeably closer to it than to an M/800 (11.4X). We thus estimate around
8.5X, a little lower than halfway between the two MIPS systems.
On nroff, a setup problem of ours aborted the Sun-4 run. At a seminar at Stan-
ford this summer, the following number was given by Sun: 18.5 seconds for:
troff -man csh.1
This is not sufficient information to allow exact duplication, but we tried
running it two different ways:
troff -a -man csh.1 >/dev/null troff -man ..... csh.1
The second case used numerous arguments to actually run it for a typesetter,
whereas the first emits a representation to the standard output. The M/500
required 23.4 and 27.9 seconds user time, respectively, while the M/800 gave
user times of 11.2 and 14.1 seconds. Assuming that the troff results are simi-
lar to those of nroff, and using the worst of the M/800 times, we get a VAX-
relative estimate of:
(9.9X for M/800) X 14.1 (M/800 secs) / 18.5 (Sun-4 secs)
which yields 7.5X for the Sun-4.
All of this is obviously a gross approximation with numerous items of missing
information. Timing granularity is inadequate. Results are being generalized
from small samples. The source code may well differ for these programs. Our
csh.1 manual pages may not be identical. Sun compilers will improve, etc, etc.
We apologize for the low quality of this data, as Sun-4 access is not something
we have in good supply. We'll run the newest forms of the benchmarks as soon
as possible. However, the end results does seem quite consistent with with
other benchmarks that various people in the industry have tried.
Finally, note that this benchmark set is running versus 4.3BSD, not versus
Ultrix 2.0 with vcc. Hence, the relative performance numbers are inflated
somewhat relative to VAX/VMS or VAX-Ultrix numbers. From other experience,
we'd guess that subtracting 10% from most of the computed mips-ratings would
give a good estimate of the Ultrix 2.0 (vcc)-relative mips-ratings.
4.2. Dhrystone (DHRY 1.1)
Dhrystone is a synthetic programming benchmark that measures processor and com-
piler efficiency in executing a ``typical'' benchmark. The Dhrystone results
shown below are measured in Dhrystones / second, using the 1.1 version of the
benchmark.
We include Dhrystone because it is popular. MIPS systems do extremely well on
it. However, comparisons of systems based on Dhrystone and especially, only on
Dhrystone, are unreliable and should be avoided. More details are given at the
end of this section. According to [Richardson 87], 1.1 cleans up a bug, and is
the correct version to use, even though results for a given machine are typi-
cally about 15% less for 1.1 than with 1.0.
Advice for running Dhrystone has changed over time with regard to optimization.
It used to ask that people turn off optimizers that were more than peephole
optimizers, because the benchmark contained a modest amount of "dead" code that
optimizers were eliminating. However, it turned out that many people were sub-
mitting optimized results, often unlabeled, confusing everyone. Currently, any
numbers can be submitted, as long as they're appropriately labeled, except that
procedure inlining (done by only a few very advanced compilers) must be
avoided.
We continue to include a range of numbers to show the difference optimization
technology makes on this particular benchmark, and to provide a range for com-
parison when others' cited Dhrystone figures are not clearly defined by optimi-
zation levels. For example, -O3 does interprocedural register allocation, and
-O4 does procedure inlining, and we know -O4 is beyond the spirit of the bench-
mark. Hence, we now cite the -O3 numbers. We're not sure what the Sun-4's -O3
level does, but we do not believe that it does inlining either.
In the table below, it is interesting to compare the performance of the two
Ultrix compilers. Also, examination of the MIPS and Sun-4 numbers shows the
performance gained by the high-powered optimizers available on these machines.
The numbers are ordered in what we think is the overall integer performance of
the processors.
Dhrystone Benchmark Results - Optimization Effects
No Opt -O -O3 -O4
NoReg Regs NoReg Regs Regs Regs
Dhry's Dhry's Dhry's Dhry's Dhry's Dhry's
/Sec /Sec /Sec /Sec /Sec /Sec System
1,442 1,474 1,559 1,571 DEC VAX 11/780, 4.3BSD
2,800 3,025 3,030 3,325 Sun-3/160M
4,896 5,130 5,154 5,235 DEC VAX 8600, Ultrix 1.2
8,800 10,200 12,300 12,300 13,000 14,200 MIPS M/500
8,000 8,000 8,700 8,700 DEC VAX 8550, Ultrix 2.0 cc
9,600 9,600 9,600 9,700 DEC VAX 8550, Ultrix 2.0 vcc
10,550 12,750 17,700 17,700 19,000 Sun-4, SunOS 3.2L
12,800 15,300 18,500 18,500 19,800 21,300 MIPS M/800
15,100 18,300 22,000 22,000 23,700 25,000 MIPS M/1000
Some other published numbers of interest include the following, all of which
are taken from [Richardson 87], unless otherwise noted. Items marked * are
those that we know (or have good reason to believe) use optimizing compilers.
These are the "register" versions of the numbers, i.e., the highest ones
reported by people.
Dhrystone Benchmark Results
Dhry's
/Sec Rel. System
1571 0.9 VAX 11/780, 4.3BSD [in-house]
1757 1.0 VAX 11/780, VAX/VMS 4.2 [Intergraph 86]*
3325 1.9 Sun3/160, SunOS 3.2 [in-house]
3856 2.2 Pyramid 98X, OSx 3.1, CLE 3.2.0
4433 2.5 MASSCOMP MC-5700, 16.7MHz 68020, RTU 3.1*
4716 2.7 Celerity 1230, 4.2BSD, v3.2
6240 3.6 Ridge 3200, ROS 3.4
6374 3.6 Sun3/260, 25MHz 68020, SunOS 3.2
6423 3.7 VAX 8600, 4.3BSD
6440 3.7 IBM 4381-2, UTS V, cc 1.11
6896 3.9 Intergraph InterPro 32C, SYSV R3 3.0.0, Greenhills, -O*
7109 4.0 Apollo DN4000 -O
7142 4.1 Sun-3/200 [Sun 87] *
7249 4.2 Convex C-1 XP 6.0, vc 1.1
7409 4.2 VAX 8600, VAX/VMS in [Intergraph 86]*
7655 4.4 Alliant FX/8 [Multiflow]
8300 4.7 DG MV20000-I and MV15000-20 [Stahlman 87]
8309 4.7 InterPro-32C,30MHz Clipper,Green Hills[Intergraph 86]*
9436 5.4 Convergent Server PC, 20MHz 80386, GreenHills*
9920 5.6 HP 9000/840S [HP 87]
10416 5.9 VAX 8550, VAX/VMS 4.5, cc 2.2*
10787 6.1 VAX 8650, VAX/VMS, [Intergraph 86]*
11215 6.4 HP 9000/840, HP-UX, full optimization*
12639 7.2 HP 9000/825S [HP 87]*
13000 7.4 MIPS M/500, 8MHz R2000, -O3*
13157 7.5 HP 825SRX [Sun 87]*
14195 8.1 Multiflow Trace 7/200 [Multiflow]
14820 8.4 CRAY 1S
15007 8.5 IBM 3081, UTS SVR2.5, cc 1.5
15576 8.9 HP 9000/850S [HP 87]
18530 10.5 CRAY X-MP
19000 10.8 Sun-4/200, -O3* [Sun 87]
19800 11.3 MIPS M/800, 12.5MHz R2000, -O3*
23700 13.5 MIPS M/1000, 15MHz R2000, -O3*
28846 16.4 Amdahl 5860, UTS-V, cc1.22
31250 17.8 IBM 3090/200
43668 24.9 Amdahl 5890/300E, cc -O
Unusual Dhrystone Attributes
We've calibrated this benchmark against many more realistic ones, and we
believe that its results must be treated with care, because the detailed pro-
gram statistics are unusual in some ways. It has an unusually low number of
instructions per function call (35-40 on our machines), where most C programs
fall in in the 50-60 range or higher. Stated another way, Dhrystone does more
function calls than usual, which especially penalizes the DEC VAX, making this
a favored benchmark for inflating one's "VAX-mips" rating. Any machine with a
lean function call sequence looks a little better on Dhrystone that it does on
others.
The dynamic nesting depth of function calls inside the timed part of Dhrystone
is low (3-4). This means that most register-window RISC machines would never
even once overflow/underflow their register windows and be required to
save/restore registers.
This is not to say fast function calls or register windows are bad (they're
not!), merely that this benchmark overstates their performance effects.
Dhrystone can spend 30-40% of the time in the strcpy function, copying atypi-
cally long (30-character) strings, which happen to be alignable on word boun-
daries. More realistic programs don't spend this much time in this sort of
code, and when they do, they handle more shorter strings: 6 characters would be
much more typical.
On our machines, Dhrystone uses 0-offset addressing for 50% of memory data
references (dynamic). Most real programs use 0-offsets 10-15% of the time.
Of course, Dhrystone is a fairly small benchmark, and thus fits into almost any
reasonable instruction cache.
In conclusion, Dhrystone gives some indication of user-level integer perfor-
mance, but is susceptible to surprises when comparing amongst architectures
that differ strongly. Unfortunately, the industry seems to lack a good set of
widely-available integer benchmarks that are as representative as are some of
the popular floating point ones.
4.3. Stanford Small Integer Benchmarks (STAN INT)
The Computer Systems Laboratory at Stanford University, has collected a set of
programs to compare the performance of various systems. These benchmarks are
popular in some circles as they are small enough to simulate, and are respon-
sive to compiler optimizations.
It is well known that small benchmarks can be misleading. If you see claims
that machine X is up to N times a VAX on some (unspecified) benchmarks, these
benchmarks are probably the sort they're talking about.
Stanford Small Integer Benchmark Results
Perm Tower Queen Intmm Puzzle Quick Bubble Tree Aggr Rel.
Secs Secs Secs Secs Secs Secs Secs Secs Secs* Perf+ System
2.34 2.30 .94 1.67 11.23 1.12 1.51 2.72 3.08 .84 VAX 11/780 4.3BSD
2.60 1.0 VAX 11/780@
.72 1.07 .50 .93 5.53 .58 .97 1.05 1.42 1.8 Sun-3/160M [ours]
.63 .63 .27 .73 2.96 .31 .44 .69 .86 3.0 VAX 8600 Ultrix1.2
.28 .35 .17 .42 2.22 .18 .25 .35 .50 5.2 VAX 8550#
.28 .35 .13 .15 .88 .13 .17 .50 .40 6.5 VAX 8550##
.65 4.7 Sun-3/200 [Sun 87]
.18 .24 .15 .23 1.15 .17 .19 .34 .36 7.2 MIPS M/500
.36 7.3 Sun-4/200 [Sun 87]
.12 .16 .11 .13 .61 .10 .12 .22 .22 11.8 MIPS M/800
.10 .13 .10 .11 .51 .08 .10 .17 .18 14.1 MIPS M/1000
* As weighted by the Stanford Benchmark Suite
+ Ratios of the Aggregate times
@ Estimated VAX 11/780 Ultrix 2.0 vcc -O time. We get this by 3.08 *
(.40o.02)/.50 = 2.60, i.e., using the VAX 8550 numbers to estimate the
effect of optimization. The ".02" is a guess that optimization helps the
8550 a little more than it does the 11/780, because the former's cache is
big enough to hold the whole program and data, whereas the latter's does
not. Another way to put it is that the 8550 is not cache-missing very
much, and so optimization pays off more in removing what's left, whereas
the 11/780 will cache-miss more, and the nature of these particular tests
is that the optimizations won't fix cache-misses. (None of this is very
scientific, but it's probably within 10%!)
# Ultrix 2.0 cc -O
## Ultrix 2.0 vcc -O. The quick and bubble tests actually had errors; how-
ever, the times were in line with expectations (these two optimize well),
so we used them. All 8550 numbers thanks to Greg Pavlov
(ames!harvard!hscvax!pavlov, of Amherst, NY).
The Sun numbers are from [Sun 87]. The published Sun-4 number is .356, for
SunOS 3.2L software, i.e., it is slightly faster than the M/500.
5. Floating Point Benchmarks
5.1. Livermore Fortran Kernels (LLNL DP)
Lawrence Livermore National Labs' workload is dominated by large scientific
calculations that are largely vectorizable. The workload is primarily served
by expensive supercomputers. This benchmark was designed for evaluation of
such machines, although it has been run on a wide variety of hardware, includ-
ing workstations and PCs [McMahon86].
The Livermore Fortran Kernels are 24 pieces of code abstracted from the appli-
cations at Lawrence Livermore Labs. These kernels are embedded in a large,
carefully engineered benchmark driver. The driver runs the kernels multiple
times on different data sets, checks for correct results, verifies timing accu-
racy, reports execution rates for all 24 kernels, and summarizes the results
with several statistics.
Unlike many other benchmarks, there is no attempt to distill the benchmark
results down to a single number. Instead all 24 kernel rates, measured in
mflops (million floating point operations per second) are presented individu-
ally for three different vector lengths (a total of 72 results). The minimum
and maximum rates define the performance range of the hardware. Various
statistics of the 24 or 72 rates, such as the harmonic, geometric, and arith-
metic means give insight into general behavior. Any one of these statistics
might suffice for comparisons of scalar machines, but multiple statistics are
necessary for comparisons involving machines with vector or parallel features.
These machines have unbalanced, bimodal performance, and a single statistic is
insufficient characterization. McMahon asserts:
``When the computer performance range is very large the net Mflops rate
of many Fortran programs and workloads will be in the sub-range between
the equi-weighted harmonic and arithmetic means depending on the degree
of code parallelism and optimization. More accurate estimates of cpu
workload rates depend on assigning appropriate weights for each kernel.
McMahon's analysis goes on to suggest that the harmonic mean corresponds to
approximately 40% vectorization, the geometric mean to approximately 70% vec-
torization, and the arithmetic mean to 90%o vectorization. These three statis-
tics can be interpreted as different benchmarks that each characterize certain
applications. For example, there is fair agreement between the kernels' har-
monic mean and Spice performance. LINPACK, on the other hand, is better
characterized by the geometric mean.
On the next two pages are shown a summary of results from McMahon's report,
followed by the complete M/1000 results. (Given the volume of data, we've only
done this on M/1000s. M/800s scale directly by .833X).
The complete M/1000 data shows that MIPS performance is insensitive to vector
length. The minimum to maximum variation is also small for this benchmark.
Both characteristics are typical of scalar machines with mature compilers.
Performance of vector and parallel machines, on the hand, may span two orders
of magnitude on this benchmark, or more, depending on the kernel and the vector
length.
64-Bit Livermore FORTRAN Kernels
MegaFlops, L = 167, Sorted by Geometric Mean
Harm. Geom. Arith. Rel.*
Min Mean Mean Mean Max Geom. System
.05 .12 .12 .13 .24 .7 VAX 780 w/FPA 4.3BSD f77 [ours]
.06 .16 .17 .18 .28 1.0 VAX 780 w/FPA VMS 4.1
.11 .30 .33 .37 .87 1.9 SUN 3/160 w/FPA
.20 .42 .46 .50 1.42 2.5 MIPS M/500
.17 .43 .48 .53 1.13 2.8 SUN 3/260 w/FPA [our numbers]
.29 .58 .64 .70 1.21 3.8 Alliant FX/1 FX 2.0.2 Scalar
.38 .72 .77 .83 1.57 4.5 SUN 4/260 w/FPA [Hough 87]
.39 .94 1.00 1.04 1.64 5.9 VAX 8700 w/FPA VMS 4.1
.10 .76 1.06 1.50 5.23 6.2 Alliant FX/1 FX 2.0.2 Vector
.33 .92 1.06 1.20 2.88 6.2 Convex C-1 F77 V2.1 Scalar
.52 1.09 1.19 1.30 2.74 7.0 ELXSI 6420 EMBOS F77 MP=1
.51 1.26 1.37 1.48 2.70 8.1 MIPS M/800
.61 1.51 1.65 1.78 3.24 9.7 MIPS M/1000
1.01 1.06 1.94 3.33 12.79 11.4 Convex C-1 F77 V2.1 Vector
.28 1.24 2.32 5.11 29.20 13.7 Alliant FX/8 FX 2.0.2 MP=8*Vec
1.51 4.93 5.86 7.00 17.43 34.5 Cray-1S CFT 1.4 scalar
1.23 4.74 6.09 7.67 21.64 35.8 FPS 264 SJE APFTN64
3.43 9.29 10.68 12.15 25.89 62.8 Cray-XMP/1 COS CFT77.12 scalar
0.97 6.47 11.94 22.20 82.05 70.2 Cray-1S CFT 1.4 vector
4.47 11.35 13.08 15.20 45.07 76.9 NEC SX-2 SXOS1.21 F77/SX24 scalar
1.47 12.33 24.84 50.18 188 146 Cray-XMP/1 COS CFT77.12 vector
4.47 19.07 43.94 140 1042 258 NEC SX-2 SXOS1.21 F77/SX24 vector
32-Bit Livermore FORTRAN Kernels
MegaFlops, L = 167, Sorted by Geometric Mean
Harm. Geom. Arith. Rel.*
Min Mean Mean Mean Max Geom. System
.05 .18 .20 .23 .48 .7 VAX 780 4.3BSD f77 [ours]
.10 .28 .30 .32 .58 1.0 VAX 780 w/FPA VMS 4.1
.19 .46 .50 .56 1.26 1.7 SUN 3/160 w/FPA
.30 .65 .71 .77 1.55 2.4 SUN 3/260 w/FPA [ours]
.30 .66 .74 .83 1.60 2.5 Alliant FX/1 FX 2.0.2 Scalar
.10 .60 .90 1.31 4.23 3.0 Alliant FX/1 FX 2.0.2 Vector
.40 .97 1.05 1.14 2.08 3.5 MIPS M/500
.55 1.04 1.12 1.20 2.21 3.7 SUN 4/260 w/FPA [Hough 87]
.36 1.11 1.27 1.42 3.61 4.2 Convex C-1 F77 V2.1 Scalar
.46 1.26 1.36 1.45 2.41 4.5 VAX 8700 w/FPA VMS 4.1
.68 1.31 1.46 1.61 3.19 4.9 ELXSI 6420 EMBOS F77 MP=1
.93 2.02 2.19 2.36 3.96 7.3 MIPS M/1000
.28 1.30 2.47 5.59 33.52 8.2 Alliant FX/8 FX 2.0.2 MP=8*Vec
.12 1.27 2.73 5.44 23.60 9.1 Convex C-1 F77 V2.1 Vector
* Relative Performance, as ratio of the Geometric Mean numbers. This is a
simplistic attempt to extract a single figure-of-merit. We admit this goes
against the grain of this benchmark. The next table gives the complete
M/1000 output, in the form used by McMahon.
Livermore FORTRAN Kernels - Complete MIPS M/1000 Output
Vendor MIPS MIPS MIPS MIPS | MIPS MIPS MIPS MIPS
Model M/1000 M/1000 M/1000 M/1000 | M/1000 M/1000 M/1000 M/1000
OSystem BSD2.1 BSD2.1 BSD2.1 BSD2.1 | BSD2.1 BSD2.1 BSD2.1 BSD2.1
Compiler 1.21 1.21 1.21 1.21 | 1.21 1.21 1.21 1.21
OptLevel O2 O2 O2 O2 | O2 O2 O2 O2
Samples 72 24 24 24 | 72 24 24 24
WordSize 64 64 64 64 | 32 32 32 32
DO Span 167 19 90 471 | 167 19 90 471
Year 1987 1987 1987 1987 | 1987 1987 1987 1987
Kernel ------ ------ ------ ------ | ------ ------ ------ ------
1 2.2946 2.2946 2.3180 2.3136 | 2.9536 2.9536 2.9613 2.9727
2 1.6427 1.6427 1.8531 1.8381 | 2.1218 2.1218 2.4691 2.4758
3 2.0625 2.0625 2.1260 2.1021 | 2.8935 2.8935 2.9389 2.9853
4 1.3440 1.3440 1.7954 1.9600 | 1.6836 1.6836 2.2084 2.4248
5 1.4652 1.4652 1.4879 1.4776 | 2.0924 2.0924 2.1096 2.1374
6 1.0453 1.0453 1.3734 1.4183 | 1.3076 1.3076 1.7920 1.8517
7 3.1165 3.1165 3.1304 3.1281 | 3.9336 3.9336 3.9581 3.9623
8 2.4829 2.4829 2.5725 2.5686 | 3.1625 3.1625 3.2853 3.2612
9 3.2215 3.2215 3.2359 3.2290 | 3.8708 3.8708 3.8831 3.8632
10 1.2293 1.2293 1.2336 1.2327 | 2.4413 2.4413 2.3419 2.3263
11 1.1907 1.1907 1.2274 1.2320 | 1.5789 1.5789 1.6365 1.6559
12 1.2102 1.2102 1.2404 1.2308 | 1.6004 1.6004 1.6414 1.6471
13 0.6095 0.6095 0.6272 0.6378 | 0.9288 0.9288 0.9334 0.9428
14 0.9712 0.9712 0.9455 0.6695 | 1.2133 1.2133 1.2175 1.0092
15 0.9894 0.9894 0.9605 0.9585 | 1.3701 1.3701 1.3314 1.3314
16 1.6427 1.6427 1.6159 1.6272 | 1.7904 1.7904 1.7567 1.7500
17 2.3898 2.3898 2.2674 2.2667 | 3.7320 3.7320 3.4866 3.5210
18 2.1462 2.1462 2.3321 2.3276 | 2.5642 2.5642 2.8431 2.8365
19 1.8268 1.8268 1.8540 1.8536 | 2.2883 2.2883 2.3369 2.3466
20 2.7821 2.7821 2.7800 1.9158 | 3.7104 3.7104 3.7214 3.6583
21 1.5372 1.5372 1.6004 1.6201 | 1.9644 1.9644 2.0564 2.0855
22 1.4507 1.4507 1.4506 1.4489 | 2.0802 2.0802 2.0646 2.0658
23 2.1395 2.1395 2.3897 2.3729 | 2.5972 2.5972 2.9489 2.9532
24 1.0148 1.0148 1.0458 1.0448 | 1.1995 1.1995 1.2226 1.2281
-------------- ------ ------ ------ ------ | ------ ------ ------ ------
Standard Dev. 0.6878 0.6938 0.6902 0.6742 | 0.8702 0.8870 0.8616 0.8669
Median Dev. 0.6869 0.6445 0.7023 0.6546 | 0.9837 0.8980 1.0239 1.0360
Maximum Rate 3.2359* 3.2215 3.2359 3.2290 | 3.9623* 3.9336 3.9581 3.9623
Average Rate 1.7834* 1.7419 1.8110 1.7698 | 2.3611* 2.2949 2.3811 2.3872
Geometric Mean 1.6469* 1.6053 1.6752 1.6330 | 2.1935* 2.1238 2.2175 2.2168
Median Rate 1.6272* 1.5899 1.7056 1.7326 | 2.2084* 2.1071 2.2727 2.3365
Harmonic Mean 1.5078* 1.4724 1.5368 1.4874 | 2.0235* 1.9590 2.0503 2.0372
Minimum Rate 0.6095* 0.6095 0.6272 0.6378 | 0.9288* 0.9288 0.9334 0.9428
Maximum Ratio 1.0000 0.9955 1.0000 0.9978 | 1.0000 0.9927 0.9989 1.0000
Average Ratio 1.0000 0.9767 1.0154 0.9923 | 1.0000 0.9719 1.0084 1.0110
Geometric Ratio 1.0000 0.9747 1.0171 0.9915 | 1.0000 0.9682 1.0109 1.0106
Harmonic Mean 1.0000 0.9765 1.0192 0.9864 | 1.0000 0.9681 1.0132 1.0067
Minimum Rate 1.0000 1.0000 1.0290 1.0464 | 1.0000 1.0000 1.0049 1.0150
* These are the numbers brought forward into the summary section.
5.2. LINPACK (LNPK DP and LNPK SP)
The LINPACK benchmark has become one of the most widely used single benchmarks
to predict relative performance in scientific and engineering environments.
The usual LINPACK benchmark measures the time required to solve a 100x100 sys-
tem of linear equations using the LINPACK package. LINPACK results are meas-
ured in MFlops, millions of floating point operations per second. All numbers
are from [Dongarra 87], unless otherwise noted.
The LINPACK package calls on a set of general-purpose utility routines called
BLAS -- Basic Linear Algebra Subroutines -- to do most of the actual computa-
tion. A FORTRAN version of the BLAS is available, and the appropriate routines
are included in the benchmark. However, vendors often provide hand-coded ver-
sions of the BLAS as a library package. Thus LINPACK results are usually cited
in two forms: FORTRAN BLAS and Coded BLAS. The FORTRAN BLAS actually come in
two forms as well, depending on whether the loops are 4X unrolled in the FOR-
TRAN source (the usual) or whether the unrolling is undone to facilitate recog-
nition of the loop as a vector instruction. According to the ground rules of
the benchmark, either may be used when citing FORTRAN BLAS results, although it
is typical to note rolled loops with the annotation ``(Rolled BLAS).''
For our own numbers, we've corrected a few to follow Dongarra more closely than
we have in the past. LINPACK output produces quite a few MFlops numbers, and
we've tended to use the fourth one in each group, which uses more iterations,
and thus is more immune to clock randomness. Dongarra uses the highest MFlops
number that appears, then rounds to two digits.
Note that relative ordering even within families is not particularly con-
sistent, illustrating the extreme sensitivity of these benchmarks to memory
system design.
100x100 LINPACK Results - FORTRAN and Coded BLAS
From [Dongarra 87], Unless Noted Otherwise
DP DP SP SP
Fortran Coded Fortran Coded System
.10 .10 .11 .11 Sun-3/160, 16.7MHz (Rolled BLAS)o
.11 .11 .13 .11 Sun-3/260,25MHz 68020o20MHz 68881 (Rolled BLAS)+
.14 - - - Apollo DN4000, 25MHz (68020 o 68881) [ENEWS 87]
.14 - .24 - VAX 11/780, 4.3BSD, LLL Fortran [ours]
.14 .17 .25 .34 VAX 11/780, VAX/VMS
.20 - .24 - 80386+80387, 20MHz, 64K cache, GreenHills
.20 .23 .40 .51 VAX 11/785, VAX/VMS
.29 .49 .45 .69 Intergraph IP-32C,30Mz Clipper[Intergraph 86]
.30 - - - IBM RT PC, optional FPA [IBM 87]
.33 - .57 - OPUS 300PM, Greenhills, 30MHz Clipper
.36 .59 .51 .72 Celerity C1230, 4.2BSD f77
.38 - .67 - 80386oWeitek 1167,20MHz,64K cache, GreenHills
.39 .50 .66 .81 Ridge 3200/90
.41 .41 .62 .62 Sun-3/160, Weitek FPA (Rolled BLAS)o
.45 .54 .60 .74 HP9000 Model 840S [HP 87]
.46 .46 .86 .86 Sun-3/260, Weitek FPA (Rolled BLAS)o
.47 .81 .69 1.30 Gould PN9000
.49 .66 .84 1.20 VAX 8600, VAX/VMS 4.5
.49 .54 .62 .68 HP 9000/825S [HP 87]
.57 .72 .86 .87 HP9000 Model 850S [HP 87]
.60 .72 .93 1.2 MIPS M/500
.61 - .84 - DG MV20000-I, MV15000-20 [Stahlman, 87]
.65 .76 .80 .96 VAX 8500, VAX/VMS
.70 .96 1.3 1.9 VAX 8650, VAX/VMS
.78 - 1.1 - IBM 9370-90, VS FORT 1.3.0
.97 1.1 1.4 1.7 VAX 8550/8700/8800, VAX/VMS
1.0 1.3 1.9 3.6 MIPS M/800
1.1 1.1 1.6 1.6 SUN 4/260 (Rolled BLAS)+
1.2 1.7 1.3 1.6 ELXSI 6420
1.2 1.6* 2.3* 4.3 MIPS M/1000
1.6 2.0 1.6 2.0 Alliant FX-1 (1 CE)
2.1 - 2.4 - IBM 3081K H enhanced opt=3
3.0 3.3 4.3 4.9 CONVEX C-1/XP, Fort 2.0 (Rolled BLAS)
6.0 - - - Multiflow Trace 7/200 Fortran 1.4 (Rolled BLAS)
7.0 11.0 7.6 9.8 Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9
12 23 n.a. n.a. CRAY 1S CFT (Rolled BLAS)
39 57 n.a. n.a. CRAY X-MP CFT (Rolled BLAS)
43 - 44 - NEC SX-2, Fortran 77/SX (Rolled BLAS)
+ The Sun FORTRAN Rolled BLAS code appears to be optimal, so we used the same
numbers for Coded BLAS. The 4X unrolled numbers for Sun-4 are .86 (DP) and
1.25 (SP) [Hough 87].
* These numbers are as reported by Dongarra. We prefer the typical results,
which are slightly lower, viz. 1.2, 1.5, 2.2, and 4.3.
On the next page, we take a subset of these numbers, and normalize them to the
VAX/VMS 11/780.
100x100 LINPACK Results - FORTRAN and Coded BLAS
VAX/VMS Relative Performance
For A Subset of the Systems
Rel. Rel. Rel. Rel.
DP DP SP SP
Fortran Coded Fortran Coded System
.8 .6 .5 .3 Sun-3/260,25MHz 68020o20MHz 68881 (Rolled)
1.0 1.0 1.0 1.0 VAX 11/780, VAX/VMS
1.4 - 1.0 - 80386o80387, 20MHz, 64K cache, GreenHills
2.0 2.9 1.8 2.0 Intergraph IP-32C,30Mz Clipper[Intergraph 86]
2.7 - 2.7 - 80386oWeitek 1167,20MHz,64K cache, GreenHills
2.9 2.4 2.5 1.8 Sun-3/160, Weitek FPA (Rolled BLAS)
3.3 2.7 3.4 2.5 Sun-3/260, Weitek FPA (Rolled BLAS)
3.5 3.9 3.4 3.5 VAX 8600, VAX/VMS 4.5
4.1 4.2 3.4 2.6 HP9000 Model 850S [HP 87]
4.3 4.2 3.7 3.5 MIPS M/500
6.9 6.6 5.6 5.0 VAX 8550/8700/8800, VAX/VMS
7.1 7.6 7.6 10.6 MIPS M/800
7.9 6.5 6.4 4.7 SUN 4/260 (Rolled BLAS)
8.6 9.4 9.2 12.6 MIPS M/1000
11.4 11.8 6.4 5.9 Alliant FX-1 (1 CE)
21.4 19.4 17.2 14.4 CONVEX C-1/XP, Fort 2.0 (Rolled BLAS)
50 65 30 28.8 Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9
307 - 176 - NEC SX-2, Fortran 77/SX (Rolled BLAS)
5.3. Spice Benchmarks (SPCE 2G6)
Spice [UCB 87] is a general-purpose circuit simulator written at U.C. Berke-
ley. Spice and its derivatives are widely used in the semiconductor industry.
It is a valuable benchmark because it shares many characteristics with other
real-world programs that are not represented in popular small benchmarks. It
uses both integer and floating-point computation heavily. The floating-point
calculations are not vector oriented, as in LINPACK. Also, the program itself
is very large and therefore tests both instruction and data cache performance.
We have chosen to benchmark Spice version 2g.6 because of its general availa-
bility. This is one of the later and more popular Fortran versions of Spice
distributed by Berkeley. We felt that the circuits distributed with the Berke-
ley distribution for testing and benchmarking were not sufficiently large and
modern to serve as benchmarks. In previous version of this brief, we presented
results on circuits we felt were representative, but which contained
proprietary data. This time, we gathered and produced appropriate benchmark
circuits that can be distributed, and have since been posted as public domain
on Usenet. The Spice group at Berkeley found these circuits to be up-to-date
and good candidates for Spice benchmarking. By distributing the circuits we
obtained results for many other machines. In the table below, "Geom Mean" is
the geometric mean of the 3 "Rel." columns.
Spice2G6 Benchmarks Results
digsr bipole comparator Geom
Secs Rel. Secs Rel. Secs Rel. Mean System
1354.0 0.60 439.6 0.68 460.3 0.63 .6 VAX 11/780 4.3BSD, f77 V2.0
993.5 0.81 394.3 0.76 366.9 0.80 .8 Microvax-II Ultrix 1.1, fortrel
901.9 0.90 285.1 1.0 328.6 0.89 .9 SUN 3/160 SunOS 3.2 f77 -O -f68881
848.0 0.95 312.6 0.96 302.9 0.96 1.0 VAX 11/780 4.3BSD, fortrel -opt
808.1 1.0 299.1 1.0 291.7 1.0 1.0 VAX 11/780 VMS 4.4 /optimize
744.8 1.1 221.7 1.3 266.0 1.1 1.2 SUN 3/260 SunOS 3.2 f77 -O -f68881
506.5 1.6 170.0 1.8 189.1 1.5 1.6 SUN 3/160 SunOS 3.2 f77 -O -ffpa
361.2 2.2 112.0 2.7 129.4 2.3 2.4 SUN 3/260 SunOS 3.2 f77 -O -ffpa
296.5 2.7 73.4 4.1 83.0 3.5 3.4 MIPS M/500
225.9 3.6 63.7 4.7 73.4 4.0 4.1 SUN 4/260 f77 -O3 -Qoption as -Ff0+
- - - - - - 5.3 VAX 8700 (estimate)
136.5 5.9 42.6 7.0 41.4 7.0 6.6 MIPS M/800
125.5 6.4 39.5 7.6 39.3 7.4 7.1 AMDAHL 470V7 VMSP FORTVS4.1
114.3 7.1 35.4 8.4 34.5 8.5 8.0 MIPS M/1000
48.0 16.8 12.5 23.9 17.5 16.7 18.9 FPS 20/64 VSPICE (2g6 derivative)
+ Sun numbers are from [Hough 87], who notes that the Sun-4 number was beta
software, and that a few modules did not optimize.
Benchmark descriptions:
digsr CMOS 9 bit Dynamic shift register with parallel load capability, i.e.,
SISO (Serial Input Serial Output) and PISO (Parallel Input Serial Out-
put), widely used in microprocessors. Clock period is 10 ns. Channel
length = 2 um, Gate Oxide = 400 Angstrom. Uses MOS LEVEL=2.
bipole Schottky TTL edge-triggered register. Supplied with nearly- coin-
cident inputs (synchronizer application).
comparator
Analog CMOS auto-zeroed comparator, composed of Input, Differential
Amplifier and Latch. Input signal is 10 microvolts. Channel Length =
3 um, Gate Oxide = 500 Angstrom. Uses MOS LEVEL=3. Each part is con-
nected by the capacitance coupling, which is often used for the offset
cancellation. (Sometimes called Toronto, in honor of its source).
Hspice is a commercial version of Spice offered by Meta-Software, which
recently published benchmark results for a variety of machines [Meta-software
87]. (Note that the M/800 number cited there was before the UMIPS-BSD 2.1 and
f77 1.21 releases, and the numbers have improved). The VAX 8700 Spice number
(5.3X) was estimated by using the Hspice numbers below for 8700 and M/800, and
the M/800 Spice number:
(5.5: 8700 Hspice) / (6.9: M/800 Hspice) X (6.6: M/800 Spice) yields 5.3X.
This section indicates that the performance ratios seem to hold for at least
one important commercial version as well.
Hspice Benchmarks Results
HSPICE-8601K
ST230
Secs Rel. System
166.5 .6 VAX 11/780, 4.2BSD
92.2 1.0 VAX 11/780 VMS
91.5 1.0 Microvax-II VMS
29.2 3.2 ELXSI 6400
29.1 3.2 Alliant FX/1
25.3 3.6 HyperSPICE (EDGE)
16.8 5.5 VAX 8700 VMS
16.3 5.7 IBM 4381-12
13.4 6.9 MIPS M/800 [ours]
11.3 8.2 MIPS M/1000 [ours]
3.27 28.2 IBM 3090
2.71 34.0 CRAY-1S
5.4. Digital Review (DIG REV)
The Digital Review magazine benchmark [DR 87] is a 3300-line FORTRAN program
that includes 33 separate tests, mostly floating-point, some integer. The
magazine reports the times for all tests, and summarizes them with the
geometric mean seconds shown below. All numbers below are from [DR 87], except
the M/500 and M/800 figures.
Digital Review Benchmarks Results
Secs Rel. System
9.17 1.0 VAXstation II/GPX, VMS 4.5
2.90 3.2 VAXstation 3200
2.32 4.0 VAX 8600, VMS 4.5
2.09 4.4 Sun-4, SunOS 3.2L
1.86 4.9 MIPS M/500 [ours]
1.584 5.8 VAX 8650
1.480 6.2 Alliant FX/8, 1 CE
1.469 6.2 VAX 8700
1.200 7.6 MIPS M/800 [ours]
1.193 7.7 ELXSI 6420
.990 9.3 MIPS M/1000*
.487 18.8 Convex C-1 XP
* The actual run number was .99, which [DR 87] reported as 1.00.
5.5. Doduc Benchmark (DDUC)
This benchmark [Doduc 86] is a 5300-line FORTRAN program that simulates aspects
of nuclear reactors, has little vectorizable code, and is thought to be
representative of Monte-Carlo simulations. It uses mostly double precision
floating point, and is often viewed as a ``nasty'' benchmark, i.e., it breaks
things, and makes machines underperform their usual VAX-mips ratings. Perfor-
mance is given as a number R normalized to 100 for an IBM 370/168-3 or 170 for
an IBM 3033-U, [ R = 48671/(cpu time in seconds) ], so that larger R's are
better.
In order of increasing performance, following are numbers for various machines.
All are from [Doduc 87] unless otherwise specified.
Double Precision Doduc Benchmark Results
DoDuc R Relative
Factor Perf. System
17 0.7 Sun3/110, 16.7MHz
19 0.7 Intel 80386o80387, 16MHz, iRMX
22 0.8 Sun-3/260, 25MHz 68020, 20MHz 68881
26 1.0 VAX 11/780, VMS
33 1.3 Fairchild Clipper, 30MHz, Green Hills
43 1.7 Sun-3/260, 25MHz, Weitek FPA
48 1.8 Celerity C1260
50 1.9 CCI Power 6/32
53 2.0 Edge 1
64 2.5 Harris HCX-7
85 3.3 Alliant FX/1
88 3.4 MIPS M/500, f77 -O2, runs 553 seconds
90 3.5 IBM 4381-2
90 3.5 Sun-4/200 [Hough 1987], SunOS 3.2L, runs 540 seconds
91 3.5 DEC VAX 8600, VAX/VMS
97 3.7 ELXSI 6400
99 3.8 DG MV/20000
100 3.8 MIPS M/500, f77 -O3, runs 488 seconds
101 3.9 Alliant FX/8
113 4.3 FPSystems 164
119 4.6 Gould 32/8750
129 5.0 DEC VAX 8650
136 5.2 DEC VAX 8700, VAX/VMS
150 5.7 Amdahl 470 V8, VM/UTS
178 6.8 MIPS M/800, f77 -O2, runs 273 secs
181 7.0 IBM 3081-G, F4H ext, opt=2
190 7.3 MIPS M/800, f77 -O3, runs 256 secs
218 8.4 MIPS M/1000, f77 -O2, runs 223 secs
229 8.8 MIPS M/1000, f77 -O3, runs 213 secs
236 9.1 IBM 3081-K
475 18.3 Amdahl 5860
714 27.5 IBM 3090-200, scalar mode
1080 41.6 Cray X/MP [for perspective:we have a lonnng way to go yet!]
5.6. Whetstone
Whetstone is a synthetic mix of floating point and integer arithmetic, function
calls, array indexing, conditional jumps, and transcendental functions [Curnow
76].
Whetstone results are measured in KWips, thousands of Whetstone interpreter
instructions per second. In this case, some of our numbers actually went down,
although compiled code has generally improved. First, the accuracy of several
library routines was improved, at a slight cost in performance. Second, on
machines this fast, relatively few clock ticks are actually counted, and UNIX
timing includes some variance. We've been running many runs and averaging.
We've now increased the loop counts from 10 to 1000 to increase the total run-
ning time to the point where the variance is reduced. This changed the bench-
mark slightly. Our experiences show some general uncertainty about the numbers
reported by anybody: we've heard that various different source programs are
being used.
Whetstone Benchmark Results
DP DP SP SP
KWips Rel. Kwips Rel. System
410 0.5 500 0.4 VAX 11/780, 4.3BSD, f77 [ours]
715 0.9 1,083 0.9 VAX 11/780, LLL compiler [ours]
830 1.0 1,250 1.0 VAX 11/780 VAX/VMS [Intergraph 86]
960 1.2 1,040 0.8 Sun3/160, 68881 [Intergraph 86]
1,110 1.3 1,670 1.3 VAX 11/785, VAX/VMS [Intergraph 86]
1,230 1.5 1,250 1.0 Sun3/260, 25MHz 68020, 20MHz 68881
1,400 1.7 1,600 1.3 IBM RT PC, optional FPA [IBM 87]
1,730 2.1 1,860 1.5 Intel 80386o80387, 20MHz, 64K cache, GreenHills
1,740 2.1 2,980 2.4 Intergraph InterPro-32C,30MHz Clipper[Intergraph86]
1,744 2.1 2,170 1.7 Apollo DN4000, 25MHz 68020, 25MHz 68881 [ENEWS 87]
1,860 2.2 2,400 1.9 Sun3/160, FPA
2,092 2.5 3,115 2.5 HP 9000/840S [HP 87]
2,433 2.9 3,521 2.8 HP 9000/825S [HP 87]
2,590 3.1 4,170 3.3 Intel 80386oWeitek 1167, 20MHz, Green Hills
2,600 3.1 3,400 2.7 Sun3/260, Weitek FPA [measured elsewhere]
2,670 3.2 4,590 3.7 VAX 8600, VAX/VMS [Intergraph 86]
2,907 3.5 4,202 3.4 HP 9000 Model 850S [HP 87]
3,540 4.3 5,290 4.2 Sun-4 (reported secondhand, not confirmed)
- - 6,400 5.1 DG MV/15000-12
3,950 4.8 6,670 5.3 VAX 8700, VAX/VMS, Pascal(?) [McInnis, 1987]
4,000 4.8 6,900 5.5 VAX 8650, VAX/VMS [Intergraph 86]
4,120 5.0 4,930 3.9 Alliant FX/8 (1 CE) [Alliant 86]
4,200 5.1 - - Convex C-1 XP [Multiflow]
4,220 5.1 5,430 4.3 MIPS M/500
6,930 8.0 8,570 6.9 MIPS M/800
7,960 9.6 10,280 8.2 MIPS M/1000
12,605 15 - - Multiflow Trace 7/200 [Multiflow]
25,000 30 - - IBM 3090-200 [Multiflow]
35,000 42 - - Cray X-MP/12
6. Acknowledgements
Some people have noted that they seldom believe the numbers that come from cor-
porations unless accompanied by names of people who take responsibility for the
numbers. Many people at MIPS have contributed to this document, which was ori-
ginally created by Web Augustine. Particular contributors to this issue
include Mark Johnson (much Spice work, including creation of public-domain
Spice benchmarks), and especially Earl Killian (a great deal of work in various
areas, particularly floating-point). Final responsibility for the numbers in
this Brief is taken by the editor, John Mashey.
We thank David Hough of Sun Microsystems, who kindly supplied numbers for some
of the Sun configurations, even fixing a few of our numbers that were
incorrectly high, and who has also offered good comments on joint efforts look-
ing for higher-quality benchmarks.
We also thank Cliff Purkiser of Intel, who posted the Intel 80386 Whetstone and
LINPACK numbers on Usenet.
We also thank Greg Pavlov, who ran hordes of Stanford and Dhrystone benchmarks
for us on a VAX 8550, Ultrix 2.0 system.
7. References
[Alliant 86]
Alliant Computer Systems Corp, "FX/Series Product Summary", October 1986.
[Curnow 76]
Curnow, H. J., and Wichman, B. A., ``A Synthetic Benchmark'', Computing
Journal, Vol. 19, No. 1, February 1976, pp. 43-49.
[Doduc 87]
Doduc, N., FORTRAN Central Processor Time Benchmark, Framentec, June 1986,
Version 13. Newer numbers were received 03/17/87, and we used them where
different.
E-mail: seismo!mcvax!ftcsun3!ndoduc
[Dongarra 87]
Dongarra, J., ``Performance of Various Computers Using Standard Linear Equa-
tions in a Fortran Environment'', Argonne National Laboratory, August 10,
1987.
[Dongarra 87b]
Dongarra, J., Marin, J., Worlton, J., "Computer Benchmarking: paths and pit-
falls", IEEE Spectrum, July 1987, 38-43.
[DR 87]
"A New Twist: Vectors in Parallel", June 29, 1987, "The M/1000: VAX 8800
Power for Price of a MicroVAX II", August 24, 1987, and "VAXstation 3200
Benchmarks: CVAX Eclipses MicroVAX II", September 14, 1987. Digital Review,
One Park Ave., NY, NY 10016.
[ENEWS 87]
Electronic News, ``Apollo Cuts Prices on Low-End Stations'', July 6, 1987,
p. 16.
[Fleming 86]
Fleming, P.J. and Wallace, J.J.,``How Not to Lie With Statistics: The
Correct Way to Summarize Benchmark Results'', Communications of the ACM,
Vol. 29, No. 3, March 1986, 218-221.
[HP 87]
Hewlett Packard, ``HP 9000 Series 800 Performance Brief'', 5954-9903, 5/87.
(A comprehensive 40-page characterization of 825S, 840S, 850S).
[Hough 86,1]
Hough, D., ``Weitek 1164/5 Floating Point Accelerators'', Usenet, January
1986.
[Hough 86,2]
Hough, D., ``Benchmarking and the 68020 Cache'', Usenet, January 1986.
[Hough 86,3]
Hough, D., ``Floating-Point Programmer's Guide for the Sun Workstation'',
Sun Microsystems, September 1986. [an excellent document, including a good
set of references on IEEE floating point, especially on micros, and good
notes on benchmarking hazards]. Sun3/260 Spice numbers are from later mail.
[Hough 87]
Hough, D., ``Sun-4 Floating-Point Performance'', Usenet, 08/04/87.
[IBM 87]
IBM, ``IBM RT Personal Computer (RT PC) New Models, Features, and Software
Overview, February 17, 1987.
[Intergraph 86]
Intergraph Corporation, ``Benchmarks for the InterPro 32C'', December 1986.
[Meta-Software 87]
Meta-Software, ``HSPICE Performance Benchmarks'', June 1987. 50 Curtner
Avenue, Suite 16, Campbell, CA 95008.
[McInnis 87]
McInnis, D., Kusik, R., Bhandarkar, D., ``VAX 8800 System Overview'', Proc.
IEEE COMPCON, March 1987, San Francisco, 316-321.
[McMahon 86]
``The Livermore Fortran Kernels: A Computer Test of the Numerical Perfor-
mance Range'', December 1986, Lawrence Livermore National Labs.
[MIPS 87]
MIPS Computer Systems, "A Sun-4 Benchmark Analysis", and "RISC System Bench-
mark Comparison: Sun-4 vs MIPS", July 23, 1987.
[Purkiser 87]
Purkiser, C., ``Whetstone and LINPACK Numbers'', Usenet, March 1987.
[Richardson 87]
Richardson, R., ``9/20/87 Dhrystone Benchmark Results'', Usenet, Sept. 1987.
Rick publishes the source several times a year. E-mail address:
...!seismo!uunet!pcrat!rick
[Serlin 87a]
Serlin, O., ``MIPS, DHRYSTONES, AND OTHER TALES'', Reprinted with revisions
from SUPERMICRO Newsletter, April 1986, ITOM International, P.O. Box 1450,
Los Altos, CA 94023.
Analyses on the perils of simplistic benchmark measures.
[Serlin 87b]
Serlin, O., SUPERMICRO #69, July 31, 1987. pp. 1-2.
Offers good list of attributes customers should demand of vendor benchmark-
ing.
[Stahlman 87]
Stahlman, M., "The Myth of Price/performance", Sanford C. Bernstein & Co,
Inc, NY, NY, March 17, 1987.
[Sun 86]
SUN Microsystems, ``The SUN-3 Family: A Hardware Overview'', August 1986.
[Sun 87]
SUN Microsystems, SUN-4 Product Introduction Material, July 7, 1987.
[UCB 87]
U. C. Berkeley, CAD/IC group, ``SPICE2G.6'', March 1987. Contact: Cindy
Manly, EECS/ERL Industrial Liason Program, 479 Cory Hall, University of Cal-
ifornia, Berkeley, CA 94720.
[Weicker 84]
Weicker, R. P., ``Dhrystone: A Synthetic Systems Programming Benchmark'',
Communications of the ACM, Vol. 27, No. 10, October 1984, pp. 1013-1030.
________
UNIX is a Registered Trademark of AT&T. DEC, VAX, Ultrix, and VAX/VMS are
trademarks of Digital Equipment Corp. Sun-3, Sun-4 are Trademarks on Sun
Microsystems. Many others are trademarks of their respective companies.
--
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash OR mash@mips.com
DDD: 408-991-0253 or 408-720-1700, x253
USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086manson@tut.cis.ohio-state.edu (Bob Manson) (11/02/87)
Alright, I'm a little pissed over the recent posting of these supposed "performance figures". Exactly what are these things supposed to mean? Well, they are compiled programs on different systems that are run and supposed to represent the speed of the various processors, MIPS etc. We all know that MIPS is mostly a meaningless figure (Okay, so there's some debate there, but I see no meaning to how many instructions per second a processor runs-"The COPYMEM instruction blockmoves the entire memory to disk space, but takes 1,000,000 usec to execute, with an effective MIPS of 1/1000000"). Plus let's say I write my own compiler for my new machine. It has no optimizing and is real brain-damaged-it writes code like a Forth machine (lots of stack ops). It means nothing to try to compare my computer with my dumb compiler to a well-developed VAX machine with a real compiler. So why run benchmarks in compiled languages??? It's easier that way, you don't have to write individual programs for each machine that might actually show off their abilities and improved instructions. I'll admit that it does mean something for compiled language users but not really for performance comparisons-you're comparing apples and oranges, or really the efficiency of the compilers on the machines. If anyone has done any studies using machine language programs I'd be very interested in that.... Bob M. ...!ihnp4!cbosgd!osu-cis!tut.cis.ohio-state.edu!manson or manson@tut.cis.ohio-state.edu Disclamer:My employers don't care what I say.... -- Batches? We don't need no stinkin batches!
sjc@mips.UUCP (Steve "The" Correll) (11/03/87)
In article <864@tut.cis.ohio-state.edu>, manson@tut.cis.ohio-state.edu (Bob Manson) writes: > Exactly what are these things supposed to mean? > Well, they are compiled programs on different systems that are run and > supposed to represent the speed of the various processors, MIPS etc. We > all know that MIPS is mostly a meaningless figure... > So why run benchmarks in compiled languages??? It's easier that way, you > don't have to write individual programs for each machine that might actually > show off their abilities and improved instructions. I'll admit that it does > mean something for compiled language users but not really for performance > comparisons-you're comparing apples and oranges, or really the efficiency > of the compilers on the machines. The best benchmark for person x is clearly the program which accounts for most of the cycles that person executes. But there are so many different "x"s! We assume that grep, nroff, and Unix system calls are important to most readers of comp.arch, so we study them. A sizeable class of Fortran users tells us that Linpack and the Livermore loops are representative of their programs, so we study them. IC circuit designers tell us they execute most of their cycles within Spice, so we pay a lot of attention to that. If, on the other hand, you execute most of your cycles within hand-tuned assembly language, and you are willing to revise your programs completely to best use the instruction set of each new machine you acquire, and you are a serious potential customer, the sales people at most computer vendors will be happy to run your own specific benchmarks; ours do so all the time. Tuned code is the right measurement for some people; compiled code is right for others. I view a computer as a system, so for me it makes poor sense to omit the effect of compilers. And since one can argue forever about how well a hypothetical compiler _might_ use a particular instruction set, I prefer to ask how well the best existing compiler _does_ use it. While I'm all in favor of hand-coding inner loops and library routines to improve performance, one can argue forever about how easy and how profitable that is; so I think the best test of that is to measure the effects of the tradeoffs that people made in constructing an actual OS and compiler system, rather than a hypothetical one. An article in IEEE Micro some time back which measured assembly-coded algorithms on 68xxx and xx86 machines seemed pretty useless to me, more like a contest between assembly coders than an indication of the useful work I might get out of the machines when running Unix or any other OS. Incidentally, as explained in the Performance Brief, our definition of "mips" is _not_ the meaningless "millions of instructions per second"; it's "number of times faster than a Vax 780 on this particular problem", where we arbitrarily declare a Vax 780 to have a mips rating of 1. We would be better off using "Vax780s" rather than "mips" as our unit of measure, except that we'd have to put so many "TM"s in the document that you wouldn't be able to find the numbers. :-) -- ...decwrl!mips!sjc Steve Correll