[net.arch] Benchmark Numbers for MIPS M/500 Dev. System long

mash@mips.UUCP (John Mashey) (10/15/86)
Having seen MIPS get dragged into the 386-68K wars over in net.micro.68k, I
posted a couple MIPS R2000 numbers, and offered to post some more here if peo-
ple were interested.  A bunch were, so here are the #s.  This is a super-
condensed excerpt from "MIPS Performance Brief - MIPS M/500 with UMIPS-BSD 1.0,
October 1986", copies of which can be obtained from Sondra Smith at MIPS
({decvax,ihnp4,ucbvax}!decwrl!mips!sondra.

This has much raw data, with minimal editorializing, so you can draw your own
conclusions.  However, please remember that these systems use a chip running at
half its design speed, whose architecture was spec'd just 18 months ago, using
a UNIX(TM) whose source we got 3 months ago, and is therefore untuned.  They ARE
real systems being shipped to customers in exactly the state that produced the
benchmark numbers below.

----- OVERVIEW -----
This compares results from the widely-used LINPACK, Whetstone, Byte, and Dhry-
stone benchmarks as well as the Stanford Benchmark Suite and MIPS UNIX Bench-
mark Suite run on various computers in-house at MIPS (DEC VAX-11/780(TM) and
VAX 8600, Sun-3/160M, and MIPS M/500 Development System). Live results are com-
pared with the simulation numbers published in the April 24, 1986 edition of
the Performance Brief.  Benchmarks that include floating point show live
numbers, using the interim R2360 Floating Point board, and simulations for the
forthcoming R2010 Floating Point Accelerator chip.  Here is the benchmark sum-
mary, followed by the details of configurations, actual times, etc.

                         Summary of Benchmark Results

                     (VAX 11/780 = 1.0; Bigger is Faster)
	(We call the 780 a 1Mips box, and an M/500 a 5Mips one).

                                  MIPS     MIPS    DEC VAX    Sun-3/   DEC VAX
                                 M/500    SIMUL      8600      160M     11/780
                                Relative Relative  Relative  Relative  Relative
Benchmark                         Perf     Perf      Perf      Perf      Perf

Integer Benchmarks
  MIPS UNIX Total+                  +5.4      4.9       3.7       2.0       1.0
  Dhrystone: registers, no opt       7.0      6.6       3.5       2.0       1.0
  Stanford Integer Aggregate         8.3      8.3       3.6       2.2       1.0

Floating Point Benchmarks&
  LINPACK: Double Precision+        &1.5    +&4.8       3.8       0.6       1.0
  Whetstone: Double Precision       &1.6     &6.3      #1.0       1.3       1.0
  Stanford FP Aggregate             &3.9     &9.0      #2.9       1.6       1.0
System Benchmarks
  Byte Total                         4.5        -       3.2       1.8       1.0


+ We consider these benchmarks to be the best indications of real performance.

& Note that floating point simulations show the expected performance of the
  R2010, whereas the actual benchmarks show that of the current R2360. Also,
  the current R2360 has a soon-to-be-removed restriction that impacts perfor-
  mance by 25-30%.

# These numbers are low because that 8600 has no FPA< as described later.

M/500 performance on realistic, live benchmarks can be summarized as follows,
with any futures shown in ().

-User-level integer performance now lies in the range 5.0-5.5X faster than a
VAX 11/780 running 4.3BSD UNIX.

-Kernel-intensive performance is in the range of 4.0-4.5X faster.  (After some
straightforward tuning, this should be in the 4.3-4.7X range.  Overall system
performance on normal user-kernel mixes (50u/50s to 70u/30s) should thus end up
at the 5X level, or slightly above.)

-Using the current version of the floating point board, floating-point perfor-
mance is about 1.5X.  (This will improve soon to about 2X, and then will go to
about 5X when we finish the R2010 FPA chip.)

-----HOW BENCHMARKS WERE DONE-----
The computer systems configurations described below do not necessarily reflect
optimal configurations, but rather the in-house systems to which we had repeat-
able access. We wish we had an FPA for the VAX 8600, which would give the 8600
substantially better performance on floating point.

VAX 11/780: 8MB; FPA; 4.3BSD; LLL Fortran for Whetstone & LINPACK

VAX 8600: 20MB; no FPA; Ultrix V1.2; LLL Fortran for Whetsone & LINPACK

SUN 3/160M: 8MB; 16.6Mhz 68020 + 12.5Mhz 68881; 4.2BSD, release 3.0

MIPS M/500: 8MB; R2300 CPU board (8Mhz R2000, 16K I-cache, 8K D-cache); R2360
interim Floating Point (Weitek 1164/65); 4.3BSD

Simulated numbers are also given (MIPS SIMUL columns in tables) for the MIPS
R2010 FPA chip, which is must faster than the R2360.  Cycle counts are:
FP Type Add SP  Add DP  Mult SP Mult DP Div SP  Div DP
R2360   10      12      10      14      35      66
R2010    2       2       4       5      11      18

I.e., the R2010 DP Add would be 250ns @ 8Mhz or 125ns @ 16Mhz; although I
believe in peak MegaFLOPS about as much as I believe in peak MIPS, (i.e., not
much!) people might call this a 4MFLOPS machine @ 8Mhz, or 8MFLOPS @ 16Mhz.
(Actually, it's faster than that, since it has multiple functional units and
can overlap operations).

Systems were tested in normal multi-user development environment, (for fairness
to the 11/780 and 8600, whose daemons are hard to exorcise!), with load factor
<0.2 (as measured by uptime).  Benchmarks were run 3 times and averaged.

How to Interpret the Numbers

The tables include a "MIPS M/500" column and a "MIPS SIMUL" column.  The
numbers are computed as follows.  Times (or rates, such as for Dhrystones,
Whetstones, and LINPACK KFlops) are shown for the VAX 11/780.  Other machines'
times or rates are shown, and the relative performance ("Rel." column) normal-
ized to the 11/780, which is treated as 1.0.

The SIMUL times/rates shown are those we got doing simulations 6-8 months ago.
We normalize those times/rates to the same 11/780 4.3BSD used throughout this
report.  Note that the April 24,1986 Performance Brief shows the same
times/rates as the SIMUL column, but often shows a different relative perfor-
mance number.  Not only has our software changed, but our 11/780 went from
4.2BSD to 4.3BSD, so the base for normalization changed also.

The M/500 benchmark numbers use production release 1.0 of the MIPS compilers
and UMIPS-BSD 1.0.  The latter is a 4.3BSD UNIX port, compiled at optimization
level -O2, but otherwise untuned.

All benchmarks were compiled -O, i.e., with optimization.  UMIPS compilers call
this level -O2, and it includes serious global optimization.

----- DETAILED BENCHMARK DATA -----
The MIPS UNIX Benchmarks described below are fairly typical of nontrivial UNIX
programs.  This benchmark suite provides the opportunity to execute the same
code across several different machines.  User time is shown; kernel time is
typically 10-15% of the user time (on the 780).

                         MIPS UNIX Benchmarks Results


               MIPS          MIPS         DEC VAX       Sun-3/        DEC VAX
               M/500         SIMUL         8600          160M         11/780
Benchmark   Secs   Rel.   Secs   Rel.   Secs   Rel.   Secs   Rel.   Secs   Rel.
bm           0.2   6.0     0.2   6.0     0.3   4.0     0.6   2.0     1.2   1.0
grep         1.0   4.7     1.0   4.7     1.2   3.9     2.4   2.0     4.7   1.0
diff         0.4   6.7     0.5   5.4     0.9   3.0     1.7   1.6     2.7   1.0
yacc         0.6   5.7     0.7   4.9     0.9   3.8     1.7   2.0     3.4   1.0
nroff        3.5   5.4     3.9   4.8     5.0   3.8     9.0   2.1    18.8   1.0

Total +      5.7   5.4     6.3   4.9     8.3   3.7    15.4   2.0    30.8   1.0


+ Sum of the time for all benchmarks.  "Total Rel." is ratio of the totals.


Dhrystone Benchmark Results, version 1.1



                  MIPS         MIPS       DEC VAX       Sun-3/      DEC VAX
                 M/500        SIMUL         8600         160M        11/780
              Dhry's       Dhry's       Dhry's       Dhry's       Dhry's
Benchmark      /Sec   Rel.  /Sec   Rel.  /Sec   Rel.  /Sec   Rel.  /Sec   Rel.
no Registers   8,855  6.1       -   -        -   -    2,730  1.9   1,442  1.0
Registers     10,309  7.0   9,510  6.6   5,130  3.5   2,954  2.0   1,474  1.0
-O, no Regs   12,323  7.9       -   -        -   -    2,945  1.9   1,559  1.0
-O, Registers 12,329  7.8  11,280  7.2       -   -    3,243  2.1   1,571  1.0


Although this commonly-used benchmark claims that the M/500 is 7-8X faster than
the 11/780, we believe that this benchmark is not necessarily representative.


Stanford Small Integer Benchmarks

These benchmarks are included primarily for perspective. It is well known that
small benchmarks can be misleading.  We would not claim that the R2300 perfor-
mance is 8-9 times that of a VAX-11/780 based on this one benchmark.

                   Stanford Small Integer Benchmark Results


                MIPS         MIPS        DEC VAX       Sun-3/       DEC VAX
                M/500        SIMUL        8600          160M         11/780
Benchmark    Secs   Rel.  Secs   Rel.  Secs   Rel.   Secs   Rel.   Secs   Rel.
Perm         0.21   11.1  0.21   11.1  0.63    3.7   0.72    3.3   2.34    1.0
Towers       0.25    9.2  0.27    8.5  0.63    3.7   1.07    2.2   2.30    1.0
Queen        0.14    6.7  0.14    6.7  0.27    3.5   0.50    1.9   0.94    1.0
Intmm        0.23    7.3  0.21    8.0  0.73    2.3   0.93    1.8   1.67    1.0
Puzzle       1.13    9.9  0.94   11.9  2.96    3.8   5.53    2.0  11.23    1.0
Quick        0.17    6.6  0.17    6.6  0.31    3.6   0.58    1.9   1.12    1.0
Bubble       0.19    7.9  0.19    7.9  0.44    3.4   0.97    1.6   1.51    1.0
Tree         0.36    7.6  0.31    8.8  0.69    3.9   1.05    2.6   2.72    1.0

Aggregate *  0.37    8.3  0.37    8.3  0.86    3.6   1.42    2.2   3.08    1.0

Total +      2.68    8.9  2.43    9.8  6.66    3.6  11.35    2.1  23.83    1.0


* As weighted by the Stanford Benchmark Suite

+ Simple summation of the times for all benchmarks


Floating Point Benchmarks

                           LINPACK Benchmark Results


               MIPS          MIPS        DEC VAX        Sun-3/       DEC VAX
              M/500@        SIMUL*        8600#         160M&         11/780
Benchmark  KFlops  Rel.  KFlops  Rel.  KFlops  Rel.  KFlops  Rel.  KFlops  Rel.
Dbl Prec      193  1.5      625  4.8      490  3.8       80  0.6      130  1.0
Sngl Prec     299  1.2      890  3.7      610  2.5       85  0.4      240  1.0


@ Recall that this uses the R2360 FP board, while SIMUL uses R2010.

* Projection based on inner loop calculations, including cache effects, using
R2010 FPA.

# VAX 8600 (apparently) with FPA [Dongarra 85]

  133 (DP) was measured on the in-house 8600 without FPA

& [Sun 86]; Sun quotes performance with 68881 as 101 (DP), 108 (SP), and with
  FPA board as 405 (DP), 625 (SP).


                           Whetstone Benchmark Results


                       MIPS        MIPS       DEC VAX      Sun-3/      DEC VAX
                      M/500       SIMUL*       8600 #      160M &      11/780 #
  Benchmark         KWips Rel.  KWips  Rel.  KWips Rel.  KWips  Rel.  KWips Rel.
  Double Precision  1,164 1.6   4,500  6.3     680 1.0     930  1.3     715 1.0
  Single Precision  1,808 1.7   6,200  5.7   1,720 1.6   1,030  1.0   1,083 1.0


* Simulation with R2010 FPA.

# UNIX 4.3 BSD, Lawrence Livermore FORTRAN compiler results

  This compiler produces results more consistent with the VMS FORTRAN compiler,
  which is known to have substantially higher performance than the f77 results
  measured at MIPS.  Remember 8600 is lacking FPA.

& Note that the in-house Sun-3/160M benchmarked better than Sun's published
  data:
      790 (DP), 860 (SP) [Hough 86,1]

  Sun quotes performance with FPA board as 1,700 (DP), 2,300 (SP) [Hough 86,1]


                    Stanford Floating Point Benchmark Results


                      MIPS         MIPS       DEC VAX       Sun-3/      DEC VAX
                      M/500       SIMUL@       8600 #       160M #      11/780
  Benchmark        Secs   Rel.  Secs  Rel.   Secs  Rel.   Secs  Rel.  Secs   Rel
  FFT              1.02   3.9   0.44  9.0    1.42  2.8    3.12  1.3   3.97   1.0
  Matrix Multiply  1.74   1.3   0.26  8.9    1.37  1.7    2.14  1.1   2.31   1.0

  Aggregate *      1.43   3.9   0.61  9.0    1.88  2.9    3.43  1.6   5.52   1.0
  Total +          2.76   2.3   0.70  9.0    2.79  2.2    5.26  1.2   6.28   1.0


@ Simulation with R2010 FPA.

* As weighted by the Stanford Benchmark Suite
   (includes some contribution from the integer results)

+ Simple summation of the times for both benchmarks

# Performance would be substantially improved with the respective FPA boards.


System Benchmarks

Byte Benchmarks

The Byte tests are intended to provide a mixture of user and kernel-level
actions.  Although used frequently, their user:kernel balance of 30u:70s is
unlike normal time-sharing use, which is more like 50:50 or 70u:30s.

In the table below, "Usr" and "Sys" are the CPU times reported by /bin/time,
and "Tot" is the sum of those two.  "Rel" is the speed relative to the 11/780.

We run these with a makefile that includes 3 each of calla, calle, loop, sieve,
pipes, and scall, and the time shown is the mean of the 3 times.  The others
are run once.  We run the makefile 3 times, and average everything, especially
since these runs are quite short, and even .1 second variation can cause large
jumps in relative performance.

                                Byte Benchmarks


             MIPS             DEC VAX           Sun-3/            DEC VAX
             M/500             8600              160M              11/780
Bench  Usr  Sys  Tot Rel Usr  Sys  Tot Rel Usr  Sys  Tot Rel  Usr  Sys  Tot Rel

calla  0.0  0.0  0.0   - 0.0  0.0  0.0   - 0.0  0.0  0.0   -  0.1  0.0  0.1 1.0
calle  0.0  0.0  0.0   - 0.2  0.0  0.2 5.5 0.2  0.0  0.2 5.5  1.1  0.0  1.1 1.0
loop   0.3  0.0  0.3 8.0 0.9  0.0  0.9 2.7 1.5  0.0  1.5 1.6  2.4  0.0  2.4 1.0
sieve  0.2  0.0  0.2 7.0 0.4  0.0  0.4 3.5 0.6  0.0  0.6 2.3  1.4  0.0  1.4 1.0

pipes  0.0  0.7  0.7 2.6 0.0  0.3  0.3 6.0 0.0  1.3  1.3 1.4  0.2  1.6  1.8 1.0
scall  0.1  0.7  0.8 4.8 0.2  0.8  1.0 3.8 0.5  2.0  2.5 1.5  0.6  3.2  3.8 1.0
diskw  0.0  0.2  0.2 3.5 0.0  0.2  0.2 3.5 0.0  0.5  0.5 1.4  0.0  0.7  0.7 1.0
diskr  0.0  0.2  0.2 2.5 0.0  0.2  0.2 2.5 0.0  0.3  0.3 1.7  0.0  0.5  0.5 1.0

sh 1   0.0  0.4  0.4 4.1 0.1  0.5  0.6 2.8 0.1  0.8  0.9 1.9  0.3  1.4  1.7 1.0
sh 2   0.1  0.7  0.8 4.0 0.2  0.9  1.1 2.9 0.2  1.5  1.7 1.9  0.6  2.6  3.2 1.0
sh 3   0.1  1.1  1.2 4.0 0.3  1.4  1.7 2.8 0.2  2.3  2.5 1.9  1.0  3.8  4.8 1.0
sh 4   0.2  1.5  1.7 3.8 0.4  1.9  2.3 2.8 0.4  3.1  3.5 1.8  1.3  5.1  6.4 1.0
sh 5   0.3  1.6  1.9 4.3 0.5  2.4  2.9 2.8 0.4  4.1  4.5 1.8  1.8  6.3  8.1 1.0
sh 6   0.4  2.0  2.4 4.1 0.6  2.9  3.5 2.8 0.5  5.0  5.5 1.8  2.1  7.7  9.8 1.0

Total* 2.9 11.9 14.8 4.5 7.2 13.7 20.9 3.2 9.6 27.5 38.1 1.8 24.5 42.5 67.0 1.0


* Summation of times as we run them, which includes multiple instances of some
of the tests, whose average numbers are shown.  Precisely:

  total = 3* (calla + calle + loop + pipes + scall + sieve) +
           diskw + diskr + sh1 + sh2 + sh3 + sh4 + sh5 + sh6

What seems clear is that, even if this benchmark were considered typical (it
isn't), the other 3 machines have not yet caught up with the tuning done on the
11/780, i.e., their performance ratios before the 4.2->4.3 11/780 conversion
were more in line with the commonly-viewed models of 5 (M/500), 4.2 (8600), and
2.0 (Sun).  We'd assume the 780 has an "unfair" advantage right now, since the
8600 and Sun are probably tuned, but don't have some of the 4.3 improvements,
and the M/500 has 4.3, but hasn't been otherwise tuned.


------ SUMMARY -------
Those are the numbers, folks.  The Performance Brief has more detail and back-
ground, caveats, conjectures on what the numbers mean, references, detailed
comparisons with past simulations, etc, but you have the meat of it here.

DEC & VAX are trademarks of Digital Equipment Corporation. UNIX is a Registered
Trademark of AT&T.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086