mash@mips.UUCP (04/02/87)
MIPS Performance Brief, Issue 2.2. April 1, 1987 [but real!] This is a highly-crunched version of the publication, with much of the extra explanations and commentary omitted. Most of what's left is large tables of comparative numbers. We're interested in additions or corrections. Nice, uncrunched printed copies are available [they do contain a little advertising that disappeared in this version!] The 2nd page of this summarizes the benchmarks, followed by tons of detail. 1. Introduction New Features of This Issue New features in this issue include benchmarks using the new R2010 Floating Point chip, both in the existing MIPS M/500 and our new M/800 systems. We've also gathered numbers from other processors for more comparisons. Some bench- marks are now normalized to the VAX-11/780 under VAX/VMS, rather than UNIX. Finally, we include benchmarks for UMIPS-BSD and UMIPS-V. Benchmarking - Caveats and Comments While no one benchmark can fully characterize overall system performance, the results of several benchmarks can give some insight into expected real perfor- mance. A more important benchmarking methodology is a side-by-side comparison of two systems running the same real application. We don't believe in characterizing a processor with only a single number, but we follow (what seems to be) standard industry practice of using a mips-rating that essentially describes overall integer performance. Thus, we label a 5 mips machine to be one that is about 5X (i.e., anywhere from 4X to 6X!) faster than a VAX 11/780 (UNIX 4.3BSD) on integer performance, since this seems to be how most people compute mips-ratings. Even within the same computer family, performance ratios between processors vary widely. For example, [McInnis 87] characterizes a VAX 8700 as anywhere from 3X to 7X faster than the 11/780. Floating point performance often varies even more than, and scales up slower than integer performance versus the 11/780. Most of this paper analyzes one important aspect of overall computer system performance - user-level CPU performance. A small set of operating system benchmarks are also included. 2. Benchmark Summary The following table summarizes the benchmark results described in more detail throughout this document. On three rows, all VAX systems run VAX/VMS, and all numbers are computed rela- tive to a VAX 11/780 running VAX/VMS. The VAX/VMS FORTRAN numbers were often 1.1-1.2X faster than our Lawrence Livermore Labs FORTRAN, and often 2X better than those of 4.3BSD. We've done this because many people are not familiar with the LLL FORTRAN, and don't take any UNIX FORTRAN seriously unless compared with VMS. Other rows use 4.3BSD UNIX as the base for relative performance, but show a few VAX/VMS numbers, i.e., those marked.# The rows marked + are those that we consider truly representative benchmarks. Summary of Benchmark Results (VAX 11/780 = 1.0, Bigger is Faster) (VAX/VMS when possible, UNIX otherwise) MIPS VAX MIPS VAX Sun-3/ VAX M/800 8650 M/500 8600 160M 11/780 RelativeRelativeRelativeRelativeRelativeRelative Benchmark Perform Perform Perform Perform Perform Perform Integer Benchmarks MIPS UNIX Total+ 9.0 - 5.6 3.7 2.0 1.0 Dhrystone: registers, 9.7 6.9# 6.6 4.7# 2.1 1.0 no opt for MIPS, -O for others Stanford Integer Aggregate 12.8 - 8.5 3.6 2.2 1.0 Floating Point Benchmarks (Relative to VAX/VMS) LINPACK: FORTRAN DP+ 5.7 5.0 4.1 3.5 0.7 1.0 Whetstone: Double Precision+ 8.3 4.8 5.4 3.2 1.2 1.0 DoDuc: Double Precision+ 5.6 5.0 3.6 3.5 0.7 1.0 (Relative to 4.3BSD UNIX) Spice, elh2a+ 7.6 - 4.0 - 1.6 1.0 Stanford FP Aggregate (C) 14.9 - 9.4 2.9& 1.6@ 1.0 System Benchmarks Byte Total (UMIPS-BSD) 7.4 - 4.6 3.2 1.8 1.0 Byte Total (UMIPS-V) - - 5.4 3.2 1.8 1.0 & In-house, FPA-less VAX 8600, Ultrix. @ In-house Sun3/160 had 12.5MHz 68881. Other FP benchmarks used published fig- ures with 16.6MHz 68881s. 3. Methodology 3.1. Tested Configurations The computer systems configurations described below do not necessarily reflect optimal configurations, but rather the in-house systems to which we had repeat- able access. When we've had the faster results available, we've quoted them in place of our own system's numbers. DEC VAX-11/780 Main Memory: 8 Mbytes Floating Point: Configured with FPA board. Operating System: 4.3 BSD UNIX. DEC VAX 8600 Main Memory: 20 Mbytes Floating Point: Configured without FPA board. Operating System: Ultrix V1.2. (4.2BSD with many 4.3BSD tunings). Sun-3/160M CPU: (16.67 MHz MC68020) Main Memory: 8 Mbytes Floating Point: 12.5 MHz MC68881 coprocessor (compiled -f68881). Operating System: 4.2 BSD UNIX, Release 3.2. MIPS M/500 CPU: 8MHz R2000, in R2300 CPU board, 16K I-cache, 8K D-cache Floating Point: R2010 FPA chip (8MHz) (has 1 performance bug, see next page) Main Memory: 8 Mbytes (2 R2350 memory boards) Operating System: UMIPS-BSD 2.0Beta (4.3BSD) UMIPS-V 1.1 (System V R3) also shown where differs from BSD MIPS M/800 CPU: 12.5MHz R2000, in R2600 CPU board, 64K I-cache, 64K D-cache Floating Point: R2010 FPA chip (12.5MHz) (has 1 performance bug, see next page) Main Memory: 8 Mbytes (2 R2350 memory boards) Operating System: UMIPS-BSD 2.0Beta (4.3BSD) Test Conditions All programs were compiled with -O (optimize). C is used for all benchmarks except Whetstone, LINPACK, and DoDuc, which use FORTRAN. When possible, we've obtained numbers for VAX/VMS, and use them in place of UNIX numbers. The MIPS compilers are version 1.10. User time was measured for all benchmarks using /bin/time command. Systems were tested in normal multi-user development environment, with load factor <0.2 (as measured by uptime command). (Note that this occasionally makes them run longer, due to slight interference from background daemons and clock handling, even on an otherwise empty system.) Benchmarks were run at least 3 times and averaged. The intent is to show numbers that people can reproduce. 3.2. MIPS Results MIPS R2300 and R2600 CPU Boards The MIPS R2300 CPU board is based on the 8 MHz (125 ns cycle time) version of the MIPS R2000 CPU chip. The R2300 also includes the MIPS R2010 Floating Point Accelerator chip, a 16 Kbyte instruction cache, an 8 Kbyte data cache, and a four-stage write buffer. The MIPS R2600 CPU board is similar to the R2300, but runs at 12.5MHz (80ns cycle time), and has larger caches (64KB each for instruction and data). MIPS R2010 FPA Chip The R2010 has the following cycle counts: Single Double Operation Precision Precision Add, Subtract 2 cycles 2 cycles Multiply 4 cycles 5 cycles Divide 12 cycles 19 cycles This yields 250ns for an Add @ 8 MHz, 160ns @ 12.5MHz or 120ns @ 16.6 MHz. In addition, the R2010 overlaps operations extensively, i.e., load and store can be overlapped with arithmetic, while multiply, divide, and add are mostly independent. Note: the early parts (first silicon) used in these benchmarks had a bug that inhibited overlap of add, multiply, and divide. Peak Performance Numbers We don't believe these mean much, but people ask. Note that VAX-Relative and Peak mips are not the same! MIPS MIPS Category M/800 M/500 Peak (Burst) Mips 12.5 8.0 Peak DP Megaflops 6.25 4.0 2.2 MIPS Results (cont.) How to Interpret the Numbers Times (or rates, such as for Dhrystones, Whetstones, and LINPACK KFlops) are shown for the VAX 11/780. Other machines' times or rates are shown, and the relative performance ("Rel." column) normalized to the 11/780 treated as 1.0. VAX/VMS is used whenever possible as the base. Compilers and Operating Systems Unless otherwise specified, The M/500 and M/800 benchmark numbers use the Release 1.10 of the MIPS compilers and UMIPS-BSD 2.0 Beta. Some numbers are also given for UMIPS-V 1.1, also using compiler release 1.10. Most user-level benchmarks have similar performance for the two UMIPS variants, so we report only major differences, mainly for the Byte benchmarks. Optimization Levels Unless otherwise specified, all benchmarks were compiled -O, i.e., with optimi- zation. UMIPS compilers call this level -O2, and it includes global optimiza- tion. In a few cases, we show numbers for -O3 and -O4 optimization levels, which do inter-procedural register allocation and procedure merging. 4. Integer Benchmarks 4.1. MIPS UNIX Benchmarks The MIPS UNIX Benchmarks described below are fairly typical of nontrivial UNIX programs. This benchmark suite provides the opportunity to execute the same code across several different machines, in contrast to the compilers and link- ers for each machine, which have substantially different code. User time is shown; kernel time is typically 10-15% of the user time (on the 780), so these are good indications of integer/character compute-intensive programs. MIPS UNIX Benchmarks Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/800 M/500 8600 160M 11/780 Benchmark Secs Rel. Secs Rel. Secs Rel. Secs Rel. Secs Rel. grep 0.6 7.8 1.0 4.7 1.2 3.9 2.3 2.0 4.7 1.0 diff 0.3 9.0 0.4 6.7 0.9 3.0 1.8 1.5 2.7 1.0 yacc 0.4 8.5 0.6 5.7 0.9 3.8 1.7 2.0 3.4 1.0 nroff 2.0 9.4 3.3 5.7 5.0 3.8 9.0 2.1 18.8 1.0 Total+ 3.3 9.0 5.3 5.6 8.0 3.7 14.8 2.0 29.6 1.0 + Simple summation of the time for all benchmarks. "Total Rel." is ratio of the totals. Note: in order to assure "apples-to-apples" comparisons, we moved the same copies of the sources for these to the various machines, compiled them there, and ran them, to avoid surprises from different binary versions of commands resident on these machines. Note that the granularity here is at the edge of UNIX timing, i.e., tenths of seconds make large differences. 4.2. Dhrystone Dhrystone is a synthetic programming benchmark that measures processor and com- piler efficiency in executing a ``typical'' benchmark. The Dhrystone results shown below are measured in Dhrystones / second, using the 1.1 version of the benchmark. According to [Richardson 87], 1.1 cleans up a bug, and is the correct version to use, even though results for a given machine are typically about 15% less for 1.1 than with 1.0. Dhrystone Benchmark Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/800 M/500 8600 160M 11/780 Dhry's Dhry's Dhry's Dhry's Dhry's Benchmark /Sec Rel. /Sec Rel. /Sec Rel. /Sec Rel. /Sec Rel. no Registers 12,900 8.9 8900 6.2 4,896 3.4 2,800 1.9 1,442 1.0 Registers 15,300 10.4 10,300 7.0 5,130 3.5 3,025 2.0 1,474 1.0 -O, no Registers 18,500 11.9 12,500 8.0 5,154 3.3 3,030 1.9 1,559 1.0 -O, Registers 18,500 11.8 12,500 8.0 5,235 3.3 3,325 2.1 1,571 1.0 -O3, Registers 20,000 - 13,000 - - - - - - -O4, Registers 21,300 - 14,200 - - - - - - Advice for running Dhrystone has changed over time with regard to optimization, and currently asks that people turn off optimizers that are more than peephole optimizers. [This penalizes people who have good optimizers versus those who don't!] However, from examination of the published numbers, and from personal knowledge, we've found that many people are using compilers that do substantial optimization, and are publishing those numbers as their standard Dhrystone rat- ings. We continue to include a range of numbers to show the difference optimization technology makes on this particular benchmark, and to provide a range for com- parison when others' cited Dhrystone figures are not clearly defined by optimi- zation levels. For example, -O3 does interprocedural register allocation, and -O4 does procedure inlining, and we suspect -O4 is beyond the spirit of the benchmark. To compare with other systems, we'd suggest using the numbers 18,500 (12,500 for M/500), unless you know that a number is obtained without optimization, in which case use 15,300 (10,300). Although this commonly-used benchmark claims that the M/500 is 7-8X faster than the 11/780, we believe that this benchmark is not necessarily representative. Some other published numbers of interest include the following, all of which are taken from [Richardson 87], unless otherwise noted. Items marked * are those that we know (or have good reason to believe) use optimizing compilers. These are the "register" versions of the numbers, i.e., the highest ones reported by people. Note that we standardly report the unoptimized versions (-O1) of our numbers, although we report -O2 numbers also below. Thus, we're giving every other machine every possible benefit of the doubt on this one. Selected Dhrystone Benchmark Results Dhry's /Sec Rel. Processor 1571 1.0 VAX 11/780, 4.3BSD [our number] 1757 1.1 VAX 11/780, VAX/VMS 4.2 [Intergraph 86]* 3325 2.1 Sun3/160, SunOS 3.2 [our number] 3773 2.4 Pyramid 98X, OSx 3.1, CLE 3.2.0 4061 2.6 Celerity 1260-D, 4.2BSD 4433 2.8 MASSCOMP MC-5700, 16.7MHz 68020, RTU 3.1* 5156 3.3 Intergraph InterPro 32C, SYSV R3 2.0.0, Greenhills* 6240 4.0 Ridge 3200, ROS 3.4 6329 4.0 IBM RT PC, AIX 2.1, Advanced C* 6340 4.0 Gould PN 9080 6362 4.0 Sun3/260, 25MHz 68020 6423 4.1 VAX 8600, 4.3BSD 6440 4.1 IBM 4381-2, UTS V, cc 1.11 6500 4.1 IBM RT PC , AIX 2.1, New Models *Dhrystone 1.0* [IBM 87]* 7409 4.7 VAX 8600, VAX/VMS in [Intergraph 86]* 7810 5.0 Intel 80386, 20MHz, 64K-cache, Green Hills* 8300 5.3 DG MV20000-I and MV15000-20 [Stahlman 87] 8309 5.3 InterPro-32C,30MHz Clipper,Green Hills[Intergraph 86]* 9600 6.1 HP9000 Model 840 [Stahlman, 87, estimate] 10300 6.6 MIPS M/500, 8MHz R2000, no optimization 10787 6.9 VAX-8650, VAX/VMS, [Intergraph 86]* 12500 8.0 MIPS M/500, 8MHz R2000, -O* 15300 9.7 MIPS M/800, 12.5MHz R2000, no optimization 18500 11.8 MIPS M/800, 12.5MHz R2000, -O* 28846 18.4 Amdahl 5860, UTS-V, cc1.22 31250 19.9 IBM 3090/200 4.3. Stanford Small Integer Benchmarks John Hennessy, Director of the Computer Systems Laboratory at Stanford Univer- sity, has collected a set of programs to compare the performance of various systems. It is well known that small benchmarks can be misleading. If you see claims that machine X is up to N times a VAX on some (unspecified) benchmarks, these benchmarks are probably the sort they're talking about. Stanford Small Integer Benchmark Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/800 M/500 8600 160M 11/780 Benchmark Secs Rel. Secs Rel. Secs Rel. Secs Rel. Secs Rel. Perm 0.13 18.0 0.18 13.0 0.63 3.7 0.72 3.3 2.34 1.0 Towers 0.18 12.8 0.24 9.6 0.63 3.7 1.07 2.2 2.30 1.0 Queen 0.11 8.5 0.15 6.3 0.27 3.5 0.50 1.9 0.94 1.0 Intmm 0.15 11.1 0.23 7.3 0.73 2.3 0.93 1.8 1.67 1.0 Puzzle 0.63 17.8 1.15 9.8 2.96 3.8 5.53 2.0 11.23 1.0 Quick 0.11 10.2 0.17 6.6 0.31 3.6 0.58 1.9 1.12 1.0 Bubble 0.12 12.6 0.19 7.9 0.44 3.4 0.97 1.6 1.51 1.0 Tree 0.23 11.8 0.34 8.0 0.69 3.9 1.05 2.6 2.72 1.0 Aggregate * 0.24 12.8 0.36 8.5 0.86 3.6 1.42 2.2 3.08 1.0 Total + 1.66 14.4 2.65 9.0 6.66 3.6 11.35 2.1 23.83 1.0 * As weighted by the Stanford Benchmark Suite + Simple summation of the times for all benchmarks 5. Floating Point Benchmarks 5.1. LINPACK The LINPACK benchmark has evolved into one of the most widely used single benchmarks to predict relative performance in scientific and engineering environments. The usual LINPACK benchmark measures the time required to solve a 100x100 system of linear equations. LINPACK results are measured in MFlops, millions of floating point operations per second. The results below use compiled FORTRAN, often called "FORTRAN BLAS", for FORTRAN Basic Linear Algebra Subroutines. Some hand-coded, or "Coded BLAS" results are shown later, along with results from many other machines, mostly taken from [Dongarra 87]. LINPACK Benchmark Results - FORTRAN and Coded BLAS MIPS MIPS DEC VAX Sun-3/ DEC VAX M/800 M/500 8600# 160M& 11/780# Benchmark MFlops Rel. MFlops Rel. MFlops Rel. MFlops Rel. MFlops Rel. Dbl Precision: FORTRAN .80 5.7 .58 4.1 .49 3.5 .10 0.6 .14 1.0 Coded BLAS 1.08 6.4 .72 4.2 .66 3.9 .10 0.6 .17 1.0 Sngl Precision: FORTRAN 1.46 5.8 .89 3.6 .88 3.5 .11 0.4 .24 1.0 Coded BLAS 1.90 5.6 1.18 3.5 1.30 3.8 .11 0.3 .34 1.0 # VAX/VMS with FPA [Dongarra 87]. Actually, our LLL Fortran numbers were quite close: our in-house 11/780 showed .14 MFlops (DP) and .24 MFlops (SP). & [Dongarra 87], [Sun 86] give these numbers for a Sun3/160 with 16.7MHz MC68881, so we used them instead of those for our in-house (12.5MHz 68881) system. Following are some additional numbers for LINPACK MFlops, showing DP (FORTRAN and Coded BLAS), followed by SP (FORTRAN and Coded BLAS). Coded BLAS is obtained by hand-coding the inner loops in assembler. All numbers are from [Dongarra 87], unless otherwise noted. Selected LINPACK Results - FORTRAN and Coded BLAS DP DP SP SP Fortran Coded Fortran Coded System .08 - - - IBM RT PC, standard FP [IBM 87] .10 .10 .11 .11 Sun-3/160, 16.7MHz [Rolled BLAS]+ .11 .11 .13 .11 Sun-3/260,25MHz 68020+20MHz 68881[RolledBLAS] .14 - .24 - VAX 11/780, 4.3BSD, LLL Fortran [ours] .14 .17 .25 .34 VAX 11/780, VAX/VMS .20 - .24 - 80386+80387, 20MHz, 64K cache, GreenHills .20 .23 .40 .51 VAX 11/785, VAX/VMS .29 .49 .45 .69 Intergraph IP-32C,30Mz Clipper[Intergraph 86] .30 - - - IBM RT PC, optional FPA [IBM 87] .30 - .39 - DG MV/10000 .33 - .57 - OPUS 300PM, Greenhills, 30MHz Clipper .36 .59 .51 .72 Celerity C1230, 4.2BSD f77 .38 - .67 - 80386+Weitek 1167,20MHz,64K cache, GreenHills .39 .50 .66 .81 Ridge 3200/90 .41 .41 .62 .62 Sun-3/160, Weitek FPA [Rolled BLAS] .42 - .56 - HP9000 Series 840 [Stahlman, 87] .46 .46 .86 .86 Sun-3/260, Weitek FPA [Rolled BLAS] .47 .81 .69 1.30 Gould PN9000 .48 - .94 - Harris HCX-7, CCI Power 6/32 .49 .66 .84 1.20 VAX 8600, VAX/VMS 4.5 .56 .85 .68 .99 Harris H1200 .58 .72 .89 1.18 MIPS M/500 [ours] .61 - .84 - DG MV20000-I, MV15000-20 [Stahlman, 87] .65 .76 .80 .96 VAX 8500, VAX/VMS .70 .96 1.30 1.90 VAX 8650, VAX/VMS .78 - 1.10 - IBM 9370-90, VS FORT 1.3.0 .80 1.08 1.46 1.90* MIPS M/800 .87 - 1.20 - Concurrent 3280XP .97 1.13 1.40 1.70 VAX 8700/8800, VAX/VMS 1.10 1.40 1.20 1.60 ELXSI, MOD 2 (6420 CPU, new model) 1.6 2.0 1.6 2.0 Alliant FX-1 (1 CE) 3.0 3.3 4.3 4.9 CONVEX C-1/XP, Fort 2.0 (Rolled BLAS) 7.0 11.0 7.6 9.8 Alliant FX-8, 8 CEs, FX Fortran, v2.0.1.9 43.0 - 44.0 - NEC SX-2, Fortran 77/SX (Rolled BLAS) + Sun FORTRAN and coded numbers are the same, since the code cannot be improved by hand. * When the multiply-add overlap bug is fixed, this should rise to 2.50, accord- ing to our simulations. Note that relative ordering even within families is not particularly con- sistent, illustrating the extreme sensitivity of these benchmarks to memory system design. 5.2. Whetstone Whetstone instructions are a synthetic mix of both floating point and integer arithmetic, functional calls, array indexing, conditional jumps, and transcen- dental functions [Curnow 76]. Whetstone results are measured in KWips, thousands of Whetstone interpreter instructions per second. Whetstone Benchmark Results - FORTRAN MIPS MIPS DEC VAX Sun-3/ DEC VAX M/800 M/500 8600 # 160M & 11/780 # Benchmark KWips Rel. KWips Rel. KWips Rel. KWips Rel. KWips Rel. Double Precision 6,900 8.3 4,450 5.4 2,670 3.2 960 1.2 830 1.0 Single Precision 8,900 7.1 5,800 4.6 4,590 3.7 970 0.8 1,250 1.0 # VAX/VMS, with FPA. & This column was from another Sun-3/160 that had a 16.6MHz 68881. Following are Whetstone figures gathered from miscellaneous sources: Selected Whetstone Benchmark Results DP DP SP SP KWips Rel. Kwips Rel. System 410 0.5 500 0.4 VAX 11/780, 4.3BSD, f77 [ours] 700 0.8 810 0.6 IBM RT PC, standard FP [IBM 87] 715 0.9 1,083 0.9 VAX 11/780, LLL compiler [ours] 830 1.0 - - Symbolics 3650, with FPA [Garren 87] 830 1.0 1,250 1.0 VAX 11/780 VAX/VMS [Intergraph 86] 960 1.2 1,040 0.8 Sun3/160, 68881 [Intergraph 86] 1,110 1.3 1,670 1.3 VAX 11/785, VAX/VMS [Intergraph 86] 1,230 1.5 1,250 1.0 Sun3/260, 25MHz 68020, 20MHz 68881 1,400 1.7 1,600 1.3 IBM RT PC, optional FPA [IBM 87] 1,730 2.1 1,860 1.5 Intel 80386+80387, 20MHz, 64K cache, GreenHills 1,740 2.1 2,980 2.4 Intergraph InterPro-32C,30MHz Clipper[Intergraph86] 1,860 2.2 2,400 1.9 Sun3/160, Weitek FPA [measured elsewhere] - - 2,900 2.3 DG MV/15000-8 2,590 3.1 4,170 3.3 Intel 80386+Weitek 1167, 20MHz, Green Hills 2,600 3.1 3,400 2.7 Sun3/260, Weitek FPA [measured elsewhere] - - 4,300 3.4 DG MV/15000-10 2,670 3.2 4,590 3.7 VAX 8600, VAX/VMS [Intergraph 86] - - 6,400 5.1 DG MV/15000-12 3,950 4.8 6,670 5.3 VAX 8700, VAX/VMS, Pascal(?) [McInnis, 1987] 4,000 4.8 6,900 5.5 VAX 8650, VAX/VMS [Intergraph 86] 4,120 5.0 4,930 3.9 Alliant FX/8 (1 CE) [Alliant 86] 4,450 5.4 5,800 4.6 MIPS M/500 [ours] 6,900 8.3 8,900 7.1 MIPS M/800 [ours] 5.3. DoDuc Benchmark This benchmark [DoDuc 86] is a 5300-line FORTRAN program that simulates aspects of nuclear reactors, has little vectorizable code, and is thought to be representative of Monte-Carlo simulations. It uses mostly double precision floating point, and is often viewed as a ``nasty'' benchmark, i.e., it breaks things. The performance is given as a number R normalized to 100 for an IBM 370/168-3 or 170 for an IBM 3033-U, [ R = 48671/(cpu time in seconds) ], so that larger R's are better. In order of increasing performance, following are numbers for various machines. All are from [DoDuc 87] unless otherwise specified. Selected DoDuc Benchmark Results DoDuc R Relative Factor Perf. System 14 0.5 DEC uVAXII, Ultrix 16 0.5 Apollo DN580 17 0.7 Sun3/110, 16.7MHz 19 0.7 Intel 80386+80387, 16MHz, iRMX 22 0.8 DEC uVAXII, VAX/VMS 22 0.8 Sun-3/260, 25MHz 68020, 20MHz 68881+ 26 1.0 VAX 11/780, VMS 43 1.7 Sun-3/260, 25MHz, Weitek FPA* 48 1.8 Celerity C1260 50 1.9 CCI Power 6/32 53 2.0 Edge 1 57 2.2 DG MV/10000 64 2.5 Harris HCX-7 75 2.9 MIPS M/500 [f77 -O2] [runs 655 seconds] 85 3.3 Alliant FX/1 90 3.5 IBM 4381-2 91 3.5 DEC VAX 8600, VAX/VMS 92 3.5 Gould PN 9050 93 3.6 MIPS M/500, f77 -O3 [runs 527 seconds] 97 3.7 ELXSI 6400 99 3.8 DG MV/20000 101 3.9 Alliant FX/8 113 4.3 FPSystems 164 119 4.6 Gould 32/8750 136 5.2 MIPS M/800, f77 -O2 [runs 358 seconds] 129 5.0 DEC VAX 8650 136 5.2 DEC VAX 8800, VAX/VMS 146 5.6 MIPS M/800, f77 -O3 [runs 334 seconds] 150 5.7 Amdahl 470 V8, VM/UTS 181 7.0 IBM 3081-G, F4H ext, opt=2 475 18.3 Amdahl 5860 1080 41.6 Cray X/MP [for perspective: we have a way to go yet!] 5.4. Spice Benchmarks The Spice program is a large circuit simulator that uses floating-point heavily, but not in the same way as LINPACK. The numbers below use Spice 2G.6 (FORTRAN version). We wish we had some VAX/VMS numbers for these, but we don't. Spice 2G.6 Benchmark Results MIPS MIPS Sun-3/ Sun-3/ VAX 11/780 M/800 M/500 260 160M 4.3BSD Benchmark Secs Relative Secs Rel. Secs Rel. Secs Rel. Secs Rel. elh1* 3.8 6.9 6.2 3.8 - - - - 26.2 1.0 elh2a* 44.4 7.6 84.0 4.0 81 4.2 123 2.7 337.1 1.0 "" Sun 68881 163 2.1 211 1.6 cring+ 17.6 7.1 31.3 4.0 - - - - 124.9 1.0 mosamp2# 15.1 7.8 29.1 4.1 31 3.8 45 2.6 118.2 1.0 "" Sun 68881 62 1.9 78 1.5 tunnel 1.7 11.5 3.3 5.9 - - - - 19.5 1.0 # The Sun mosamp2 numbers are from [Hough,86,3]. The first row for Suns uses Weitek FPA, the second uses 68881s. * These are considered the most representative. They are simulations of real circuits using modern technology. + This was proposed as a standard benchmark in Usenet newsgroup comp.lsi. Mosamp2 and tunnel were included in the original Berkeley Spice benchmarks. 5.5. Stanford Floating Point Benchmarks The following two benchmarks from the Stanford Benchmark Suite emphasize float- ing point performance. They exemplify that class of programs that use tight loops and a high proportion of actual FP code, unlike, for example Linpack, which depends as much on the speed of accessing main memory as on FP perfor- mance itself. These benchmarks are also quite responsive to good optimizing compiler technology. Stanford Floating Point Benchmark Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/800 M/500 8600 # 160M # 11/780 Benchmark Secs Relative Secs Rel. Secs Rel. Secs Rel. Secs Rel. FFT 0.20 19.8 0.34 11.7 1.42 2.8 3.12 1.3 3.97 1.0 Matrix Multiply 0.13 17.8 0.26 8.9 1.37 1.7 2.14 1.1 2.31 1.0 Aggregate* 0.37 14.9 0.59 9.4 1.88 2.9 3.41 1.6 5.52 1.0 Total + 0.33 19.0 0.60 10.5 2.79 2.2 5.17 1.2 6.28 1.0 * As weighted by the Stanford Benchmark Suite (includes some contribution from the integer results) + Simple summation of the times for both benchmarks # Performance would be substantially improved with the respective FPA boards. Also, recall that the in-house Sun has only a 12.5MHz 68881. Benchmark Descriptions FFT Computes a 256-point Fast Fourier Transform (FFT) twenty times. Matrix Multiply Multiplies two 40x40 single-precision matrices. 6. System Benchmarks 6.1. Byte Benchmarks The Byte benchmarks are intended to provide a mixture of user and kernel-level actions. Although they are used frequently, their user:kernel balance of 30:70 is unlike normal time-sharing use, which is more like 50:50 or even 70:30. In the table below, "Tot" is the sum of the "User" and "Sys" numbers reported by /bin/time, and "Rel" is the performance relative to the 11/780 (4.3BSD). (You can derive the "User" time by subtracting "Sys" from "Tot".) The first group of tests (call with assignment, call with empty function, moop, and sieve) measure a few aspects of user-level behavior. The second group of tests (piping data, system calls, disk write, and disk read) measure kernel performance on narrowly-focused benchmarks, which are subject especially to unusual cache behavior. The third group show running 1, ... 6 shell procedures together, each containing a sort, an od | sort pipeline, and a grep | tee | wc pipeline, and an rm. Thus, they mostly measure the efficiency of fork and exec system calls, with relatively little user-level computation. Although not par- ticularly an accurate model of a real mix, this group is a bit more realistic that the first two groups. We run these with a makefile that includes 3 each of calla, calle, loop, sieve, pipes, and scall, and the time shown is the mean of the 3 times. The others are run once. We run the makefile 3 times, and average everything, especially since these runs are quite short, and even .1 second variation can cause large jumps in relative performance. Byte Benchmarks - Part 1 - BSD Systems MIPS M/800 MIPS M/500 DEC VAX 8600 Sun-3/160M DEC VAX11/780 UMIPS-BSD2.0B UMIPS-BSD2.0B Ultrix 1.2 SunOS 3.2 4.3BSD Bench Sys Tot Rel Sys Tot Rel Sys Tot Rel Sys Tot Rel Sys Tot Rel calla 0.0 0.0 - 0.0 0.0 - 0.0 0.0 - 0.0 0.0 - 0.0 0.1 1 calle 0.0 0.0 - 0.0 0.0 - 0.0 0.2 5.5 0.0 0.2 5.5 0.0 1.1 1 loop 0.0 0.2 12.0 0.0 0.3 8.0 0.0 0.9 2.7 0.0 1.5 1.6 0.0 2.4 1 sieve 0.0 0.2 7.0 0.0 0.2 7.0 0.0 0.4 3.5 0.0 0.6 2.3 0.0 1.4 1 pipes 0.3 0.3 6.0 0.7 0.7 2.6 0.3 0.3 6.0 1.6 1.6 1.1 1.6 1.8 1 scall 0.6 0.6 6.3 0.7 0.7 5.4 0.8 1.0 3.8 2.0 2.6 1.5 3.2 3.8 1 diskw 0.1 0.1 7.0 0.2 0.2 3.5 0.2 0.2 3.5 0.5 0.5 1.4 0.7 0.7 1 diskr 0.1 0.1 5.0 0.2 0.2 2.5 0.2 0.2 2.5 0.3 0.3 1.7 0.5 0.5 1 sh 1 0.3 0.3 5.7 0.4 0.4 4.1 0.5 0.6 2.8 0.5 0.6 2.8 1.4 1.7 1 sh 2 0.5 0.5 6.4 0.7 0.8 4.0 0.9 1.1 2.9 1.1 1.3 2.5 2.6 3.2 1 sh 3 0.7 0.8 6.0 1.1 1.2 4.0 1.4 1.7 2.8 1.6 1.9 2.5 3.8 4.8 1 sh 4 1.1 1.2 5.3 1.4 1.6 4.0 1.9 2.3 2.8 2.2 2.5 2.6 5.1 6.4 1 sh 5 1.2 1.4 5.8 1.9 2.1 3.9 2.4 2.9 2.8 2.9 3.3 2.5 6.3 8.1 1 sh 6 1.5 1.7 5.8 2.2 2.5 3.9 2.9 3.5 2.8 3.4 3.9 2.5 7.7 9.8 1 Tot* 8.2 9.0 7.4 12.3 14.7 4.6 13.7 20.9 3.2 24.9 33.7 2.0 42.5 67.0 1 * Summation of times as we run them, which includes multiple instances of some of the tests, whose average numbers are shown. Precisely: total = 3* (calla + calle + loop + sieve + pipes + scall) + diskw + diskr + sh1 + sh2 + sh3 + sh4 + sh5 + sh6 These numbers can be interpreted in many different ways, and are presented because many people use them, not because we consider them to be representative of overall performance. Following are some additional Byte benchmarks, showing UMIPS-V performance on M/500 (and, sooner or later, M/800), plus a 30MHz Intergraph InterPro-32C [Intergraph 86], with the Sun & VAX numbers replicated for context. We don't know why the IP-32C's kernel performance is as shown, since it seems out of line with other areas of its performance. UMIPS-V numbers for the M/800 will be available soon. Byte Benchmarks - Part 2 - Some System V Numbers MIPS M/800 MIPS M/500 Intergraph Sun-3/160M DEC VAX11/780 UMIPS-V1.1 UMIPS-V1.1 IP-32C,SVR3 SunOS 3.2 4.3BSD Bench Sys Tot Rel Sys Tot Rel Sys Tot Rel Sys Tot Rel Sys Tot Rel calla 0.0 0.0 - 0.0 0.0 - 0.0 0.0 - 0.0 0.0 - 0.0 0.1 1 calle 0.0 0.0 - 0.0 0.0 - 0.0 0.1 11.0 0.0 0.2 5.5 0.0 1.1 1 loop 0.0 0.2 12.0 0.0 0.4 7.0 0.0 0.5 4.8 0.0 1.5 1.6 0.0 2.4 1 sieve 0 0.2 7.0 0.0 0.2 7.0 0.0 0.4 3.5 0.0 0.6 2.3 0.0 1.4 1 pipes 0.X 0.X X.X 0.3 0.3 6.0 1.4 1.4 1.3 1.6 1.6 1.1 1.6 1.8 1 scall 0.X 0.X X.X 0.6 0.6 5.4 1.4 1.5 2.5 2.0 2.6 1.5 3.2 3.8 1 diskw 0.1 0.1 7.0 0.1 0.1 7.0 0.2 0.2 3.5 0.5 0.5 1.4 0.7 0.7 1 diskr 0.1 0.1 5.0 0.1 0.1 5.0 0.3 0.3 1.7 0.3 0.3 1.7 0.5 0.5 1 sh 1 0.X 0.X X.X 0.2 0.3 5.7 2.4 2.4 0.7 0.5 0.6 2.8 1.4 1.7 1 sh 2 0.X 0.X X.X 0.5 0.7 4.6 2.7 2.8 1.1 1.1 1.3 2.5 2.6 3.2 1 sh 3 0.X 0.X X.X 0.7 1.1 4.4 4.0 4.1 1.2 1.6 1.9 2.5 3.8 4.8 1 sh 4 1.X 1.X X.X 1.0 1.5 4.3 5.8 6.0 1.1 2.2 2.5 2.6 5.1 6.4 1 sh 5 1.X 1.X X.X 1.2 1.8 4.5 7.2 7.4 1.1 2.9 3.3 2.5 6.3 8.1 1 sh 6 1.X 1.X X.X 1.3 2.0 4.7 8.3 8.6 1.1 3.4 3.9 2.5 7.7 9.8 1 Tot* X.X X.X X.X 7.8 12.4 5.4 34.3 43.5 1.5 24.9 33.7 2.0 42.5 67.0 1 7. Acknowledgements Many people at MIPS contributed to this document, which was originally created by Web Augustine. We thank David Hough of Sun Microsystems, who kindly supplied numbers for some of the Sun configurations, even correcting a few of our numbers that were incorrectly high. We also thank Cliff Purkiser of Intel, who posted the Intel 80386 Whetstone and LINPACK numbers on Usenet. 8. References [Alliant 86] Alliant Computer Systems Corp, "FX/Series Product Summary", October 1986. [Curnow 76] Curnow, H. J., and Wichman, B. A., ``A Synthetic Benchmark'', Computing Journal, Vol. 19, No. 1, February 1976, pp. 43-49. [DoDuc 87] DoDuc, N., FORTRAN Central Processor Time Benchmark, Framentec, June 1986, Version 13. Newer numbers were received 03/17/87, and we used them where different. [Dongarra 87] Dongarra, J. J., ``Performance of Various Computers Using Standard Linear Equations in a Fortran Environment'', Technical Memo. No. 23, Argonne National Laboratory, March, 1987. [Fleming 86] Fleming, P.J. and Wallace, J.J.,``How Not to Lie With Statistics: The Correct Way to Summarize Benchmark Results'', Communications of the ACM, Vol. 29, No. 3, March 1986, 218-221. [Garren 87] Garren, S., ``Symbolics on Performance'', Symbolics, Jan 9, 1987, Cambridge, MA. [Hough 86,1] Hough, D., ``Weitek 1164/5 Floating Point Accelerators'', Usenet, January 1986. [Hough 86,2] Hough, D., ``Benchmarking and the 68020 Cache'', Usenet, January 1986. [Hough 86,3] Hough, D., ``Floating-Point Programmer's Guide for the Sun Workstation'', Sun Microsystems, September 1986. [an excellent document, including a good set of references on IEEE floating point, especially on micros, and good notes on benchmarking hazards]. Sun3/260 Spice numbers are from later mail. [IBM 87] IBM, ``IBM RT Personal Computer (RT PC) New Models, Features, and Software Overview, February 17, 1987. [Intergraph 86] Intergraph Corporation, ``Benchmarks for the InterPro 32C'', December 1986. [McInnis 87] McInnis, D., Kusik, R., Bhandarkar, D., ``VAX 8800 System Overview'', Proc. IEEE COMPCON, March 1987, San Francisco, 316-321. [Purkiser 87] Purkiser, C., ``Whetstone and LINPACK Numbers'', Usenet, March 1987. [Richardson 87] Richardson, R., ``3/15/87 Dhrystone Benchmark Results'', Usenet, March 1987. [Stahlman 87] Stahlman, M., "The Myth of Price/performance", Sanford C. Bernstein & Co, Inc, NY, NY, March 17, 1987. [Sun 86] SUN Microsystems, ``The SUN-3 Family: A Hardware Overview'', August 1986. [Weicker 84] Weicker, R. P., ``Dhrystone: A Synthetic Systems Programming Benchmark'', Communications of the ACM, Vol. 27, No. 10, October 1984, pp. 1013-1030. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086