mash@mips.UUCP (John Mashey) (10/15/86)
Having seen MIPS get dragged into the 386-68K wars over in net.micro.68k, I posted a couple MIPS R2000 numbers, and offered to post some more here if peo- ple were interested. A bunch were, so here are the #s. This is a super- condensed excerpt from "MIPS Performance Brief - MIPS M/500 with UMIPS-BSD 1.0, October 1986", copies of which can be obtained from Sondra Smith at MIPS ({decvax,ihnp4,ucbvax}!decwrl!mips!sondra. This has much raw data, with minimal editorializing, so you can draw your own conclusions. However, please remember that these systems use a chip running at half its design speed, whose architecture was spec'd just 18 months ago, using a UNIX(TM) whose source we got 3 months ago, and is therefore untuned. They ARE real systems being shipped to customers in exactly the state that produced the benchmark numbers below. ----- OVERVIEW ----- This compares results from the widely-used LINPACK, Whetstone, Byte, and Dhry- stone benchmarks as well as the Stanford Benchmark Suite and MIPS UNIX Bench- mark Suite run on various computers in-house at MIPS (DEC VAX-11/780(TM) and VAX 8600, Sun-3/160M, and MIPS M/500 Development System). Live results are com- pared with the simulation numbers published in the April 24, 1986 edition of the Performance Brief. Benchmarks that include floating point show live numbers, using the interim R2360 Floating Point board, and simulations for the forthcoming R2010 Floating Point Accelerator chip. Here is the benchmark sum- mary, followed by the details of configurations, actual times, etc. Summary of Benchmark Results (VAX 11/780 = 1.0; Bigger is Faster) (We call the 780 a 1Mips box, and an M/500 a 5Mips one). MIPS MIPS DEC VAX Sun-3/ DEC VAX M/500 SIMUL 8600 160M 11/780 Relative Relative Relative Relative Relative Benchmark Perf Perf Perf Perf Perf Integer Benchmarks MIPS UNIX Total+ +5.4 4.9 3.7 2.0 1.0 Dhrystone: registers, no opt 7.0 6.6 3.5 2.0 1.0 Stanford Integer Aggregate 8.3 8.3 3.6 2.2 1.0 Floating Point Benchmarks& LINPACK: Double Precision+ &1.5 +&4.8 3.8 0.6 1.0 Whetstone: Double Precision &1.6 &6.3 #1.0 1.3 1.0 Stanford FP Aggregate &3.9 &9.0 #2.9 1.6 1.0 System Benchmarks Byte Total 4.5 - 3.2 1.8 1.0 + We consider these benchmarks to be the best indications of real performance. & Note that floating point simulations show the expected performance of the R2010, whereas the actual benchmarks show that of the current R2360. Also, the current R2360 has a soon-to-be-removed restriction that impacts perfor- mance by 25-30%. # These numbers are low because that 8600 has no FPA< as described later. M/500 performance on realistic, live benchmarks can be summarized as follows, with any futures shown in (). -User-level integer performance now lies in the range 5.0-5.5X faster than a VAX 11/780 running 4.3BSD UNIX. -Kernel-intensive performance is in the range of 4.0-4.5X faster. (After some straightforward tuning, this should be in the 4.3-4.7X range. Overall system performance on normal user-kernel mixes (50u/50s to 70u/30s) should thus end up at the 5X level, or slightly above.) -Using the current version of the floating point board, floating-point perfor- mance is about 1.5X. (This will improve soon to about 2X, and then will go to about 5X when we finish the R2010 FPA chip.) -----HOW BENCHMARKS WERE DONE----- The computer systems configurations described below do not necessarily reflect optimal configurations, but rather the in-house systems to which we had repeat- able access. We wish we had an FPA for the VAX 8600, which would give the 8600 substantially better performance on floating point. VAX 11/780: 8MB; FPA; 4.3BSD; LLL Fortran for Whetstone & LINPACK VAX 8600: 20MB; no FPA; Ultrix V1.2; LLL Fortran for Whetsone & LINPACK SUN 3/160M: 8MB; 16.6Mhz 68020 + 12.5Mhz 68881; 4.2BSD, release 3.0 MIPS M/500: 8MB; R2300 CPU board (8Mhz R2000, 16K I-cache, 8K D-cache); R2360 interim Floating Point (Weitek 1164/65); 4.3BSD Simulated numbers are also given (MIPS SIMUL columns in tables) for the MIPS R2010 FPA chip, which is must faster than the R2360. Cycle counts are: FP Type Add SP Add DP Mult SP Mult DP Div SP Div DP R2360 10 12 10 14 35 66 R2010 2 2 4 5 11 18 I.e., the R2010 DP Add would be 250ns @ 8Mhz or 125ns @ 16Mhz; although I believe in peak MegaFLOPS about as much as I believe in peak MIPS, (i.e., not much!) people might call this a 4MFLOPS machine @ 8Mhz, or 8MFLOPS @ 16Mhz. (Actually, it's faster than that, since it has multiple functional units and can overlap operations). Systems were tested in normal multi-user development environment, (for fairness to the 11/780 and 8600, whose daemons are hard to exorcise!), with load factor <0.2 (as measured by uptime). Benchmarks were run 3 times and averaged. How to Interpret the Numbers The tables include a "MIPS M/500" column and a "MIPS SIMUL" column. The numbers are computed as follows. Times (or rates, such as for Dhrystones, Whetstones, and LINPACK KFlops) are shown for the VAX 11/780. Other machines' times or rates are shown, and the relative performance ("Rel." column) normal- ized to the 11/780, which is treated as 1.0. The SIMUL times/rates shown are those we got doing simulations 6-8 months ago. We normalize those times/rates to the same 11/780 4.3BSD used throughout this report. Note that the April 24,1986 Performance Brief shows the same times/rates as the SIMUL column, but often shows a different relative perfor- mance number. Not only has our software changed, but our 11/780 went from 4.2BSD to 4.3BSD, so the base for normalization changed also. The M/500 benchmark numbers use production release 1.0 of the MIPS compilers and UMIPS-BSD 1.0. The latter is a 4.3BSD UNIX port, compiled at optimization level -O2, but otherwise untuned. All benchmarks were compiled -O, i.e., with optimization. UMIPS compilers call this level -O2, and it includes serious global optimization. ----- DETAILED BENCHMARK DATA ----- The MIPS UNIX Benchmarks described below are fairly typical of nontrivial UNIX programs. This benchmark suite provides the opportunity to execute the same code across several different machines. User time is shown; kernel time is typically 10-15% of the user time (on the 780). MIPS UNIX Benchmarks Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/500 SIMUL 8600 160M 11/780 Benchmark Secs Rel. Secs Rel. Secs Rel. Secs Rel. Secs Rel. bm 0.2 6.0 0.2 6.0 0.3 4.0 0.6 2.0 1.2 1.0 grep 1.0 4.7 1.0 4.7 1.2 3.9 2.4 2.0 4.7 1.0 diff 0.4 6.7 0.5 5.4 0.9 3.0 1.7 1.6 2.7 1.0 yacc 0.6 5.7 0.7 4.9 0.9 3.8 1.7 2.0 3.4 1.0 nroff 3.5 5.4 3.9 4.8 5.0 3.8 9.0 2.1 18.8 1.0 Total + 5.7 5.4 6.3 4.9 8.3 3.7 15.4 2.0 30.8 1.0 + Sum of the time for all benchmarks. "Total Rel." is ratio of the totals. Dhrystone Benchmark Results, version 1.1 MIPS MIPS DEC VAX Sun-3/ DEC VAX M/500 SIMUL 8600 160M 11/780 Dhry's Dhry's Dhry's Dhry's Dhry's Benchmark /Sec Rel. /Sec Rel. /Sec Rel. /Sec Rel. /Sec Rel. no Registers 8,855 6.1 - - - - 2,730 1.9 1,442 1.0 Registers 10,309 7.0 9,510 6.6 5,130 3.5 2,954 2.0 1,474 1.0 -O, no Regs 12,323 7.9 - - - - 2,945 1.9 1,559 1.0 -O, Registers 12,329 7.8 11,280 7.2 - - 3,243 2.1 1,571 1.0 Although this commonly-used benchmark claims that the M/500 is 7-8X faster than the 11/780, we believe that this benchmark is not necessarily representative. Stanford Small Integer Benchmarks These benchmarks are included primarily for perspective. It is well known that small benchmarks can be misleading. We would not claim that the R2300 perfor- mance is 8-9 times that of a VAX-11/780 based on this one benchmark. Stanford Small Integer Benchmark Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/500 SIMUL 8600 160M 11/780 Benchmark Secs Rel. Secs Rel. Secs Rel. Secs Rel. Secs Rel. Perm 0.21 11.1 0.21 11.1 0.63 3.7 0.72 3.3 2.34 1.0 Towers 0.25 9.2 0.27 8.5 0.63 3.7 1.07 2.2 2.30 1.0 Queen 0.14 6.7 0.14 6.7 0.27 3.5 0.50 1.9 0.94 1.0 Intmm 0.23 7.3 0.21 8.0 0.73 2.3 0.93 1.8 1.67 1.0 Puzzle 1.13 9.9 0.94 11.9 2.96 3.8 5.53 2.0 11.23 1.0 Quick 0.17 6.6 0.17 6.6 0.31 3.6 0.58 1.9 1.12 1.0 Bubble 0.19 7.9 0.19 7.9 0.44 3.4 0.97 1.6 1.51 1.0 Tree 0.36 7.6 0.31 8.8 0.69 3.9 1.05 2.6 2.72 1.0 Aggregate * 0.37 8.3 0.37 8.3 0.86 3.6 1.42 2.2 3.08 1.0 Total + 2.68 8.9 2.43 9.8 6.66 3.6 11.35 2.1 23.83 1.0 * As weighted by the Stanford Benchmark Suite + Simple summation of the times for all benchmarks Floating Point Benchmarks LINPACK Benchmark Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/500@ SIMUL* 8600# 160M& 11/780 Benchmark KFlops Rel. KFlops Rel. KFlops Rel. KFlops Rel. KFlops Rel. Dbl Prec 193 1.5 625 4.8 490 3.8 80 0.6 130 1.0 Sngl Prec 299 1.2 890 3.7 610 2.5 85 0.4 240 1.0 @ Recall that this uses the R2360 FP board, while SIMUL uses R2010. * Projection based on inner loop calculations, including cache effects, using R2010 FPA. # VAX 8600 (apparently) with FPA [Dongarra 85] 133 (DP) was measured on the in-house 8600 without FPA & [Sun 86]; Sun quotes performance with 68881 as 101 (DP), 108 (SP), and with FPA board as 405 (DP), 625 (SP). Whetstone Benchmark Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/500 SIMUL* 8600 # 160M & 11/780 # Benchmark KWips Rel. KWips Rel. KWips Rel. KWips Rel. KWips Rel. Double Precision 1,164 1.6 4,500 6.3 680 1.0 930 1.3 715 1.0 Single Precision 1,808 1.7 6,200 5.7 1,720 1.6 1,030 1.0 1,083 1.0 * Simulation with R2010 FPA. # UNIX 4.3 BSD, Lawrence Livermore FORTRAN compiler results This compiler produces results more consistent with the VMS FORTRAN compiler, which is known to have substantially higher performance than the f77 results measured at MIPS. Remember 8600 is lacking FPA. & Note that the in-house Sun-3/160M benchmarked better than Sun's published data: 790 (DP), 860 (SP) [Hough 86,1] Sun quotes performance with FPA board as 1,700 (DP), 2,300 (SP) [Hough 86,1] Stanford Floating Point Benchmark Results MIPS MIPS DEC VAX Sun-3/ DEC VAX M/500 SIMUL@ 8600 # 160M # 11/780 Benchmark Secs Rel. Secs Rel. Secs Rel. Secs Rel. Secs Rel FFT 1.02 3.9 0.44 9.0 1.42 2.8 3.12 1.3 3.97 1.0 Matrix Multiply 1.74 1.3 0.26 8.9 1.37 1.7 2.14 1.1 2.31 1.0 Aggregate * 1.43 3.9 0.61 9.0 1.88 2.9 3.43 1.6 5.52 1.0 Total + 2.76 2.3 0.70 9.0 2.79 2.2 5.26 1.2 6.28 1.0 @ Simulation with R2010 FPA. * As weighted by the Stanford Benchmark Suite (includes some contribution from the integer results) + Simple summation of the times for both benchmarks # Performance would be substantially improved with the respective FPA boards. System Benchmarks Byte Benchmarks The Byte tests are intended to provide a mixture of user and kernel-level actions. Although used frequently, their user:kernel balance of 30u:70s is unlike normal time-sharing use, which is more like 50:50 or 70u:30s. In the table below, "Usr" and "Sys" are the CPU times reported by /bin/time, and "Tot" is the sum of those two. "Rel" is the speed relative to the 11/780. We run these with a makefile that includes 3 each of calla, calle, loop, sieve, pipes, and scall, and the time shown is the mean of the 3 times. The others are run once. We run the makefile 3 times, and average everything, especially since these runs are quite short, and even .1 second variation can cause large jumps in relative performance. Byte Benchmarks MIPS DEC VAX Sun-3/ DEC VAX M/500 8600 160M 11/780 Bench Usr Sys Tot Rel Usr Sys Tot Rel Usr Sys Tot Rel Usr Sys Tot Rel calla 0.0 0.0 0.0 - 0.0 0.0 0.0 - 0.0 0.0 0.0 - 0.1 0.0 0.1 1.0 calle 0.0 0.0 0.0 - 0.2 0.0 0.2 5.5 0.2 0.0 0.2 5.5 1.1 0.0 1.1 1.0 loop 0.3 0.0 0.3 8.0 0.9 0.0 0.9 2.7 1.5 0.0 1.5 1.6 2.4 0.0 2.4 1.0 sieve 0.2 0.0 0.2 7.0 0.4 0.0 0.4 3.5 0.6 0.0 0.6 2.3 1.4 0.0 1.4 1.0 pipes 0.0 0.7 0.7 2.6 0.0 0.3 0.3 6.0 0.0 1.3 1.3 1.4 0.2 1.6 1.8 1.0 scall 0.1 0.7 0.8 4.8 0.2 0.8 1.0 3.8 0.5 2.0 2.5 1.5 0.6 3.2 3.8 1.0 diskw 0.0 0.2 0.2 3.5 0.0 0.2 0.2 3.5 0.0 0.5 0.5 1.4 0.0 0.7 0.7 1.0 diskr 0.0 0.2 0.2 2.5 0.0 0.2 0.2 2.5 0.0 0.3 0.3 1.7 0.0 0.5 0.5 1.0 sh 1 0.0 0.4 0.4 4.1 0.1 0.5 0.6 2.8 0.1 0.8 0.9 1.9 0.3 1.4 1.7 1.0 sh 2 0.1 0.7 0.8 4.0 0.2 0.9 1.1 2.9 0.2 1.5 1.7 1.9 0.6 2.6 3.2 1.0 sh 3 0.1 1.1 1.2 4.0 0.3 1.4 1.7 2.8 0.2 2.3 2.5 1.9 1.0 3.8 4.8 1.0 sh 4 0.2 1.5 1.7 3.8 0.4 1.9 2.3 2.8 0.4 3.1 3.5 1.8 1.3 5.1 6.4 1.0 sh 5 0.3 1.6 1.9 4.3 0.5 2.4 2.9 2.8 0.4 4.1 4.5 1.8 1.8 6.3 8.1 1.0 sh 6 0.4 2.0 2.4 4.1 0.6 2.9 3.5 2.8 0.5 5.0 5.5 1.8 2.1 7.7 9.8 1.0 Total* 2.9 11.9 14.8 4.5 7.2 13.7 20.9 3.2 9.6 27.5 38.1 1.8 24.5 42.5 67.0 1.0 * Summation of times as we run them, which includes multiple instances of some of the tests, whose average numbers are shown. Precisely: total = 3* (calla + calle + loop + pipes + scall + sieve) + diskw + diskr + sh1 + sh2 + sh3 + sh4 + sh5 + sh6 What seems clear is that, even if this benchmark were considered typical (it isn't), the other 3 machines have not yet caught up with the tuning done on the 11/780, i.e., their performance ratios before the 4.2->4.3 11/780 conversion were more in line with the commonly-viewed models of 5 (M/500), 4.2 (8600), and 2.0 (Sun). We'd assume the 780 has an "unfair" advantage right now, since the 8600 and Sun are probably tuned, but don't have some of the 4.3 improvements, and the M/500 has 4.3, but hasn't been otherwise tuned. ------ SUMMARY ------- Those are the numbers, folks. The Performance Brief has more detail and back- ground, caveats, conjectures on what the numbers mean, references, detailed comparisons with past simulations, etc, but you have the meat of it here. DEC & VAX are trademarks of Digital Equipment Corporation. UNIX is a Registered Trademark of AT&T. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086