raveling@isi.edu (Paul Raveling) (07/14/90)
Last week I ran some benchmarks with interesting results for evaluating some combinations of workstations, operating systems, and C compilers. Following the formfeed below is a rather long report on the results. The two systems being most seriously compared were an HP 9000/370 and a Sun 4, but various results are included for a Sun 3 and a VAX 8650. Beyond some obvious conclusions about which {hardware/OS/compiler} is fastest in various circumstances, one result that I find interesting supports an old hypothesis of mine in the area of OS theory. This hypothesis is essentially that context switch overhead is the principle determinant of OS performance in the presence of a typical multi-process workload. Please note that this is cross-posted to several newsgroups that may have an interest in the machines, OS's, and compilers that were compared. It would be appropriate to edit the Newsgroups line in any followups. Also, please be aware that I don't subscribe to most of these newsgroups; the best way to get a question to me would be by email or by a followup to comp.sys.hp. ---------------- Paul Raveling Raveling@isi.edu Last week I ran two suites of benchmarks to compare various combinations of workstation hardware, operating systems, and C compilers. Emphasis was on: ** HP 9000/370 vs Sun 4 vs Sun 3 ** HP-UX vs BSD ** Native C compilers versus gcc One suite was the small collection that I've been using for a couple years, the other is the BYTE UNIX benchmarks published recently on comp.sources.unix. Some Conclusions ---------------- -- Comparisons between HP-UX and BSD on HP 9000/370's indicate that BSD is generally much faster. The main differences are in speed of context switching and i/o. -- Context switch overhead is probably a key determinant of overall system performance. The HP-UX/BSD comparison shows strong similarity between relative speed ratios for BYTE's system loading test and context switch benchmarks; the same correlation does not apply well to other low level benchmarks. -- The C compiler that produced the fastest code at maximum optimization was the vendor's C compiler on both the HP 9000/370 and the Sun 4. However, gcc may produce faster floating point code on the Sun 4. -- Processor speed tests show that the HP 9000/370 and Sun 4 are about equally matched, except in two areas: The Sun 4 is faster in floating point and recursion. -- BYTE's I/O throughput tests showed that the Sun 4 was surprisingly slow. Both the HP and a Sun 3 were faster. Measured Results ---------------- All results that follow are expressed as relative speed ratios based on some measured quantity: User process time, system time, real time, or i/o rates. 1 is assigned to the fastest measured result. n means "n times slower than fastest"; "n" is expressed to 2 fractional digits (e.g. "1.23") I.e., the lower the speed ratio, the faster the performance. In a few cases two or more different machines/systems/compilers produced a dead tie for the fastest measured result. In this case both show "1" as their relative speed ratio. "1.00" indicates a speed very slightly slower than the fastest, for which the ratio rounds to 1.00. 1. Best optimizing compiler: On HP 9000/370's it was HP-UX's compiler. On Sun 4's it was Sun's, except that gcc was better in BYTE's floating point math tests. Measured results were user process time, and on benchmarks marked with "(r)", the "{dhry/whet}stones/second" rating reported by the benchmark. Compilers on the HP were: "HP-UX cc": Native compiler from HP-UX 6.5 "gcc": gcc 1.37.1 "BSD cc": gcc 1.34, as supplied by Utah for BSD Compilers on the Sun 4 were: "Sun cc": Native compiler from SunOs 4.0.3 "gcc": gcc 1.37.1 HP 9000/370 Sun 4 Benchmark HP-UX cc gcc BSD cc Sun cc gcc --------- -------- --- ------ ------ --- dhrystone 1 1.26 1.19 1 1.17 dhrystone(r) 1 1.25 1.43 1 1.17 whetstone 1 1.15 1.10 1.01 1 whetstone(r) 1 1.16 1.07 1.01 1 tak 1 2.10 2.06 1 1.24 dhrystone2a(r) 1 1.15 1.41 1 1.76 dhrystone2b(r) 1 1.14 1.39 1 1.79 arithoh 1 2.18 1.76 1 1 register 1.01 1.02 1 1 10.51 short 1.12 1 1.00 1.00 1 int 1.01 1.03 1 1 1.03 long 1.01 1.03 1 1 1.02 float 1.07 1 1.60 2.42 1 double 1 1.10 1.04 1.14 1 tower of hanoi 1 1.83 1.83 1 1 2. Relative processing [hardware] speeds: These results also are based on user process time. For the HP and Sun 4, the measurements used are those for whichever compiler's executable was fastest. Only the installed "cc" was used on the Sun 3 and the VAX. This doesn't precisely show relative hardware speed because it's at the mercy of the available C compilers. Benchmark HP 9K/370 Sun 4 Sun 3 VAX 8650 --------- --------- ----- ----- -------- dhrystone 1.18 1 5.02 2.44 dhrystone(r) 1.18 1 5.03 2.45 whetstone 1 1.01 23.65 2.62 whetstone(r) 1 1 24.34 3.77 tak 1.88 1 3.91 3.47 dhrystone2a(r) 1.13 1 3.73 2.08 drhystone2b(r) 1.12 1 3.74 2.16 arithoh 1 1.18 4.12 (Test failed on VAX) register 1.38 1.40 2.39 1 short 1 1.40 2.01 1.16 int 1.26 1.50 2.17 1 long 1 1.20 1.73 1.37 float 2.23 1 67.62 4.91 double 1.54 1 39.15 2.94 tower of hanoi 1.50 1 4.25 (Test failed on VAX) See item 4, 3 pages farther on, for a comparison of relative i/o speeds. These would be largely dependent on hardware, but as item 3 on the next page shows, choice of operating system is also significant. 3. Relative operating system system speeds: Direct comparison of HP-UX 6.5 and BSD 4.3 on identical HP 9000/370's. Tests included 3 types of benchmarks: -- Low level processor-intensive tests -- Low level i/o-intensive tests -- High level tests of a simulated workload Low level processor-intensive tests: System Time Real Time ::::::::::: ::::::::: Benchmark HP-UX BSD HP-UX BSD --------- ----- --- ----- --- pt [context switch] 2.08 1 2.21 1 iocall 1 1.15 1 1.13 system call overhead 1 1.19 1 1.19 pipe throughput 1.33 1 1.28 1 pipe-based context sw. 2.74 1 2.15 1 process creation 1.33 1 1.15 1 execl throughput 1.45 1 1 1.28 Low level i/o-intensive tests: Filesystem throughput, based on reported KBytes/second. Test Time System Read Write Copy --------- ------ ---- ----- ---- 1 sec HP-UX 1 1.17 1.27 BSD 1.08 1 1 10 sec HP-UX 1.27 1.29 1.91 BSD 1 1 1 20 sec HP-UX 1.48 1.48 1.67 BSD 1 1 1 High level tests of a simulated workload: Bourne shell script and UNIX utilities Concurrent Background ........Time........ Processes System & Compiler User System Real --------- ----------------- ---- ----- ---- 1 HP-UX cc 1.04 2.09 2.02 HP-UX gcc 1 2.22 1.98 BSD cc 1.29 1 1 2 HP-UX cc 1 2.22 2.15 HP-UX gcc 1.02 2.28 2.22 BSD cc 1.36 1 1 4 HP-UX cc 1.03 2.21 2.43 HP-UX gcc 1 2.20 2.12 BSD cc 1.32 1 1 8 HP-UX cc 1 2.29 2.16 HP-UX gcc 1.01 2.30 1.73 BSD cc 1.30 1 1 4. Net relative OS-related system speeds, comparing different all tested combinations of hardware, OS's, and C compilers. Comparisons in the immediately following table are based on measured real time, except for the "n-sec" i/o benchmarks. HP 9000/370 Sun 4 Sun 3 VAX ::::::::::::::::::: ::::: ::::: ::: HP-UX BSD SunOS SunOS BSD ::::::::::: ::: ::::::::::: ::::: ::: Benchmark cc gcc cc cc gcc cc cc --------- -- --- -- -- --- -- -- pt 2.39 2.45 1.08 1 1.10 2.29 1.52 iocall 2.08 2.34 2.34 1.04 1 3.92 2.27 sys call ovhd 1.12 1.20 1.33 1 1.02 3.61 1.20 pipe th'put 2.09 3.08 1.63 1.03 1 3.36 1.44 context sw. 2.15 3.69 1 1 1 2.28 1 process creat'n 1.37 1.54 1.19 3.46 3.40 7.05 1 execl th'put 1.01 1.03 1.29 1.84 1.79 3.51 1 1-sec read 1.03 1.10 1.15 1.26 1.31 1 [0.17] 1-sec write 1.14 1.21 1 1.40 1.45 1.10 [0.08] 1-sec copy 1.43 1.15 1 1.31 1.31 1.17 [0.36] 10-sec read 1.29 1.25 1 2.50 2.25 1.50 [0.13] 10-sec write 1.29 1.29 1 2.50 2.25 1.50 [0.11] 10-sec copy 1.87 1.95 1 3.07 2.26 1.65 [0.25] 20-sec read 1.44 1.53 1 2.55 2.55 1.53 [0.14] 20-sec write 1.44 1.53 1 2.55 2.55 1.53 [0.12] 20-sec copy 1.64 1.71 1 2.12 2.25 1.33 [0.27] sh+ut load(1) 2.02 1.98 1 1.72 1.51 1.64 1.33 sh+ut load(2) 2.15 2.22 1 5.20 1.99 1.98 1.29 sh+ut load(4) 2.43 2.12 1 1.65 1.75 1.99 1.33 sh+ut load(8) 2.16 1.73 1 1.49 1.47 1.78 1.12 Hardware Configurations ----------------------- HP 9000/370: 24 MB RAM, 68881 floating point (no FPA) I/O via NFS mounts to another HP 9K/370 on local ethernet Sun 4 24 MB RAM, programs loaded from local disk, other I/O via NFS mounts to VAX 8650 Sun 3/80 8 MB RAM, programs loaded from local disk, other I/O via NFS mounts to VAX 8650 VAX 8650 20 MB RAM, I/O to local disk Notes ----- 1. All tests were run at least 3 times, and the BYTE benchmarks ran many tests 6 times. The results reported are mean values for all trials. 2. Measurements based on real time should be treated with a bit of suspicion, particularly on the VAX, which supports a substantial amount of activity in both user jobs and NFS i/o. The BYTE benchmarks reported 95 interactive users when they started on the VAX. ** A notable case is that variance was unusually high for the "whetstones per second" rate reported on the VAX. However, user process times reported for the same tests were much more consistent. The workstations should be fairly safe from loading by local processes, but their i/o speeds are vulnerable to loading on the local ethernet and file servers. ** And yes, gcc-generated code WAS slower by an order of magnitude on the Sun 4 "register" benchmark. This is so blatantly odd that I repeated both compiling and running this test to be sure the numbers were correct and consistent. 3. The VAX's I/O was MUCH faster than the workstations, sometimes by up to an order of magnitude. This may be partly due to use of only local disks rather than NFS-mounted files systems on the VAX. However, older benchmarks also had suggested that workstations using local disks still offered much less data bandwidth than the VAX. In order to provide a meaningful comparison among workstations for i/o, performance ratio "1" was assigned to the fastest workstation. This is why the VAX's performance is fractional. ** A particularly interesting result was that i/o to/from an NFS-mounted file system was slower on the Sun 4 than on the Sun 3. Both machines were using the same file system on the same server.