dan@rna.UUCP (Dan Ts'o) (07/11/84)
Hi, Does anyone have any benchmark results of C programs on a commercially available 16000 UNIX system ? Can you mail me the results or post them ? What do they show in relation to 68000's and VAX's ? The only benchmarks I've seen were done at USENIX meetings a year or so ago on machines such as the LMC. They showed a rather miserable performance with respect to 68000 offerings, results were only slightly better than 8088's and LSI11/23's. Granted, the chips were running at 6Mhz (I think). Yet all the ads you see seem to suggest that a 16k at 10Mhz should be on par with 68000's at 10Mhz. So I'm looking for documentation - benchmarks done on real 16k UNIX systems - got any ? Preferably with C source so I can run the same benchmarks on other systems to compare. Thanks. Cheers, Dan Ts'o Dept. Neurobiology Rockefeller Univ. 1230 York Ave. NY, NY 10021 212-570-7671 ...cmcl2!rna!dan
beaucham@uiucuxc.UUCP (07/27/84)
#R:rna:-27100:uiucuxc:25800010:000:1820 uiucuxc!beaucham Jul 26 22:18:00 1984 We are about to buy an LMC after months of soul searching. While not the fastest machine in the world, it does do very well on fairly long floating point intensive programs, particularly in C, but also in F77, and F.P. was our most important requirement. (there is an unfortunate initial overhead with F77 --the entire library is loaded whether you need it or not!) We have bench marked it against the Dual 83/80, the Integrated Solutions 5/10, the PDP11/34, the IBM CS9000, and the VAX 11/780, both for compile and execution times on three C and four F77 programs. (the Dual and I.S. machines use the 68000 and 68010, respectively, with no FPU in our tests; also, the VAX had no FPA.) The results show that the LMC is about 9 times slower than the VAX on compiles, with a variation from 6 to 12, and 6 times slower than the VAX on executions, varying from 2.3 to 12. However, the poor executions were for short F77 programs. For two fpu-intensive C jobs and one fpu-intensive long F77 job the LMC averaged only x3 slower than the VAX, and we are talking about a $22,500 machine! ($16875 with educational discount) These benchmarks were done under the Unity operating system with the 6 MHz clock. Switching to the Genix op. sys. soon is supposed to increase system speed twofold and an increase of clock speed to 8 MHz soon is supposed to improve performance by much more than a linear increase. Also, they have 9 track tape working and will soon have an intelligent RS232 interface. While we were disappointed with the LMC compared to the 68k machines in terms of the edit/compile/ex debug cycle for short programs, our need for good F.P. for the buck for long programs was the deciding factor in our chooosing the LMC. If anyone is interested in the benchmark details, I can provide those too.
mike@hcrvax.UUCP (Mike Tilson) (08/02/84)
Some recent articles have benchmarked the UNITY system on the LMC hardware. In the light of those benchmarks, I thought that readers of this newsgroup would be interested in knowing what work will be completed on UNITY in the near future. (One should also keep in mind that LMC will be making hardware upgrades in the near future, for example the upgrading of the clock rate on the processor card.) The current UNITY system works well on the National 32000 series hardware, and it has been adapted to quite a number of boxes (over a dozen of them). Our focus to date has been to make the system reliable and configurable to a wide variety of hardware. We now plan certain performance improvements by Q4 of this year, if not earlier. The improvements are running internally. The current release of UNITY on the 32000 is based upon Berkeley 4.1BSD. This was chosen because at the time we did not want to reinvent a paging algorithm. As most readers know, National will be supplying UNIX System V to AT&T. In turn, HCR has been contracted by National Semiconductor to perform the conversion of UNIX System V to allow it to run on the NSC Sys 16 System (which is based on the NS32000 microprocessor family). We will be using our System V Rel. 2 implementation to immediately provide the initial basis of the next version of UNITY. (This will occur even before an official AT&T release of 32000 System V.) From a feature point of view, this release will provide the functionality now enjoyed by the 4.1BSD based version, as we have already ported most of the UCB utilities to System V. (You'll have to use shell layers rather than job control...) From a peformance point of view, there will be a number of good results: 1. A new implementation of C and Fortran 77. Better code is generated, and a number of Fortran problems will be resolved. In particular, the "module" linking used in the current version of UNITY will be changed to a more Vax-like convention. This will speed up execution, but it will also significantly improve the performance of the compilation process. The current linkage editor is slower than it needs to be, mostly due to module table processing. This is what accounts for an unexpectedly slow showing on "compile" benchmarks. When compiling `printf("hello world\n")', the compile-assemble-ld process is around a factor of 2.5 better with the new compiler/assembler/loader. 2. The System V Rel 2 implementation is generally faster. All of the steps taken on the Vax (e.g. implementation of critical library routines in assembly language, command hashing in /bin/sh, etc.) are used on the 32000. 3. The overhead of interrupt processing will be significantly reduced. 4. Virtual memory will be implemented by our own proprietary paging algorithm. This algorithm is designed to approximate swapping performance when running small jobs, and yet provide good performance on large jobs. It is the only UNIX paging algorithm that we know of that uses a working set algorithm. Prior to any significant performance tuning, it already benchmarks better than any algorithm now available on the 32000 hardware. (Note: if and when AT&T releases its own algorithm to the "public", we will evaluate its performance and use whatever is best. Both the AT&T and HCR algorithms are transparent to user code, so a change should be possible without modification of any user programs.) In summary, the current UNITY 32000 release has performance which is comparable to other systems. However, on compile-bound benchmarks, particularly benchmarks which emphasize small programs, the linkage editing time predominates. Our main development thrust has been to ensure the completeness and configurability of the system, so extensive performance tuning has in the past not been emphasized. The next major release will incorporate above performance and functionality improvements. Final note: It has been said before, but one must use great caution when attempting to draw general conclusions from benchmarks on specific hardware. If attempting to benchmark only the software, one must take into account variations in memory speed, processor clock rate, disk speed, etc. Also, it is very important to benchmark at more than one point. For example, the current release of GENIX outperforms the current release of UNITY when compiling a single small program; UNITY far outperforms GENIX when heavy memory demands cause paging activity to occur. I can flat out state that, comparing current version to current version, a switch to GENIX will *not* double performance, and in some cases could degrade performance. / Michael Tilson Human Computing Resources Corp., 10 St Mary Street, Toronto, Canada (416) 922-1937 {decvax,utzoo,utcsrgv}!hcr!hcrvax!mike
dan@rna.UUCP (Dan Ts'o) (08/02/84)
Hi, Well I haven't seen any recent benchmark postings, so here's one. I recently posted a request for performance benchmarks for real 32032/16 UNIX systems and received very little response - there seems to be a real lack of functional, deliverable 32032/16 UNIX systems out there. LMC is one example, but the 32016 in it apparently is running at 6Mhz. I remember playing with this machine a while back and it was slow. I managed to run some benchmarks on another 32016 system - the AIS (American Information System) 3210. The 3210 is a Qbus CPU. It is designed to run either as a Qbus master (no other CPU required) or as a "slave". In "slave" mode, the scheme I tested, the 3210 runs National's GENIX (4.1BSD) with all disk I/O calls going through a VIOS (Virtual I/O System) to another Qbus CPU (e.g. PDP11/23). Thus all the real I/O is performed by the 11/23. Here are the explanation and results of a series of benchmarks on the 3210, as well as a few VAXes and other machines, including a Pyramid and a MASSCOMP 500. Quick note to start: I didn't believe the user and sys times reported by the 3210, so I don't list them (explanation below). The normalization index is real time execution (or something more reasonable) with respect to the 780. Numbers listed for each benchmark are real(r), user(u), system(s), %cpu(%), and normalization(n). Times are in seconds, normalization index is fraction of the 11/780. The normalization index is the easiest number to purvey. Therefore I list first just this index. The actually data is given at the end of the article. - LOOP, for loop of 1million with long int index, Same as some previous UNIX conference benchmarks - CC LOOP, cc -O loop.c, Companion C compile to above - SIEVE, Same as published in BYTE - CC SIEVE, cc -O sieve.c - FLOAT, Same as published in BYTE, testing floating point performance *, / - GETPID, for loop of 100000 getpid()'s - GREP, grep zoom /usr/dict/words, grep through ~200kbytes - COPY, cp /usr/dict/words /tmp/junk, copying ~200kbytes - NROFF, nroff -ms /dev/null, load the MS macro package - SORT, sort -r /usr/dict/words > /tmp/junk PYR 780 750 11/44 11/34 11/23 MASS 3210 PC/XT 286 LOOP 2.1 1 .49 .27 .19 .1 .38 .23 .080 .16 CC LOOP .6 1 .6 .3 .25 .17 .38 .17 .073 .17 SIEVE 2.5 1 .61 .71 .46 .26 .57 .36 .21 .56 CC SIEVE .67 1 .57 .36 .27 .19 .4 .17 .075 .19 FLOAT .27 1 .76 .31 .27 .034 .030 .33 .13 .0029 GETPID 2.0 1 .59 .41 .30 .15 .76 .25 .22 .55 GREP 1.3 1 .5 .44 .4 .24 .4 .2 .13 .39 COPY 2 1 1 .16 .13 .13 .25 .1 .047 .10 NROFF 1.3 1 .57 .33 .22 .14 .4 no -ms .12 .27 SORT 1.4 1 .55 .42 .34 .20 .5 .22 .16 .41 Summary of normalizations: mean 1.4 1 .62 .37 .28 .16 .41 .23 .12 .28 standard deviation .74 .15 .14 .098 .067 .19 .08 .059 .19 Machine configurations: PYR: Pyramid, Eagle disk, no FPA, running OSx (4.2BSD) 780: 11/780, Eagle disk on SC780, FPA, 4.2BSD, 4k/1k fs 750: 11/750, Eagle disk on SC750, FPA, 4.2BSD, 4k/1k fs 11/44: CDC 9762 disk, FPU, cache, PWB/Unix (512byte/block), 50 kernel buffers 11/34: CDC 9762 disk, FPU, cache, PWB/Unix (512byte/block), 10 kernel buffers 11/23: USDC 40ms disk with read cache, FPU, no FPA, PWB/Unix, 15 kernel buffers MASS: Masscomp 500, no FPA, 4kb cache, virtual memory System III, 68010 10Mhz 3210: 32016 8Mhz, PDP11/23 IOP, 16081 FPU, GENIX (4.1BSD), no wait state mem PC/XT 8088 w/ 8087 FPU, Venix 286 Intel 286/380, 80286 at 6Mhz, no 80287 FPU, XENIX, Priam 3450 35Mb disk Notes: - All machines were running multiuser with one user. Results presented were reproduced with several trials. /usr/dict/words was confirmed to be of the same 200kb size +- 2kb (1%). The MS macros were not compacted/compiled. - The 3210 used a 8Mhz 32016. The company (AIS) claims that they will soon have 10Mhz CPU's and will later have 10Mhz 32032's which they expect between 750 and 780 performance. Right now it looks like the 3210 is roughly a 730. Its hard to say whether a 10Mhz CPU with 32bit paths would give them a 100% performance improvement. - The 3210 version of GENIX reported nonsense user and system times under both the Cshell and /bin/time. System time was always 0.0, %cpu was almost always 16% and user time was always about 1/6 of expected. Thus, at least times() was broken and maybe the clock was running at 10HZ instead of 60HZ. I couldn't test the nroff -ms, although they may have it, it wasn't on the system I tested. Other commands were broken or absent as well (e.g. ps). - 286 had a similar problem with user and system times. i became convinced that user, system and %cpu numbers were off by a factor of 3 (perhaps a 20Hz clock), so the times reported have been adjusted. - As one net person pointed out, the real win with the 32032/16 is the 16081 FPU which is basically on par with the 750 without an FPA, and the 11/44 and 11/34 FPU's. The Masscomp 500 without an FPU performed terribly, but Masscomp promises a FPU of their own design which will be several times faster than the popular SKY FPU and should alleviate this long standing sore spot. Pyramid also promises a FPA to help its unimpressive floating point performance. As an index, both the 780 and the 750 FPA's boost floating point performance by roughly 4X. - The floating point performance of the 286 was also terrible. A closer look reveals that the floating point was handled in system mode, probably the result of an illegal instruction trap. The version of the software tested did not support an 80287 FPU. - I believe the I/O performance of the 11/34 to be greatly hampered by the small number of kernel buffers it had (do you care ?). Changing the number of free buffers (by umount) affects the I/O performance by 2X. The 512byte/block filesystem doesn't help either. I don't know what the Masscomp filesystem blocking factor is, but it may be 1kbyte. The 4.2BSD filesystem is very fast - COPY on a 4.1BSD 780 takes 2.5X longer. 2.8 and 2.9BSD should give a performance boost to the PDP-11's in I/O and system call overhead. - Of course, the PDP-11's were handicapped in the LOOP using a long. In raw integer performance, the 11/44 is usually slightly faster than the 750. - Pyramid needs to speed up its C compiler. - NROFF appears to be the best general indicators of overall performance. Comparing the normalization index, NROFF had a standard error of .048. LOOP, for example, had a s.e. of .26 (i.e. wrong by 26% of a 780). If you could only run one command on a system and wanted to know what the normalization index would be like, the command "nroff -ms /dev/null" seems to be a fair indication. - Unfortunately, I didn't benchmark terminal I/O, memory access and addressing or process context switching performance - other important measurements. - Some opinions/flames not to be taken too seriously: as it turns out, those vague performance specs that DEC marketing uses seem actually on the mark. For example, the 750 is 60% of a 780, looking at the normalization numbers. Also the 11/23 is 80% of the 11/34 (uncached, the cache adds 25% average performance to the 11/34). The 785 benchmarks I've seen also jive with the marketing talk. In contrast other vendors are considerably more optimistic about their product - the Masscomp is supposed to be as fast as a 750 but seems really to be 70% of a 750 (an Eagle might help). The Pyramid is touted as being 2-4 times a 780 but seems like 1.4X. The 3210 was spec'd as "slightly less than a 750 and will be almost a 780", but is now less than 50% of a 750. Well, if DEC is also correct about the MicroVAX I being 35% of a 780, it may not be so bad after all. I hope this info is of help. It looks like 32032/16 UNIX systems have a little maturing to do. I plan to post another series of benchmarks on more machines such as the 11/73 and the Ridge (unless I get too many flames.) Cheers, Dan Ts'o Dept. Neurobiology Rockefeller Univ. 1230 York Ave. NY, NY 10021 212-570-7671 ...cmcl2!rna!dan Appendix of times: PYR 780 750 11/44 11/34 11/23 MASS 3210 PC/XT 286 LOOP r 1 2 5 9 13 25 7 11 25 15 u 1.2 2.5 5.1 9.1 13.1 24.9 6.3 24.9 15.6 s 0 0 .1 .1 .1 .1 .2 .1 0 % 92 97 93 92 97 96 n 2.1 1 .49 .27 .19 .1 .38 .23 .080 .16 CC LOOP r 5 3 5 10 12 18 8 18 41 18 u .9 .7 1.3 .8 1.1 2.0 1.4 8.4 5.7 s 1.6 1.6 2.7 2.9 4.1 6.9 2.7 9.5 3.3 % 47 68 73 54 43 48 n .6 1 .6 .3 .25 .17 .38 .17 .073 .17 SIEVE r 1 2 4 4 5 9 4 7 12 4 u 1 2.5 4.1 3.4 5.2 9.8 4.2 11.6 4.5 s 0 0 .1 .1 .2 .3 .2 .4 0 % 88 99 99 107! 99 99 n 2.5 1 .61 .71 .46 .26 .57 .36 .21 .56 CC SIEVE r 6 4 7 11 15 21 10 23 53 21 u 1.5 1.7 2.8 1.6 2.7 4.5 3.4 20.2 6.9 s 1.5 1.7 3 3.3 4.3 7.3 3 9.8 3.9 % 41 70 74 63 56 51 n .67 1 .57 .36 .27 .19 .4 .17 .075 .19 FLOAT r 5 1 1 5 5 38 44 4 10 454 u 4.9 1.3 1.7 4.2 4.9 38 43.1 9.8 6.3 s 0 0 0 0 0 .1 1.1 .3 448.2 % 93 97 98 100 100 99 n .27 1 .76 .31 .27 .034 .030 .33 .13 .0029 GETPID r 9 19 33 45 63 123 24 75 85 34 u 1.6 2.5 4.1 9.9 8.6 25.5 1.3 12.1 6.9 s 7.6 16.1 27.5 35.0 54.0 96.6 23.1 72.0 27 % 96 96 95 101! 98 99 n 2.0 1 .59 .41 .30 .15 .76 .25 .22 .55 GREP r 3 4 8 9 10 17 10 20 30 11 u 2.6 3.5 6.9 5.5 6.7 10.8 6.6 23.0 8.7 s.3 .5 .8 2.2 3.0 5.5 2.3 3.5 1.5 % 84 95 97 88 n 1.3 1 .5 .44 .4 .24 .4 .2 .13 .39 COPY r 1 2 2 12 16 16 8 21 43 20 u 0 0 0 0 .1 .17 0 .2 0 s .4 .7 .9 4.2 6.1 10.2 4 8.5 4.5 % 21 34 41 50 20 21 n 2 1 1 .16 .13 .13 .25 .1 .047 .10 NROFF r 3 4 7 12 18 29 10 no -ms 33 15 u 1.4 2.9 5.2 7 11.1 18.8 7.7 21.2 9.0 s .4 .6 1 2.2 3.4 5.1 2.1 5 1.8 % 75 83 97 79 72 n 1.3 1 .57 .33 .22 .14 .4 .12 .27 SORT r 26 37 67 88 110 187 74 167 226 90 u 22.4 34.2 60.1 51.3 77.3 144.9 53.4 174.2 63.6 s 1.1 2.1 4.2 14.3 19.4 31.3 12.8 41.3 11.4 % 89 96 95 89 95 81 n 1.4 1 .55 .42 .34 .20 .5 .22 .16 .41