[net.unix] 16k benchmarks ?

dan@rna.UUCP (Dan Ts'o) (08/02/84)
Hi,
	Well I haven't seen any recent benchmark postings, so here's one.
I recently posted a request for performance benchmarks for real 32032/16 UNIX
systems and received very little response - there seems to be a real lack of
functional, deliverable 32032/16 UNIX systems out there.
	LMC is one example, but the 32016 in it apparently is running at 6Mhz.
I remember playing with this machine a while back and it was slow. I managed to
run some benchmarks on another 32016 system - the AIS (American Information
System) 3210.
	The 3210 is a Qbus CPU. It is designed to run either as a Qbus master
(no other CPU required) or as a "slave". In "slave" mode, the scheme I tested,
the 3210 runs National's GENIX (4.1BSD) with all disk I/O calls going through
a VIOS (Virtual I/O System) to another Qbus CPU (e.g. PDP11/23). Thus all
the real I/O is performed by the 11/23.
	Here are the explanation and results of a series of benchmarks on the
3210, as well as a few VAXes and other machines, including a Pyramid and a
MASSCOMP 500. Quick note to start: I didn't believe the user and sys times
reported by the 3210, so I don't list them (explanation below). The
normalization index is real time execution (or something more reasonable)
with respect to the 780. Numbers listed for each benchmark are real(r), user(u),
system(s), %cpu(%), and normalization(n). Times are in seconds, normalization
index is fraction of the 11/780. The normalization index is the easiest number
to purvey. Therefore I list first just this index. The actually data is given
at the end of the article.

- LOOP, for loop of 1million with long int index, Same as some previous UNIX
	conference benchmarks
- CC LOOP, cc -O loop.c, Companion C compile to above
- SIEVE, Same as published in BYTE
- CC SIEVE, cc -O sieve.c
- FLOAT, Same as published in BYTE, testing floating point performance *, /
- GETPID, for loop of 100000 getpid()'s
- GREP, grep zoom /usr/dict/words, grep through ~200kbytes
- COPY, cp /usr/dict/words /tmp/junk, copying ~200kbytes
- NROFF, nroff -ms /dev/null, load the MS macro package
- SORT, sort -r /usr/dict/words > /tmp/junk

   PYR	780	750	11/44	11/34	11/23	MASS	3210	PC/XT	286

LOOP
   2.1	1	.49	.27	.19	.1	.38	.23	.080	.16
CC LOOP
    .6	1	.6	.3	.25	.17	.38	.17	.073	.17
SIEVE
   2.5	1	.61	.71	.46	.26	.57	.36	.21	.56
CC SIEVE
    .67	1	.57	.36	.27	.19	.4	.17	.075	.19
FLOAT
    .27	1	.76	.31	.27	.034	.030	.33	.13	.0029
GETPID
   2.0	1	.59	.41	.30	.15	.76	.25	.22	.55
GREP
   1.3	1	.5	.44	.4	.24	.4	.2	.13	.39
COPY
   2	1	1	.16	.13	.13	.25	.1	.047	.10
NROFF
   1.3	1	.57	.33	.22	.14	.4	no -ms	.12	.27
SORT
   1.4	1	.55	.42	.34	.20	.5	.22	.16	.41

Summary of normalizations:
mean
  1.4	1	.62	.37	.28	.16	.41	.23	.12	.28
standard deviation
  .74		.15	.14	.098	.067	.19	.08	.059	.19

Machine configurations:
PYR:	Pyramid, Eagle disk, no FPA, running OSx (4.2BSD)
780:	11/780, Eagle disk on SC780, FPA, 4.2BSD, 4k/1k fs
750:	11/750, Eagle disk on SC750, FPA, 4.2BSD, 4k/1k fs
11/44:	CDC 9762 disk, FPU, cache, PWB/Unix (512byte/block), 50 kernel buffers
11/34:	CDC 9762 disk, FPU, cache, PWB/Unix (512byte/block), 10 kernel buffers
11/23:	USDC 40ms disk with read cache, FPU, no FPA, PWB/Unix, 15 kernel buffers
MASS:	Masscomp 500, no FPA, 4kb cache, virtual memory System III, 68010 10Mhz
3210:	32016 8Mhz, PDP11/23 IOP, 16081 FPU, GENIX (4.1BSD), no wait state mem
PC/XT	8088 w/ 8087 FPU, Venix
286	Intel 286/380, 80286 at 6Mhz, no 80287 FPU, XENIX, Priam 3450 35Mb disk

Notes:
	- All machines were running multiuser with one user. Results presented
were reproduced with several trials. /usr/dict/words was confirmed to be of the
same 200kb size +- 2kb (1%). The MS macros were not compacted/compiled.
	- The 3210 used a 8Mhz 32016. The company (AIS) claims that they will
soon have 10Mhz CPU's and will later have 10Mhz 32032's which they expect
between 750 and 780 performance. Right now it looks like the 3210 is roughly
a 730. Its hard to say whether a 10Mhz CPU with 32bit paths would give them a
100% performance improvement.
	- The 3210 version of GENIX reported nonsense user and system times
under both the Cshell and /bin/time. System time was always 0.0, %cpu was
almost always 16% and user time was always about 1/6 of expected. Thus, at
least times() was broken and maybe the clock was running at 10HZ instead of
60HZ. I couldn't test the nroff -ms, although they may have it, it wasn't on the
system I tested. Other commands were broken or absent as well (e.g. ps).
	- 286 had a similar problem with user and system times. i became
convinced that user, system and %cpu numbers were off by a factor of 3 (perhaps
a 20Hz clock), so the times reported have been adjusted.
	- As one net person pointed out, the real win with the 32032/16 is the
16081 FPU which is basically on par with the 750 without an FPA, and the 11/44
and 11/34 FPU's. The Masscomp 500 without an FPU performed terribly, but
Masscomp promises a FPU of their own design which will be several times faster
than the popular SKY FPU and should alleviate this long standing sore spot.
Pyramid also promises a FPA to help its unimpressive floating point performance.
As an index, both the 780 and the 750 FPA's boost floating point performance by
roughly 4X.
	- The floating point performance of the 286 was also terrible. A closer
look reveals that the floating point was handled in system mode, probably the
result of an illegal instruction trap. The version of the software tested did
not support an 80287 FPU.
	- I believe the I/O performance of the 11/34 to be greatly hampered by
the small number of kernel buffers it had (do you care ?). Changing the number
of free buffers (by umount) affects the I/O performance by 2X. The 512byte/block
filesystem doesn't help either. I don't know what the Masscomp filesystem
blocking factor is, but it may be 1kbyte. The 4.2BSD filesystem is very fast -
COPY on a 4.1BSD 780 takes 2.5X longer. 2.8 and 2.9BSD should give a performance
boost to the PDP-11's in I/O and system call overhead.
	- Of course, the PDP-11's were handicapped in the LOOP using a long.
In raw integer performance, the 11/44 is usually slightly faster than the 750.
	- Pyramid needs to speed up its C compiler.
	- NROFF appears to be the best general indicators of overall
performance. Comparing the normalization index, NROFF had a standard error of
.048. LOOP, for example, had a s.e. of .26 (i.e. wrong by 26% of a 780). If you
could only run one command on a system and wanted to know what the normalization
index would be like, the command "nroff -ms /dev/null" seems to be a fair
indication.
	- Unfortunately, I didn't benchmark terminal I/O, memory access and
addressing or process context switching performance - other important
measurements.
	- Some opinions/flames not to be taken too seriously: as it turns out,
those vague performance specs that DEC marketing uses seem actually on the mark.
For example, the 750 is 60% of a 780, looking at the normalization numbers. Also
the 11/23 is 80% of the 11/34 (uncached, the cache adds 25% average performance
to the 11/34). The 785 benchmarks I've seen also jive with the marketing talk.
In contrast other vendors are considerably more optimistic about their product -
the Masscomp is supposed to be as fast as a 750 but seems really to be 70% of a
750 (an Eagle might help). The Pyramid is touted as being 2-4 times a 780 but
seems like 1.4X. The 3210 was spec'd as "slightly less than a 750 and will be
almost a 780", but is now less than 50% of a 750. Well, if DEC is also correct
about the MicroVAX I being 35% of a 780, it may not be so bad after all.

	I hope this info is of help. It looks like 32032/16 UNIX systems have a
little maturing to do. I plan to post another series of benchmarks on more
machines such as the 11/73 and the Ridge (unless I get too many flames.)

					Cheers,
					Dan Ts'o
					Dept. Neurobiology
					Rockefeller Univ.
					1230 York Ave.
					NY, NY 10021
					212-570-7671
					...cmcl2!rna!dan

Appendix of times:

  PYR	780	750	11/44	11/34	11/23	MASS	3210	PC/XT	286

LOOP
r 1	2	5	9	13	25	7	11	25	15
u 1.2	2.5	5.1	9.1	13.1	24.9	6.3		24.9	15.6
s 0	0	.1	.1	.1	.1	.2		.1	0
% 92	97	93				92		97	96
n 2.1	1	.49	.27	.19	.1	.38	.23	.080	.16

CC LOOP
r 5	3	5	10	12	18	8	18	41	18
u .9	.7	1.3	.8	1.1	2.0	1.4		8.4	5.7
s 1.6	1.6	2.7	2.9	4.1	6.9	2.7		9.5	3.3
% 47	68	73				54		43	48
n .6	1	.6	.3	.25	.17	.38	.17	.073	.17

SIEVE
r 1	2	4	4	5	9	4	7	12	4
u 1	2.5	4.1	3.4	5.2	9.8	4.2		11.6	4.5
s 0	0	.1	.1	.2	.3	.2		.4	0
% 88	99	99				107!		99	99
n 2.5	1	.61	.71	.46	.26	.57	.36	.21	.56

CC SIEVE
r 6	4	7	11	15	21	10	23	53	21
u 1.5	1.7	2.8	1.6	2.7	4.5	3.4		20.2	6.9
s 1.5	1.7	3	3.3	4.3	7.3	3		9.8	3.9
% 41	70	74				63		56	51
n .67	1	.57	.36	.27	.19	.4	.17	.075	.19

FLOAT
r 5	1	1	5	5	38	44	4	10	454
u 4.9	1.3	1.7	4.2	4.9	38	43.1		9.8	6.3
s 0	0	0	0	0	.1	1.1		.3	448.2
% 93	97	98				100		100	99
n .27	1	.76	.31	.27	.034	.030	.33	.13	.0029

GETPID
r 9	19	33	45	63	123	24	75	85	34
u 1.6	2.5	4.1	9.9	8.6	25.5	1.3		12.1	6.9
s 7.6	16.1	27.5	35.0	54.0	96.6	23.1		72.0	27
% 96	96	95				101!		98	99
n 2.0	1	.59	.41	.30	.15	.76	.25	.22	.55

GREP
r 3	4	8	9	10	17	10	20	30	11
u 2.6	3.5	6.9	5.5	6.7	10.8	6.6		23.0	8.7
s.3	.5	.8	2.2	3.0	5.5	2.3		3.5	1.5
% 84	95	97				88
n 1.3	1	.5	.44	.4	.24	.4	.2	.13	.39

COPY
r 1	2	2	12	16	16	8	21	43	20
u 0	0	0	0	.1	.17	0		.2	0
s .4	.7	.9	4.2	6.1	10.2	4		8.5	4.5
% 21	34	41				50		20	21
n 2	1	1	.16	.13	.13	.25	.1	.047	.10

NROFF
r 3	4	7	12	18	29	10	no -ms	33	15
u 1.4	2.9	5.2	7	11.1	18.8	7.7		21.2	9.0
s .4	.6	1	2.2	3.4	5.1	2.1		5	1.8
% 	75	83				97		79	72
n 1.3	1	.57	.33	.22	.14	.4		.12	.27

SORT
r 26	37	67	88	110	187	74	167	226	90
u 22.4	34.2	60.1	51.3	77.3	144.9	53.4		174.2	63.6
s 1.1	2.1	4.2	14.3	19.4	31.3	12.8		41.3	11.4
% 89	96	95				89		95	81
n 1.4	1	.55	.42	.34	.20	.5	.22	.16	.41