[net.micro.16k] 16k benchmarks ?

dan@rna.UUCP (Dan Ts'o) (07/11/84)

Hi,
	Does anyone have any benchmark results of C programs on a
commercially available 16000 UNIX system ? Can you mail me the results
or post them ? What do they show in relation to 68000's and VAX's ?
	The only benchmarks I've seen were done at USENIX meetings a
year or so ago on machines such as the LMC. They showed a rather
miserable performance with respect to 68000 offerings, results were
only slightly better than 8088's and LSI11/23's. Granted, the chips were
running at 6Mhz (I think).
	Yet all the ads you see seem to suggest that a 16k at 10Mhz should
be on par with 68000's at 10Mhz. So I'm looking for documentation -
benchmarks done on real 16k UNIX systems - got any ? Preferably with
C source so I can run the same benchmarks on other systems to compare.
	Thanks.

					Cheers,
					Dan Ts'o
					Dept. Neurobiology
					Rockefeller Univ.
					1230 York Ave.
					NY, NY 10021
					212-570-7671
					...cmcl2!rna!dan

beaucham@uiucuxc.UUCP (07/27/84)

#R:rna:-27100:uiucuxc:25800010:000:1820
uiucuxc!beaucham    Jul 26 22:18:00 1984

   We are about to buy an LMC after months of soul searching.  While not the
fastest machine in the world, it does do very well on fairly long floating
point intensive programs, particularly in C, but also in F77, and F.P. was our
most important requirement.  (there is an unfortunate initial overhead with F77
--the entire library is loaded whether you need it or not!)  We have bench 
marked it against the Dual 83/80, the Integrated Solutions 5/10, the PDP11/34,
the IBM CS9000, and the VAX 11/780, both for compile and execution times on 
three C and four F77 programs. (the Dual and I.S. machines use the 68000 and
68010, respectively, with no FPU in our tests; also, the VAX had no FPA.)
   
The results show that the LMC is about 9 times slower than the VAX on compiles,
with a variation from 6 to 12,  and 6 times slower than the VAX on executions,
varying from 2.3 to 12.  However, the poor executions were for short F77
programs. For two fpu-intensive C jobs and one fpu-intensive long F77 job the 
LMC averaged only x3 slower than the VAX, and we are talking about a $22,500
machine!  ($16875 with educational discount) 

These benchmarks were done under the Unity operating system with the 6 MHz
clock.  Switching to the Genix op. sys. soon is supposed to increase system
speed twofold and an increase of clock speed to 8 MHz soon is supposed to
improve performance by much more than a linear increase.  Also, they have
9 track tape working and will soon have an intelligent RS232 interface.
  
While we were disappointed with the LMC compared to the 68k machines in terms
of the edit/compile/ex debug cycle for short programs, our need for good F.P.
for the buck for long programs was the deciding factor in our chooosing the LMC.
If anyone is interested in the benchmark details, I can provide those too.

mike@hcrvax.UUCP (Mike Tilson) (08/02/84)

Some recent articles have benchmarked the UNITY system on the LMC hardware.
In the light of those benchmarks, I thought that readers of this newsgroup
would be interested in knowing what work will be completed on UNITY in the
near future.  (One should also keep in mind that LMC will be making hardware
upgrades in the near future, for example the upgrading of the clock rate
on the processor card.)

The current UNITY system works well on the National 32000 series hardware,
and it has been adapted to quite a number of boxes (over a dozen of them).
Our focus to date has been to make the system reliable and configurable
to a wide variety of hardware.  We now plan certain performance improvements
by Q4 of this year, if not earlier.  The improvements are running internally.

The current release of UNITY on the 32000 is based upon Berkeley 4.1BSD.
This was chosen because at the time we did not want to reinvent a paging
algorithm.  As most readers know, National will be supplying UNIX System
V to AT&T.  In turn, HCR has been contracted by National Semiconductor
to perform the conversion of UNIX System V to allow it to run on the NSC
Sys 16 System (which is based on the NS32000 microprocessor family).
We will be using our System V Rel. 2 implementation to immediately provide
the initial basis of the next version of UNITY.  (This will occur even before
an official AT&T release of 32000 System V.)  From a feature point of view,
this release will provide the functionality now enjoyed by the 4.1BSD
based version, as we have already ported most of the UCB utilities to
System V.  (You'll have to use shell layers rather than job control...)
From a peformance point of view, there will be a number of good results:

1.  A new implementation of C and Fortran 77.  Better code is generated,
    and a number of Fortran problems will be resolved.  In particular,
    the "module" linking used in the current version of UNITY will be
    changed to a more Vax-like convention.  This will speed up execution,
    but it will also significantly improve the performance of the
    compilation process.  The current linkage editor is slower than it
    needs to be, mostly due to module table processing.  This is what accounts
    for an unexpectedly slow showing on "compile" benchmarks.  When compiling
    `printf("hello world\n")', the compile-assemble-ld process is around
    a factor of 2.5 better with the new compiler/assembler/loader.

2.  The System V Rel 2 implementation is generally faster.  All of the steps
    taken on the Vax (e.g. implementation of critical library routines in
    assembly language, command hashing in /bin/sh, etc.) are used on the
    32000.

3.  The overhead of interrupt processing will be significantly reduced.

4.  Virtual memory will be implemented by our own proprietary paging
    algorithm.  This algorithm is designed to approximate swapping
    performance when running small jobs, and yet provide good performance
    on large jobs.  It is the only UNIX paging algorithm that we know
    of that uses a working set algorithm.  Prior to any significant
    performance tuning, it already benchmarks better than any algorithm
    now available on the 32000 hardware.  (Note:  if and when AT&T releases
    its own algorithm to the "public", we will evaluate its performance
    and use whatever is best.  Both the AT&T and HCR algorithms are
    transparent to user code, so a change should be possible without
    modification of any user programs.)

In summary, the current UNITY 32000 release has performance which is
comparable to other systems.  However, on compile-bound benchmarks,
particularly benchmarks which emphasize small programs, the linkage editing
time predominates.  Our main development thrust has been to ensure the
completeness and configurability of the system, so extensive performance tuning
has in the past not been emphasized.  The next major release will incorporate
above performance and functionality improvements.

Final note:  It has been said before, but one must use great caution
when attempting to draw general conclusions from benchmarks on specific
hardware.  If attempting to benchmark only the software, one must take into
account variations in memory speed, processor clock rate, disk speed, etc.
Also, it is very important to benchmark at more than one point.  For
example, the current release of GENIX outperforms the current release of
UNITY when compiling a single small program; UNITY far outperforms GENIX
when heavy memory demands cause paging activity to occur.  I can
flat out state that, comparing current version to current version, a
switch to GENIX will *not* double performance, and in some cases could
degrade performance.

/ Michael Tilson
  Human Computing Resources Corp., 10 St Mary Street, Toronto, Canada
  (416) 922-1937
  {decvax,utzoo,utcsrgv}!hcr!hcrvax!mike

dan@rna.UUCP (Dan Ts'o) (08/02/84)

Hi,
	Well I haven't seen any recent benchmark postings, so here's one.
I recently posted a request for performance benchmarks for real 32032/16 UNIX
systems and received very little response - there seems to be a real lack of
functional, deliverable 32032/16 UNIX systems out there.
	LMC is one example, but the 32016 in it apparently is running at 6Mhz.
I remember playing with this machine a while back and it was slow. I managed to
run some benchmarks on another 32016 system - the AIS (American Information
System) 3210.
	The 3210 is a Qbus CPU. It is designed to run either as a Qbus master
(no other CPU required) or as a "slave". In "slave" mode, the scheme I tested,
the 3210 runs National's GENIX (4.1BSD) with all disk I/O calls going through
a VIOS (Virtual I/O System) to another Qbus CPU (e.g. PDP11/23). Thus all
the real I/O is performed by the 11/23.
	Here are the explanation and results of a series of benchmarks on the
3210, as well as a few VAXes and other machines, including a Pyramid and a
MASSCOMP 500. Quick note to start: I didn't believe the user and sys times
reported by the 3210, so I don't list them (explanation below). The
normalization index is real time execution (or something more reasonable)
with respect to the 780. Numbers listed for each benchmark are real(r), user(u),
system(s), %cpu(%), and normalization(n). Times are in seconds, normalization
index is fraction of the 11/780. The normalization index is the easiest number
to purvey. Therefore I list first just this index. The actually data is given
at the end of the article.

- LOOP, for loop of 1million with long int index, Same as some previous UNIX
	conference benchmarks
- CC LOOP, cc -O loop.c, Companion C compile to above
- SIEVE, Same as published in BYTE
- CC SIEVE, cc -O sieve.c
- FLOAT, Same as published in BYTE, testing floating point performance *, /
- GETPID, for loop of 100000 getpid()'s
- GREP, grep zoom /usr/dict/words, grep through ~200kbytes
- COPY, cp /usr/dict/words /tmp/junk, copying ~200kbytes
- NROFF, nroff -ms /dev/null, load the MS macro package
- SORT, sort -r /usr/dict/words > /tmp/junk

   PYR	780	750	11/44	11/34	11/23	MASS	3210	PC/XT	286

LOOP
   2.1	1	.49	.27	.19	.1	.38	.23	.080	.16
CC LOOP
    .6	1	.6	.3	.25	.17	.38	.17	.073	.17
SIEVE
   2.5	1	.61	.71	.46	.26	.57	.36	.21	.56
CC SIEVE
    .67	1	.57	.36	.27	.19	.4	.17	.075	.19
FLOAT
    .27	1	.76	.31	.27	.034	.030	.33	.13	.0029
GETPID
   2.0	1	.59	.41	.30	.15	.76	.25	.22	.55
GREP
   1.3	1	.5	.44	.4	.24	.4	.2	.13	.39
COPY
   2	1	1	.16	.13	.13	.25	.1	.047	.10
NROFF
   1.3	1	.57	.33	.22	.14	.4	no -ms	.12	.27
SORT
   1.4	1	.55	.42	.34	.20	.5	.22	.16	.41

Summary of normalizations:
mean
  1.4	1	.62	.37	.28	.16	.41	.23	.12	.28
standard deviation
  .74		.15	.14	.098	.067	.19	.08	.059	.19

Machine configurations:
PYR:	Pyramid, Eagle disk, no FPA, running OSx (4.2BSD)
780:	11/780, Eagle disk on SC780, FPA, 4.2BSD, 4k/1k fs
750:	11/750, Eagle disk on SC750, FPA, 4.2BSD, 4k/1k fs
11/44:	CDC 9762 disk, FPU, cache, PWB/Unix (512byte/block), 50 kernel buffers
11/34:	CDC 9762 disk, FPU, cache, PWB/Unix (512byte/block), 10 kernel buffers
11/23:	USDC 40ms disk with read cache, FPU, no FPA, PWB/Unix, 15 kernel buffers
MASS:	Masscomp 500, no FPA, 4kb cache, virtual memory System III, 68010 10Mhz
3210:	32016 8Mhz, PDP11/23 IOP, 16081 FPU, GENIX (4.1BSD), no wait state mem
PC/XT	8088 w/ 8087 FPU, Venix
286	Intel 286/380, 80286 at 6Mhz, no 80287 FPU, XENIX, Priam 3450 35Mb disk

Notes:
	- All machines were running multiuser with one user. Results presented
were reproduced with several trials. /usr/dict/words was confirmed to be of the
same 200kb size +- 2kb (1%). The MS macros were not compacted/compiled.
	- The 3210 used a 8Mhz 32016. The company (AIS) claims that they will
soon have 10Mhz CPU's and will later have 10Mhz 32032's which they expect
between 750 and 780 performance. Right now it looks like the 3210 is roughly
a 730. Its hard to say whether a 10Mhz CPU with 32bit paths would give them a
100% performance improvement.
	- The 3210 version of GENIX reported nonsense user and system times
under both the Cshell and /bin/time. System time was always 0.0, %cpu was
almost always 16% and user time was always about 1/6 of expected. Thus, at
least times() was broken and maybe the clock was running at 10HZ instead of
60HZ. I couldn't test the nroff -ms, although they may have it, it wasn't on the
system I tested. Other commands were broken or absent as well (e.g. ps).
	- 286 had a similar problem with user and system times. i became
convinced that user, system and %cpu numbers were off by a factor of 3 (perhaps
a 20Hz clock), so the times reported have been adjusted.
	- As one net person pointed out, the real win with the 32032/16 is the
16081 FPU which is basically on par with the 750 without an FPA, and the 11/44
and 11/34 FPU's. The Masscomp 500 without an FPU performed terribly, but
Masscomp promises a FPU of their own design which will be several times faster
than the popular SKY FPU and should alleviate this long standing sore spot.
Pyramid also promises a FPA to help its unimpressive floating point performance.
As an index, both the 780 and the 750 FPA's boost floating point performance by
roughly 4X.
	- The floating point performance of the 286 was also terrible. A closer
look reveals that the floating point was handled in system mode, probably the
result of an illegal instruction trap. The version of the software tested did
not support an 80287 FPU.
	- I believe the I/O performance of the 11/34 to be greatly hampered by
the small number of kernel buffers it had (do you care ?). Changing the number
of free buffers (by umount) affects the I/O performance by 2X. The 512byte/block
filesystem doesn't help either. I don't know what the Masscomp filesystem
blocking factor is, but it may be 1kbyte. The 4.2BSD filesystem is very fast -
COPY on a 4.1BSD 780 takes 2.5X longer. 2.8 and 2.9BSD should give a performance
boost to the PDP-11's in I/O and system call overhead.
	- Of course, the PDP-11's were handicapped in the LOOP using a long.
In raw integer performance, the 11/44 is usually slightly faster than the 750.
	- Pyramid needs to speed up its C compiler.
	- NROFF appears to be the best general indicators of overall
performance. Comparing the normalization index, NROFF had a standard error of
.048. LOOP, for example, had a s.e. of .26 (i.e. wrong by 26% of a 780). If you
could only run one command on a system and wanted to know what the normalization
index would be like, the command "nroff -ms /dev/null" seems to be a fair
indication.
	- Unfortunately, I didn't benchmark terminal I/O, memory access and
addressing or process context switching performance - other important
measurements.
	- Some opinions/flames not to be taken too seriously: as it turns out,
those vague performance specs that DEC marketing uses seem actually on the mark.
For example, the 750 is 60% of a 780, looking at the normalization numbers. Also
the 11/23 is 80% of the 11/34 (uncached, the cache adds 25% average performance
to the 11/34). The 785 benchmarks I've seen also jive with the marketing talk.
In contrast other vendors are considerably more optimistic about their product -
the Masscomp is supposed to be as fast as a 750 but seems really to be 70% of a
750 (an Eagle might help). The Pyramid is touted as being 2-4 times a 780 but
seems like 1.4X. The 3210 was spec'd as "slightly less than a 750 and will be
almost a 780", but is now less than 50% of a 750. Well, if DEC is also correct
about the MicroVAX I being 35% of a 780, it may not be so bad after all.

	I hope this info is of help. It looks like 32032/16 UNIX systems have a
little maturing to do. I plan to post another series of benchmarks on more
machines such as the 11/73 and the Ridge (unless I get too many flames.)

					Cheers,
					Dan Ts'o
					Dept. Neurobiology
					Rockefeller Univ.
					1230 York Ave.
					NY, NY 10021
					212-570-7671
					...cmcl2!rna!dan

Appendix of times:

  PYR	780	750	11/44	11/34	11/23	MASS	3210	PC/XT	286

LOOP
r 1	2	5	9	13	25	7	11	25	15
u 1.2	2.5	5.1	9.1	13.1	24.9	6.3		24.9	15.6
s 0	0	.1	.1	.1	.1	.2		.1	0
% 92	97	93				92		97	96
n 2.1	1	.49	.27	.19	.1	.38	.23	.080	.16

CC LOOP
r 5	3	5	10	12	18	8	18	41	18
u .9	.7	1.3	.8	1.1	2.0	1.4		8.4	5.7
s 1.6	1.6	2.7	2.9	4.1	6.9	2.7		9.5	3.3
% 47	68	73				54		43	48
n .6	1	.6	.3	.25	.17	.38	.17	.073	.17

SIEVE
r 1	2	4	4	5	9	4	7	12	4
u 1	2.5	4.1	3.4	5.2	9.8	4.2		11.6	4.5
s 0	0	.1	.1	.2	.3	.2		.4	0
% 88	99	99				107!		99	99
n 2.5	1	.61	.71	.46	.26	.57	.36	.21	.56

CC SIEVE
r 6	4	7	11	15	21	10	23	53	21
u 1.5	1.7	2.8	1.6	2.7	4.5	3.4		20.2	6.9
s 1.5	1.7	3	3.3	4.3	7.3	3		9.8	3.9
% 41	70	74				63		56	51
n .67	1	.57	.36	.27	.19	.4	.17	.075	.19

FLOAT
r 5	1	1	5	5	38	44	4	10	454
u 4.9	1.3	1.7	4.2	4.9	38	43.1		9.8	6.3
s 0	0	0	0	0	.1	1.1		.3	448.2
% 93	97	98				100		100	99
n .27	1	.76	.31	.27	.034	.030	.33	.13	.0029

GETPID
r 9	19	33	45	63	123	24	75	85	34
u 1.6	2.5	4.1	9.9	8.6	25.5	1.3		12.1	6.9
s 7.6	16.1	27.5	35.0	54.0	96.6	23.1		72.0	27
% 96	96	95				101!		98	99
n 2.0	1	.59	.41	.30	.15	.76	.25	.22	.55

GREP
r 3	4	8	9	10	17	10	20	30	11
u 2.6	3.5	6.9	5.5	6.7	10.8	6.6		23.0	8.7
s.3	.5	.8	2.2	3.0	5.5	2.3		3.5	1.5
% 84	95	97				88
n 1.3	1	.5	.44	.4	.24	.4	.2	.13	.39

COPY
r 1	2	2	12	16	16	8	21	43	20
u 0	0	0	0	.1	.17	0		.2	0
s .4	.7	.9	4.2	6.1	10.2	4		8.5	4.5
% 21	34	41				50		20	21
n 2	1	1	.16	.13	.13	.25	.1	.047	.10

NROFF
r 3	4	7	12	18	29	10	no -ms	33	15
u 1.4	2.9	5.2	7	11.1	18.8	7.7		21.2	9.0
s .4	.6	1	2.2	3.4	5.1	2.1		5	1.8
% 	75	83				97		79	72
n 1.3	1	.57	.33	.22	.14	.4		.12	.27

SORT
r 26	37	67	88	110	187	74	167	226	90
u 22.4	34.2	60.1	51.3	77.3	144.9	53.4		174.2	63.6
s 1.1	2.1	4.2	14.3	19.4	31.3	12.8		41.3	11.4
% 89	96	95				89		95	81
n 1.4	1	.55	.42	.34	.20	.5	.22	.16	.41