[net.arch] 11/08/85 Dhrystone Benchmark Results

gemini@hou2h.UUCP (R.GBADAMOSI) (11/09/85)

ANNOUNCMENT
Attached, please find the 11/08/85 list of DHRYSTONE benchmark results.
The source code for the drystone benchmark can be found in net.sources.

The latest list includes many new machines/compiler combinations.  Many
of the questionable results have been confirmed or corrected.  I am still
waiting for Intel 386 results; I'm sure many others are, too.

CLARIFICATION
There seems to have been a great deal of confusion over what this
benchmark measures, and how to use these results.  Let me try to clarify
this:

	1) DHRYSTONE is a measure of processor+compiler efficiency in
	   executing a 'typical' program.  The 'typical' program was
	   designed by measuring statistics on a great number of
	   'real' programs.  The 'typical' program was then written
	   by Reinhold P. Weicker using these statistics.  The
	   program is balanced according to statement type, as well
	   as data type.

	2) DHRYSTONE does not use floating point.  Typical programs don't.

	3) DHRYSTONE does not do I/O.  Typical programs do, but then
	   we'd have a whole can of worms opened up.

	4) DHRYSTONE does not contain much code that can be optimized
	   by vector processors.  That's why a CRAY doesn't look real
	   fast, they weren't built to do this sort of computing.

	5) DHRYSTONE does not measure OS performance, as it avoids
	   calling the O.S.  The O.S. is indicated in the results only
	   to help in identifying the compiler technology.

If somebody asked me to pick out the best machine for the money, I
wouldn't look at just the results of DHRYSTONE.  I'd probably:

	1) Run DHRYSTONE to get a feel for the compiler+processor
	   speed.
	2) Run any number of benchmarks to check disk I/O bandwidth,
	   using both sequential and random read/writes.
	3) Run a multitasking benchmark to check multi-user response
	   time.  Typically, these benchmarks run several types of
	   programs such as editors, shell scripts, sorts, compiles,
	   and plot the results against the number of simulated users.
	4) If appropriate for the intended use, run WHETSTONE, to determine
	   floating point performance.
	5) If appropriate for intended use, run some programs which do
	   vector and matrix computations.
	6) Figure out what the box will:
		- cost to buy
		- cost to operate and maintain
		- be worth when it is sold
		- be worth if the manufacturer goes out of business
	7) Having done the above, I probably have a hand-full of
	   machines which meet my price/performance requirements.
	   Now, I find out if the applications programs I'd like
	   to use will run on any of these machines.  I also find
	   out how much interest people have in writing new software
	   for the machine, and look carefully at the migration path
	   I will have to take when I reach the limits of the machine.

To summarize, DHRYSTONES by themselves are not anything more than
a way to win free beers when arguing 'Box-A versus Box-B' religion.
They do provide insight into Box-A/Compiler-A versus Box-A/Compiler-B
comparisons.

As usual, all comments and new results should be mailed directly
to me at ..{ihnp4,..others..}!houxm!castor!rer.  I will summarize
and post to the net.

Rick Richardson
PC Research, Inc.
(201) 834-1378
..!houxm!castor!rer

RESULTS
 *
 * MACHINE	MICROPROCESSOR	OPERATING	COMPILER	DHRYSTONES/SEC.
 * TYPE				SYSTEM				NO REG	REGS
 * --------------------------	------------	-----------	---------------
 * Commodore 64	6510-1MHz	C64 ROM		C Power 2.8	  36	  36
 * HP-110	8086-5.33Mhz	MSDOS 2.11	Lattice 2.14	 284	 284
 * IBM PC/XT	8088-4.77Mhz	PC/IX		cc		 257	 287
 * PERKIN-ELMER	3205		XELOS(SVR2) 	cc		 279	 296
 * IBM PC/XT	8088-4.77Mhz	COHERENT 2.3.43	MarkWilliams cc  296	 317
 * Cosmos	68000-8Mhz	UniSoft		cc		 305	 322
 * IBM PC/XT	8088-4.77Mhz	VENIX/86 2.0	cc		 297	 324
 * IBM PC	8088-4.77Mhz	MSDOS 2.0	b16cc 2.0	 310	 340
 * IBM PC	8088-4.77Mhz	MSDOS 2.0	CI-C86 2.20M	 390	 390
 * IBM PC/XT	8088-4.77Mhz	PCDOS 2.1	Wizard 2.1	 367	 403
 * IBM PC/XT	8088-4.77Mhz	PCDOS 3.1	Lattice 2.15	 403	 403 @
 * IBM PC	8088-4.77Mhz	PCDOS 3.1	Datalight 1.10	 416	 416
 * IBM PC/XT	8088-4.77Mhz	PCDOS 2.1	Microsoft 3.0	 390	 427
 * PDP-11/34	-		UNIX V7M	cc		 387	 438
 * IBM PC	8088, 4.77mhz	PC-DOS 2.1	Aztec C v3.2d	 423	 454
 * Tandy 1000	V20, 4.77mhz	MS-DOS 2.11	Aztec C v3.2d	 423	 458
 * Onyx C8002	Z8000-4Mhz	IS/1 1.1 (V7)	cc		 476	 511
 * PRO-380	11/73 with FPA	Venix 2.0 (SVR2) cc		 574	 632
 * Apollo DN550	68010-?Mhz	AegisSR9/IX	cc 3.12		 666	 666
 * HP-110	8086-5.33Mhz	MSDOS 2.11	Aztec-C		 641	 676 
 * ATT PC6300	8086-8Mhz	MSDOS 2.11	b16cc 2.0	 632	 684
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	CI-C86 2.1	 666	 684
 * Tandy 6000	68000-8Mhz	Xenix 3.0	cc		 694	 694
 * Macintosh	68000-7.8Mhz 2M	Mac Rom		Mac C 32 bit int 694	 704
 * Macintosh	68000-7.7Mhz	-		MegaMax C 2.0	 661	 709
 * Cadmus 9000	68010-10Mhz	UNIX		cc		 714	 735
 * Cadmus 9790	68010-10Mhz 1MB	SVR0,Cadmus3.7	cc		 720	 747
 * NEC PC9801F	8086-8Mhz	PCDOS 2.11	Lattice 2.15	 768	  -  @
 * ATT PC6300	8086-8Mhz	MSDOS 2.11	CI-C86 2.20M	 769	 769
 * ATT 3B2/300	WE32000-?Mhz	UNIX 5.0.2	cc		 735	 806
 * Apollo DN320	68010-?Mhz	AegisSR9/IX	cc 3.12		 806	 806
 * Atari 520ST  68000-8Mhz      TOS             DigResearch      839     846
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	MS 3.0(large)	 833	 847 LM
 * VAX 11/750	-		Ultrix 1.1	4.2BSD cc	 781	 862
 * VAX 11/750	-		Unix 4.2bsd	cc		 862	 877
 * Fast Mac	68000-7.7Mhz	-		MegaMax C 2.0	 839	 904 +
 * IBM PC/XT	8086-9.54Mhz	PCDOS 3.1	Microsoft 3.0	 833	 909 C1
 * Macintosh	68000-7.8Mhz 2M	Mac Rom		Mac C 16 bit int 877	 909 S
 * Perkin-Elmer 3220            Ed. 7 v2.3      cc		 892	 925
 * AT&T 6300	8086, 8mhz	MS-DOS 2.11	Aztec C v3.2d	 862	 943
 * IBM PC/XT	8086-9.54Mhz	PCDOS 3.1	Wizard 2.1	 892	 980 C1
 * IBM PC/XT	8086-9.54Mhz	PCDOS 3.1	Lattice 2.15	 980	 980 C1
 * PDP-11/73	KDJ11-AA 15Mhz	UNIX V7M 2.1	cc		 862     981
 * VAX 11/750	-		Unix 4.3bsd	cc		 994	 997
 * IRIS-1400	68010-10Mhz	Unix System V	cc		 909	1000
 * IBM PC/AT	80286-6Mhz	VENIX/86 2.1	cc		 961	1000
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	b16cc 2.0	 943	1063
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	MS 3.0(small)	1063	1086
 * VAX 11/750	-		VMS		VAX-11 C 2.0	 958	1091
 * Stride	68000-10Mhz	System-V/68	cc		1041	1111
 * ATT PC7300	68010-10Mhz	UNIX 5.2	cc		1041	1111
 * Stride	68000-12Mhz	System-V/68	cc		1063	1136
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	Datalight 1.10	1190	1190
 * ATT PC6300+	80286-6Mhz	MSDOS 3.1	b16cc 2.0	1111	1219
 * IBM PC/AT	80286-6Mhz	PCDOS 3.1	Wizard 2.1	1136	1219
 * Sun2/120	68010-10Mhz	Sun 4.2BSD	cc		1136	1219
 * IBM PC/AT	80286-6Mhz	PCDOS 3.0	CI-C86 2.20M	1219	1219
 * MASSCOMP 500	68010-10MHz	RTU V3.0	cc (V3.2)	1156	1238
 * Cyb DataMate	68010-12.5Mhz	Uniplus 5.0	Unisoft cc	1162	1250
 * PDP 11/70	-		UNIX 5.2	cc		1162	1250
 * IBM PC/AT	80286-6Mhz	PCDOS 3.1	Lattice 2.15	1250	1250
 * IBM PC/AT	80286-7.5Mhz	VENIX/86 2.1	cc		1190	1315 *15
 * Sun2/120	68010-10Mhz	Standalone	cc		1219	1315
 * Intel 380	80286-8Mhz	Xenix R3.0up1	cc		1250	1315 *16
 * ATT 3B2/400	WE32100-?Mhz	UNIX 5.2	cc		1315	1315
 * DG MV4000	-		AOS/VS 5.00	cc		1333	1333
 * IBM PC/AT	80286-6Mhz	MSDOS 3.0	Microsoft 3.0	1250	1388
 * ATT PC6300+	80286-6Mhz	MSDOS 3.1	CI-C86 2.20M	1428	1428
 * Cyb DataMate	68010-12.5Mhz	Uniplus 5.0	Unisoft cc	1470	1562 S
 * VAX 11/780	-		UNIX 5.2	cc		1515	1562
 * MicroVAX-II	-		-		-		1562	1612
 * VAX 11/780	-		Unix 4.3bsd	cc		1646	1662
 * Apollo DN660	-		AegisSR9/IX	cc 3.12		1666	1666
 * ATT 3B20	-		UNIX 5.2	cc		1515	1724
 * NEC PC-98XA	80286-8Mhz	PCDOS 3.1	Lattice 2.15	1724	1724 @
 * HP9000-500	B series CPU	HP-UX 4.02	cc		1724	-
 * IBM PC/STD	80286-8Mhz	MSDOS 3.0 	Microsoft 3.0	1724	1785 C2
 * DEC-2065	KL10-Model B	TOPS-20 6.1FT5	Port. C Comp.	1937	1946
 * Gould PN6005	-		UTX 1.1(4.2BSD)	cc		1675	1964
 * DEC2060	KL-10		TOPS-20		cc		2000	2000 &
 * VAX 11/785	-		UNIX 5.2	cc		2083	2083
 * VAX 11/785	-		VMS		VAX-11 C 2.0	2083	2083
 * VAX 11/785	-		Unix 4.3bsd	cc		2135	2136
 * Pyramid 90x	-		OSx 2.3		cc		2272	2272
 * ALTOS 586	8086-10Mhz	XENIX 3.0b	cc 		2194	2411 ?!
 * Pyramid 90x	FPA,cache,4Mb	OSx 2.5		cc		2777	2777
 * Pyramid 90x	-		OSx 2.5		cc		3125	3125
 * IBM-4341-II	-		VM/SP3		Waterloo C 1.2  3333	3333
 * SUN 3/75	68020-16.67Mhz	SUN 4.2 V3	cc		3333	3571
 * SUN-3/160    68020-16.67Mhz  Sun 4.2 V3.0A   cc		3381    3764
 * Sun 3/180	68020-16.67Mhz	Sun 4.2		cc		3333	3846
 * MC 5400	68020-16.67MHz	RTU V3.0	cc (V4.0)	3952	4054
 * NCR Tower32  68020-16.67Mhz  SYS 5.0 Rel 2.0 cc              3846	4545
 * Gould PN9080	-		UTX-32 1.1c	cc		-	4629
 * MC 5600/5700	68020-16.67MHz	RTU V3.0	cc (V4.0)	4504	4746 %
 * VAX 8600	-		Unix 4.3bsd	cc		7024	7088
 * VAX 8600	-		VMS		VAX-11 C 2.0	7142	7142
 * CCI POWER 6/32		COS(SV+4.2)	cc		7500	7800
 * IBM-3083	-		UTS 5.0 Rel 1	cc	       16666   12500
 * CRAY-1A	    80Mhz	CTSS		Cray C 2.0     12100   13888
 * IBM-3083	-		VM/CMS HPO 3.4	Waterloo C 1.2 13889   13889
 * Amdahl 470 V/8 		UTS/V 5.2       cc v1.23       15560   15560
 * CRAY-XMP/48	   105Mhz	CTSS		Cray C 2.0     15625   17857
 * Amdahl 580	-		UTS 5.0 Rel 1.2	cc v1.5        23076   23076
 * Amdahl 5860	 		UTS/V 5.2       cc v1.23       28970   28970
 *
 *   *  Crystal changed from 'stock' to listed value.
 *   +  This Macintosh was upgraded from 128K to 512K in such a way that
 *      the new 384K of memory is not slowed down by video generator accesses.
 *   %  Single processor; MC == MASSCOMP
 *   &  A version 7 C compiler written at New Mexico Tech.
 *   @  vanilla Lattice compiler used with MicroPro standard library
 *   S  Shorts used instead of ints
 *   LM Large Memory Model. (Otherwise, all 80x8x results are small model)
 *   C1 Univation PC TURBO Co-processor; 9.54Mhz 8086, 640K RAM
 *   C2 Seattle Telecom STD-286 board
 *   C? Unknown co-processor board?
 *   ?  I don't trust results marked with '?'.  These were sent to me with
 *      either incomplete information, or with times that just don't make sense.
 *	?? means I think the performance is too poor, ?! means too good.
 *      If anybody can confirm these figures, please respond.
 *

NOTE NOTE NOTE - Do not reply to this article with 'r'.  I am using a
borrowed login, since my usual machine hasn't been getting news lately. -Rick

jqj@cornell.UUCP (J Q Johnson) (11/12/85)

Although I remain unconvinced about the validity of Dhrystone benchmarks as
a useful measure of system througput, I applaud Richardson's well presented
discussion of what benchmarks like this do and don't mean.  One major question
that remains in my mind with respect to Dhrystone benchmarks is the effect
of cache.  Has anyone looked at Dhrystone benchmarks with this in mind?
How much does a typical cache architecture (say a 4K 2-way associative
cache, or the onboard cache on a 68020) effect Dhrystone performance?

Dhrystone is a small program, and as a result may have a quite atypical
performance profile even if its balance of instruction types is typical.
Cache performance is one source of such variation.  Another might be the
fact that typical Dhrystone implementations exist in a single file.  Some
compilers (e.g. Tartan C), treat this as an opportunity to optimize
subroutine calling conventions (using JSR instead of CALLS on a VAX).

carter@masscomp.UUCP (Jeff Carter) (11/13/85)

In article <643@cornell.UUCP> jqj@cornell.UUCP (J Q Johnson) writes:
>that remains in my mind with respect to Dhrystone benchmarks is the effect
>of cache.  Has anyone looked at Dhrystone benchmarks with this in mind?
>How much does a typical cache architecture (say a 4K 2-way associative
>cache, or the onboard cache on a 68020) effect Dhrystone performance?
>
I ran the Dhrystone for the MASSCOMP 5000-series machines, all of which are
based on the 68020. The primary differences between the CPU modules is the
cache architecture and the Translation Buffer. The results for these machines
are extracted from Richardson's article:

 * MC 5400	68020-16.67MHz	RTU V3.0	cc (V4.0)	3952	4054
 * MC 5600/5700	68020-16.67MHz	RTU V3.0	cc (V4.0)	4504	4746 

The MC-5400 uses a cache with the following characteristics:
	8KByte size
	Direct-Mapped
	Write-Through
	8 Byte Block size
	Cache by Process Virtual Address

The MC-5600 and MC-5700 use a cache with the following:
	8KByte size
	2-Way Associative
	Write-Through
	8 Byte Block Size
	Cache by Physical Address

Of course, both use the '020 internal instruction cache. There are several 
other vendors using '020s with different cache architectures represented in 
the results, it is informative to examine these.

Both systems are zero wait states on read cache hit, and zero wait states on
write (Regardless of cache hit/miss). The translation buffer is quite different
in these 2 systems, but due to the small program size, I doubt if this has
any effect. The effect of the 2-way cache is (probably) to cache the
instructions in one cache bank and the data in the other bank (not really,
but it serves as a nice model) In a direct-mapped cache, the instructions and
data can tend to kick each other out as the program loops. 

Jeff Carter
MASSCOMP
1 Technology Park
Westford, MA 01886
...!{ihnp4|decvax|allegra}!masscomp!carter

mat@amdahl.UUCP (Mike Taylor) (11/14/85)

> ...One major question that remains in my mind with respect to Dhrystone
> benchmarks is the effect of cache...
> Dhrystone is a small program, and as a result may have a quite atypical
> performance profile even if its balance of instruction types is typical.

Cache locality for instruction fetch is certainly pretty good for
Dhrystone.  However, this kind of locality is typical for many
user applications, as Dhrystone is represented to be.  Operand locality
is much worse, again the usual situation.  While there are always
exceptions, this kind of pattern is typical in the data that we collect
for applications (not supervisory code).  Based on the well-known
S/370-type machines in the list (470, 580, 3083, 4341) the results track
real experience surprisingly well.  In our experience, it is supervisory
code that really shows the differences in cache performance to extremes.
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

crowl@rochester.UUCP (Lawrence Crowl) (11/14/85)

In article <643@cornell.UUCP> jqj@cornell.UUCP (J Q Johnson) writes:
>....  One major question
>that remains in my mind with respect to Dhrystone benchmarks is the effect
>of cache.  Has anyone looked at Dhrystone benchmarks with this in mind?
>....

The Dhrystone benchmark originated in:

        Reinhold P. Weicker
        Dhrystone: A Synthetic Systems Programming Benchmark
        Communications of the ACM
        Volume 27, Number 10, Pages 1013-1030, October 1984

In the article, the benchmark is timed for only one run.  This eliminates the
effect of cache on multiple runs, but not for the single run, which is as it
should be.  Multiple runs will skew the results heavily towards machines with
a cache.  Multiple runs are a timing convenience and are not in the spirit of
a 'systems' benchmark because systems programs usually do not exibit this
behavior.  This convenience is acceptable for machines without ANY cache, but
it is not acceptable for machines with a cache.  Therefore, do not accept the
results of a multiple run timing on machines with a cache.  Insist that the
benchmarker get out a logic analyzer and time a single run.  I know this is a
major hassle, but it is the only comparable result.  I suggest you read the
article if you intend to use the results.
-- 

Lawrence Crowl             716-275-5766        University of Rochester
                                               Computer Science Department
...!{allegra,decvax,seismo}!rochester!crowl    Rochester, New York,  14627

stubbs@ncr-sd.UUCP (Jan Stubbs) (11/23/85)

In article <643@cornell.UUCP> jqj@cornell.UUCP (J Q Johnson) writes:
>How much does a typical cache architecture (say a 4K 2-way associative
>cache, or the onboard cache on a 68020) effect Dhrystone performance?
>
I have found the 256 byte instruction cache on the 68020 to have small impact on dhrystone performance ( <10% ) and in any other larger benchmark program. This is easy to measure as you can turn off the cache. Much more important are 
off chip data and instruction caches if large enough.

Jan Stubbs sdcsvax!ncr-sd!stubbs  619 485-3052

campbell@sauron.UUCP (Mark Campbell) (11/25/85)

In article <340@ncr-sd.UUCP> stubbs@ncr-sd.UUCP (0000-Jan Stubbs) writes:
>In article <643@cornell.UUCP> jqj@cornell.UUCP (J Q Johnson) writes:
>>How much does a typical cache architecture (say a 4K 2-way associative
>>cache, or the onboard cache on a 68020) effect Dhrystone performance?
>>
>I have found the 256 byte instruction cache on the 68020 to have small impact
>on dhrystone performance ( <10% ) and in any other larger benchmark program.
>This is easy to measure as you can turn off the cache. Much more important are 
>off chip data and instruction caches if large enough.

It is difficult to evaluate the effects of cache architecture with respect to
the performance of a system without taking into account the implementation
of that architecture and the implementation of the rest of the system.  The
MC68020 internal (on-chip) cache is an excellent example: the implementation
details of the system in which the MC68020 resides is extremely important.

I also executed the Dhrystone benchmark on an MC68020-based system with the
internal cache disabled and enabled.  I found the average performance
degradation due to the disabling of the internal cache to be a minimum of
15%, with an average degradation of 18% (from 4545 to 3846).  Some details
of the system on which I obtained these numbers are given below:

	Machine:		NCR Tower 32
	Clock Rate:		16.67 MHz
	External Caches:	0 Wait-State, 6K Direct-Mapped Program Cache
				1 Wait-State, 2K Direct-Mapped Data Cache

Given these results, I have a problem understanding the less than 10%
performance degradation given in the preceeding article.  It seems to me
that the configuration I used would be very close to the worst case for
almost any architecture.  An n-way set associative cache might decrease
the degradation, but given the nature of the Dhrystone benchmark I doubt
that this decrease would be noticable.

The minimum penalty for missing the MC68020 internal cache is one cycle
(60 ns at 16.67 MHz).  Decreasing the clock rate causes the minimum
penalty to increase to 80 ns, thus increasing the performance degradation
due to the disabling of the internal cache.  Likewise, increasing the number
of external program cache wait-states causes the performance degradation
to increase.

With many other architectures the disabling of the internal cache would cause
a much larger performance degradation.  The Sun 3, for example, uses a
"syncopated" clock to achieve an average memory access time of 90 ns (1.5
wait-states).  Thus the performance degradation increases dramatically without
the MC68020 cache.  [Note: Sun people...please correct me if I'm wrong -- EDN
isn't the best place to get technical information].  This will be the case in
any system in which the second level of the memory hierarchy with respect to
program accesses (after the first level, the internal cache) can't be accessed
in the very small, no wait-state timing window of external bus fetches.
-- 

Mark Campbell
Phone:  (803)-791-6697
E-Mail: {decvax!mcnc, ihnp4!msdc}!ncsu!ncrcae!sauron!campbell