[comp.arch] i860 Dhrystones

mash@mips.COM (John Mashey) (03/11/89)

In article <93452@sun.uucp> garner@sun.UUCP (Robert Garner) writes:
.....
>The judgement so far is that the 22% improvement must be coming from
>the FORTRAN version.  As I've never seen a FORTRAN version of 
>Dhrystone, does anyone at Intel have the source that they could
>post on the net?  Or was the reference to the Fortan
>compiler a typo?  (The Performance brief remarks:
>"Dhrystone was developed in ADA by R. Weicker in 1984.
>Fortran and C versions of the benchmark are more commonly used.")

>It will be interesting to see how pointers and structures are handled.
>Also, which Fortran library routine was used to do the string copies?

If it is indeed true that this is no typo, it is fascinating,
as it is well-known that C's byte-by-byte copy can add 25-30% over
(for example) PASCAL, or anything that has fixed-length character strings.
(On an R3000, 34% of Dhrystone is in strcmp & strcpy.)
Of course it depends on what kind of code-generation is done, also,
i.e., in-line versus out-of-line.

In reading the Intel i860(TM) performacne document, it is interesting
to note that 7 benchmarks are presented:
Dhrystone 1.1 and 2.1, Stanford Integer, SP & DP Whetstone,
and FORTRAN and Coded LINPACK.

Of these, the document claims that Dhrystone was in FORTRAN,
and Stanford and LINPAK were simulated (with zero-wait-state memory;
this makes little difference to Stanford, as it mostly fits in the cache.)

That means, that in terms of published benchmarks that seem apples-to-apples
comparable, the sum total is:  SP & DP Whetstone.

On Feb 27, Green Hills announced it was shipping C and FORTRAN compilers
for the i860, so it does seem a little strange that C wouldn't have been
used.  Maybe it is a typo, and there's some other reason.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

clif@intelca.intel.com (Ken Shoemaker) (03/14/89)

In article <15074@winchester.mips.COM>, mash@mips.COM (John Mashey) writes:
> In article <93452@sun.uucp> garner@sun.UUCP (Robert Garner) writes:
> .....
> >The judgement so far is that the 22% improvement must be coming from
> >the FORTRAN version.  As I've never seen a FORTRAN version of 

Mashey writes
> 
> If it is indeed true that this is no typo, it is fascinating,
> as it is well-known that C's byte-by-byte copy can add 25-30% over
> (for example) PASCAL, or anything that has fixed-length character strings.
> (On an R3000, 34% of Dhrystone is in strcmp & strcpy.)
> In reading the Intel i860(TM) performacne document, it is interesting
> to note that 7 benchmarks are presented:
> Dhrystone 1.1 and 2.1, Stanford Integer, SP & DP Whetstone,
> and FORTRAN and Coded LINPACK.
> 
> That means, that in terms of published benchmarks that seem apples-to-apples
> comparable, the sum total is:  SP & DP Whetstone.

The i860 CPU benchmark report had a TYPO the Dhrystone benchmark used
the Greenhill C compiler not FORTRAN.
Sorry to dissappoint everyone who thought that we were getting great
Dhrystone numbers by rewritting the benchmark in FORTRAN.

As for the simulated numbers versus actual numbers.  We have an excellent
correlation (within 3%) between simulated numbers and actual numbers.

My speculation (note the word speculation) as to why the the Dhrystone 
numbers are so good is: 

	Clock Frequency
	128-bit loads for string instructions
	The clocks/instruction is 1 (I imagine other RISC chips
	approach 1 clock/instruction but don't actually obtain it)

Clif Purkiser
Intel Corp.

The above views are mine and don't represent Intel's official position.

mash@mips.COM (John Mashey) (03/14/89)

In article <210@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes:
...
>The i860 CPU benchmark report had a TYPO the Dhrystone benchmark used
>the Greenhill C compiler not FORTRAN.
>Sorry to dissappoint everyone who thought that we were getting great
>Dhrystone numbers by rewritting the benchmark in FORTRAN.
>
>As for the simulated numbers versus actual numbers.  We have an excellent
>correlation (within 3%) between simulated numbers and actual numbers.
>
>My speculation (note the word speculation) as to why the the Dhrystone 
>numbers are so good is: 
>
>	Clock Frequency
>	128-bit loads for string instructions
>	The clocks/instruction is 1 (I imagine other RISC chips
>	approach 1 clock/instruction but don't actually obtain it)

Thanx for the correction; that certainly saves wasting some time.

1) Can you say any more words on simulations?  I.e., everybody
understands that the memory system is irrelevant for almost-100%-cache-hit
programs [Dhrystone, Stanford, Whetstone], but we'd be surprised
that a 5-wait-state machine (the measured one) and the zero-wait-state
machine (the simulated one) would be within 3% on DP LINPACK, given the
speed of the basic FP ops.  Could the zero-wait-state thing also be a typo?

2) OK, I give up.  There must be something unbelievably clever going on
to use 128-bit loads for C-language string operations. I've looked
at the i860 Programmer's Reference Manual a bunch, trying to figure
out how to use either the FP unit or the graphics unit to do this.
The string copy on page 9-5 of the manual is the "natural" strcpy
(which doesn't use anything but byte load/store, and takes about 5 cycles/byte).
I haven't been able to find anything like "branch on any byte zero", and the 860
doesn't have unaligned word operations.  For a fair test, you MUST
use str* that only assume byte alignment of operands, and
you can't inline the str*.  The only place I can think of using 128-bit
loads is in the structure-copy, and it shouldn't be used there,
unless structures whose largest entities are words are always aligned
to 4-word boundaries, which seems unlikely.

3) Anyway, various people at various companies still can't figure
out why the number can reasonably be this high, under the
normal rules, UNLESS there's some really slick trick for
getting strcpy and strcmp down around 2 cycles/byte.
There just aren't enough differences between an R3000 and an 860,
on this benchmark, to account for this otherwise. [Everything
fits in the caches; an 860 wins some places, an R3000 wins in some
places; the R3000 has essentially no write stalls on this benchmark,
so difference between write-thru and writeback is irrelevant; etc;
since something like 40% of the time is spent in str*, and the rest is
spread around; it's really the major place to look.]

Maybe somebody at Intel would care to post the str* routines
and educate us?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

cprice@mips.COM (Charlie Price) (03/16/89)

In article <93088@sun.uucp> garner@sun.UUCP (Robert Garner) writes:
>Question:  What's going on with the i860 Dhrystone/MHz ratio?
>
>The Intel "i860 Processor Performance" brief--Release 1.0, March 89--shows
>82,900 Dhrystones/sec for version 1.1 for a scaled 40-MHz i860.
>With compiler improvements and elimination of "errata on the current
>stepping of the i860 processor", it says they expect to push the value to 90K.
>
>According to the paper, 69K Dhrystones/sec were measured on a Compaq 386/20
>add-in card with a 33-MHz i860 and 8MB of SCRAM (0-wait cycles for hits,
>5-W for read miss, and 2-W for write misses). 
>
>As a sanity check to a similar micro-architecture implementation (split
>i&d caches, 1-cycle load/store), the 25-MHz R3000 value is 42,300
>Dhrystones/sec (w/ MIPS -O3, i.e., interprocedural register allocation).
>
>Since the i860 integer/cache micro-architecture is so similar to
>the R3000 integer/cache micro-architecture, and assuming that Intel's
>compiler technology is not significantly better than MIPSCo's,
>shouldn't an i860 value equal a scaled R3000 value?
>
>Scaling the 25-MHz R3000 value up to 40 MHz gives 67,680 Dhrystones/sec.
>So where did Intel get the extra 22% ?

The 25 MHz R3000 can actually get 51,800 1.1 dhrystones per second.
This number scales to 82,800 at 40 MHz, within 100 of Intel's figure.
The 51.8K number, however, is beyond the spirit of the benchmark.

What do dhrystone numbers mean?
MIPS has maintained for a long time that they are not especially
meaningful numbers.  Our existing results show that you need
a lot of context to have any idea what a number is telling you.

MIPS has just released Performance Brief 3.6
(March 89 -- I assume that Mashey will post it sometime)
and it has an interesting set of numbers for dhrystone.
There are numbers for two different compiler releases and one
number for the newer compilers with assembly language versions
of strcpy() and strcmp().

dhrystone 1.1 -- M/2000-8 (25 MHz R3000) (numbers are Kilo-loops)

			Default	opt	-O	-O3	-O4
1.31 compiler		32.4 (K)	39.7	42.3	45.3
2.00 compiler		32.6 (K)	39.7	43.1	46.7	
2.00 with new str rtns				47.4
2.00 with new str rtns (my measurement, not in Brief)	51.8

dhrystone 2.1 -- M/2000-8 (25 MHz R3000) (numbers are Kilo-loops)

			Default	opt	-O	-O3	-O4
1.31 compiler		33.0 (K)	36.7	38.8	42.8
2.00 compiler		32.4 (K)	36.7	39.4	43.2

>(1) More aggressive compiler optimizations?
>MIPSCo's -O4 value, which according to MIPSCo's Performance Brief
>"is beyond the spirit of the benchmark", is 45,300 Dhrystones/sec.
>Scaled to 40 MHz, this still falls short at 72,480 Dhrystones/sec.

MIPS believes (and says in the Performance Brief) that -O4 is
beyond the spirit of the benchmark.  We include the -O4 results
to show what is possible, and for illumination of what a dhrystone
figure without context can mean since not all quoted figures are
The "-O4" optimizations give
7.1% for the 1.31 compiler,
8.4% for the 2.00 compiler with string routines in C, and
9.3% for the 2.00 compiler with strcmp() and strcpy() in assembler.

If you don't know exactly what the optimizations are, it is hard to
say what a dhrystone number might mean.

>(2) Faster string copy/compare with graphics instructions?

This can be quite important.
For dhrystone 1.1 on an M/2000 assembly-language routines give
9.3% increase for -O3 (within the spirit of the benchmark), and
11.6% increase for -O4 (too much optimization).
I believe that these lib routines are just assembler,
not especially tuned for the dhrystone string lengths.
If you wanted faster dhrystone numbers, you could get them with
string routines that worked generally, but especially well for
the specific length operations that dhrystone does.

Again, if you don't know much about the libraries, you can't determine
what the dhrystone number is telling you.

The range of dhrystone figures for a single MIPS machine
might be interesting because it tells you something about
the compilers, but a single "dhrystones for this machine"
just doesn't mean much.
-- 
Charlie Price    cprice@mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086

hanko@masscomp.UUCP (Jim Hanko) (03/17/89)

In article <15226@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <210@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes:
>...
>>The i860 CPU benchmark report had a TYPO the Dhrystone benchmark used
>>the Greenhill C compiler not FORTRAN.
>>My speculation (note the word speculation) as to why the the Dhrystone 
>>numbers are so good is:  ...
>>
>>	128-bit loads for string instructions
>
>
>2) OK, I give up.  There must be something unbelievably clever going on
>to use 128-bit loads for C-language string operations. ...
>...  For a fair test, you MUST
            ^^^^^^^^^
>use str* that only assume byte alignment of operands, and
>you can't inline the str*.  ...
>
>3) Anyway, various people at various companies still can't figure
>out why the number can reasonably be this high, under the
>normal rules, UNLESS there's some really slick trick for
>getting strcpy and strcmp down around 2 cycles/byte.

A couple of years ago I investigated the output of the Green Hills C compiler
on the Dhrystone benchmark (for a different architecture).  I remember being
somewhat surprised to see that the compiler had inlined the strcpy calls.  It
could do this since most of the calls were of the form: 
	strcpy(x, "a constant string");

I believe that it did not actually copy the bytes from memory but loaded long
immediate values and stored them.  

Although strcpy is extensively called with string constants in
Dhrystone, this is relatively rare in real programs.  Therefore, such a
compiler feature seems to be targeted specifically to Dhrystone. 

I can't say that the Intel version of the compiler has this "optimization"
(or if it did that Intel knew about it), but this may explain the high
numbers.  Can anyone with access to the compiler check this?

I think it would clearly be unfair to compare Dhrystone numbers where this
trick was used to those where a strcpy subroutine was called. 

-
#include <std/disclaimer>

Jim Hanko		{uunet|decvax|harvard|mit-eddie}!masscomp!hanko

chase@Ricerca.orc.olivetti.com (David Chase) (03/17/89)

In article <955@masscomp.UUCP> hanko@masscomp.UUCP (Jim Hanko) writes:
>Although strcpy is extensively called with string constants in
>Dhrystone, this is relatively rare in real programs.  Therefore, such a
>compiler feature seems to be targeted specifically to Dhrystone. 
>
>I think it would clearly be unfair to compare Dhrystone numbers where this
>trick was used to those where a strcpy subroutine was called. 

Get real.  Nobody with a half a brain should trust silly little
benchmark programs that reduce performance to a single number.
Develop benchmarks based on real code that does real work, and perhaps
compiler writers will target will target all those "unfair"
optimizations at code that people actually use.  Procedure inlining is
"not in the spirit of Dhrystone", but it would be stupid not to use it
for real programs if it was reasonably implemented.

When I want to compare processors, I run the programs that I use every
day.  The one that works best on those is the one that works best for me.

David

aglew@mcdurb.Urbana.Gould.COM (03/18/89)

>Although strcpy is extensively called with string constants in
>Dhrystone, this is relatively rare in real programs.
>
>Jim Hanko		{uunet|decvax|harvard|mit-eddie}!masscomp!hanko

Have you got a source for this,
or can you post numbers?

jesup@cbmvax.UUCP (Randell Jesup) (03/18/89)

In article <15226@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <210@intelca.intel.com> clif@intelca.intel.com (Ken Shoemaker) writes:
>...
>>The i860 CPU benchmark report had a TYPO the Dhrystone benchmark used
>>the Greenhill C compiler not FORTRAN.
>>Sorry to dissappoint everyone who thought that we were getting great
>>Dhrystone numbers by rewritting the benchmark in FORTRAN.
...
>>My speculation (note the word speculation) as to why the the Dhrystone 
>>numbers are so good is: 
>>
>>	Clock Frequency
>>	128-bit loads for string instructions
>>	The clocks/instruction is 1 (I imagine other RISC chips
>>	approach 1 clock/instruction but don't actually obtain it)
...
>2) OK, I give up.  There must be something unbelievably clever going on
>to use 128-bit loads for C-language string operations. I've looked
...
>doesn't have unaligned word operations.  For a fair test, you MUST
>use str* that only assume byte alignment of operands, and
>you can't inline the str*.  The only place I can think of using 128-bit
>loads is in the structure-copy, and it shouldn't be used there,
>unless structures whose largest entities are words are always aligned
>to 4-word boundaries, which seems unlikely.

	Actually, I think the statement "Greenhills C" was the giveaway.
We use Greenhills C here at Commodore for Amiga OS work, and got bitten recently
because the compiler was set up with the "dhrystone" optimizer turned on,
without our knowing it.  This causes mis-aligned strcpy()s to bus-fault on
68000, since it (a) assumes string sources AND destinations are ALWAYS word-
aligned, and (b) inlines strcpy, even though in general greenhills doesn't
do inlining.

	So I suspect the differences are being caused by the "dhrystone" switch
in Greenhills.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

mash@mips.COM (John Mashey) (03/18/89)

In article <39388@oliveb.olivetti.com> chase@Ricerca.UUCP (David Chase) writes:
>In article <955@masscomp.UUCP> hanko@masscomp.UUCP (Jim Hanko) writes:
....
>>I think it would clearly be unfair to compare Dhrystone numbers where this
>>trick was used to those where a strcpy subroutine was called. 
...
>Get real.  Nobody with a half a brain should trust silly little
>benchmark programs that reduce performance to a single number.
>Develop benchmarks based on real code that does real work, and perhaps
>compiler writers will target will target all those "unfair"
>optimizations at code that people actually use.  Procedure inlining is
>"not in the spirit of Dhrystone", but it would be stupid not to use it
>for real programs if it was reasonably implemented.
>
>When I want to compare processors, I run the programs that I use every
>day.  The one that works best on those is the one that works best for me.

A lot of us do this, and we also try to publish it.
So far, Intel:
	-published the results of EXACTLY ONE integer benchmark (Dhrystone
	1.1 & 2.1) actually measured on this machine (@ 33MHz)
	-features this number prominently in its marketing claims
	(well, actually, it features the numbers that would be gotten at 40MHz,
	or sometimes 50MHz)
	-uses it frequently to claim superiority over other processors
	-in its performance document, describes Dhrystone WITHOUT THE SLIGHTEST
	TRACE OF CAVEATS about the care with which these results must be
	interpreted, despite the fact that the Dhrystone sources give such
	caveats, and that the Dhrystone table in the Intel document comes
	straight from a document that takes great pains to warn the reader
	to be very careful about interpretation.

However, Olivetti is not Intel.  Since you work at a site well-known to be
working on i860s, perhaps you could suggest for us the "programs that you
use every day" that you run on an i860, and performance thereof,
so we could have a better shot at evaluating its performance.  You could
really help the cause of realistic-benchmarking by posting
sources-of-real-programs (if any could be public domain) of such programs
and their i860 times.....  Anyway, it's no wonder than many users of
computers trust vendors abotu as far as they can throw them....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (03/18/89)

In article <6326@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
...
>	Actually, I think the statement "Greenhills C" was the giveaway.
>We use Greenhills C here at Commodore for Amiga OS work, and got bitten recently
>because the compiler was set up with the "dhrystone" optimizer turned on,
>without our knowing it.  This causes mis-aligned strcpy()s to bus-fault on
>68000, since it (a) assumes string sources AND destinations are ALWAYS word-
>aligned, and (b) inlines strcpy, even though in general greenhills doesn't
>do inlining.
>
>	So I suspect the differences are being caused by the "dhrystone" switch
>in Greenhills.

Can you say more?  Do you mean that this is a compile-time switch,
but the default was set up to do the strpy this way?  (You clearly must
be able to turn it off, since some real programs will fail if this is done
generally.  Maybe it only inlines strcpy when all the conditions are right
(and this was a bug)?)

Or is this something gen'd into some compilers, but not others?

In any case, do you (or anybody) know the option names for turning this
effect ON/OFF?  (maybe they're different across CPUs?)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

cquenel@polyslo.CalPoly.EDU (56 more school days) (03/19/89)

In article <6326@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>     Actually, I think the statement "Greenhills C" was the giveaway.
We use Greenhills C here at Commodore for Amiga OS work, and got bitten recently
>because the compiler was set up with the "dhrystone" optimizer turned on,
>without our knowing it.  This causes mis-aligned strcpy()s to bus-fault on
>68000, since it (a) assumes string sources AND destinations are ALWAYS word-
>aligned, and (b) inlines strcpy, even though in general greenhills doesn't
>do inlining.

	Ghreenstones  --

	The benchmark number corresponding to Dhrystones 1.1 run 
	through a "dhrystone" optimizing Greenhills compiler.  
	Known to be 2 to 3 times higher than it should be.

--chris

	(I didn't write this.  You can't sue me, I don't
	 have any money! P.S.  Greenhills deserves it.)

-- 
@---@  ------------------------------------------------------------------  @---@
\. ./  | Chris Quenelle (The First Lab Rat) cquenel@polyslo.calpoly.edu |  \. ./
 \ /   |                   Better Red than dead !                       |   \ / 
==o==  ------------------------------------------------------------------  ==o==

w-colinp@microsoft.UUCP (Colin Plumb) (03/19/89)

mash@mips.COM (John Mashey) wrote:
> 2) OK, I give up.  There must be something unbelievably clever going on
> to use 128-bit loads for C-language string operations. I've looked
> at the i860 Programmer's Reference Manual a bunch, trying to figure
> out how to use either the FP unit or the graphics unit to do this.

Yeah... the Z-buffer check instructions could be used for this, but they're
only available in 16 and 32-bit versions, and you have to test the bits
from the psr, two cycles.  And even that would only be 64 bits at a time.

> The string copy on page 9-5 of the manual is the "natural" strcpy
> (which doesn't use anything but byte load/store, and takes about 5 cycles/
> byte).  I haven't been able to find anything like "branch on any byte zero",
> and the 860 doesn't have unaligned word operations.  For a fair test,
> you MUST use str* that only assume byte alignment of operands, and
> you can't inline the str*.  The only place I can think of using 128-bit
> loads is in the structure-copy, and it shouldn't be used there,
> unless structures whose largest entities are words are always aligned
> to 4-word boundaries, which seems unlikely.

Well, you can quickly cobble together some code using the
((x-0x01010101)&~x)&0x80808080 != 0 trick that works on words
at a time.  This would help in Dhrystone which, as has been observeed,
has unnaturally long strings.  If you get this going with a bit of
alternation  to allow for load latency, you can get strlen down to about
5 cycles/word.  Strcmp and strcpy would be slower, but would probably
be bandwidth-limited.

As for structure copies, what you want is for all structures 4 words
or larger in size to always be 4-word aligned.  Intel suggests the
stack is kept this way, for just the same reason.

(I admit this is starting to enter the realm of declining returns - you
can waste a lot of memory this way - but is still feasable.)

> Maybe somebody at Intel would care to post the str* routines
> and educate us?

I posted the instruction set - it's an exercise for the reader. :-)
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do." - The Doctor

chase@Ozona.orc.olivetti.com (David Chase) (03/21/89)

In article <15475@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

>Since you work at a site well-known to be working on i860s, perhaps you
>could suggest for us the "programs that you use every day"

X11, GNU emacs, make, cc (as, ld), tex, dvi2ps, iptex (widely accessible),
m2c, m2l, m2make, m3cfe, m3be (not accessible, not all portable either).

>that you run on an i860,

None -- several very important ones only run on 68K boxes, and have
not even been ported to SunOS 4.*.  I tend to be conservative in
moving to new hardware and software.

>and performance thereof,

I'm not sure, but I don't think I could even if I had the numbers.
When in doubt, I don't comment publicly on Olivetti endeavors, or on
companies with whom we appear to be working.  I seem to recall (from
my student days) agreements with DEC and IBM barring me from
publishing benchmarks or bug reports for products not yet released to
the rest of the world, so I'll assume that such an agreement probably
holds here.

In general, I think that I am I/O- and bad software-bound, not
CPU-bound.  (That is, I am CPU bound, but only because some critical
software is Really Stupid.  It would be helped by a faster cpu, but
that solution (though practical and probably the one I'll use) offends
me.)

Benchmarks will appear via e-mail.

David

cramer@sun.com (Sam Cramer) (03/21/89)

In article <15475@winchester.mips.COM>, mash@mips (John Mashey) writes:
>So far, Intel:
>	-published the results of EXACTLY ONE integer benchmark (Dhrystone
>	1.1 & 2.1) actually measured on this machine (@ 33MHz)
>	-features this number prominently in its marketing claims
>	(well, actually, it features the numbers that would be gotten at 40MHz,
>	or sometimes 50MHz)
>	-uses it frequently to claim superiority over other processors
>	-in its performance document, describes Dhrystone WITHOUT THE SLIGHTEST
>	TRACE OF CAVEATS about the care with which these results must be
>	interpreted, despite the fact that the Dhrystone sources give such
>	caveats, and that the Dhrystone table in the Intel document comes
>	straight from a document that takes great pains to warn the reader
>	to be very careful about interpretation.

Intel is not the only company to do this.  In the "DECstation 3100
Performance Summary" distributed by DEC, the SINGLE integer benchmark shown
is Dhrystone 2.1.  The text that accompanies these results contains no
warning regarding the tendency of Dhrystone to overstate "real-world"
integer performance.  In fact, the Dhrystone results are used to calculate
"price-performance" ratios relative to two Sun machines (a vital metric for
those users who spend all day running Dhrystones).  This section of the
document (titled "CASE and Dhrystone"!) goes on to imply that these
Dhrystone results promise good performacne on a CASE workload.

It seems that when DEC acquired RISC technology from MIPS, they overlooked
MIPS's benchmarking know-how.

Sam Cramer	sun!cramer  cramer@sun.com

sclafani@jumbo.dec.com (Michael Sclafani) (03/21/89)

In article <95013@sun.Eng.Sun.COM>, cramer@sun.com (Sam Cramer) writes:
> Intel is not the only company to do this.  In the "DECstation 3100
> Performance Summary" distributed by DEC, the SINGLE integer benchmark shown
> is Dhrystone 2.1.  The text that accompanies these results contains no
> warning regarding the tendency of Dhrystone to overstate "real-world"
> integer performance.

From the DECstation 3100 Performance Summary / Part 2: Performance Details:

"Dhrystone is widely available, easy to run and is arguably the industry's
most popular Integer benchmark.  Unfortunately, the result obtained is
difficult to fairly compare amongst differing computing architectures and
is almost as sensitive to how the Dhrystone executable image is compiled
and linked as it is to the underlying processor speed.  The benchmark
documentation presents a set of ground rules for building and executing
Dhrystone.  Today, the accepted practice is to run the benchmark under any
environment you wish, as long as the environment is clearly described and
procedure inlining compiler optimization is not employed."

"Dhrystone does not seem to be the best indication of application
performance and is unusual in the following respects:

    + Unusually low dynamic nesting depth of function calls
    + Unusually low number of instuctions executed per function call
    + Large percentage of time spent in "strcpy" and "strcmp" routines,
      processing unusually large character strings
    + Character strings are typically alignable on a word boundary
    + Does not show how the use of shared libraries in real workload
      with multiple concurrent applications effects performance

Results for the Sun-3/60 are not reported because the data in
[Presentation on Benchmarks given at Sun User Group Conference, Dec 5-7,
1988 by Sun Microsystems, Inc.] uses compiler optimization level 4 which
employs procedure inlining.

We include the Dhrystone benchmark in our performance evaluation because
of its popularity, but warn against using it as the sole basis of
comparing system performance and of accepting results that don't
explicitly label how the benchmark was built and what optimizations were
exploited."

The performance summary and technical information are available via
anonymous ftp as compressed postscript from gatekeeper.dec.com in
~ftp/pub:
        ds3100_perf.1a.ps.Z
        ds3100_perf.1b.ps.Z
        ds3100_perf.2.ps.Z
        ds3100_tech.ps.Z

The summary includes Linpack, Whetstone, DR Labs CPU2, Livermore FORTRAN
Kernels, Dhrystone (2.1 AND 1.1), SPICE 2G6, Doduc, Dynamic Graphics TOP
Benchmark, and X11 graphics benchmarks.

Please note that I am not a Digital spokescritter, and any opinions
presented or errors committed are my own.
-- 

Michael Sclafani      \\\  Digital Equipment Corporation
sclafani@src.dec.com  \\\  Systems Research Center, Palo Alto, CA
(415) 854-7569 (home) \\\  (415) 853-2271 (work)

david@sun.com (Academy of Pathetic Mail) (03/22/89)

In article <13641@jumbo.dec.com> sclafani@jumbo.dec.com (Michael Sclafani)
writes:
>Results for the Sun-3/60 are not reported because the data in
>[Presentation on Benchmarks given at Sun User Group Conference, Dec 5-7,
>1988 by Sun Microsystems, Inc.] uses compiler optimization level 4 which
>employs procedure inlining.

FYI, this is incorrect.  Current Sun C compilers do not perform procedure
inlining at any optimization level.

-- 
David DiGiacomo, Sun Microsystems, Mt. View, CA  sun!david david@sun.com

cramer@sun.com (Sam Cramer) (03/22/89)

Mr. Sclafani is quite right - I overlooked the section toward the back of
the document which he quotes.  My apologies for mistakenly claiming that
the "DECstation 3100 Performance Summary" contains no caveats regarding
Dhrystone.

Nonetheless, it is a bit curious that such a flawed benchmark (the results
of which are described in the section I missed as "difficult to fairly compare
amongst differing computing architectures") is used as the basis of a
price/performance comparison.

Sam Cramer	sun!cramer  cramer@sun.com

jg@jumbo.dec.com (Jim Gettys) (03/23/89)

In article <95013@sun.Eng.Sun.COM> cramer@sun.com (Sam Cramer) writes:
>Intel is not the only company to do this.  In the "DECstation 3100
>Performance Summary" distributed by DEC, the SINGLE integer benchmark shown
>is Dhrystone 2.1.  The text that accompanies these results contains no
>warning regarding the tendency of Dhrystone to overstate "real-world"
>integer performance.  In fact, the Dhrystone results are used to calculate
>"price-performance" ratios relative to two Sun machines (a vital metric for
>those users who spend all day running Dhrystones).  This section of the
>document (titled "CASE and Dhrystone"!) goes on to imply that these
>Dhrystone results promise good performacne on a CASE workload.
>
>It seems that when DEC acquired RISC technology from MIPS, they overlooked
>MIPS's benchmarking know-how.

If you get Mashey's latest performance brief (Issue 3.6), you will find 
pretty well complete DECstation 3100 performance numbers for his entire
suite. I sent them to John within a few days of announcement. (I ran
his suite last fall myself).

Our marketeers blew the original summary; I was to review it before it saw the
light of day, but was in the process of moving from California to 
Massachusetts.  Due to some unfortunate problems getting a copy printed
when I arrived in Cambridge, I was unable to get this problem fixed
(only one integer benchmark) in the first version of the memo  saw the 
light of day.

To first order, you can easily estimate its performance given the fact
it is a 16.667 mhz R2000.  Due to differences in the memory subsystem,
it is slightly slower than the MIPS M/120-5.

It was the typical problem of people working to deadlines overlooking things.
The summary had to make a printer's deadline for announcement.
The marketing folks promised to update the original briefs within a couple 
weeks of the original announcement; I believe you wil find that current ones 
are better.

				- Jim Gettys

mash@mips.COM (John Mashey) (03/23/89)

In article <13645@jumbo.dec.com> jg@jumbo.UUCP (Jim Gettys) writes:

>If you get Mashey's latest performance brief (Issue 3.6), you will find 
>pretty well complete DECstation 3100 performance numbers for his entire
>suite. I sent them to John within a few days of announcement. (I ran
>his suite last fall myself).
Yes.  Between Uniforum & being knocked flat with the flu, it's taken a while
to postthis, although we gave a lot out at Uniforum.   I've got a couple
typos and broken referecnes t ofix, and then I'll post 3.7 (soon).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086