[comp.arch] Understanding variations in Dhrystone performance

wck353@estevax.UUCP (HrDr Weicker Reinhold ) (05/11/89)

Some days ago, Rick Richardson has posted a new list of Dhrystone results.
For the benefit of those who use the numbers, and to warn against
overly hasty conclusions from some numbers, I'll post here an article
that I have written about variations in Dhrystone performance.
The article was prompted by the discussion in Usenet in March 1989
about the Dhrystone numbers published for new microprocessors.
It has been published in the May 1989 issue of the "Microprocessor Report"
(Ed. Michael Slater, 550 California Ave., Suite 320, Palo Alto, CA 94306).
This newsletter will also publish, in a forthcoming issue, a comparison of the
code generated and the library functions used by the compilers for
the major microprocessors. Since I have seen code listings for very few
processors only, I'll refrain from commenting here since these comments would
necessarily be incomplete.


------------------------------------------------------------------


            Understanding Variations in Dhrystone Performance

                   By Reinhold P. Weicker, Siemens AG

                                April 1989


Microprocessor manufacturers tend to credit all the performance measured
by benchmarks to the speed of their processors, they often don't even
mention the programming language and compiler used. In their detailed
documents, usually called "performance brief" or "performance report,"
they do give more details. However, these details are often lost
in the press releases and other marketing statements. For serious
performance evaluation, it is necessary to study the code generated by
the various compilers.

Dhrystone was originally published in Ada (Communications of the ACM,
Oct. 1984). However, since good Ada compilers were rare at this time
and, together with UNIX, C became more and more popular, the C version
of Dhrystone is the one now mainly used in industry. There are
"official" versions 2.1 for Ada, Pascal, and C, which are as close
together as the languages' semantic differences permit.

Dhrystone contains two statements where the programming language and its
translation play a major part in the execution time measured by the
benchmark:

(1) String assignment (in procedure Proc_0 / main)

(2) String comparison (in function Func_2)

In Ada and Pascal, strings are arrays of characters where the length of
the string is part of the type information known at compile time. In C,
strings are also arrays of characters, but there are no operators
defined in the language for assignment and comparison of strings.
Instead, functions "strcpy" and "strcmp" are used. These functions are
defined for strings of arbitrary length, and make use of the fact that
strings in C have to end with a terminating null byte. For general-purpose
calls to these functions, the implementor can assume nothing
about the length and the alignment of the strings involved.

The C version of Dhrystone spends a relatively large amount of time in
these two functions. Some time ago, I made measurements on a VAX 11/785
with the Berkeley UNIX (4.2) compilers (often-used compilers, but
certainly not the most advanced). In the C version, 23% of the time was
spent in the string functions; in the Pascal version, only 10%. On good
RISC machines (where less time is spent in the procedure calling
sequence than on a VAX) and with better optimizing compilers, the
percentage is higher; MIPS has reported 34% for an R3000.
Because of this effect, Pascal and Ada Dhrystone results are usually
better than C results (except when the optimization quality of the C
compiler is considerably better than that of the other compilers).

Several people have noted that the string operations are over-represented
in Dhrystone, mainly because the strings occurring in
Dhrystone are longer than average strings. I admit that this is true,
and have said so in my SIGPLAN Notices paper (Aug. 1988);
however, I didn't want to
generate confusion by changing the string lengths from version 1 to
version 2.

Even if they are somewhat over-represented in Dhrystone, string
operations are frequent enough that it makes sense to implement them in
the most efficient way possible, not only for benchmarking purposes.
This means that they can and should be written in assembly language
code. ANSI C also explicitly allows the strings functions to be
implemented as macros, i.e. by inline code.

There is also a third way to speed up the "strcpy" statement in
Dhrystone: For this particular "strcpy" statement, the source of the
assignment is a string constant. Therefore, in contrast to calls to
"strcpy" in the general case, the compiler knows the length and
alignment of the strings involved at compile time and can generate code
in the same efficient way as a Pascal compiler (word instructions
instead of byte instructions).

This is not allowed in the case of the "strcmp" call: Here, the
addresses are formal procedure parameters, and no assumptions can be
made about the length or alignment of the strings.  Any such assumptions
would indicate an incorrect implementation. They might work for
Dhrystone, where the strings are in fact word-aligned with typical
compilers, but other programs would deliver incorrect results.

So, for an apple-to-apple comparison between processors, and not between
several possible (legal or illegal) degrees of compiler optimization,
one should check that the systems are comparable with respect to the
following three points:

(1) String functions in assembly language vs. in C

Frequently used functions such as the string functions can and should be
written in assembly language, and all serious C language systems known
to me do this. (I list this point for completeness only.) Note that
processors with an instruction that checks a word for a null byte (such
as AMD's 29000 and Intel's 80960) have an advantage here. (This
advantage decreases relatively if optimization (3) is applied.) Due to
the length of the strings involved in Dhrystone, this advantage may be
considered too high in perspective, but it is certainly legal to use
such instructions - after all, these situations are what they were
invented for.

(2) String function code inline vs. as library functions.

ANSI C has created a new situation, compared with the older
Kernighan/Ritchie C. In the original C, the definition of the string
function was not part of the language. Now it is, and inlining is
explicitly allowed. I probably should have stated more clearly in my
SIGPLAN Notices paper that the rule "No procedure inlining for
Dhrystone" referred to the user level procedures only and not to the
library routines.

(3) Fixed-length and alignment assumptions for the strings

Compilers should be allowed to optimize in these cases if (and only if)
it is safe to do so. For Dhrystone, this is the "strcpy" statement, but
not the "strcmp" statement (unless the "strcmp" code explicitly
checks the alignment at execution time and branches accordingly).
A "Dhrystone switch" for the compiler that
causes the generation of code that may not work under certain
circumstances is certainly inappropriate for comparisons. It has been
reported in Usenet that some C compilers provide such a
compiler option; since I don't have access to all C compilers involved,
I cannot verify this.

If the fixed-length and word-alignment assumption can be used, a wide
bus that permits fast multi-word load instructions certainly does help;
however, this fact by itself should not make a really big difference.

A check of these points - something that is necessary for a thorough
evaluation and comparison of the Dhrystone performance claims -
requires object code listings for the compiled program as well as
code listings for the string functions (strcpy, strcmp) that are
possibly called by the program.

I don't pretend that Dhrystone is a perfect tool to measure the integer
performance of microprocessors. The more it is used and discussed, the
more I myself learn about aspects that I hadn't noticed yet when I wrote
the program. And of course, the very success of a benchmark program is a
danger in that people may tune their compilers and/or hardware to it,
and with this action make it less useful.

Whetstone and Linpack have their critical points also: The Whetstone
rating depends heavily on the speed of the mathematical functions (sine,
sqrt, ...), and Linpack is sensitive to data alignment for some cache
configurations.

Introduction of a standard set of public domain benchmark software
(something the SPEC effort attempts) is certainly a worthwhile thing.
In the meantime, people will continue to use whatever is available
and widely distributed, and Dhrystone ratings
are probably still better than MIPS ratings if these are - as
often in industry - based on no reproducible derivation.
However, any serious performance evaluation requires more than just
a comparison of raw numbers; one has to make sure that the
numbers have been obtained in a comparable way.
-- 
Reinhold P. Weicker, Siemens AG, E STE 35, PO Box 3220, D-8520 Erlangen, Germany
Phone:		     +49-9131-720330 (Centr.Europ.Time, 8 am - 5 pm)
UUCP:		     ...!mcvax!unido!estevax!weicker
Disclaimer:	     Although I work for Siemens, I speak here only for myself

henry@utzoo.uucp (Henry Spencer) (05/16/89)

In article <474@estevax.UUCP> wck353@estevax.UUCP (HrDr Weicker Reinhold ) writes:
>... Note that
>processors with an instruction that checks a word for a null byte (such
>as AMD's 29000 and Intel's 80960) have an advantage here...

Only a small one; you can do the same check on a machine without the
fancy instruction by being clever.  Consider:

	(((x & ~0x80808080) - 0x01010101) & 0x80808080)

The result is nonzero if, and only if, there was a NUL byte in x.  This
is a bit more expensive than a single instruction, but not a whole lot
if you put the constants in registers... especially on a machine where
you can juggle the code to put most of the operations in load-delay slots.
If you're into benchmarksmanship seriously, you can omit the first "&"
if you're careful to use only ASCII (or if you expect high-bit characters
to be rare and are willing to do a more precise check afterward to eliminate
false alarms).  There are a number of variations.

>If the fixed-length and word-alignment assumption can be used, a wide
>bus that permits fast multi-word load instructions certainly does help;

Beware that there are alignment restrictions here too:  you don't want
a multi-word load to cross a page boundary unless you are sure the string
crosses it too.  Accessing the next page may cause a trap.
-- 
Subversion, n:  a superset     |     Henry Spencer at U of Toronto Zoology
of a subset.    --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (05/17/89)

In article <1989May15.173631.3029@utzoo.uucp> I wrote:
>	(((x & ~0x80808080) - 0x01010101) & 0x80808080)
>
>The result is nonzero if, and only if, there was a NUL byte in x...

Oops, my mistake, it does get a false alarm on an 0x80.  So you do end
up needing a false-alarm filter.  There are ways around this, but they
add their own overheads.  Nevertheless, alignment permitting, using a
fast filter like this is a considerable win if you're scanning big
chunks of text.  Now, how worthwhile *any* of this is for typical C
strings is a different question -- it's hard to amortize any significant
setup overhead over the short strings typically found in real code.
-- 
Subversion, n:  a superset     |     Henry Spencer at U of Toronto Zoology
of a subset.    --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ECULHAM@UALTAVM.BITNET (Earl Culham) (05/18/89)

In article <1989May15.173631.3029@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
 
<In article <474@estevax.UUCP> wck353@estevax.UUCP (HrDr Weicker Reinhold ) writes:
<>... Note that
<>processors with an instruction that checks a word for a null byte (such
<>as AMD's 29000 and Intel's 80960) have an advantage here...
<
<Only a small one; you can do the same check on a machine without the
<fancy instruction by being clever.  Consider:
<
<        (((x & ~0x80808080) - 0x01010101) & 0x80808080)
<
<The result is nonzero if, and only if, there was a NUL byte in x.
                       EEEEEEEEEEEEEEE
 
Actually, this also causes a false trigger if any byte contains
X'80'.

jed4885@ultb.UUCP (J.E. Dyer) (05/18/89)

In article <1989May16.172354.1417@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <1989May15.173631.3029@utzoo.uucp> I wrote:
>>	(((x & ~0x80808080) - 0x01010101) & 0x80808080)
>>
>>The result is nonzero if, and only if, there was a NUL byte in x...
>
> [ Stuff about needing a filter to detect false alarms deleted ]
>Subversion, n:  a superset     |     Henry Spencer at U of Toronto Zoology
>of a subset.    --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

	This does seem like alot of work to test for a null byte...  Has
	anyone considered putting in a test-word-for-byte (twfb?)
	instruction on their favorite processor?  It seems to me that
	adding this kind of a function to an alu would be pretty
	trivial, and it would make a significant improvement in some
	kinds of string operations.  Of course, your strings would have
	to be aligned on word boundaries, but that shouldn't be to
	difficult to add to a compiler.  Has anyone done this sort of
	thing?  Is there any reason not to?  (I haven't designed any
	(real) processors, so it's entirely possible that I'm missing
	out on some major consideration :).

						-Jason

-sig-of-the-day-
"So, Jason, how's that graphics project going?"
BITNET: JED4885@RITVAX			UUCP: jed4885@ultb

joe@petsd.UUCP (Joe Orost) (05/18/89)

<>
Another factor in the Ada performance is whether or not the Ada compiler
supports the pragma PACK, and to what degree.

Compilers that ignore the pragma, along with those that only pack to the
nearest byte or power-of-2 will do better in dhrystone than those that pack
minimally, because the 30-byte strings become 30 7-bit strings.

These bit-packed strings are harder to move and harder to compare, causing
the the dhrystone rating to drop.

This is not fair to implementers that try to provide better support for the
language by supplying minimal bit-packing.

Moving and comparing 7-bit character strings is not something that most users
will do, yet that is what determines the dhrystone number.

My recommendation: throw out the "pragma PACK" on the 30-character
strings.

				regards,
				joe

--

 Full-Name:  Joseph M. Orost
 UUCP:       rutgers!petsd!joe
 ARPA:	     petsd!joe@RUTGERS.EDU, joe@petsd.ccur.com
 Phone:      (201) 758-7284
 US Mail:    MS 313; Concurrent Computer Corporation; 106 Apple St
             Tinton Falls, NJ 07724

tim@crackle.amd.com (Tim Olson) (05/19/89)

In article <839@ultb.UUCP> jed4885@ultb.UUCP (J.E. Dyer (713ICS)) writes:
| 	This does seem like alot of work to test for a null byte...  Has
| 	anyone considered putting in a test-word-for-byte (twfb?)
| 	instruction on their favorite processor?

From the Am29000 User's Manual:

CPBYTE								CPBYTE
			    Compare Bytes

Operation:	if (srca.byte0 = srcb.byte0) or
		   (srca.byte1 = srcb.byte1) or
		   (srca.byte2 = srcb.byte2) or
		   (srca.byte3 = srcb.byte3)
		then
			dest <- TRUE
		else
			dest <- FALSE

Description:
	Each byte of the srca operand is compared to the corresponding
	byte of the srcb operand.  If any corresponding bytes are equal,
	a Boolean TRUE is placed into the DEST location; otherwise, a
	Boolean FALSE is placed in the DEST location.

Assembler Syntax:

	cpbyte	rc, ra, rb
	cpbyte	rc, ra, const8


|	It seems to me that
| 	adding this kind of a function to an alu would be pretty
| 	trivial, and it would make a significant improvement in some
| 	kinds of string operations.

Yes, it is trivial.  We added it because we felt that it was an easy way
to speed up 'C' string operations (str[n]cmp, str[n]cpy, strlen) which
must constantly search for a terminating byte.  With the cpbyte
instruction, string operations can be performed a word at a time.

|	Of course, your strings would have
| 	to be aligned on word boundaries, but that shouldn't be to
| 	difficult to add to a compiler.

Not necessarily.  All you have to do is take care of the boundary
conditions correctly.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)