wck353@estevax.UUCP (HrDr Weicker Reinhold ) (05/11/89)
Some days ago, Rick Richardson has posted a new list of Dhrystone results.
For the benefit of those who use the numbers, and to warn against
overly hasty conclusions from some numbers, I'll post here an article
that I have written about variations in Dhrystone performance.
The article was prompted by the discussion in Usenet in March 1989
about the Dhrystone numbers published for new microprocessors.
It has been published in the May 1989 issue of the "Microprocessor Report"
(Ed. Michael Slater, 550 California Ave., Suite 320, Palo Alto, CA 94306).
This newsletter will also publish, in a forthcoming issue, a comparison of the
code generated and the library functions used by the compilers for
the major microprocessors. Since I have seen code listings for very few
processors only, I'll refrain from commenting here since these comments would
necessarily be incomplete.
------------------------------------------------------------------
Understanding Variations in Dhrystone Performance
By Reinhold P. Weicker, Siemens AG
April 1989
Microprocessor manufacturers tend to credit all the performance measured
by benchmarks to the speed of their processors, they often don't even
mention the programming language and compiler used. In their detailed
documents, usually called "performance brief" or "performance report,"
they do give more details. However, these details are often lost
in the press releases and other marketing statements. For serious
performance evaluation, it is necessary to study the code generated by
the various compilers.
Dhrystone was originally published in Ada (Communications of the ACM,
Oct. 1984). However, since good Ada compilers were rare at this time
and, together with UNIX, C became more and more popular, the C version
of Dhrystone is the one now mainly used in industry. There are
"official" versions 2.1 for Ada, Pascal, and C, which are as close
together as the languages' semantic differences permit.
Dhrystone contains two statements where the programming language and its
translation play a major part in the execution time measured by the
benchmark:
(1) String assignment (in procedure Proc_0 / main)
(2) String comparison (in function Func_2)
In Ada and Pascal, strings are arrays of characters where the length of
the string is part of the type information known at compile time. In C,
strings are also arrays of characters, but there are no operators
defined in the language for assignment and comparison of strings.
Instead, functions "strcpy" and "strcmp" are used. These functions are
defined for strings of arbitrary length, and make use of the fact that
strings in C have to end with a terminating null byte. For general-purpose
calls to these functions, the implementor can assume nothing
about the length and the alignment of the strings involved.
The C version of Dhrystone spends a relatively large amount of time in
these two functions. Some time ago, I made measurements on a VAX 11/785
with the Berkeley UNIX (4.2) compilers (often-used compilers, but
certainly not the most advanced). In the C version, 23% of the time was
spent in the string functions; in the Pascal version, only 10%. On good
RISC machines (where less time is spent in the procedure calling
sequence than on a VAX) and with better optimizing compilers, the
percentage is higher; MIPS has reported 34% for an R3000.
Because of this effect, Pascal and Ada Dhrystone results are usually
better than C results (except when the optimization quality of the C
compiler is considerably better than that of the other compilers).
Several people have noted that the string operations are over-represented
in Dhrystone, mainly because the strings occurring in
Dhrystone are longer than average strings. I admit that this is true,
and have said so in my SIGPLAN Notices paper (Aug. 1988);
however, I didn't want to
generate confusion by changing the string lengths from version 1 to
version 2.
Even if they are somewhat over-represented in Dhrystone, string
operations are frequent enough that it makes sense to implement them in
the most efficient way possible, not only for benchmarking purposes.
This means that they can and should be written in assembly language
code. ANSI C also explicitly allows the strings functions to be
implemented as macros, i.e. by inline code.
There is also a third way to speed up the "strcpy" statement in
Dhrystone: For this particular "strcpy" statement, the source of the
assignment is a string constant. Therefore, in contrast to calls to
"strcpy" in the general case, the compiler knows the length and
alignment of the strings involved at compile time and can generate code
in the same efficient way as a Pascal compiler (word instructions
instead of byte instructions).
This is not allowed in the case of the "strcmp" call: Here, the
addresses are formal procedure parameters, and no assumptions can be
made about the length or alignment of the strings. Any such assumptions
would indicate an incorrect implementation. They might work for
Dhrystone, where the strings are in fact word-aligned with typical
compilers, but other programs would deliver incorrect results.
So, for an apple-to-apple comparison between processors, and not between
several possible (legal or illegal) degrees of compiler optimization,
one should check that the systems are comparable with respect to the
following three points:
(1) String functions in assembly language vs. in C
Frequently used functions such as the string functions can and should be
written in assembly language, and all serious C language systems known
to me do this. (I list this point for completeness only.) Note that
processors with an instruction that checks a word for a null byte (such
as AMD's 29000 and Intel's 80960) have an advantage here. (This
advantage decreases relatively if optimization (3) is applied.) Due to
the length of the strings involved in Dhrystone, this advantage may be
considered too high in perspective, but it is certainly legal to use
such instructions - after all, these situations are what they were
invented for.
(2) String function code inline vs. as library functions.
ANSI C has created a new situation, compared with the older
Kernighan/Ritchie C. In the original C, the definition of the string
function was not part of the language. Now it is, and inlining is
explicitly allowed. I probably should have stated more clearly in my
SIGPLAN Notices paper that the rule "No procedure inlining for
Dhrystone" referred to the user level procedures only and not to the
library routines.
(3) Fixed-length and alignment assumptions for the strings
Compilers should be allowed to optimize in these cases if (and only if)
it is safe to do so. For Dhrystone, this is the "strcpy" statement, but
not the "strcmp" statement (unless the "strcmp" code explicitly
checks the alignment at execution time and branches accordingly).
A "Dhrystone switch" for the compiler that
causes the generation of code that may not work under certain
circumstances is certainly inappropriate for comparisons. It has been
reported in Usenet that some C compilers provide such a
compiler option; since I don't have access to all C compilers involved,
I cannot verify this.
If the fixed-length and word-alignment assumption can be used, a wide
bus that permits fast multi-word load instructions certainly does help;
however, this fact by itself should not make a really big difference.
A check of these points - something that is necessary for a thorough
evaluation and comparison of the Dhrystone performance claims -
requires object code listings for the compiled program as well as
code listings for the string functions (strcpy, strcmp) that are
possibly called by the program.
I don't pretend that Dhrystone is a perfect tool to measure the integer
performance of microprocessors. The more it is used and discussed, the
more I myself learn about aspects that I hadn't noticed yet when I wrote
the program. And of course, the very success of a benchmark program is a
danger in that people may tune their compilers and/or hardware to it,
and with this action make it less useful.
Whetstone and Linpack have their critical points also: The Whetstone
rating depends heavily on the speed of the mathematical functions (sine,
sqrt, ...), and Linpack is sensitive to data alignment for some cache
configurations.
Introduction of a standard set of public domain benchmark software
(something the SPEC effort attempts) is certainly a worthwhile thing.
In the meantime, people will continue to use whatever is available
and widely distributed, and Dhrystone ratings
are probably still better than MIPS ratings if these are - as
often in industry - based on no reproducible derivation.
However, any serious performance evaluation requires more than just
a comparison of raw numbers; one has to make sure that the
numbers have been obtained in a comparable way.
--
Reinhold P. Weicker, Siemens AG, E STE 35, PO Box 3220, D-8520 Erlangen, Germany
Phone: +49-9131-720330 (Centr.Europ.Time, 8 am - 5 pm)
UUCP: ...!mcvax!unido!estevax!weicker
Disclaimer: Although I work for Siemens, I speak here only for myselfhenry@utzoo.uucp (Henry Spencer) (05/16/89)
In article <474@estevax.UUCP> wck353@estevax.UUCP (HrDr Weicker Reinhold ) writes: >... Note that >processors with an instruction that checks a word for a null byte (such >as AMD's 29000 and Intel's 80960) have an advantage here... Only a small one; you can do the same check on a machine without the fancy instruction by being clever. Consider: (((x & ~0x80808080) - 0x01010101) & 0x80808080) The result is nonzero if, and only if, there was a NUL byte in x. This is a bit more expensive than a single instruction, but not a whole lot if you put the constants in registers... especially on a machine where you can juggle the code to put most of the operations in load-delay slots. If you're into benchmarksmanship seriously, you can omit the first "&" if you're careful to use only ASCII (or if you expect high-bit characters to be rare and are willing to do a more precise check afterward to eliminate false alarms). There are a number of variations. >If the fixed-length and word-alignment assumption can be used, a wide >bus that permits fast multi-word load instructions certainly does help; Beware that there are alignment restrictions here too: you don't want a multi-word load to cross a page boundary unless you are sure the string crosses it too. Accessing the next page may cause a trap. -- Subversion, n: a superset | Henry Spencer at U of Toronto Zoology of a subset. --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (05/17/89)
In article <1989May15.173631.3029@utzoo.uucp> I wrote: > (((x & ~0x80808080) - 0x01010101) & 0x80808080) > >The result is nonzero if, and only if, there was a NUL byte in x... Oops, my mistake, it does get a false alarm on an 0x80. So you do end up needing a false-alarm filter. There are ways around this, but they add their own overheads. Nevertheless, alignment permitting, using a fast filter like this is a considerable win if you're scanning big chunks of text. Now, how worthwhile *any* of this is for typical C strings is a different question -- it's hard to amortize any significant setup overhead over the short strings typically found in real code. -- Subversion, n: a superset | Henry Spencer at U of Toronto Zoology of a subset. --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
ECULHAM@UALTAVM.BITNET (Earl Culham) (05/18/89)
In article <1989May15.173631.3029@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: <In article <474@estevax.UUCP> wck353@estevax.UUCP (HrDr Weicker Reinhold ) writes: <>... Note that <>processors with an instruction that checks a word for a null byte (such <>as AMD's 29000 and Intel's 80960) have an advantage here... < <Only a small one; you can do the same check on a machine without the <fancy instruction by being clever. Consider: < < (((x & ~0x80808080) - 0x01010101) & 0x80808080) < <The result is nonzero if, and only if, there was a NUL byte in x. EEEEEEEEEEEEEEE Actually, this also causes a false trigger if any byte contains X'80'.
jed4885@ultb.UUCP (J.E. Dyer) (05/18/89)
In article <1989May16.172354.1417@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <1989May15.173631.3029@utzoo.uucp> I wrote: >> (((x & ~0x80808080) - 0x01010101) & 0x80808080) >> >>The result is nonzero if, and only if, there was a NUL byte in x... > > [ Stuff about needing a filter to detect false alarms deleted ] >Subversion, n: a superset | Henry Spencer at U of Toronto Zoology >of a subset. --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu This does seem like alot of work to test for a null byte... Has anyone considered putting in a test-word-for-byte (twfb?) instruction on their favorite processor? It seems to me that adding this kind of a function to an alu would be pretty trivial, and it would make a significant improvement in some kinds of string operations. Of course, your strings would have to be aligned on word boundaries, but that shouldn't be to difficult to add to a compiler. Has anyone done this sort of thing? Is there any reason not to? (I haven't designed any (real) processors, so it's entirely possible that I'm missing out on some major consideration :). -Jason -sig-of-the-day- "So, Jason, how's that graphics project going?" BITNET: JED4885@RITVAX UUCP: jed4885@ultb
joe@petsd.UUCP (Joe Orost) (05/18/89)
<>
Another factor in the Ada performance is whether or not the Ada compiler
supports the pragma PACK, and to what degree.
Compilers that ignore the pragma, along with those that only pack to the
nearest byte or power-of-2 will do better in dhrystone than those that pack
minimally, because the 30-byte strings become 30 7-bit strings.
These bit-packed strings are harder to move and harder to compare, causing
the the dhrystone rating to drop.
This is not fair to implementers that try to provide better support for the
language by supplying minimal bit-packing.
Moving and comparing 7-bit character strings is not something that most users
will do, yet that is what determines the dhrystone number.
My recommendation: throw out the "pragma PACK" on the 30-character
strings.
regards,
joe
--
Full-Name: Joseph M. Orost
UUCP: rutgers!petsd!joe
ARPA: petsd!joe@RUTGERS.EDU, joe@petsd.ccur.com
Phone: (201) 758-7284
US Mail: MS 313; Concurrent Computer Corporation; 106 Apple St
Tinton Falls, NJ 07724tim@crackle.amd.com (Tim Olson) (05/19/89)
In article <839@ultb.UUCP> jed4885@ultb.UUCP (J.E. Dyer (713ICS)) writes: | This does seem like alot of work to test for a null byte... Has | anyone considered putting in a test-word-for-byte (twfb?) | instruction on their favorite processor? From the Am29000 User's Manual: CPBYTE CPBYTE Compare Bytes Operation: if (srca.byte0 = srcb.byte0) or (srca.byte1 = srcb.byte1) or (srca.byte2 = srcb.byte2) or (srca.byte3 = srcb.byte3) then dest <- TRUE else dest <- FALSE Description: Each byte of the srca operand is compared to the corresponding byte of the srcb operand. If any corresponding bytes are equal, a Boolean TRUE is placed into the DEST location; otherwise, a Boolean FALSE is placed in the DEST location. Assembler Syntax: cpbyte rc, ra, rb cpbyte rc, ra, const8 | It seems to me that | adding this kind of a function to an alu would be pretty | trivial, and it would make a significant improvement in some | kinds of string operations. Yes, it is trivial. We added it because we felt that it was an easy way to speed up 'C' string operations (str[n]cmp, str[n]cpy, strlen) which must constantly search for a terminating byte. With the cpbyte instruction, string operations can be performed a word at a time. | Of course, your strings would have | to be aligned on word boundaries, but that shouldn't be to | difficult to add to a compiler. Not necessarily. All you have to do is take care of the boundary conditions correctly. -- Tim Olson Advanced Micro Devices (tim@amd.com)