wck353@estevax.UUCP (HrDr Weicker Reinhold ) (05/11/89)
Some days ago, Rick Richardson has posted a new list of Dhrystone results. For the benefit of those who use the numbers, and to warn against overly hasty conclusions from some numbers, I'll post here an article that I have written about variations in Dhrystone performance. The article was prompted by the discussion in Usenet in March 1989 about the Dhrystone numbers published for new microprocessors. It has been published in the May 1989 issue of the "Microprocessor Report" (Ed. Michael Slater, 550 California Ave., Suite 320, Palo Alto, CA 94306). This newsletter will also publish, in a forthcoming issue, a comparison of the code generated and the library functions used by the compilers for the major microprocessors. Since I have seen code listings for very few processors only, I'll refrain from commenting here since these comments would necessarily be incomplete. ------------------------------------------------------------------ Understanding Variations in Dhrystone Performance By Reinhold P. Weicker, Siemens AG April 1989 Microprocessor manufacturers tend to credit all the performance measured by benchmarks to the speed of their processors, they often don't even mention the programming language and compiler used. In their detailed documents, usually called "performance brief" or "performance report," they do give more details. However, these details are often lost in the press releases and other marketing statements. For serious performance evaluation, it is necessary to study the code generated by the various compilers. Dhrystone was originally published in Ada (Communications of the ACM, Oct. 1984). However, since good Ada compilers were rare at this time and, together with UNIX, C became more and more popular, the C version of Dhrystone is the one now mainly used in industry. There are "official" versions 2.1 for Ada, Pascal, and C, which are as close together as the languages' semantic differences permit. Dhrystone contains two statements where the programming language and its translation play a major part in the execution time measured by the benchmark: (1) String assignment (in procedure Proc_0 / main) (2) String comparison (in function Func_2) In Ada and Pascal, strings are arrays of characters where the length of the string is part of the type information known at compile time. In C, strings are also arrays of characters, but there are no operators defined in the language for assignment and comparison of strings. Instead, functions "strcpy" and "strcmp" are used. These functions are defined for strings of arbitrary length, and make use of the fact that strings in C have to end with a terminating null byte. For general-purpose calls to these functions, the implementor can assume nothing about the length and the alignment of the strings involved. The C version of Dhrystone spends a relatively large amount of time in these two functions. Some time ago, I made measurements on a VAX 11/785 with the Berkeley UNIX (4.2) compilers (often-used compilers, but certainly not the most advanced). In the C version, 23% of the time was spent in the string functions; in the Pascal version, only 10%. On good RISC machines (where less time is spent in the procedure calling sequence than on a VAX) and with better optimizing compilers, the percentage is higher; MIPS has reported 34% for an R3000. Because of this effect, Pascal and Ada Dhrystone results are usually better than C results (except when the optimization quality of the C compiler is considerably better than that of the other compilers). Several people have noted that the string operations are over-represented in Dhrystone, mainly because the strings occurring in Dhrystone are longer than average strings. I admit that this is true, and have said so in my SIGPLAN Notices paper (Aug. 1988); however, I didn't want to generate confusion by changing the string lengths from version 1 to version 2. Even if they are somewhat over-represented in Dhrystone, string operations are frequent enough that it makes sense to implement them in the most efficient way possible, not only for benchmarking purposes. This means that they can and should be written in assembly language code. ANSI C also explicitly allows the strings functions to be implemented as macros, i.e. by inline code. There is also a third way to speed up the "strcpy" statement in Dhrystone: For this particular "strcpy" statement, the source of the assignment is a string constant. Therefore, in contrast to calls to "strcpy" in the general case, the compiler knows the length and alignment of the strings involved at compile time and can generate code in the same efficient way as a Pascal compiler (word instructions instead of byte instructions). This is not allowed in the case of the "strcmp" call: Here, the addresses are formal procedure parameters, and no assumptions can be made about the length or alignment of the strings. Any such assumptions would indicate an incorrect implementation. They might work for Dhrystone, where the strings are in fact word-aligned with typical compilers, but other programs would deliver incorrect results. So, for an apple-to-apple comparison between processors, and not between several possible (legal or illegal) degrees of compiler optimization, one should check that the systems are comparable with respect to the following three points: (1) String functions in assembly language vs. in C Frequently used functions such as the string functions can and should be written in assembly language, and all serious C language systems known to me do this. (I list this point for completeness only.) Note that processors with an instruction that checks a word for a null byte (such as AMD's 29000 and Intel's 80960) have an advantage here. (This advantage decreases relatively if optimization (3) is applied.) Due to the length of the strings involved in Dhrystone, this advantage may be considered too high in perspective, but it is certainly legal to use such instructions - after all, these situations are what they were invented for. (2) String function code inline vs. as library functions. ANSI C has created a new situation, compared with the older Kernighan/Ritchie C. In the original C, the definition of the string function was not part of the language. Now it is, and inlining is explicitly allowed. I probably should have stated more clearly in my SIGPLAN Notices paper that the rule "No procedure inlining for Dhrystone" referred to the user level procedures only and not to the library routines. (3) Fixed-length and alignment assumptions for the strings Compilers should be allowed to optimize in these cases if (and only if) it is safe to do so. For Dhrystone, this is the "strcpy" statement, but not the "strcmp" statement (unless the "strcmp" code explicitly checks the alignment at execution time and branches accordingly). A "Dhrystone switch" for the compiler that causes the generation of code that may not work under certain circumstances is certainly inappropriate for comparisons. It has been reported in Usenet that some C compilers provide such a compiler option; since I don't have access to all C compilers involved, I cannot verify this. If the fixed-length and word-alignment assumption can be used, a wide bus that permits fast multi-word load instructions certainly does help; however, this fact by itself should not make a really big difference. A check of these points - something that is necessary for a thorough evaluation and comparison of the Dhrystone performance claims - requires object code listings for the compiled program as well as code listings for the string functions (strcpy, strcmp) that are possibly called by the program. I don't pretend that Dhrystone is a perfect tool to measure the integer performance of microprocessors. The more it is used and discussed, the more I myself learn about aspects that I hadn't noticed yet when I wrote the program. And of course, the very success of a benchmark program is a danger in that people may tune their compilers and/or hardware to it, and with this action make it less useful. Whetstone and Linpack have their critical points also: The Whetstone rating depends heavily on the speed of the mathematical functions (sine, sqrt, ...), and Linpack is sensitive to data alignment for some cache configurations. Introduction of a standard set of public domain benchmark software (something the SPEC effort attempts) is certainly a worthwhile thing. In the meantime, people will continue to use whatever is available and widely distributed, and Dhrystone ratings are probably still better than MIPS ratings if these are - as often in industry - based on no reproducible derivation. However, any serious performance evaluation requires more than just a comparison of raw numbers; one has to make sure that the numbers have been obtained in a comparable way. -- Reinhold P. Weicker, Siemens AG, E STE 35, PO Box 3220, D-8520 Erlangen, Germany Phone: +49-9131-720330 (Centr.Europ.Time, 8 am - 5 pm) UUCP: ...!mcvax!unido!estevax!weicker Disclaimer: Although I work for Siemens, I speak here only for myself
henry@utzoo.uucp (Henry Spencer) (05/16/89)
In article <474@estevax.UUCP> wck353@estevax.UUCP (HrDr Weicker Reinhold ) writes: >... Note that >processors with an instruction that checks a word for a null byte (such >as AMD's 29000 and Intel's 80960) have an advantage here... Only a small one; you can do the same check on a machine without the fancy instruction by being clever. Consider: (((x & ~0x80808080) - 0x01010101) & 0x80808080) The result is nonzero if, and only if, there was a NUL byte in x. This is a bit more expensive than a single instruction, but not a whole lot if you put the constants in registers... especially on a machine where you can juggle the code to put most of the operations in load-delay slots. If you're into benchmarksmanship seriously, you can omit the first "&" if you're careful to use only ASCII (or if you expect high-bit characters to be rare and are willing to do a more precise check afterward to eliminate false alarms). There are a number of variations. >If the fixed-length and word-alignment assumption can be used, a wide >bus that permits fast multi-word load instructions certainly does help; Beware that there are alignment restrictions here too: you don't want a multi-word load to cross a page boundary unless you are sure the string crosses it too. Accessing the next page may cause a trap. -- Subversion, n: a superset | Henry Spencer at U of Toronto Zoology of a subset. --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (05/17/89)
In article <1989May15.173631.3029@utzoo.uucp> I wrote: > (((x & ~0x80808080) - 0x01010101) & 0x80808080) > >The result is nonzero if, and only if, there was a NUL byte in x... Oops, my mistake, it does get a false alarm on an 0x80. So you do end up needing a false-alarm filter. There are ways around this, but they add their own overheads. Nevertheless, alignment permitting, using a fast filter like this is a considerable win if you're scanning big chunks of text. Now, how worthwhile *any* of this is for typical C strings is a different question -- it's hard to amortize any significant setup overhead over the short strings typically found in real code. -- Subversion, n: a superset | Henry Spencer at U of Toronto Zoology of a subset. --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
ECULHAM@UALTAVM.BITNET (Earl Culham) (05/18/89)
In article <1989May15.173631.3029@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: <In article <474@estevax.UUCP> wck353@estevax.UUCP (HrDr Weicker Reinhold ) writes: <>... Note that <>processors with an instruction that checks a word for a null byte (such <>as AMD's 29000 and Intel's 80960) have an advantage here... < <Only a small one; you can do the same check on a machine without the <fancy instruction by being clever. Consider: < < (((x & ~0x80808080) - 0x01010101) & 0x80808080) < <The result is nonzero if, and only if, there was a NUL byte in x. EEEEEEEEEEEEEEE Actually, this also causes a false trigger if any byte contains X'80'.
jed4885@ultb.UUCP (J.E. Dyer) (05/18/89)
In article <1989May16.172354.1417@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In article <1989May15.173631.3029@utzoo.uucp> I wrote: >> (((x & ~0x80808080) - 0x01010101) & 0x80808080) >> >>The result is nonzero if, and only if, there was a NUL byte in x... > > [ Stuff about needing a filter to detect false alarms deleted ] >Subversion, n: a superset | Henry Spencer at U of Toronto Zoology >of a subset. --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu This does seem like alot of work to test for a null byte... Has anyone considered putting in a test-word-for-byte (twfb?) instruction on their favorite processor? It seems to me that adding this kind of a function to an alu would be pretty trivial, and it would make a significant improvement in some kinds of string operations. Of course, your strings would have to be aligned on word boundaries, but that shouldn't be to difficult to add to a compiler. Has anyone done this sort of thing? Is there any reason not to? (I haven't designed any (real) processors, so it's entirely possible that I'm missing out on some major consideration :). -Jason -sig-of-the-day- "So, Jason, how's that graphics project going?" BITNET: JED4885@RITVAX UUCP: jed4885@ultb
joe@petsd.UUCP (Joe Orost) (05/18/89)
<> Another factor in the Ada performance is whether or not the Ada compiler supports the pragma PACK, and to what degree. Compilers that ignore the pragma, along with those that only pack to the nearest byte or power-of-2 will do better in dhrystone than those that pack minimally, because the 30-byte strings become 30 7-bit strings. These bit-packed strings are harder to move and harder to compare, causing the the dhrystone rating to drop. This is not fair to implementers that try to provide better support for the language by supplying minimal bit-packing. Moving and comparing 7-bit character strings is not something that most users will do, yet that is what determines the dhrystone number. My recommendation: throw out the "pragma PACK" on the 30-character strings. regards, joe -- Full-Name: Joseph M. Orost UUCP: rutgers!petsd!joe ARPA: petsd!joe@RUTGERS.EDU, joe@petsd.ccur.com Phone: (201) 758-7284 US Mail: MS 313; Concurrent Computer Corporation; 106 Apple St Tinton Falls, NJ 07724
tim@crackle.amd.com (Tim Olson) (05/19/89)
In article <839@ultb.UUCP> jed4885@ultb.UUCP (J.E. Dyer (713ICS)) writes: | This does seem like alot of work to test for a null byte... Has | anyone considered putting in a test-word-for-byte (twfb?) | instruction on their favorite processor? From the Am29000 User's Manual: CPBYTE CPBYTE Compare Bytes Operation: if (srca.byte0 = srcb.byte0) or (srca.byte1 = srcb.byte1) or (srca.byte2 = srcb.byte2) or (srca.byte3 = srcb.byte3) then dest <- TRUE else dest <- FALSE Description: Each byte of the srca operand is compared to the corresponding byte of the srcb operand. If any corresponding bytes are equal, a Boolean TRUE is placed into the DEST location; otherwise, a Boolean FALSE is placed in the DEST location. Assembler Syntax: cpbyte rc, ra, rb cpbyte rc, ra, const8 | It seems to me that | adding this kind of a function to an alu would be pretty | trivial, and it would make a significant improvement in some | kinds of string operations. Yes, it is trivial. We added it because we felt that it was an easy way to speed up 'C' string operations (str[n]cmp, str[n]cpy, strlen) which must constantly search for a terminating byte. With the cpbyte instruction, string operations can be performed a word at a time. | Of course, your strings would have | to be aligned on word boundaries, but that shouldn't be to | difficult to add to a compiler. Not necessarily. All you have to do is take care of the boundary conditions correctly. -- Tim Olson Advanced Micro Devices (tim@amd.com)