pbickers@tamaluit.phys.uidaho.edu (Paul Bickerstaff) (12/15/90)
> > Another easy-to-memorize benchmark is the computation of the sum > of the first 10 million terms in the harmonic series. I've also used this as quick but (very rough) guide to floating point speed (but mainly to orient myself on a new system ). > This is a FORTRAN version, it should not be too hard to translate > even without f2c :-) > > PROGRAM RR > DOUBLE PRECISION R > R=0.0 > DO 10 I=1,10000000 ^^^^^^^^ As a general tutorial type comment, one should always sum series by doing the smallest terms first. This is for numerical accuracy. OK, the smallest here is 10^-7 and we're using double precision but the comment still stands as a matter of programming practice and summing in the reverse order may give very different results. > R=R+1/DBLE(I) > 10 CONTINUE > WRITE(*,*)R,I > END > > This one is obviously testing floating-point perfomance only. The Idon't think this is true. It is also testing a tight do loop. > emphasis on divisions might give biased results. It vectorizes ^^^^ This and other things *will* give biased results. Heck, *every* benchmark gives biased results. The trick is to choose a benchmark (or create your own) which matches your applications. There is not a single benchmark, MFLOPS, MIPS, SPECMARKS or whatever that means anything worthwhile if you don't know exactly how relevant it is to what you're doing. If the harmonic series has any value at all it is in educating people just how useless benchmarks are. eg. (I won't include exact code, but mine was double precision with reverse order of summation) (also, I only summed 1 million terms) IBM RS6000/320 2.14 xlf 1.41 xlf -O 1.33 xlf -O -Q'OPT(3)' (July '90 results) Mips Magnum 3000 1.2 f77 -O0 (ie no optimizations) 0.9 f77 (default level = f77 -O1) 0.5 f77 -O2 (Fortran 2.11 , RISCos 4.51) Times are all user times in secs. So how come a 3.6 MFLOP machine can run Fortran at about twice the speed of a 7.4 MFLOP machine? (Yes I have this the right way around!) Answer: Easy. (This article is not intended as IBM bashing. I have Fortran codes which do run much faster on the RS6000. The IBM does excell at the LINPACK benchmark but unless you do a lot of 100x100 array manipulations the 7.4 MFLOP LINPACK number clearly does n't mean much. Nor do the times for the harmonic series. ) Paul Bickerstaff Internet: pbickers@tamaluit.phys.uidaho.edu Physics Dept., Univ. of Idaho Phone: (208) 885 6809 Moscow ID 83843, USA FAX: (208) 885 6173
patrick@convex.COM (Patrick F. McGehearty) (12/18/90)
Here is a numerically more precise version of the harmonic code following the suggestion of Paul Bickerstaff (pbickers@tamaluit.phys.uidaho.edu): PROGRAM RR DOUBLE PRECISION R R=0.0 DO 10 I=10000000,1,-1 R=R+1/DBLE(I) 10 CONTINUE WRITE(*,*)R,I END It runs at the same speed as the original on a Convex. The reported time of 8.7 is for a C120, which is our original product, currently >6 year old technology. No surprise that the workstations have caught up to it. A C210 gets the job done in around 2.2 seconds. As has been noted, this benchmark (assuming good compiler technology) primarily tests divide speed, plus a little bit of summation work. I criticisied the bc benchmark on several points, some of which could also be applied to this harmonic benchmark or any other small benchmark. However, there is a critical difference: Improvements to compilers or hardware to help this benchmark are likely to help real customer's codes. I examined our (Convex's) assembly code, and identified a fairly minor optimization that would make this loop run 2% faster. The real benefit is that same optimization would apply to most other codes that use DBLE(I). Since one of my tasks here is to identify such opportunities, this benchmark helped me do my job better. Of course, since the entire computation is visible at compile time, there is the ultimate optimization which computes the sum at compile time and just generates the assignment of 16.695311... to R, along with an appropriate assignment to I. The potential for this sort of optimization is why it is so dangerous to rely on standard, well known benchmarks for serious evaluation purposes. It happens though. Some may not realize the extent to which vendors spend effort on optimizations which have no meaning outside the "standard" benchmarks. Serious comparision of vendor results can identify a number of these types of optimizations. For example, Berkeley developed some tests to measure the performance of the Berkeley kernel for a variety of system calls. This suite is good for the purpose for which it was written, which it to allow measurement of the various efforts at tuning kernel performance. However, it also started being used to evaluate the many Unix boxes that are available. One test in particular was used to measure the time to do a minimal kernel call. The test called getpid() 10,000 times. Since the standard getpid does almost nothing besides changing into the kernel protection domain, initially it was a valid, easy to compare measure. However, some vendors changed their libraries so that on its first invocation, getpid saved it's result in a static variable, and then on all following invocations, it used that static value instead of calling the kernel. This change meant that the Berkeley test reported the time for a subroutine call as the time for a kernel call. (Disclaimer: Up through the current release, Convex getpid invokes the kernel on every getpid call.) Besides giving misleading results, it was a misapplication of resources. I have not seen a real application that uses getpid frequently. While the optimization is fairly trivial, effort inside the kernel at the call interface would allow all kernel calls to run faster. I do not want to over-critize the technical effort of tuning a given benchmark. Generally, a engineer is told "Do whatever it takes to make this program run faster". With the typical time pressure most work under, there is frequently not time to develop the general optimization, so only the special case gets covered. These sorts of tunings are indirectly under the control of customers and those who report technical results. If a benchmark is widely reported and used, then vendors will attempt to improve their systems to give better results. The Whetstone and Linpack benchmarks are well known cases of the wide reporting. Dongarra's Linpack reports has encouraged vector architectures and compiler optimizations which help real codes in Computational Fluid Dynamics and other application areas. Whetstones has encouraged faster transcendentals and other floating point operations, which help Computational Chemistry applications among others. However, these two benchmarks have been around so long that they have pretty much been milked dry, and it is time to move on to other simple loops, or when possible, more complex application codes. The new SPECmark and Perfect Club benchmarks are valuable, partly due to their size. Detailed study and examination of them will yield significant improvements to compilers and architectures which will benefit many programs besides the benchmarks. As vendors get these benchmarks tuned, new ones need to be developed every few years to "keep the vendors honest". If you are considering proposing a new standard benchmark, first consider what it is testing, and whether that is something you want to see improved. In summary, Standard benchmarks can be useful in selecting the "initial list" of vendors to invite to make bids, or perhaps for very small procurements (say, less than $100,000). However, major procurements (> half-million dollars) deserve some effort in selecting a "non-standard" load representative of the intended usage of the machine or machines, in addition to reviewing the standard benchmark values. This applies to lowend workstations as well as the fileserver/compute engine products. Fifty workstations at $8,000 is a non-trivial piece of change.