[comp.benchmarks] harmonic series

pbickers@tamaluit.phys.uidaho.edu (Paul Bickerstaff) (12/15/90)

> 
> Another easy-to-memorize benchmark is the computation of the sum
> of the first 10 million terms in the harmonic series.

I've also used this as quick but (very rough) guide to floating point
speed (but mainly to
orient myself on a new system ).
> This is a FORTRAN version, it should not be too hard to translate
> even without f2c :-)
> 
>       PROGRAM RR
>       DOUBLE PRECISION R
>       R=0.0
>         DO 10 I=1,10000000
                            ^^^^^^^^
As a general tutorial type comment, one should always sum series  by
doing the smallest terms
first. This is for numerical accuracy. OK, the smallest here is 10^-7
and we're using double
precision but the comment still stands as a matter of programming
practice and summing
in the reverse order may give very different results.
>         R=R+1/DBLE(I)
> 10      CONTINUE
>       WRITE(*,*)R,I
>       END
> 
> This one is obviously testing floating-point perfomance only. The

Idon't think this is true. It is also testing a tight do loop.
> emphasis on divisions might give biased results. It vectorizes
                                           ^^^^
This and other things *will* give biased results. Heck, *every*
benchmark gives biased results.
The trick is to choose a benchmark (or create your own) which matches
your applications.  There
is not a single benchmark, MFLOPS, MIPS, SPECMARKS or whatever that
means anything
worthwhile if you don't know exactly how relevant it is to what you're doing.

If the harmonic series has any value at all it is in educating people
just how useless benchmarks
are. eg. (I won't include exact code, but mine was double precision with
reverse order of summation)

(also, I only summed 1 million terms)

IBM   RS6000/320    2.14   xlf
                                    1.41   xlf -O
                                    1.33   xlf -O -Q'OPT(3)'            
      (July '90 results)

Mips Magnum 3000  1.2     f77 -O0   (ie no optimizations)
                                     0.9     f77   (default level = f77 -O1)
                                     0.5     f77 -O2                    
                 (Fortran 2.11 , RISCos 4.51)

Times are all user times in secs.

So how come a 3.6 MFLOP machine can run Fortran at about twice the speed
of a 7.4 MFLOP
machine? (Yes I have this the right way around!)

Answer: Easy.

(This article is not intended as IBM bashing. I  have Fortran codes
which do run  much faster
on the RS6000.  The IBM does excell at the LINPACK benchmark but unless
you do a lot of 
100x100 array manipulations the 7.4 MFLOP LINPACK  number clearly does
n't mean much.
Nor do the times for the harmonic series. )
  

Paul Bickerstaff                 Internet: pbickers@tamaluit.phys.uidaho.edu
Physics Dept., Univ. of Idaho    Phone:    (208) 885 6809
Moscow ID 83843, USA             FAX:      (208) 885 6173

patrick@convex.COM (Patrick F. McGehearty) (12/18/90)

Here is a numerically more precise version of the harmonic code following
the suggestion of Paul Bickerstaff (pbickers@tamaluit.phys.uidaho.edu):
      PROGRAM RR
      DOUBLE PRECISION R
      R=0.0
        DO 10 I=10000000,1,-1
        R=R+1/DBLE(I)
10      CONTINUE
      WRITE(*,*)R,I
      END
It runs at the same speed as the original on a Convex.
The reported time of 8.7 is for a C120, which is our original product,
currently >6 year old technology.  No surprise that the workstations
have caught up to it.  A C210 gets the job done in around 2.2 seconds.

As has been noted, this benchmark (assuming good compiler technology)
primarily tests divide speed, plus a little bit of summation work.

I criticisied the bc benchmark on several points, some of which could
also be applied to this harmonic benchmark or any other small benchmark.
However, there is a critical difference:
Improvements to compilers or hardware to help this benchmark are likely
to help real customer's codes.  I examined our (Convex's) assembly code,
and identified a fairly minor optimization that would make this loop
run 2% faster.  The real benefit is that same optimization would apply
to most other codes that use DBLE(I).  Since one of my tasks here is
to identify such opportunities, this benchmark helped me do my job better.

Of course, since the entire computation is visible at compile time, there is
the ultimate optimization which computes the sum at compile time and just
generates the assignment of 16.695311... to R, along with an appropriate
assignment to I.  The potential for this sort of optimization is why it is
so dangerous to rely on standard, well known benchmarks for serious
evaluation purposes.  It happens though.

Some may not realize the extent to which vendors spend effort on
optimizations which have no meaning outside the "standard" benchmarks.
Serious comparision of vendor results can identify a number of these types
of optimizations.  For example, Berkeley developed some tests to measure the
performance of the Berkeley kernel for a variety of system calls.  This
suite is good for the purpose for which it was written, which it to allow
measurement of the various efforts at tuning kernel performance.  However,
it also started being used to evaluate the many Unix boxes that are
available.  One test in particular was used to measure the time to do a
minimal kernel call.  The test called getpid() 10,000 times.  Since the
standard getpid does almost nothing besides changing into the kernel
protection domain, initially it was a valid, easy to compare measure.
However, some vendors changed their libraries so that on its first
invocation, getpid saved it's result in a static variable, and then on all
following invocations, it used that static value instead of calling the
kernel.  This change meant that the Berkeley test reported the time for a
subroutine call as the time for a kernel call.
(Disclaimer: Up through the current release, Convex getpid invokes the
kernel on every getpid call.)
Besides giving misleading results, it was a misapplication of resources.  I
have not seen a real application that uses getpid frequently.  While the
optimization is fairly trivial, effort inside the kernel at the call
interface would allow all kernel calls to run faster.

I do not want to over-critize the technical effort of tuning a given
benchmark.  Generally, a engineer is told "Do whatever it takes to make this
program run faster".  With the typical time pressure most work under, there
is frequently not time to develop the general optimization, so only the
special case gets covered.

These sorts of tunings are indirectly under the control of customers and
those who report technical results.  If a benchmark is widely reported and
used, then vendors will attempt to improve their systems to give better
results.  The Whetstone and Linpack benchmarks are well known cases of the
wide reporting.  Dongarra's Linpack reports has encouraged vector architectures
and compiler optimizations which help real codes in Computational Fluid
Dynamics and other application areas.  Whetstones has encouraged faster
transcendentals and other floating point operations, which help
Computational Chemistry applications among others.  However, these two
benchmarks have been around so long that they have pretty much been milked
dry, and it is time to move on to other simple loops, or when possible, more
complex application codes.

The new SPECmark and Perfect Club benchmarks are valuable, partly due to
their size.  Detailed study and examination of them will yield significant
improvements to compilers and architectures which will benefit many programs
besides the benchmarks.  As vendors get these benchmarks tuned, new ones
need to be developed every few years to "keep the vendors honest".

If you are considering proposing a new standard benchmark, first consider
what it is testing, and whether that is something you want to see improved.

In summary, Standard benchmarks can be useful in selecting the "initial
list" of vendors to invite to make bids, or perhaps for very small
procurements (say, less than $100,000).  However, major procurements (>
half-million dollars) deserve some effort in selecting a "non-standard" load
representative of the intended usage of the machine or machines, in addition
to reviewing the standard benchmark values.  This applies to lowend
workstations as well as the fileserver/compute engine products.  Fifty
workstations at $8,000 is a non-trivial piece of change.