[comp.benchmarks] Ole Swang's benchmark: Sum of Harmonic Series

mark@mips.COM (Mark G. Johnson) (12/16/90)

In article <OLES.90Dec13213301@kelvin.uio.no> oles@kelvin.uio.no (Ole Swang) writes:
  >
  >Another easy-to-memorize benchmark is the computation of the sum
  >of the first 10 million terms in the harmonic series.
  >This is a FORTRAN version, it should not be too hard to translate
  >even without f2c :-)
  >
  >      PROGRAM RR
  >      DOUBLE PRECISION R
  >      R=0.0
  >        DO 10 I=1,10000000
  >        R=R+1/DBLE(I)
  >10      CONTINUE
  >      WRITE(*,*)R,I
  >      END

Because the Suns that I have access to, don't have Fortran compilers,
I translated (without f2c :-), giving
	#include <stdio.h>
	main() {
		double	r ;
		int	i ;
		r = 0.0;
		for(i=1; i<=10000000; i++)
		    { r += (1.0/( (double) i ) ) ; }
		printf(" r = %16.7le, i = %d\n", r, i);
	}


>It vectorizes fully on the vectorizing compilers I've tested it on
>(Cray and Convex).  It has the advantage over the bc benchmark that
>it's the same code every time.
>


Adding to Ole's list (user CPU seconds from /bin/time),


>Cray X/MP 216               0.29  *
 MIPS RC6280                 3.3            {added}
 MIPS RC3230 Magnum          8.1            {added}
>Convex C 120                8.7
 Sun SPARCstation II        10.0            {added}
>DECstation 5000/200        10.5
>DECsystem 5400             13.1
 Sun SPARCstation 1+        33.0            {added}
>VAX 6330/VMS5.3 FPA        41.9
>VAX 8650/VMS5.3 FPA        55.3
>VAX 8600/VMS5.2 FPA        77.1
>Sun 3/60 (m68881)         105.6
>
>
>* The code was modified to single presicion for the Cray, as this
>yields the wanted 64-bits accuracy.
>
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086
	(408) 524-8308    mark@mips.com  {or ...!decwrl!mips!mark}

mark@mips.COM (Mark G. Johnson) (12/16/90)

>
>It vectorizes fully on the vectorizing compilers I've tested it on
>(Cray and Convex).  It has the advantage over the bc benchmark that
>it's the same code every time.
>
>Cray X/MP 216               0.29
>MIPS RC6280                 3.3
>MIPS RC3230 Magnum          8.1
>Convex C 120                8.7   <**** full vectorization ???
>Sun SPARCstation II        10.0
>DECstation 5000/200        10.5
>...etc

Since the Convex measurement is "surrounded" by little bitty workstations,
it might not have been running the fully vectorized object code.  Maybe
someone can replicate the C-120 measurement above and see what the
generated code is doing.

1/2 baked random idea: this pgm performs divide-and-accumulate, so perhaps
the machines that have a multiply-accumulate atomic instruction (Apollo
DN-10000, Intel i860, IBM RS-6000, etc.) might possibly excel.  Or,
perhaps not.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086
	(408) 524-8308    mark@mips.com  {or ...!decwrl!mips!mark}

oles@kelvin.uio.no (Ole Swang) (12/18/90)

>>
>>It vectorizes fully on the vectorizing compilers I've tested it on
>>(Cray and Convex).  It has the advantage over the bc benchmark that
>>it's the same code every time.
>>
>>Cray X/MP 216               0.29
>>MIPS RC6280                 3.3
>>MIPS RC3230 Magnum          8.1
>>Convex C 120                8.7   <**** full vectorization ???
>>Sun SPARCstation II        10.0
>>DECstation 5000/200        10.5
>>...etc
>
>Since the Convex measurement is "surrounded" by little bitty workstations,
>it might not have been running the fully vectorized object code.  Maybe
>someone can replicate the C-120 measurement above and see what the
>generated code is doing.
>
>1/2 baked random idea: this pgm performs divide-and-accumulate, so perhaps
>the machines that have a multiply-accumulate atomic instruction (Apollo
>DN-10000, Intel i860, IBM RS-6000, etc.) might possibly excel.  Or,
>perhaps not.
>-- 
> -- Mark Johnson	
> 	MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086
>	(408) 524-8308    mark@mips.com  {or ...!decwrl!mips!mark}

The compiler insists that the loop is fully vectorized - in scalar
mode (compiler option -O1), the C120 uses 44 secs. The C120 isn't very
fast... The quantum chemical program package Gaussian 88 (tm) runs
about an order of magnitude faster on a Cray X/MP. This benchmark, however,
runs about 25 times faster on the Cray. Perhaps the C120 is relatively
weaker on divisions.
--
-----------------------------------------------------------------------
Ole Swang     assistant professor, Dept. of Chemistry, U .of Oslo
-----------------------------------------------------------------------

suitti@ima.isc.com (Stephen Uitti) (12/19/90)

In article <44125@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>Because the Suns that I have access to, don't have Fortran compilers,
>I translated (without f2c :-), giving
>	#include <stdio.h>
>	main() {
>		double	r ;
>		int	i ;
>		r = 0.0;
>		for(i=1; i<=10000000; i++)
>		    { r += (1.0/( (double) i ) ) ; }
>		printf(" r = %16.7le, i = %d\n", r, i);
>	}

I translated it as:

    #include <stdio.h>
    main()
    {
            register double r;
            register long int i;

            r = 0.0;
            i = 10000000;
            do {
                    r += 1.0 / (double) i;
            } while (--i);
            printf("%f, %ld\n", r, i);
    }

I'm not sure why Mark used such an odd float format.  I prefer
the output "16.695311, 0" to " r =    1.6695311e+01, i = 0".
Does this match the FORTRAN output?

I used a do-while because of my PDP-11 days.  The Ritchie
compiler would produce an "sob" (subtract one & branch)
instruction for the do-while, rather than an increment, compare,
and branch.  On the 80386, the code for the "for" loop is:
        incl %eax
        cmpl $10000000,%eax
        jle .L5
and the code for the "do-while" is:
        decl %eax
        jne .L2

The do-while is faster.  On the 386/25 with 387, the "for" loop
took 101.6 seconds, and the "do-while" loop took 99.8 seconds.
Now, most people will not notice the 2% speed up, but often the
loop overhead is more significant.

It is interesting that you want to add up the small stuff first.
That coding happens to be quicker on most machines, too.

Adding to Ole's list (user CPU seconds from /bin/time),
Cray X/MP 216               0.29
MIPS RC6280                 3.3
MIPS RC3230 Magnum          8.1
Convex C 120                8.7
Sun SPARCstation II        10.0
DECstation 5000/200        10.5
DECsystem 5400             13.1
Sun SPARCstation 1+        33.0
VAX 6330/VMS5.3 FPA        41.9
Compaq 486/25              44.4 (new)
VAX 8650/VMS5.3 FPA        55.3
Compaq 386/33		   71.7 (new)
VAX 8600/VMS5.2 FPA        77.1
Compaq 386/25              99.8 (new)
Sun 3/60 (m68881)         105.6
VAX 11/780		  614.1 (new)

Does anyone remember running 20+ people on a VAX 11/780?
Here's the "uptime" output on our VAX 11/780:
  5:16pm up 34 days, 22 hrs, 21 users, load average: 3.31, 3.10, 2.71

Reliabiliy is another item that benchmarks generally miss out on.
In 34 days, even a VAX 11/780 has a significant number of cycles.
By this benchmark, it will give you 1,400 seconds of Cray X/MP
216 time for this sort of thing.

Ok, so 24 minutes of Cray time probably costs less than the
maintainance on a VAX 11/780 for a month (one would hope you've
depreciated the VAX to zero by now).

Cost effectiveness is something that hasn't been discussed much.

A 386/25 system will be obsolete in 3 years, and costs about
$4,500 dollars for that time for a system capably of running this
sort of application.  That's $1,500 per year.  It performs 6.15
times better than the VAX 780.  This gives us a rough cost
effectiveness of $240/VAX MIP.
        $1500 / (614.1 / 99.8)

Is anyone brave enough to figure this out for a Cray?  ... a PC/XT?

Stephen.
suitti@ima.isc.com

oles@kelvin.uio.no (Ole Swang) (12/19/90)

Organization: Interactive Systems, Cambridge, MA 02138-5302
In article <20667@dirtydog.ima.isc.com> suitti@ima.isc.com (Stephen Uitti) writes:
        < different translations into C deleted >

>
>I'm not sure why Mark used such an odd float format.  I prefer
>the output "16.695311, 0" to " r =    1.6695311e+01, i = 0".
>Does this match the FORTRAN output?

That is system dependent. The statement WRITE(*,*) picks a default
format. (on most systems, it _does_ match)

   < stuff about do-while and efficiency deleted >

>It is interesting that you want to add up the small stuff first.
>That coding happens to be quicker on most machines, too.

Not on the Cray! The time went up to 0.36 secs (that's 24 percent more)
when using negative increment. Explanation, anyone?
The answer differed in the tenth digit.

Here is another update, containing Stephen's additions and a couple that
was mailed to me (thanks!)

Cray X/MP 216               0.29
MIPS RC6280                 3.3
HP9000/870                  6.7 (new)
MIPS RC3230 Magnum          8.1
Convex C 120                8.7
Sun SPARCstation II        10.0
Sun 4/75                   10.1 (new)
DECstation 5000/200        10.5
DECsystem 5400             13.1
HP9000/845                 13.7 (new)
DECstation 3100            15.9 (new)
Sun SPARCstation 1+        33.0
VAX 6330/VMS5.3 FPA        41.9
Compaq 486/25              44.4 
HP9000/825                 53.9
VAX 8650/VMS5.3 FPA        55.3
Compaq 386/33		   71.7 
VAX 8600/VMS5.2 FPA        77.1
Compaq 386/25              99.8
Sun 3/60 (m68881)         105.6
VAX 11/780		  614.1


 < comments on reliability deleted >


>Cost effectiveness is something that hasn't been discussed much.
>
>A 386/25 system will be obsolete in 3 years, and costs about
>$4,500 dollars for that time for a system capably of running this
>sort of application.  That's $1,500 per year.  It performs 6.15
>times better than the VAX 780.  This gives us a rough cost
>effectiveness of $240/VAX MIP.
>        $1500 / (614.1 / 99.8)
>
>Is anyone brave enough to figure this out for a Cray?  ... a PC/XT?

Without having investigated, I'm quite sure that a high-end Cray, for
the time being, is the worst computer in the world when it comes to
cost-effectiveness. On the other hand, you would not like to use a
workstation for a well vectorizing, iterative algorithm that reads
a few hundred Mb from disk for each iteration. (this is the kind of
algorithm that eats most of the cycles in a typical quantum chemical
calculation, which is what I use computers for (when I'm not reading news))

>Stephen.
>suitti@ima.isc.com

--
-----------------------------------------------------------------------
Ole Swang     assistant lecturer, Dept. of Chemistry, U .of Oslo
-----------------------------------------------------------------------

davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (12/24/90)

In article <1990Dec19.004003.20667@dirtydog.ima.isc.com> suitti@ima.isc.com (Stephen Uitti) writes:

| The do-while is faster.  On the 386/25 with 387, the "for" loop
| took 101.6 seconds, and the "do-while" loop took 99.8 seconds.

    Nope. You have changed both the loop type and the algorithm here.
The corresponding for loop would be

	for (n=100000; --n; ) { ... }

and the reason your version runs faster is that it avoids the compare,
not because it's a do-while.

  As in most benchmarks the effect of changing the program to make it
faster also means the numbers no longer compare to the old values in a
meaningful way.
-- 
bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen)
    sysop *IX BBS and Public Access UNIX
    moderator of comp.binaries.ibm.pc and 80386 mailing list
"Stupidity, like virtue, is its own reward" -me

davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (12/24/90)

In article <OLES.90Dec19111111@kelvin.uio.no> oles@kelvin.uio.no (Ole Swang) writes:

| Not on the Cray! The time went up to 0.36 secs (that's 24 percent more)
| when using negative increment. Explanation, anyone?
| The answer differed in the tenth digit.

  The result calculated by doing the large values first and decrementing
is "more accurate" than doing it the other way, because you avoid adding
a large number and a small number, where the value of the small number
might be lost due to rounding. By adding all the small (1/large) numbers
first, the cumulative value will be kept more correctly.

  I use quotes on "more accurate," since the result is of no interest
and might as well be wrong.

  My guess is that the longer time is because when the small values are
added to the large, the add is aborted after normalization fails, thus
running faster. That's purely a guess, and I haven't done the formal
numerical analysis to back it up.
-- 
bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen)
    sysop *IX BBS and Public Access UNIX
    moderator of comp.binaries.ibm.pc and 80386 mailing list
"Stupidity, like virtue, is its own reward" -me

suitti@ima.isc.com (Stephen Uitti) (12/28/90)

In article <2720@sixhub.UUCP> davidsen@sixhub.UUCP (bill davidsen) writes:
>In article <1990Dec19.004003.20667@dirtydog.ima.isc.com> suitti@ima.isc.com (Stephen Uitti) writes:
>
>| The do-while is faster.  On the 386/25 with 387, the "for" loop
>| took 101.6 seconds, and the "do-while" loop took 99.8 seconds.
>
>    Nope. You have changed both the loop type and the algorithm here.
>The corresponding for loop would be
>
>	for (n=100000; --n; ) { ... }
>
>and the reason your version runs faster is that it avoids the compare,
>not because it's a do-while.
>
>  As in most benchmarks the effect of changing the program to make it
>faster also means the numbers no longer compare to the old values in a
>meaningful way.

As a benchmark of machines, it is pretty bad.  It does lots of
floating point divides, with some loop overhead.  What the benchmark
might be able to tell us is something about tuning floating point
divides or loop overhead.

Floating point divides don't happen to interest me much, but loop
overhead does, since lots of programs have it.

I'm aware that my version wasn't the same as the original - it
can even produce a different answer.  It is interesting that your
version is not the same as my do-while.  For one, it doesn't
compute 1/100000.  The condition is performed before the loop,
and your decrement is there.
	for (i = 10000000; i != 0; i--) {
is more accurate.

Oddly, this generates
	decl %eax
	jne .L5

for one compiler, and

	decl	%edi
	testl	%edi,%edi
	jne	.L70

for another on the same system.  However, the do-while generates

	decl	%edi
	jne	.L69

on the available compilers.  At on time, I thought that Dennis put
the do-while into the language just for subtract-one-and-branch
instructions.

It is more difficult for a compiler to notice that a 'for' loop
can be optimized to get rid of the compare.  In a 'for' loop, the
test really is at the top.  In a 'do-while', the test is at the
bottom, where the optimization is.

Now, getting rid of one of the loop overhead instructions on the
386/25 (with 387) speeds up the program by 1.8% - hardly
noticeable for any real program.  The problem is that the
instruction that is removed is one of 12 for the loop, and by
comparison, a very quick one.  Loop unrolling should do better.
However, unrolling it 10 times, slows down the program to 104.6
seconds.  Stuffing 79 instructions into the loop probably means
the 386's cache isn't getting hit as much - more than undoing any
benefits.  In fact, unrolling the loop 5 times is also slower
than not unrolling.  I wonder if there are compilers out there
that can do loop unrolling that also know how big caches are.
Are they smart enough to optimize this benchmark?  The loop index
is not invariant- it gets used in the loop.

I've been attempting to get this trivial benchmark to tell me
something about the tools I have.  I doesn't have anything really
exciting to say - just give or take a couple percent.  I don't
attempt to make my compilers faster - even when I have source -
it isn't my job.  I try to use the tools available.  For example,
use multiplies over divides.  My motto has been "Don't trust the
compiler to convert 'x / 5.0' into 'x * 0.2'".

However, some of the older, dumber compilers produce faster code
than newer, smarter compilers.  The new ones seem to get caught
up in attempting to figure out what my pointers are doing so much
that they forget about the simpler optimizations that were
designed into the language.  I don't know whether to laugh or cry.

Stephen.
suitti@ima.isc.com