mark@mips.COM (Mark G. Johnson) (12/16/90)
In article <OLES.90Dec13213301@kelvin.uio.no> oles@kelvin.uio.no (Ole Swang) writes: > >Another easy-to-memorize benchmark is the computation of the sum >of the first 10 million terms in the harmonic series. >This is a FORTRAN version, it should not be too hard to translate >even without f2c :-) > > PROGRAM RR > DOUBLE PRECISION R > R=0.0 > DO 10 I=1,10000000 > R=R+1/DBLE(I) >10 CONTINUE > WRITE(*,*)R,I > END Because the Suns that I have access to, don't have Fortran compilers, I translated (without f2c :-), giving #include <stdio.h> main() { double r ; int i ; r = 0.0; for(i=1; i<=10000000; i++) { r += (1.0/( (double) i ) ) ; } printf(" r = %16.7le, i = %d\n", r, i); } >It vectorizes fully on the vectorizing compilers I've tested it on >(Cray and Convex). It has the advantage over the bc benchmark that >it's the same code every time. > Adding to Ole's list (user CPU seconds from /bin/time), >Cray X/MP 216 0.29 * MIPS RC6280 3.3 {added} MIPS RC3230 Magnum 8.1 {added} >Convex C 120 8.7 Sun SPARCstation II 10.0 {added} >DECstation 5000/200 10.5 >DECsystem 5400 13.1 Sun SPARCstation 1+ 33.0 {added} >VAX 6330/VMS5.3 FPA 41.9 >VAX 8650/VMS5.3 FPA 55.3 >VAX 8600/VMS5.2 FPA 77.1 >Sun 3/60 (m68881) 105.6 > > >* The code was modified to single presicion for the Cray, as this >yields the wanted 64-bits accuracy. > -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086 (408) 524-8308 mark@mips.com {or ...!decwrl!mips!mark}
mark@mips.COM (Mark G. Johnson) (12/16/90)
> >It vectorizes fully on the vectorizing compilers I've tested it on >(Cray and Convex). It has the advantage over the bc benchmark that >it's the same code every time. > >Cray X/MP 216 0.29 >MIPS RC6280 3.3 >MIPS RC3230 Magnum 8.1 >Convex C 120 8.7 <**** full vectorization ??? >Sun SPARCstation II 10.0 >DECstation 5000/200 10.5 >...etc Since the Convex measurement is "surrounded" by little bitty workstations, it might not have been running the fully vectorized object code. Maybe someone can replicate the C-120 measurement above and see what the generated code is doing. 1/2 baked random idea: this pgm performs divide-and-accumulate, so perhaps the machines that have a multiply-accumulate atomic instruction (Apollo DN-10000, Intel i860, IBM RS-6000, etc.) might possibly excel. Or, perhaps not. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086 (408) 524-8308 mark@mips.com {or ...!decwrl!mips!mark}
oles@kelvin.uio.no (Ole Swang) (12/18/90)
>> >>It vectorizes fully on the vectorizing compilers I've tested it on >>(Cray and Convex). It has the advantage over the bc benchmark that >>it's the same code every time. >> >>Cray X/MP 216 0.29 >>MIPS RC6280 3.3 >>MIPS RC3230 Magnum 8.1 >>Convex C 120 8.7 <**** full vectorization ??? >>Sun SPARCstation II 10.0 >>DECstation 5000/200 10.5 >>...etc > >Since the Convex measurement is "surrounded" by little bitty workstations, >it might not have been running the fully vectorized object code. Maybe >someone can replicate the C-120 measurement above and see what the >generated code is doing. > >1/2 baked random idea: this pgm performs divide-and-accumulate, so perhaps >the machines that have a multiply-accumulate atomic instruction (Apollo >DN-10000, Intel i860, IBM RS-6000, etc.) might possibly excel. Or, >perhaps not. >-- > -- Mark Johnson > MIPS Computer Systems, 930 E. Arques M/S 2-02, Sunnyvale, CA 94086 > (408) 524-8308 mark@mips.com {or ...!decwrl!mips!mark} The compiler insists that the loop is fully vectorized - in scalar mode (compiler option -O1), the C120 uses 44 secs. The C120 isn't very fast... The quantum chemical program package Gaussian 88 (tm) runs about an order of magnitude faster on a Cray X/MP. This benchmark, however, runs about 25 times faster on the Cray. Perhaps the C120 is relatively weaker on divisions. -- ----------------------------------------------------------------------- Ole Swang assistant professor, Dept. of Chemistry, U .of Oslo -----------------------------------------------------------------------
suitti@ima.isc.com (Stephen Uitti) (12/19/90)
In article <44125@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >Because the Suns that I have access to, don't have Fortran compilers, >I translated (without f2c :-), giving > #include <stdio.h> > main() { > double r ; > int i ; > r = 0.0; > for(i=1; i<=10000000; i++) > { r += (1.0/( (double) i ) ) ; } > printf(" r = %16.7le, i = %d\n", r, i); > } I translated it as: #include <stdio.h> main() { register double r; register long int i; r = 0.0; i = 10000000; do { r += 1.0 / (double) i; } while (--i); printf("%f, %ld\n", r, i); } I'm not sure why Mark used such an odd float format. I prefer the output "16.695311, 0" to " r = 1.6695311e+01, i = 0". Does this match the FORTRAN output? I used a do-while because of my PDP-11 days. The Ritchie compiler would produce an "sob" (subtract one & branch) instruction for the do-while, rather than an increment, compare, and branch. On the 80386, the code for the "for" loop is: incl %eax cmpl $10000000,%eax jle .L5 and the code for the "do-while" is: decl %eax jne .L2 The do-while is faster. On the 386/25 with 387, the "for" loop took 101.6 seconds, and the "do-while" loop took 99.8 seconds. Now, most people will not notice the 2% speed up, but often the loop overhead is more significant. It is interesting that you want to add up the small stuff first. That coding happens to be quicker on most machines, too. Adding to Ole's list (user CPU seconds from /bin/time), Cray X/MP 216 0.29 MIPS RC6280 3.3 MIPS RC3230 Magnum 8.1 Convex C 120 8.7 Sun SPARCstation II 10.0 DECstation 5000/200 10.5 DECsystem 5400 13.1 Sun SPARCstation 1+ 33.0 VAX 6330/VMS5.3 FPA 41.9 Compaq 486/25 44.4 (new) VAX 8650/VMS5.3 FPA 55.3 Compaq 386/33 71.7 (new) VAX 8600/VMS5.2 FPA 77.1 Compaq 386/25 99.8 (new) Sun 3/60 (m68881) 105.6 VAX 11/780 614.1 (new) Does anyone remember running 20+ people on a VAX 11/780? Here's the "uptime" output on our VAX 11/780: 5:16pm up 34 days, 22 hrs, 21 users, load average: 3.31, 3.10, 2.71 Reliabiliy is another item that benchmarks generally miss out on. In 34 days, even a VAX 11/780 has a significant number of cycles. By this benchmark, it will give you 1,400 seconds of Cray X/MP 216 time for this sort of thing. Ok, so 24 minutes of Cray time probably costs less than the maintainance on a VAX 11/780 for a month (one would hope you've depreciated the VAX to zero by now). Cost effectiveness is something that hasn't been discussed much. A 386/25 system will be obsolete in 3 years, and costs about $4,500 dollars for that time for a system capably of running this sort of application. That's $1,500 per year. It performs 6.15 times better than the VAX 780. This gives us a rough cost effectiveness of $240/VAX MIP. $1500 / (614.1 / 99.8) Is anyone brave enough to figure this out for a Cray? ... a PC/XT? Stephen. suitti@ima.isc.com
oles@kelvin.uio.no (Ole Swang) (12/19/90)
Organization: Interactive Systems, Cambridge, MA 02138-5302 In article <20667@dirtydog.ima.isc.com> suitti@ima.isc.com (Stephen Uitti) writes: < different translations into C deleted > > >I'm not sure why Mark used such an odd float format. I prefer >the output "16.695311, 0" to " r = 1.6695311e+01, i = 0". >Does this match the FORTRAN output? That is system dependent. The statement WRITE(*,*) picks a default format. (on most systems, it _does_ match) < stuff about do-while and efficiency deleted > >It is interesting that you want to add up the small stuff first. >That coding happens to be quicker on most machines, too. Not on the Cray! The time went up to 0.36 secs (that's 24 percent more) when using negative increment. Explanation, anyone? The answer differed in the tenth digit. Here is another update, containing Stephen's additions and a couple that was mailed to me (thanks!) Cray X/MP 216 0.29 MIPS RC6280 3.3 HP9000/870 6.7 (new) MIPS RC3230 Magnum 8.1 Convex C 120 8.7 Sun SPARCstation II 10.0 Sun 4/75 10.1 (new) DECstation 5000/200 10.5 DECsystem 5400 13.1 HP9000/845 13.7 (new) DECstation 3100 15.9 (new) Sun SPARCstation 1+ 33.0 VAX 6330/VMS5.3 FPA 41.9 Compaq 486/25 44.4 HP9000/825 53.9 VAX 8650/VMS5.3 FPA 55.3 Compaq 386/33 71.7 VAX 8600/VMS5.2 FPA 77.1 Compaq 386/25 99.8 Sun 3/60 (m68881) 105.6 VAX 11/780 614.1 < comments on reliability deleted > >Cost effectiveness is something that hasn't been discussed much. > >A 386/25 system will be obsolete in 3 years, and costs about >$4,500 dollars for that time for a system capably of running this >sort of application. That's $1,500 per year. It performs 6.15 >times better than the VAX 780. This gives us a rough cost >effectiveness of $240/VAX MIP. > $1500 / (614.1 / 99.8) > >Is anyone brave enough to figure this out for a Cray? ... a PC/XT? Without having investigated, I'm quite sure that a high-end Cray, for the time being, is the worst computer in the world when it comes to cost-effectiveness. On the other hand, you would not like to use a workstation for a well vectorizing, iterative algorithm that reads a few hundred Mb from disk for each iteration. (this is the kind of algorithm that eats most of the cycles in a typical quantum chemical calculation, which is what I use computers for (when I'm not reading news)) >Stephen. >suitti@ima.isc.com -- ----------------------------------------------------------------------- Ole Swang assistant lecturer, Dept. of Chemistry, U .of Oslo -----------------------------------------------------------------------
davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (12/24/90)
In article <1990Dec19.004003.20667@dirtydog.ima.isc.com> suitti@ima.isc.com (Stephen Uitti) writes: | The do-while is faster. On the 386/25 with 387, the "for" loop | took 101.6 seconds, and the "do-while" loop took 99.8 seconds. Nope. You have changed both the loop type and the algorithm here. The corresponding for loop would be for (n=100000; --n; ) { ... } and the reason your version runs faster is that it avoids the compare, not because it's a do-while. As in most benchmarks the effect of changing the program to make it faster also means the numbers no longer compare to the old values in a meaningful way. -- bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen) sysop *IX BBS and Public Access UNIX moderator of comp.binaries.ibm.pc and 80386 mailing list "Stupidity, like virtue, is its own reward" -me
davidsen@sixhub.UUCP (Wm E. Davidsen Jr) (12/24/90)
In article <OLES.90Dec19111111@kelvin.uio.no> oles@kelvin.uio.no (Ole Swang) writes: | Not on the Cray! The time went up to 0.36 secs (that's 24 percent more) | when using negative increment. Explanation, anyone? | The answer differed in the tenth digit. The result calculated by doing the large values first and decrementing is "more accurate" than doing it the other way, because you avoid adding a large number and a small number, where the value of the small number might be lost due to rounding. By adding all the small (1/large) numbers first, the cumulative value will be kept more correctly. I use quotes on "more accurate," since the result is of no interest and might as well be wrong. My guess is that the longer time is because when the small values are added to the large, the add is aborted after normalization fails, thus running faster. That's purely a guess, and I haven't done the formal numerical analysis to back it up. -- bill davidsen - davidsen@sixhub.uucp (uunet!crdgw1!sixhub!davidsen) sysop *IX BBS and Public Access UNIX moderator of comp.binaries.ibm.pc and 80386 mailing list "Stupidity, like virtue, is its own reward" -me
suitti@ima.isc.com (Stephen Uitti) (12/28/90)
In article <2720@sixhub.UUCP> davidsen@sixhub.UUCP (bill davidsen) writes: >In article <1990Dec19.004003.20667@dirtydog.ima.isc.com> suitti@ima.isc.com (Stephen Uitti) writes: > >| The do-while is faster. On the 386/25 with 387, the "for" loop >| took 101.6 seconds, and the "do-while" loop took 99.8 seconds. > > Nope. You have changed both the loop type and the algorithm here. >The corresponding for loop would be > > for (n=100000; --n; ) { ... } > >and the reason your version runs faster is that it avoids the compare, >not because it's a do-while. > > As in most benchmarks the effect of changing the program to make it >faster also means the numbers no longer compare to the old values in a >meaningful way. As a benchmark of machines, it is pretty bad. It does lots of floating point divides, with some loop overhead. What the benchmark might be able to tell us is something about tuning floating point divides or loop overhead. Floating point divides don't happen to interest me much, but loop overhead does, since lots of programs have it. I'm aware that my version wasn't the same as the original - it can even produce a different answer. It is interesting that your version is not the same as my do-while. For one, it doesn't compute 1/100000. The condition is performed before the loop, and your decrement is there. for (i = 10000000; i != 0; i--) { is more accurate. Oddly, this generates decl %eax jne .L5 for one compiler, and decl %edi testl %edi,%edi jne .L70 for another on the same system. However, the do-while generates decl %edi jne .L69 on the available compilers. At on time, I thought that Dennis put the do-while into the language just for subtract-one-and-branch instructions. It is more difficult for a compiler to notice that a 'for' loop can be optimized to get rid of the compare. In a 'for' loop, the test really is at the top. In a 'do-while', the test is at the bottom, where the optimization is. Now, getting rid of one of the loop overhead instructions on the 386/25 (with 387) speeds up the program by 1.8% - hardly noticeable for any real program. The problem is that the instruction that is removed is one of 12 for the loop, and by comparison, a very quick one. Loop unrolling should do better. However, unrolling it 10 times, slows down the program to 104.6 seconds. Stuffing 79 instructions into the loop probably means the 386's cache isn't getting hit as much - more than undoing any benefits. In fact, unrolling the loop 5 times is also slower than not unrolling. I wonder if there are compilers out there that can do loop unrolling that also know how big caches are. Are they smart enough to optimize this benchmark? The loop index is not invariant- it gets used in the loop. I've been attempting to get this trivial benchmark to tell me something about the tools I have. I doesn't have anything really exciting to say - just give or take a couple percent. I don't attempt to make my compilers faster - even when I have source - it isn't my job. I try to use the tools available. For example, use multiplies over divides. My motto has been "Don't trust the compiler to convert 'x / 5.0' into 'x * 0.2'". However, some of the older, dumber compilers produce faster code than newer, smarter compilers. The new ones seem to get caught up in attempting to figure out what my pointers are doing so much that they forget about the simpler optimizations that were designed into the language. I don't know whether to laugh or cry. Stephen. suitti@ima.isc.com