barry@pico.math.ucla.edu (Barry Merriman) (11/15/90)
I had a chance to see the NeXTStations running 2.0 today at a presentation
at UCLA. Very nice.
Being a scientist, I couldn't miss the chance for an experiment.
So, I devised the following little benchmark (gotta be
short enough to type in by hand while the NeXT rep is not looking :-)
to take with me. Essentially, it does 1,000,000 floating point multiplies,
for a quick estimate of megafloppage. You can try this at home, kids! :-)
#include <stdio.h>
main() {
int i;
float x;
float m;
x=1.0000;
m=1.00001;
for (i=1;i<1000000;i++) {x *= m;}
printf("x = %f \n",x);
exit(0);
}
On all machines, it was compiled using ``cc -O'', and timed using
the unix ``time'' command (do ``time a.out'').
The results are (drumroll...)
------------------------------------------------------------------------
Sun 3/50
34.840u 0.200s 0:35.40 98% 1+1k 4+1io 0pf+0w
=> 0.028 MFLOPS (Yow! Those were the bad old days!)
Sun 3/110
32.500u 0.060s 0:32.78 99% 0+0k 1+1io 0pf+0w
=> 0.031 MFLOPS
Sun 3/110 with floating point accelerator ( compile with "cc -O -ffpa")
4.040u 0.080s 0:04.42 93% 1+1k 5+1io 0pf+0w
=> 0.25 MFLOPS
Alliant f/x 8 (scalar mode---faster in parallel, of course)
3.3u 2.1s 0:05 98% 8+4k 0+1io 0pf+0w
=> 0.3 MFLOPS
NeXT Cube (68030)
1.8u 0.0s 0:02 91% *** *** *** (I didn't write down the last 4 fields)
=> 0.56 MFLOPS
DecStation 3100:
0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w
=> 2.0 MFLOPS
SparcStation 1
0.430u 0.080s 0:00.65 78% 0+215k 2+0io 2pf+0w
=> 2.3 MFLOPS
NeXTStation
0.3u 0.0s 0:00 92% 0+0k 0+0io 0pf+0w
=> 3.3 MFLOPS
-------------------------------------------------------------------------
So, the clear winner is the NeXTStation!
I also tried some windowing and starting apps on the slab, and that was
fine, but hard to judge---the slab there had 20MB RAM, lots of apps open,
a 200MB HD (yes, they have a 200MB HD option now, for $4100 educational price.)
Slabs were said to be shipping in the next couple weeks.
--
Barry Merriman
UCLA Dept. of Math
UCLA Inst. for Fusion and Plasma Research
barry@math.ucla.edu (Internet)
barry@pico.math.ucla.edu (Barry Merriman) (11/16/90)
Some folks have pointed out that, to be fair, I should use gcc as the compiler rather than cc, since next really uses gcc and this is smarter than cc. So, here's a quick comparison, compiling my little million multiply C program with ``gcc -O'' in all cases (= ``cc -O'' on NeXTs) ---------------------------- Sun 3/80 (thanks to Charles Purcell) 0.39 MFLOPS NeXT Cube (68030) 0.56 MFLOPS SparcStation 1 (my machine, and also thanks to Fred White) 3.0--3.3 MFLOPS (vs. 2.3 MFLOPS using plain ``cc -O'') NeXTStation 3.3 MFLOPS Also, we have a special guest appearance by the...RS/6000! First, this editorial comment: Ralph Seguin writes: >The RS/6000s SCREAM. I have been using them for quite some >time now. They give SPECmarks which kill every other machine at that price >level. okay, but... IBM RS/6000 (scalar) (Thanks to Charles Purcell---I don't know if this was gcc, though) 3.3 MFLOPS --------------------------- So, for simple scalar floating point, the NeXTStation seems to be the price/performance leader (and even the performance leader!) -- Barry Merriman UCLA Dept. of Math UCLA Inst. for Fusion and Plasma Research barry@math.ucla.edu (Internet)
barry@pico.math.ucla.edu (Barry Merriman) (11/16/90)
Add this to the list (thanks Eric Anderson): DECStation 3100 (using gcc -O) 4.0 MFLOPS (vs about 1.3 MFLOPS using just ``cc -O'') So, the DECStation takes the lead (in performance, but not price/performance). But---what can you say about a company whose default compiler is worse than a freely available one? :-) -- Barry Merriman UCLA Dept. of Math UCLA Inst. for Fusion and Plasma Research barry@math.ucla.edu (Internet)
philip@pescadero.Stanford.EDU (Philip Machanick) (11/16/90)
In article <738@kaos.MATH.UCLA.EDU>, barry@pico.math.ucla.edu (Barry Merriman) writes: |> I had a chance to see the NeXTStations running 2.0 today at a presentation |> at UCLA. Very nice. |> |> Being a scientist, I couldn't miss the chance for an experiment. |> So, I devised the following little benchmark (gotta be |> short enough to type in by hand while the NeXT rep is not looking :-) |> to take with me. Essentially, it does 1,000,000 floating point multiplies, |> for a quick estimate of megafloppage. You can try this at home, kids! :-) [benchmark plus other times deleted] |> DecStation 3100: |> |> 0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w |> |> => 2.0 MFLOPS |> |> SparcStation 1 |> |> 0.430u 0.080s 0:00.65 78% 0+215k 2+0io 2pf+0w |> |> => 2.3 MFLOPS |> |> NeXTStation |> |> 0.3u 0.0s 0:00 92% 0+0k 0+0io 0pf+0w |> |> => 3.3 MFLOPS |> |> So, the clear winner is the NeXTStation! Very interesting. Of course, toy benchmarks are not an indication of real system performance (what happens when you fill the cache, how fast is paging, etc. ...?). I'm using a DECstation 3100 for my programming at the moment. Has anyone had the opportunity to benchmark compiles of C++ on the two? If not, can anyone lend me a NeXT so I can do it? -- Philip Machanick philip@pescadero.stanford.edu
barry@pico.math.ucla.edu (Barry Merriman) (11/16/90)
In the pursuit of truth, this just in: running my trivial benchmark program (multiply 1.00001 upon itself 10^6 times, in C) using csh instead of ksh on the RS/6000: IBM RS/6000 (csh, not ksh!) 3.7 MFLOPS (10% faster than under ksh) This gives IBM a 10% lead on the 3.3 MFLOP NeXTStation. Jeez, this is getting complicated :-). -- Barry Merriman UCLA Dept. of Math UCLA Inst. for Fusion and Plasma Research barry@math.ucla.edu (Internet)
garnett@cs.utexas.edu (John William Garnett) (11/16/90)
In article <742@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes: >So, here's a quick comparison, compiling my little million multiply C >program with ``gcc -O'' in all cases (= ``cc -O'' on NeXTs) First of all so that we all know what's being talked about - here again is the million (actually 999,999) multiply benchmark source. #include <stdio.h> main() { int i; float x; float m; x=1.0000; m=1.00001; for (i=1;i<1000000;i++) {x *= m;} printf("x = %f \n",x); exit(0); } > >Also, we have a special guest appearance by the...RS/6000! >First, this editorial comment: >Ralph Seguin writes: > >>The RS/6000s SCREAM. I have been using them for quite some >>time now. They give SPECmarks which kill every other machine at that price >>level. > >okay, but... > >IBM RS/6000 (scalar) (Thanks to Charles Purcell---I don't know if this was gcc, though) > > 3.3 MFLOPS For the program in question, this number of 3.3 appears to be in the right ballpark. I ran the program and received a number of 3.57. HOWEVER, if you make one small change to the program, it performs at 6.25 (toy) MFLOPS. This change is merely to change all occurences of "float" to "double". If the loop limit is increased from 1,000,000 to 10,000,000, this number jumps to 6.49 (startup costs are amortized). Obviously IBM optimized the machine toward better performance on doubles. Note that all of these numbers were generated using IBM's C Compiler with -O on the RS/6000 Model 320 which is rated at approx 7.5 "real" MFLOPS. > >So, for simple scalar floating point, the NeXTStation seems to be the >price/performance leader (and even the performance leader!) > These are big claims to make based on a 10 line benchmark :-). Followups via email, comp.unix.aix, or comp.benchmarks. -- John Garnett University of Texas at Austin garnett@cs.utexas.edu Department of Computer Science Austin, Texas
gilgalad@caen.engin.umich.edu (Ralph Seguin) (11/16/90)
In article <742@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes: >Some folks have pointed out that, to be fair, I should use gcc as >the compiler rather than cc, since next really uses gcc and this is smarter >than cc. There is that. But to be really fair, you should totally disregard this type of benchmark anyways. Using a single type of instruction (floating point multiply in this case) is NOT a good indicator of overall performance of a system. Has anybody got SPECmarks for a NeXT? I'd be interested in seeing them. Also, it is a WELL KNOWN FACT that MIPS should stand for Meaningless Indicator of Processor Speed. >IBM RS/6000 (scalar) (Thanks to Charles Purcell---I don't know if this was gcc, though) > > 3.3 MFLOPS As I've said, this is not a good benchmark. BTW- How did you compile it? When I use gcc -O I get somewhere around 7.5 megaflops on our 320s. I'll try it on a 540 in a bit. Unlike many other processors, I find that the POWERstations live up to the performance claims that IBM makes. >So, for simple scalar floating point, the NeXTStation seems to be the >price/performance leader (and even the performance leader!) Could be the price/performance leader. Dunno. A NeXTDimension board would be a better price/performance ratio in my mind. If you run a thread or two on the i860, you can get some amazing performance. I know that it is bad to quote peak performance (but that is what people are doing in this article isn't it 8-), but at a peak of 2 floating point instructions per cycle, nothing even comes close to it. Question: Will NeXT OS 2.0 allow you to create threads that run on the i860? >Barry Merriman >UCLA Dept. of Math >UCLA Inst. for Fusion and Plasma Research >barry@math.ucla.edu (Internet) Thanks, Ralph Ralph Seguin gilgalad@dip.eecs.umich.edu 536 South Forest Apt. #915 gilgalad@caen.engin.umich.edu Ann Arbor, MI 48104 (313) 662-4805
lerman@stpstn.UUCP (Ken Lerman) (11/16/90)
In article <742@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes:
->Some folks have pointed out that, to be fair, I should use gcc as
->the compiler rather than cc, since next really uses gcc and this is smarter
->than cc.
->
->So, for simple scalar floating point, the NeXTStation seems to be the
->price/performance leader (and even the performance leader!)
->
->--
->Barry Merriman
->UCLA Dept. of Math
->UCLA Inst. for Fusion and Plasma Research
->barry@math.ucla.edu (Internet)
I would have more confidence in the "benchmark" if the total execution
time were closer to ten seconds than to one second. How much of the
time was taken by loading and initializing on each of the platforms?
Ken
tgingric@magnus.ircc.ohio-state.edu (Tyler S Gingrich) (11/16/90)
In article <1990Nov16.071217.8676@engin.umich.edu> gilgalad@caen.engin.umich.edu (Ralph Seguin) writes: > >Question: Will NeXT OS 2.0 allow you to create threads that run on the >i860? > This question was asked at the Nov OSU-NeXT Users Group meeting and the NeXT technical rep (John Karibiac -- sorry John, I STILL can't remember how to spell your last name!!) said this is NOT possible at the current time under version 2.0 of the software. Wether this means it will never be possible, or that NeXT won't support it, or that NeXT is thinking about it - your guess is as good as mine..... Tyler
bostrov@storm.UUCP (Vareck Bostrom) (11/17/90)
In article <738@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes: >DecStation 3100: > > 0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w > > => 2.0 MFLOPS > >SparcStation 1 > > 0.430u 0.080s 0:00.65 78% 0+215k 2+0io 2pf+0w > > => 2.3 MFLOPS Something seems wrong here, we're talking about the SPARCstation 1, 20 MHz SPARC, Weitek 3170 FPU? That seems a bit fast for one, and I am truly surprised that it beat the DECstation 3100. Doing a lot of raytracing, which heavily uses Floating point, the DECstation 3100 ALWAYS beat SPARCstation 1's and 1+'s that I ran it on. An ardent titan-3 I ran it on so far has teh best performance, topping 20 MFLOPS easily (judging by the speed increase over an HP 9000/845, about 4.0 MFLOPS). I am a little surprised at the NeXT's results, which also seem a tad high, but only because NeXT says 2.8 MFLOPS and my test (DP Linpack) ran at about 3.0 MFLOPS. 3.0 MFLOPS is certanily not slow for a workstation or home machine. The NeXT-040 beat a SPARCserver 4/330, all the sun-3's, 4/60, 4/65, Decstation-2100, Decstation-3100 (close on this one, but the NeXT won). It did not beat a Decstation-5000 and an HP 9000/845 as well as an Ardent with vector fpu and a 8 processor Cray XMP (no suprise here, eh?) I am wondering why there isn't an i860 FPU on the NeXT motherboard, I mean after all, we'd be looking at 60-80 MFLOPS from an FPU like that, I imagine that real world performance isn't better than 30-50 MFlops, but that's still much better than the MC68040's 3.0 or so. This is interesting, the NeXT video option (NeXT dimention) offers superior performance in fp than the motherboard itself, by far. oh well. - Vareck
philip@pescadero.Stanford.EDU (Philip Machanick) (11/17/90)
In article <743@kaos.MATH.UCLA.EDU>, barry@pico.math.ucla.edu (Barry Merriman) writes: |> Add this to the list (thanks Eric Anderson): |> |> DECStation 3100 (using gcc -O) |> |> 4.0 MFLOPS (vs about 1.3 MFLOPS using just ``cc -O'') |> So, the DECStation takes the lead (in performance, but not price/performance). |> But---what can you say about a company whose default compiler |> is worse than a freely available one? :-) Now, wait a minute. In your original article, you reported the DECstation as > DecStation 3100: > 0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w > => 2.0 MFLOPS Still, a factor of 2 difference looks suspicious. I decided to check this out myself. My results looked like this on the DECstation 3100: cc -O: 0.6u 0.0s 0:00 82% 30+30k 3+0io 0pf+0w gcc -O: 0.2u 0.0s 0:00 86% 30+31k 0+0io 0pf+0w Even worse - now gcc is 3 times faster. When I get bizarre results, I check them - so I looked at the output of the program. The result of x=1.0000; m=1.00001; for (i=1;i<1000000;i++) {x *= m;} is 1.00001 to the power 1000000, which my calculators claim is 22025.365. Output from cc: x = 22323.087891 (1% different from calculators) Output from gcc: x = 0.000000 (seriously bogus) So - how about another round of runs, this time checking the output? Not that I'd advocate wasting too much time on this - toy benchmarks don't tell you much anyway. -- Philip Machanick philip@pescadero.stanford.edu
barry@pico.math.ucla.edu (Barry Merriman) (11/17/90)
In article <1990Nov16.192938.17923@Neon.Stanford.EDU> philip@pescadero.stanford.edu writes: >Still, a factor of 2 difference looks suspicious. I decided to check this >out myself. My results looked like this on the DECstation 3100: >cc -O: >0.6u 0.0s 0:00 82% 30+30k 3+0io 0pf+0w >gcc -O: >0.2u 0.0s 0:00 86% 30+31k 0+0io 0pf+0w >Even worse - now gcc is 3 times faster. When I get bizarre results, I >check them - so I looked at the output of the program. The result >is 1.00001 to the power 1000000, which my calculators claim is 22025.365. 22025.365 is the true result (according to a 16 digit calculation in mathematica)---the the other results are off due to accumulated errors in the single precision arithmetic, I guess. >Output from cc: > x = 22323.087891 (1% different from calculators) >Output from gcc: > x = 0.000000 (seriously bogus) >So - how about another round of runs, this time checking the output? gcc gives the right output (22323.087891) on a SPARC. I don't have gcc on our DEC, but it would appear it has a serious problem! My program always gave the output x value, and while I didn't check it carefully, I think it was the same on all machines I tried---definitely never 0! >Not that I'd advocate wasting too much time on this - toy benchmarks don't >tell you much anyway. No, but on the other hand, at least we do know exactly what this benchmark does, and it is highly portable, even over the sneakernet :-) -- Barry Merriman UCLA Dept. of Math UCLA Inst. for Fusion and Plasma Research barry@math.ucla.edu (Internet)
pphillip@cs.ubc.ca (Peter Phillips) (11/18/90)
In article <738@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes: [ saw a NeXTStation, tried a simple benchmark ] >The results are (drumroll...) [ other results omitted ] >NeXTStation > > 0.3u 0.0s 0:00 92% 0+0k 0+0io 0pf+0w ^^^^ 1 digit > > => 3.3 MFLOPS Hmmm, 1/0.3 = 3.3333, but does "time" round up or down or neither? This time could be as good as 5.0 MFLOPS or as poor as 2.5 MFLOPS (just about as bad as a Sparcstation 1 at 2.3). Did "time" always give the same measured speed or did you only run the "benchmark" once? Given similar qualities of measure I was able to experimentally verify that a plastic grocery bag gave me a much better overall levitation/price that Kurt Vonnetgut's recent book, "Hocus Pocus". Oddly enough, the plastic bag was actually slower to hit the ground but cost less than hardcover book. This might be due to the extra overhead involved with the book; the dust jacket seemed reasonally competitive by itself. Seriously, benchmarking a system is a tricky business. Perhaps a real program would have been more appropriate. (Remember to bring that OD or MSDOS disk with your favourite program on it NeXT time you're invited to see a nextstation). Or maybe NeXT will actually release a full SPEC performance sheet for their systems? On the other hand, does it really matter how fast a NeXTStation runs? I'm sure it is faster than the original cube and if you want to do things which are specific to the NeXT machine you really don't have any other choices, do you? Personally, I think that as long as NeXT sticks to the 680?0 line the cube's descendants will always be slower than their competitor's models. -- Peter Phillips | Just because some of us can read and write <pphillip@cs.ubc.ca> | and do a little math, that doesn't mean we {alberta,uunet}!ubc-cs!pphillip | deserve to conquer the Universe. - E.D.Hartke
ea08+@andrew.cmu.edu (Eric A. Anderson) (11/19/90)
There is actually a more fundamental problem with this benchmark: CSH does not have the accuracy to measure the output. When people report times of .2 sec using csh, on a decstation, they would find times of about .245 seconds using tcsh. This is a serious problem, as the version run on the sparcstation 2 reporting a time of .1 sec could actually be up to about .145 or more if csh just truncates the times. But the version of the program run under gcc on my machine gives the correct output. One thing I can't figure out is this: The float version runs faster than the double version, but when I look at the assembler version of the code I see this Float Version : cvt float to double mult double cvt double to float. And in the Double version just mult double. It doesn't seem to make sense that the float version would be faster. And so that you know, the double version of the program claims. x= 22025.144256 -Eric ********************************************************* "My life is full of additional complications spinning around until it makes my head snap off." -Unc. Known. "You are very smart, now shut up." -In "The Princess Bride" *********************************************************
na0i+@andrew.cmu.edu (Nenad Antonic) (11/19/90)
In article <1990Nov16.192938.17923@Neon.Stanford.EDU> philip@pescadero.stanford.edu writes: >Output from cc: > x = 22323.087891 (1% different from calculators) >Output from gcc: > x = 0.000000 (seriously bogus) >So - how about another round of runs, this time checking the output? On my DECstation 3100 (andrew system at CMU) I got the following: % gcc proba.c % time ./a.out x = 22323.087891 1.2u 0.2s 0:03 47% 30+30k 3+0io 2pf+0w % gcc -O proba.c % time ./a.out x = 22323.087891 0.2u 0.0s 0:00 58% 29+29k 3+0io 2pf+0w % cc -O proba.c % time ./a.out x = 22323.087891 0.6u 0.0s 0:00 74% 30+30k 3+0io 2pf+0w So, gcc gives correct results here. The configuration is: 16Mb memory, RZ23 hard drive, server for the rest (including .). At the run time I had X-windows on, and a bunch of other things (45% of virtual memory being used). That did not slow down the computations. Nenad Antoni\'c
chouw@galaxy.cps.msu.edu (Wen Hwa Chou) (11/20/90)
In article <obFjTbS00Vp50k61pn@andrew.cmu.edu> ea08+@andrew.cmu.edu (Eric A. Anderson) writes: >Float Version : >cvt float to double >mult double >cvt double to float. > >And in the Double version just >mult double. > >It doesn't seem to make sense that the float version would be faster. > The multiplication of double precision is not of constant time. It changes, based on the data you throwed in.