[comp.sys.next] NeXTStation Benchmark

barry@pico.math.ucla.edu (Barry Merriman) (11/15/90)

I had a chance to see the NeXTStations running 2.0 today at a presentation
at UCLA. Very nice.

Being a scientist, I couldn't miss the chance for an experiment. 
So, I devised the following little benchmark (gotta be
short enough to type in by hand while the NeXT  rep is not looking :-)
to take with me. Essentially, it does 1,000,000 floating point multiplies,
for a quick estimate of megafloppage. You can try this at home, kids! :-)

#include <stdio.h>
main() {
int i;
float x;
float m;
x=1.0000; 
m=1.00001;
for (i=1;i<1000000;i++) {x *= m;}
printf("x = %f \n",x);
exit(0);
}

On all machines, it was compiled using ``cc -O'', and timed using
the unix ``time'' command (do ``time a.out'').

The results are (drumroll...)

------------------------------------------------------------------------

Sun 3/50

        34.840u 0.200s 0:35.40 98% 1+1k 4+1io 0pf+0w
   
        => 0.028 MFLOPS   (Yow! Those were the bad old days!)

Sun 3/110

        32.500u 0.060s 0:32.78 99% 0+0k 1+1io 0pf+0w

        => 0.031 MFLOPS

Sun 3/110 with floating point accelerator ( compile with "cc -O -ffpa")

        4.040u 0.080s 0:04.42 93% 1+1k 5+1io 0pf+0w

        => 0.25 MFLOPS

Alliant f/x 8 (scalar mode---faster in parallel, of course)

        3.3u 2.1s 0:05 98% 8+4k 0+1io 0pf+0w

        => 0.3 MFLOPS

NeXT Cube (68030)

        1.8u 0.0s 0:02 91% *** *** *** (I didn't write down the last 4 fields)

        => 0.56 MFLOPS

DecStation 3100:

        0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w 

        => 2.0 MFLOPS

SparcStation 1

        0.430u 0.080s 0:00.65 78% 0+215k 2+0io 2pf+0w

        => 2.3 MFLOPS

NeXTStation

        0.3u 0.0s 0:00 92% 0+0k 0+0io 0pf+0w

        => 3.3 MFLOPS


-------------------------------------------------------------------------

So, the clear winner is the NeXTStation!

I also tried some windowing and starting apps on the slab, and that was
fine, but hard to judge---the slab there had 20MB RAM, lots of apps open,
a 200MB HD (yes, they have a 200MB HD option now, for $4100 educational price.)

Slabs were said to be shipping in the next couple weeks.


--
Barry Merriman
UCLA Dept. of Math
UCLA Inst. for Fusion and Plasma Research
barry@math.ucla.edu (Internet)

barry@pico.math.ucla.edu (Barry Merriman) (11/16/90)

Some folks have pointed out that, to be fair, I should use gcc as
the compiler rather than cc, since next really uses gcc and this is smarter
than cc.

So, here's a quick comparison, compiling my little million multiply C
program with ``gcc -O'' in all cases (= ``cc -O'' on NeXTs)

----------------------------

Sun 3/80  (thanks to Charles Purcell)

  0.39 MFLOPS

NeXT Cube (68030)

   0.56 MFLOPS

SparcStation 1   (my machine, and also thanks to Fred White)

   3.0--3.3 MFLOPS    (vs. 2.3 MFLOPS using plain ``cc -O'')

NeXTStation

   3.3 MFLOPS

Also, we have a special guest appearance by the...RS/6000!
First, this editorial comment:
Ralph Seguin writes:

>The RS/6000s SCREAM.  I have been using them for quite some
>time now.  They give SPECmarks which kill every other machine at that price
>level.

okay, but...

IBM RS/6000 (scalar) (Thanks to Charles Purcell---I don't know if this was gcc, though)

  3.3 MFLOPS

---------------------------



So, for simple scalar floating point, the NeXTStation seems to be the
price/performance leader (and even the performance leader!)

--
Barry Merriman
UCLA Dept. of Math
UCLA Inst. for Fusion and Plasma Research
barry@math.ucla.edu (Internet)

barry@pico.math.ucla.edu (Barry Merriman) (11/16/90)

Add this to the list (thanks Eric Anderson):

DECStation 3100 (using gcc -O)

  4.0 MFLOPS (vs about 1.3 MFLOPS using just ``cc -O'')


So, the DECStation takes the lead (in performance, but not price/performance).

But---what can you say about a company whose default compiler
is worse than a freely available one? :-)

--
Barry Merriman
UCLA Dept. of Math
UCLA Inst. for Fusion and Plasma Research
barry@math.ucla.edu (Internet)

philip@pescadero.Stanford.EDU (Philip Machanick) (11/16/90)

In article <738@kaos.MATH.UCLA.EDU>, barry@pico.math.ucla.edu (Barry Merriman) writes:
|> I had a chance to see the NeXTStations running 2.0 today at a presentation
|> at UCLA. Very nice.
|> 
|> Being a scientist, I couldn't miss the chance for an experiment. 
|> So, I devised the following little benchmark (gotta be
|> short enough to type in by hand while the NeXT  rep is not looking :-)
|> to take with me. Essentially, it does 1,000,000 floating point multiplies,
|> for a quick estimate of megafloppage. You can try this at home, kids! :-)
[benchmark plus other times deleted]
|> DecStation 3100:
|>
|>         0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w 
|> 
|>         => 2.0 MFLOPS
|> 
|> SparcStation 1
|> 
|>         0.430u 0.080s 0:00.65 78% 0+215k 2+0io 2pf+0w
|> 
|>         => 2.3 MFLOPS
|> 
|> NeXTStation
|> 
|>         0.3u 0.0s 0:00 92% 0+0k 0+0io 0pf+0w
|> 
|>         => 3.3 MFLOPS
|> 
|> So, the clear winner is the NeXTStation!
Very interesting. Of course, toy benchmarks are not an indication of real
system performance (what happens when you fill the cache, how fast is
paging, etc. ...?). I'm using a DECstation 3100 for my programming at the
moment. Has anyone had the opportunity to benchmark compiles of C++ on the
two? If not, can anyone lend me a NeXT so I can do it?
-- 
Philip Machanick
philip@pescadero.stanford.edu

barry@pico.math.ucla.edu (Barry Merriman) (11/16/90)

In the pursuit of truth, this just in: 
running my trivial benchmark program (multiply 1.00001 upon itself
10^6 times, in C) using csh instead of ksh on the RS/6000:

IBM RS/6000 (csh, not ksh!)

   3.7 MFLOPS  (10% faster than under ksh)


This gives IBM a 10% lead on the 3.3 MFLOP NeXTStation. Jeez, this
is getting complicated :-).

--
Barry Merriman
UCLA Dept. of Math
UCLA Inst. for Fusion and Plasma Research
barry@math.ucla.edu (Internet)

garnett@cs.utexas.edu (John William Garnett) (11/16/90)

In article <742@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes:
>So, here's a quick comparison, compiling my little million multiply C
>program with ``gcc -O'' in all cases (= ``cc -O'' on NeXTs)

First of all so that we all know what's being talked about -
here again is the million (actually 999,999) multiply benchmark source.

#include <stdio.h>

main()
{
	int i;
	float x;
	float m;
	x=1.0000; 
	m=1.00001;

	for (i=1;i<1000000;i++) {x *= m;}

	printf("x = %f \n",x);
	exit(0);
}

>
>Also, we have a special guest appearance by the...RS/6000!
>First, this editorial comment:
>Ralph Seguin writes:
>
>>The RS/6000s SCREAM.  I have been using them for quite some
>>time now.  They give SPECmarks which kill every other machine at that price
>>level.
>
>okay, but...
>
>IBM RS/6000 (scalar) (Thanks to Charles Purcell---I don't know if this was gcc, though)
>
>  3.3 MFLOPS

For the program in question, this number of 3.3 appears to be in the right ballpark.
I ran the program and received a number of 3.57.  HOWEVER, if you make one small
change to the program, it performs at 6.25 (toy) MFLOPS.  This change is merely
to change all occurences of "float" to "double".  If the loop limit is increased
from 1,000,000 to 10,000,000, this number jumps to 6.49 (startup costs are
amortized).  Obviously IBM optimized the machine toward better performance on
doubles.

Note that all of these numbers were generated using IBM's C Compiler with -O
on the RS/6000 Model 320 which is rated at approx 7.5 "real" MFLOPS.

>
>So, for simple scalar floating point, the NeXTStation seems to be the
>price/performance leader (and even the performance leader!)
>

These are big claims to make based on a 10 line benchmark :-).

Followups via email, comp.unix.aix, or comp.benchmarks.
-- 
John Garnett
                              University of Texas at Austin
garnett@cs.utexas.edu         Department of Computer Science
                              Austin, Texas

gilgalad@caen.engin.umich.edu (Ralph Seguin) (11/16/90)

In article <742@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes:
>Some folks have pointed out that, to be fair, I should use gcc as
>the compiler rather than cc, since next really uses gcc and this is smarter
>than cc.

There is that.  But to be really fair, you should totally disregard this type
of benchmark anyways.  Using a single type of instruction (floating point
multiply in this case) is NOT a good indicator of overall performance
of a system.  Has anybody got SPECmarks for a NeXT?  I'd be interested in seeing
them.  Also, it is a WELL KNOWN FACT that MIPS should stand for Meaningless
Indicator of Processor Speed.

>IBM RS/6000 (scalar) (Thanks to Charles Purcell---I don't know if this was gcc, though)
>
>  3.3 MFLOPS

As I've said, this is not a good benchmark.  BTW- How did you compile it?
When I use gcc -O I get somewhere around 7.5 megaflops on our 320s.  I'll
try it on a 540 in a bit.  Unlike many other processors, I find that the
POWERstations live up to the performance claims that IBM makes.

>So, for simple scalar floating point, the NeXTStation seems to be the
>price/performance leader (and even the performance leader!)

Could be the price/performance leader.  Dunno.  A NeXTDimension board
would be a better price/performance ratio in my mind.  If you run a
thread or two on the i860, you can get some amazing performance.  I
know that it is bad to quote peak performance (but that is what people
are doing in this article isn't it 8-), but at a peak of 2 floating
point instructions per cycle, nothing even comes close to it.

Question:  Will NeXT  OS 2.0 allow you to create threads that run on the
i860?

>Barry Merriman
>UCLA Dept. of Math
>UCLA Inst. for Fusion and Plasma Research
>barry@math.ucla.edu (Internet)

			Thanks, Ralph

Ralph Seguin			gilgalad@dip.eecs.umich.edu
536 South Forest Apt. #915	gilgalad@caen.engin.umich.edu
Ann Arbor, MI 48104		(313) 662-4805

lerman@stpstn.UUCP (Ken Lerman) (11/16/90)

In article <742@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes:
->Some folks have pointed out that, to be fair, I should use gcc as
->the compiler rather than cc, since next really uses gcc and this is smarter
->than cc.
->
->So, for simple scalar floating point, the NeXTStation seems to be the
->price/performance leader (and even the performance leader!)
->
->--
->Barry Merriman
->UCLA Dept. of Math
->UCLA Inst. for Fusion and Plasma Research
->barry@math.ucla.edu (Internet)


I would have more confidence in the "benchmark" if the total execution
time were closer to ten seconds than to one second.  How much of the
time was taken by loading and initializing on each of the platforms?

Ken

tgingric@magnus.ircc.ohio-state.edu (Tyler S Gingrich) (11/16/90)

In article <1990Nov16.071217.8676@engin.umich.edu> gilgalad@caen.engin.umich.edu (Ralph Seguin) writes:
>
>Question:  Will NeXT  OS 2.0 allow you to create threads that run on the
>i860?
>
This question was asked at the Nov OSU-NeXT Users Group meeting and the NeXT
technical rep (John Karibiac  -- sorry John, I STILL can't remember how to
spell your last name!!) said this is NOT possible at the current time under
version 2.0 of the software.  Wether this means it will never be possible, or
that NeXT won't support it, or that NeXT is thinking about it - your guess is
as good as mine.....

Tyler

bostrov@storm.UUCP (Vareck Bostrom) (11/17/90)

In article <738@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes:
>DecStation 3100:
>
>        0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w 
>
>        => 2.0 MFLOPS
>
>SparcStation 1
>
>        0.430u 0.080s 0:00.65 78% 0+215k 2+0io 2pf+0w
>
>        => 2.3 MFLOPS

Something seems wrong here, we're talking about the SPARCstation 1,
20 MHz SPARC, Weitek 3170 FPU? That seems a bit fast for one, and I am
truly surprised that it beat the DECstation 3100. Doing a lot of raytracing,
which heavily uses Floating point, the DECstation 3100 ALWAYS beat 
SPARCstation 1's and 1+'s that I ran it on. An ardent titan-3 I ran it
on so far has teh best performance, topping 20 MFLOPS easily (judging by
the speed increase over an HP 9000/845, about 4.0 MFLOPS). 

I am a little surprised at the NeXT's results, which also seem a tad high,
but only because NeXT says 2.8 MFLOPS and my test (DP Linpack) ran at about
3.0 MFLOPS. 3.0 MFLOPS is certanily not slow for a workstation or home
machine. 

The NeXT-040 beat a SPARCserver 4/330, all the sun-3's, 4/60, 4/65, 
Decstation-2100, Decstation-3100 (close on this one, but the NeXT won).
It did not beat a Decstation-5000 and an HP 9000/845 as well as an
Ardent with vector fpu and a 8 processor Cray XMP (no suprise here, eh?) 

I am wondering why there isn't an i860 FPU on the NeXT motherboard, I mean
after all, we'd be looking at 60-80 MFLOPS from an FPU like that, I imagine
that real world performance isn't better than 30-50 MFlops, but that's still
much better than the MC68040's 3.0 or so. This is interesting, the NeXT
video option (NeXT dimention) offers superior performance in fp than
the motherboard itself, by far. 

oh well.

- Vareck

philip@pescadero.Stanford.EDU (Philip Machanick) (11/17/90)

In article <743@kaos.MATH.UCLA.EDU>, barry@pico.math.ucla.edu (Barry Merriman) writes:
|> Add this to the list (thanks Eric Anderson):
|> 
|> DECStation 3100 (using gcc -O)
|> 
|>   4.0 MFLOPS (vs about 1.3 MFLOPS using just ``cc -O'')
|> So, the DECStation takes the lead (in performance, but not price/performance).
|> But---what can you say about a company whose default compiler
|> is worse than a freely available one? :-)

Now, wait a minute. In your original article, you reported the DECstation as
> DecStation 3100:
>         0.5u 0.0s 0:00 100% 41+31k 0+0io 0pf+0w 
>         => 2.0 MFLOPS
Still, a factor of 2 difference looks suspicious. I decided to check this
out myself. My results looked like this on the DECstation 3100:
cc -O:
0.6u 0.0s 0:00 82% 30+30k 3+0io 0pf+0w
gcc -O:
0.2u 0.0s 0:00 86% 30+31k 0+0io 0pf+0w
Even worse - now gcc is 3 times faster. When I get bizarre results, I
check them - so I looked at the output of the program. The result of
  x=1.0000; 
  m=1.00001;
  for (i=1;i<1000000;i++) {x *= m;}
is 1.00001 to the power 1000000, which my calculators claim is 22025.365.
Output from cc:
  x = 22323.087891 (1% different from calculators)
Output from gcc:
  x = 0.000000 (seriously bogus)
So - how about another round of runs, this time checking the output?
Not that I'd advocate wasting too much time on this - toy benchmarks don't
tell you much anyway.
-- 
Philip Machanick
philip@pescadero.stanford.edu

barry@pico.math.ucla.edu (Barry Merriman) (11/17/90)

In article <1990Nov16.192938.17923@Neon.Stanford.EDU> philip@pescadero.stanford.edu writes:

>Still, a factor of 2 difference looks suspicious. I decided to check this
>out myself. My results looked like this on the DECstation 3100:
>cc -O:
>0.6u 0.0s 0:00 82% 30+30k 3+0io 0pf+0w
>gcc -O:
>0.2u 0.0s 0:00 86% 30+31k 0+0io 0pf+0w
>Even worse - now gcc is 3 times faster. When I get bizarre results, I
>check them - so I looked at the output of the program. The result 
>is 1.00001 to the power 1000000, which my calculators claim is 22025.365.

22025.365 is the true result (according to a 16 digit calculation in mathematica)---the 
the other results are off due to accumulated errors in the single precision arithmetic, I guess. 

>Output from cc:
>  x = 22323.087891 (1% different from calculators)
>Output from gcc:
>  x = 0.000000 (seriously bogus)
>So - how about another round of runs, this time checking the output?

gcc gives the right output (22323.087891) on a SPARC. I don't
have gcc on our DEC, but it would appear it has a serious problem!

My program always gave the output x value, and while I didn't check it carefully,
I think it was the same on all machines I tried---definitely never 0!

>Not that I'd advocate wasting too much time on this - toy benchmarks don't
>tell you much anyway.

No, but on the other hand, at least we do know exactly what this 
benchmark does, and it is highly portable, even over the sneakernet :-)

--
Barry Merriman
UCLA Dept. of Math
UCLA Inst. for Fusion and Plasma Research
barry@math.ucla.edu (Internet)

pphillip@cs.ubc.ca (Peter Phillips) (11/18/90)

In article <738@kaos.MATH.UCLA.EDU> barry@pico.math.ucla.edu (Barry Merriman) writes:

[ saw a NeXTStation, tried a simple benchmark ]

>The results are (drumroll...)

[ other results omitted ]

>NeXTStation
>
>        0.3u 0.0s 0:00 92% 0+0k 0+0io 0pf+0w
         ^^^^ 1 digit
>
>        => 3.3 MFLOPS

Hmmm, 1/0.3 = 3.3333, but does "time" round up or down or neither?
This time could be as good as 5.0 MFLOPS or as poor as 2.5 MFLOPS
(just about as bad as a Sparcstation 1 at 2.3).  Did "time" always
give the same measured speed or did you only run the "benchmark" once?

Given similar qualities of measure I was able to experimentally
verify that a plastic grocery bag gave me a much better overall
levitation/price that Kurt Vonnetgut's recent book, "Hocus Pocus".
Oddly enough, the plastic bag was actually slower to hit the
ground but cost less than hardcover book.  This might be
due to the extra overhead involved with the book; the dust jacket
seemed reasonally competitive by itself.

Seriously, benchmarking a system is a tricky business.  Perhaps
a real program would have been more appropriate.  (Remember to
bring that OD or MSDOS disk with your favourite program on it
NeXT time you're invited to see a nextstation).  Or maybe NeXT
will actually release a full SPEC performance sheet for their
systems?

On the other hand, does it really matter how fast a NeXTStation
runs?  I'm sure it is faster than the original cube and if you
want to do things which are specific to the NeXT machine you
really don't have any other choices, do you?  Personally, I
think that as long as NeXT sticks to the 680?0 line the cube's
descendants will always be slower than their competitor's models.

--
Peter Phillips                  | Just because some of us can read and write
<pphillip@cs.ubc.ca>            | and do a little math, that doesn't mean we
{alberta,uunet}!ubc-cs!pphillip | deserve to conquer the Universe. - E.D.Hartke

ea08+@andrew.cmu.edu (Eric A. Anderson) (11/19/90)

There is actually a more fundamental problem with this benchmark: CSH
does not have the accuracy to measure the output.
When people report times of .2 sec using csh, on a decstation, they
would find times of about .245 seconds using tcsh.  This is a serious
problem, as the version run on the sparcstation 2 reporting a time of .1
sec could actually be up to about .145 or more if csh just truncates the
times.
But the version of the program run under gcc on my machine gives the
correct output.
One thing I can't figure out is this: The float version runs faster than
the double version, but when I look at the assembler version of the code
I see this

Float Version :
cvt float to double
mult double
cvt double to float.

And in the Double version just
mult double.

It doesn't seem to make sense that the float version would be faster. 

And so that you know, the double version of the program claims.
x= 22025.144256

          -Eric
*********************************************************
"My life is full of additional complications spinning around until
 it makes my head snap off."
           -Unc. Known.
"You are very smart, now shut up."
           -In "The Princess Bride"
*********************************************************

na0i+@andrew.cmu.edu (Nenad Antonic) (11/19/90)

In article <1990Nov16.192938.17923@Neon.Stanford.EDU>
philip@pescadero.stanford.edu writes:
>Output from cc:
>  x = 22323.087891 (1% different from calculators)
>Output from gcc:
>  x = 0.000000 (seriously bogus)
>So - how about another round of runs, this time checking the output?


On my DECstation 3100 (andrew system at CMU) I got the following:
% gcc proba.c
% time ./a.out
x = 22323.087891 
1.2u 0.2s 0:03 47% 30+30k 3+0io 2pf+0w
% gcc -O proba.c
% time ./a.out
x = 22323.087891 
0.2u 0.0s 0:00 58% 29+29k 3+0io 2pf+0w
% cc -O proba.c
% time ./a.out
x = 22323.087891 
0.6u 0.0s 0:00 74% 30+30k 3+0io 2pf+0w

So, gcc gives correct results here. The configuration is:
16Mb memory, RZ23 hard drive, server for the rest (including .).

At the run time I had X-windows on, and a bunch of other things
(45% of virtual memory being used). That did not slow down the
computations.

	Nenad Antoni\'c

chouw@galaxy.cps.msu.edu (Wen Hwa Chou) (11/20/90)

In article <obFjTbS00Vp50k61pn@andrew.cmu.edu> ea08+@andrew.cmu.edu (Eric A. Anderson) writes:
>Float Version :
>cvt float to double
>mult double
>cvt double to float.
>
>And in the Double version just
>mult double.
>
>It doesn't seem to make sense that the float version would be faster. 
>

The multiplication of double precision is not of constant time.  It
changes, based on the data you throwed in.