[comp.sys.sgi] Where's the SPARK in my SPARC????

corkum@csri.toronto.edu (Brent Thomas Corkum) (02/22/91)

We have 2 4D/25's and have just recieved a new SUN SPARCstation 2 GX in order to
port some C code from the Iris to the Sun. Our code is a boundary element 
stress analysis program, similar to finite elements in that you assemble
a matrix and solve it etc. Anyways, I ported my code, had to thange it
to K&R from Ansi because the default compiler that comes with the SUN
doesn't support Ansi (yet!!). So I compiled the program and ran it, and
did I get a surprise, it ran 6.5 times slower than the 4D/25. I thought the
SPARC2 was suppose to be 2-3 times faster in FLOPS.

Now for some information on what I did, I used the following compile statements


cc -float *.c -lm -o compute          -> SGI
cc -fsingle *.c -lm -o compute        -> SUN

Both machines have 16MB of memory and no swapping is occurring. I use floats
instead of doubles (thats why the -float and -fsingle). I'm using the 
default C compiler that comes with the SUN, but would like some information
regarding other C compilers if I can't get this straightened out.

Well, in an effort to make some sense of this problem I wrote a simple
test program to test performance. 

main()
{
  int i,j;
  float val,pi;

  pi = 3.141592654;
  val = 0.0;

  for(i=0;i<5000;++i){
    for(j=0;j<5000;++j){
      val += pi;
      val *= pi;
      val /= pi;
      val -= pi;
    }
  }
}

with the following results for float variables:

IRIS:

cc -float bm.c -o bmfs         42.8 sec cpu time (using "time bmfs").

SUN:

cc -fsingle bm.c -o bmfs         47.7 sec cpu time (using "time bmfs").


Now based on these the SGI is still a little faster ,not 6.5 times, 
but still faster. I would have expected the SUN to be at least twice
as fast. 

Now I tried the same program with double variables and guess what, the
SUN started performing a little better.

IRIS:

cc bm.c -o bm         74.0 sec cpu time (using "time bm").

SUN:

cc bm.c -o bm         62.8 sec cpu time (using "time bm").


Well not twice as fast but right now I'd take equality for my application.
Anyways, what this means , I don't know, I'm just a Civil Engineer trying
to run a stress analysis. 

So if someone out there can explain to me why all this is, it would
be much appreciated.  So far the SPARC is just taking up valuable
space, and I don't need anymore doorstops!

Brent Corkum
Civil Engineering
University of Toronto
Toronto Canada
corkum@boulder.civ.toronto.edu

gumby@Cygnus.COM (David Vinayak Wallace) (02/22/91)

   Date: 21 Feb 91 17:00:49 GMT
   From: corkum@csri.toronto.edu (Brent Thomas Corkum)

   We have 2 4D/25's and have just recieved a new SUN SPARCstation 2
   GX in order to port some C code from the Iris to the Sun. Our code
   is a boundary element stress analysis program, similar to finite
   elements in that you assemble a matrix and solve it etc. Anyways, I
   ported my code, had to thange it to K&R from Ansi because the
   default compiler that comes with the SUN doesn't support Ansi
   (yet!!). So I compiled the program and ran it, and did I get a
   surprise, it ran 6.5 times slower than the 4D/25. I thought the
   SPARC2 was suppose to be 2-3 times faster in FLOPS.

You should use gcc which is faster then the supplied Sun compiler and
is ANSI to boot!  I don't know if it'll be 6.5 time faster though;
you'll have to play with the optimiser switches.

Kudos to SGI's compiler folks by the way.

pagels@cs.arizona.edu (Michael A. Pagels) (02/22/91)

We have had a number of performance surprises brought to us
by SPARC's.  In all cases they were traced to the SPARC using
register windows, and the MIPS not.  Generally, if your code
nests small-procedure calls deeply -- a common occurance in
highly structured programs -- you pay a very large over head
in the SPARC as it spends alot of time spilling register windows.

Michael....

dik@cwi.nl (Dik T. Winter) (02/22/91)

In article <1991Feb21.120049.5626@jarvis.csri.toronto.edu> corkum@csri.toronto.edu (Brent Thomas Corkum) writes:
 >                            Anyways, I ported my code, had to thange it
 > to K&R from Ansi because the default compiler that comes with the SUN
 > doesn't support Ansi (yet!!).
This is part of the problem.
 >                               So I compiled the program and ran it, and
 > did I get a surprise, it ran 6.5 times slower than the 4D/25. I thought the
 > SPARC2 was suppose to be 2-3 times faster in FLOPS.
...
 > cc -float *.c -lm -o compute          -> SGI
 > cc -fsingle *.c -lm -o compute        -> SUN

The major problem is that the -fsingle flag is nearly useless for original
K&R C.  Suppose you have a routine whose declaration reads:
	void rout(s) float s; { ... }
this is effectively the same as declaring:
	void rout(s) double s; { ... }
and within the routine all calculations involving s are effectively done
in double rather than float *regardless to whether the -fsingle flag was
used or not*.  There is an undocumented feature of the C compiler: the
flag -fsingle2.  When you use that parameters are passed as floats
and not as doubles (so this is not really K&R C).  You might try to use
that and see what happens.  One warning though.  Use of this flag also
implies that all floats to standard library routines are not expanded
to double (which they should), so you have to take care of that problem
also (e.g. explicit casts to double).

Hope this helps.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

kipp@warp.esd.sgi.com (Kipp Hickman) (02/22/91)

In article <1991Feb21.120049.5626@jarvis.csri.toronto.edu>, corkum@csri.toronto.edu (Brent Thomas Corkum) writes:
|> cc -float *.c -lm -o compute          -> SGI
|> cc -fsingle *.c -lm -o compute        -> SUN

I think you need to run the optimizer on both machines to achieve a better analysis of your performance.

In any case, the best way to determine a machines performance is to run your application, not benchmarks.  SGI machines typically show up as slower in benchmarks (especially when other vendors publish them) but actually run real applications significantly faster.  SGI also tends to understate our performance numbers as there are a *HUGE* number of variables in determining real application performance.

				kipp

baskett@forest.asd.sgi.COM (Forest Baskett) (02/22/91)

There are a number of other disadvantages of register windows, too.
But I admit to a certain prejudice on this issue.

Forest Baskett
Silicon Graphics

drb@eecg.toronto.edu (David R. Blythe) (02/23/91)

In article <3001@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>
>The major problem is that the -fsingle flag is nearly useless for original
>K&R C.  Suppose you have a routine whose declaration reads:
>	void rout(s) float s; { ... }
>this is effectively the same as declaring:
>	void rout(s) double s; { ... }
>and within the routine all calculations involving s are effectively done
>in double rather than float *regardless to whether the -fsingle flag was
>used or not*.

This is also true for the SGI compiler so its not likely the PI got any big
performance advantage from that.
	-drb

dik@cwi.nl (Dik T. Winter) (02/23/91)

In article <1991Feb22.183550.16477@jarvis.csri.toronto.edu> drb@eecg.toronto.edu (David R. Blythe) writes:
 > In article <3001@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
 > >The major problem is that the -fsingle flag is nearly useless for original
 > >K&R C.  Suppose you have a routine whose declaration reads:
 > >	void rout(s) float s; { ... }
 > >this is effectively the same as declaring:
 > >	void rout(s) double s; { ... }
 > >and within the routine all calculations involving s are effectively done
 > >in double rather than float *regardless to whether the -fsingle flag was
 > >used or not*.
 > This is also true for the SGI compiler so its not likely the PI got any big
 > performance advantage from that.
But this is not true if you are using prototypes (which the original poster
did).  BTW I just found a bug in the compiler; the sequence:
	void foo(float f);
	void foo(double f) { return; }
is accepted by the compiler (as it is if float and double are interchanged).
Also all other combinations of old style K&R specification with ANSI
declaration are allowed.  This will lead to incorrect results.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

corkum@csri.toronto.edu (Brent Thomas Corkum) (02/24/91)

In article <9102220209.AA27844@forest.asd.sgi.com> baskett@forest.asd.sgi.COM (Forest Baskett) writes:
>There are a number of other disadvantages of register windows, too.
>But I admit to a certain prejudice on this issue.
>
>Forest Baskett
>Silicon Graphics


While we're on the subject can someone please explain to me what register
windows are, and why this has anything to do with the SPARC2 vs 4D/25
discrepancy in speed that I've been seeing (for my application!). 

Brent

mccoy@pixar.uucp (Dan McCoy) (02/27/91)

In article <1991Feb21.120049.5626@jarvis.csri.toronto.edu> corkum@csri.toronto.edu (Brent Thomas Corkum) writes:
>So I compiled the program and ran it, and
>did I get a surprise, it ran 6.5 times slower than the 4D/25. I thought the
>SPARC2 was suppose to be 2-3 times faster in FLOPS.

Aside from the register windows that others mentioned, another
place that SPARC often loses ground versus MIPS processors (like SGI)
is integer multiplies.  Unless they snuck them into the SparcStation 2
when I wasn't looking, SPARC still does multiply in software whereas
MIPS processors have hardware for that.

Even if you think your code is floating point bound, there could be a lot
of integer multiplies that start dominating the run time.
On the SGI using "pixie" you can find out how many integer mulitplies
you are doing.

Dan McCoy	...!{ucbvax,sun}!pixar!mccoy

dik@cwi.nl (Dik T. Winter) (02/27/91)

In article <1991Feb26.202910.27944@pixar.uucp> mccoy@pixar.uucp (Dan McCoy) writes:
 > Aside from the register windows that others mentioned, another
 > place that SPARC often loses ground versus MIPS processors (like SGI)
 > is integer multiplies.  Unless they snuck them into the SparcStation 2
 > when I wasn't looking, SPARC still does multiply in software whereas
 > MIPS processors have hardware for that.
 > 
It is true that MIPS processors do the integer multiply in hardware while
the SPARC does it in software (using multiply step instructions).  However,
when you look at number of cycles to do a multiply there is in general not
much difference.  The MIPS mult instruction takes quite some time to
complete (11 cycles) and you have to pick up the result.  Of course you
can do other things during the multiply, but you will find that in general
you do not have enough instructions to fill all those cycles.  See the
examples in Kanes book where you find 6 to 11 cycle interlocks after a
multiply.  On the other hand, Sun software is clever enough to use a
short multiply sequence if one of the operands is small, bringing down
the time needed from 34 cycles for a full multiply to 18 or 14 cycles.
So no much difference there.

Also register windows do not matter very much.  In practice there is no
tremendous speed-down from register windows.  You will only get problems
if your program repeatedly calls a (large) nested sequence of small
routines.  And then you will only see that system time goes up because
of register window overflow/underflow traps.  On the other hand, there
was no need to store a bunch of local variables explicitly on the stack.
I think that the net result is that there is not much difference.

The main point is, as I have mentioned before and now proven also, is the
distinction between honoring prototypes (MIPS C compiler) and not
honoring them (Sun C compiler).  In the latter case specifying that all
floating point operations must be done in single precision is useless for
routines that take floating point parameters (because of the implicit
promotion to double that is mandated by K&R).  I did pick up the original
source and changed it so as to allow Sun's -fsingle2 flag (pass all
floating point parameters as float, not as double).  Changes were needed
because all floating point parameters to library routines must explicitly
be cast to double.  I did timings on an SLC and got the following results
(where the flags given are in addition to a bunch of flags common to all
tests):
cc			2m33.16s
cc -fsingle		1m36.00s
cc -fsingle -fsingle2	0m20.46s
Some improvement I would say.
--
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl