corkum@csri.toronto.edu (Brent Thomas Corkum) (02/22/91)
We have 2 4D/25's and have just recieved a new SUN SPARCstation 2 GX in order to port some C code from the Iris to the Sun. Our code is a boundary element stress analysis program, similar to finite elements in that you assemble a matrix and solve it etc. Anyways, I ported my code, had to thange it to K&R from Ansi because the default compiler that comes with the SUN doesn't support Ansi (yet!!). So I compiled the program and ran it, and did I get a surprise, it ran 6.5 times slower than the 4D/25. I thought the SPARC2 was suppose to be 2-3 times faster in FLOPS. Now for some information on what I did, I used the following compile statements cc -float *.c -lm -o compute -> SGI cc -fsingle *.c -lm -o compute -> SUN Both machines have 16MB of memory and no swapping is occurring. I use floats instead of doubles (thats why the -float and -fsingle). I'm using the default C compiler that comes with the SUN, but would like some information regarding other C compilers if I can't get this straightened out. Well, in an effort to make some sense of this problem I wrote a simple test program to test performance. main() { int i,j; float val,pi; pi = 3.141592654; val = 0.0; for(i=0;i<5000;++i){ for(j=0;j<5000;++j){ val += pi; val *= pi; val /= pi; val -= pi; } } } with the following results for float variables: IRIS: cc -float bm.c -o bmfs 42.8 sec cpu time (using "time bmfs"). SUN: cc -fsingle bm.c -o bmfs 47.7 sec cpu time (using "time bmfs"). Now based on these the SGI is still a little faster ,not 6.5 times, but still faster. I would have expected the SUN to be at least twice as fast. Now I tried the same program with double variables and guess what, the SUN started performing a little better. IRIS: cc bm.c -o bm 74.0 sec cpu time (using "time bm"). SUN: cc bm.c -o bm 62.8 sec cpu time (using "time bm"). Well not twice as fast but right now I'd take equality for my application. Anyways, what this means , I don't know, I'm just a Civil Engineer trying to run a stress analysis. So if someone out there can explain to me why all this is, it would be much appreciated. So far the SPARC is just taking up valuable space, and I don't need anymore doorstops! Brent Corkum Civil Engineering University of Toronto Toronto Canada corkum@boulder.civ.toronto.edu
gumby@Cygnus.COM (David Vinayak Wallace) (02/22/91)
Date: 21 Feb 91 17:00:49 GMT From: corkum@csri.toronto.edu (Brent Thomas Corkum) We have 2 4D/25's and have just recieved a new SUN SPARCstation 2 GX in order to port some C code from the Iris to the Sun. Our code is a boundary element stress analysis program, similar to finite elements in that you assemble a matrix and solve it etc. Anyways, I ported my code, had to thange it to K&R from Ansi because the default compiler that comes with the SUN doesn't support Ansi (yet!!). So I compiled the program and ran it, and did I get a surprise, it ran 6.5 times slower than the 4D/25. I thought the SPARC2 was suppose to be 2-3 times faster in FLOPS. You should use gcc which is faster then the supplied Sun compiler and is ANSI to boot! I don't know if it'll be 6.5 time faster though; you'll have to play with the optimiser switches. Kudos to SGI's compiler folks by the way.
pagels@cs.arizona.edu (Michael A. Pagels) (02/22/91)
We have had a number of performance surprises brought to us by SPARC's. In all cases they were traced to the SPARC using register windows, and the MIPS not. Generally, if your code nests small-procedure calls deeply -- a common occurance in highly structured programs -- you pay a very large over head in the SPARC as it spends alot of time spilling register windows. Michael....
dik@cwi.nl (Dik T. Winter) (02/22/91)
In article <1991Feb21.120049.5626@jarvis.csri.toronto.edu> corkum@csri.toronto.edu (Brent Thomas Corkum) writes: > Anyways, I ported my code, had to thange it > to K&R from Ansi because the default compiler that comes with the SUN > doesn't support Ansi (yet!!). This is part of the problem. > So I compiled the program and ran it, and > did I get a surprise, it ran 6.5 times slower than the 4D/25. I thought the > SPARC2 was suppose to be 2-3 times faster in FLOPS. ... > cc -float *.c -lm -o compute -> SGI > cc -fsingle *.c -lm -o compute -> SUN The major problem is that the -fsingle flag is nearly useless for original K&R C. Suppose you have a routine whose declaration reads: void rout(s) float s; { ... } this is effectively the same as declaring: void rout(s) double s; { ... } and within the routine all calculations involving s are effectively done in double rather than float *regardless to whether the -fsingle flag was used or not*. There is an undocumented feature of the C compiler: the flag -fsingle2. When you use that parameters are passed as floats and not as doubles (so this is not really K&R C). You might try to use that and see what happens. One warning though. Use of this flag also implies that all floats to standard library routines are not expanded to double (which they should), so you have to take care of that problem also (e.g. explicit casts to double). Hope this helps. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
kipp@warp.esd.sgi.com (Kipp Hickman) (02/22/91)
In article <1991Feb21.120049.5626@jarvis.csri.toronto.edu>, corkum@csri.toronto.edu (Brent Thomas Corkum) writes: |> cc -float *.c -lm -o compute -> SGI |> cc -fsingle *.c -lm -o compute -> SUN I think you need to run the optimizer on both machines to achieve a better analysis of your performance. In any case, the best way to determine a machines performance is to run your application, not benchmarks. SGI machines typically show up as slower in benchmarks (especially when other vendors publish them) but actually run real applications significantly faster. SGI also tends to understate our performance numbers as there are a *HUGE* number of variables in determining real application performance. kipp
baskett@forest.asd.sgi.COM (Forest Baskett) (02/22/91)
There are a number of other disadvantages of register windows, too. But I admit to a certain prejudice on this issue. Forest Baskett Silicon Graphics
drb@eecg.toronto.edu (David R. Blythe) (02/23/91)
In article <3001@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > >The major problem is that the -fsingle flag is nearly useless for original >K&R C. Suppose you have a routine whose declaration reads: > void rout(s) float s; { ... } >this is effectively the same as declaring: > void rout(s) double s; { ... } >and within the routine all calculations involving s are effectively done >in double rather than float *regardless to whether the -fsingle flag was >used or not*. This is also true for the SGI compiler so its not likely the PI got any big performance advantage from that. -drb
dik@cwi.nl (Dik T. Winter) (02/23/91)
In article <1991Feb22.183550.16477@jarvis.csri.toronto.edu> drb@eecg.toronto.edu (David R. Blythe) writes: > In article <3001@charon.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: > >The major problem is that the -fsingle flag is nearly useless for original > >K&R C. Suppose you have a routine whose declaration reads: > > void rout(s) float s; { ... } > >this is effectively the same as declaring: > > void rout(s) double s; { ... } > >and within the routine all calculations involving s are effectively done > >in double rather than float *regardless to whether the -fsingle flag was > >used or not*. > This is also true for the SGI compiler so its not likely the PI got any big > performance advantage from that. But this is not true if you are using prototypes (which the original poster did). BTW I just found a bug in the compiler; the sequence: void foo(float f); void foo(double f) { return; } is accepted by the compiler (as it is if float and double are interchanged). Also all other combinations of old style K&R specification with ANSI declaration are allowed. This will lead to incorrect results. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
corkum@csri.toronto.edu (Brent Thomas Corkum) (02/24/91)
In article <9102220209.AA27844@forest.asd.sgi.com> baskett@forest.asd.sgi.COM (Forest Baskett) writes: >There are a number of other disadvantages of register windows, too. >But I admit to a certain prejudice on this issue. > >Forest Baskett >Silicon Graphics While we're on the subject can someone please explain to me what register windows are, and why this has anything to do with the SPARC2 vs 4D/25 discrepancy in speed that I've been seeing (for my application!). Brent
mccoy@pixar.uucp (Dan McCoy) (02/27/91)
In article <1991Feb21.120049.5626@jarvis.csri.toronto.edu> corkum@csri.toronto.edu (Brent Thomas Corkum) writes: >So I compiled the program and ran it, and >did I get a surprise, it ran 6.5 times slower than the 4D/25. I thought the >SPARC2 was suppose to be 2-3 times faster in FLOPS. Aside from the register windows that others mentioned, another place that SPARC often loses ground versus MIPS processors (like SGI) is integer multiplies. Unless they snuck them into the SparcStation 2 when I wasn't looking, SPARC still does multiply in software whereas MIPS processors have hardware for that. Even if you think your code is floating point bound, there could be a lot of integer multiplies that start dominating the run time. On the SGI using "pixie" you can find out how many integer mulitplies you are doing. Dan McCoy ...!{ucbvax,sun}!pixar!mccoy
dik@cwi.nl (Dik T. Winter) (02/27/91)
In article <1991Feb26.202910.27944@pixar.uucp> mccoy@pixar.uucp (Dan McCoy) writes: > Aside from the register windows that others mentioned, another > place that SPARC often loses ground versus MIPS processors (like SGI) > is integer multiplies. Unless they snuck them into the SparcStation 2 > when I wasn't looking, SPARC still does multiply in software whereas > MIPS processors have hardware for that. > It is true that MIPS processors do the integer multiply in hardware while the SPARC does it in software (using multiply step instructions). However, when you look at number of cycles to do a multiply there is in general not much difference. The MIPS mult instruction takes quite some time to complete (11 cycles) and you have to pick up the result. Of course you can do other things during the multiply, but you will find that in general you do not have enough instructions to fill all those cycles. See the examples in Kanes book where you find 6 to 11 cycle interlocks after a multiply. On the other hand, Sun software is clever enough to use a short multiply sequence if one of the operands is small, bringing down the time needed from 34 cycles for a full multiply to 18 or 14 cycles. So no much difference there. Also register windows do not matter very much. In practice there is no tremendous speed-down from register windows. You will only get problems if your program repeatedly calls a (large) nested sequence of small routines. And then you will only see that system time goes up because of register window overflow/underflow traps. On the other hand, there was no need to store a bunch of local variables explicitly on the stack. I think that the net result is that there is not much difference. The main point is, as I have mentioned before and now proven also, is the distinction between honoring prototypes (MIPS C compiler) and not honoring them (Sun C compiler). In the latter case specifying that all floating point operations must be done in single precision is useless for routines that take floating point parameters (because of the implicit promotion to double that is mandated by K&R). I did pick up the original source and changed it so as to allow Sun's -fsingle2 flag (pass all floating point parameters as float, not as double). Changes were needed because all floating point parameters to library routines must explicitly be cast to double. I did timings on an SLC and got the following results (where the flags given are in addition to a bunch of flags common to all tests): cc 2m33.16s cc -fsingle 1m36.00s cc -fsingle -fsingle2 0m20.46s Some improvement I would say. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl