shenkin@cunixf.cc.columbia.edu (Peter S. Shenkin) (04/27/91)
In article <15407@helios.TAMU.EDU> jamie@archone.tamu.edu (James Price) writes: >Has anyone done any benchmarking of the SGI matrix functions? I was curious >and wrote the program included below.... Jamie: You ought to tell us what kind of Iris "fritz" is, and also what version of IRIX you're running. But in any case, I ran your benchmark on avogadro, a 4d25tg running 4.2.1, with the following results. (Yours included for comparison.) I see that avogadro is faster than fritz in every regard. I note that compiling matperf.c with increasing levels of optimization (-O2 and -O3) SLOWS DOWN the hardware performance -- and even, in some cases, the software performance (!) -- CONSIDERABLY. Can anyone explain this? I only did one run each, but these differences are BIG, and I've noted them in the table with exclamation points. This is highly distressing, since one wants to compile with high optimization to get the max out of one's own code, and I'd hate to think that doing so necessarily slows down graphics performance. I note that with -O2 and -O3, software performance is far better than hardware performance is in its best case, at least if one needs to get the results back. :-) Thus I conclude that at least for my machine, it doesn't make sense to do matrix multiplication using the graphics pipeline, except in the context of graphics. Another conclusion, at least on my machine: stay away from -O3 ! Caveat: My machine does not have <stdlib.h>, so I removed that #define; I do get compilation warnings about parameter mismatches, but the thing compiles. Might this be affecting performance? I've included the results Jamie reported for comparison. All are for a command-line argument of 10000 to Jamie's matperf program. Machine: fritz ----------- avogadro ------------- GL version: GL4DGT-3.3 ---------- GL4DPIT-3.2 ----------- Matperf Optimization level: -O1 ?? -O1 -O2 -O3 Software - no optimization: 3.349 sec. 1.860 sec. 0.578 sec. 0.578 sec. Software - some optimization: 1.130 sec. 0.420 sec. 0.378 sec. 0.359 sec. Software - more optimization: 0.910 sec. 0.330 sec. 0.359 sec. !0.677 sec. Hardware - preserve CTM: 2.379 sec. 0.890 sec. 0.976 sec. 0.876 sec. Hardware - destroy CTM: 2.289 sec. 0.820 sec. 1.086 sec. 0.837 sec. Hardware - abandon results: 0.580 sec. 0.430 sec. 0.539 sec. !0.797 sec. -P. ************************f*u*cn*rd*ths*u*cn*gt*a*gd*jb************************** Peter S. Shenkin, Department of Chemistry, Barnard College, New York, NY 10027 (212)854-1418 shenkin@cunixf.cc.columbia.edu(Internet) shenkin@cunixf(Bitnet) ***"In scenic New York... where the third world is only a subway ride away."***
shenkin@cunixf.cc.columbia.edu (Peter S. Shenkin) (04/28/91)
Hmmm.... I was so incredulous at my own results, just posted earlier today, that I re-ran the -O3 benchmark. This time the results were almost identical to -O2; so I take back what I said about -O3. I must say, I'm surprised at how different the two runs of -O3 were. In both cases, I was the only one logged on. -P. ************************f*u*cn*rd*ths*u*cn*gt*a*gd*jb************************** Peter S. Shenkin, Department of Chemistry, Barnard College, New York, NY 10027 (212)854-1418 shenkin@cunixf.cc.columbia.edu(Internet) shenkin@cunixf(Bitnet) ***"In scenic New York... where the third world is only a subway ride away."***
jamie@archone.tamu.edu (James Price) (04/28/91)
fritz is a 4D/50GT, with only an 8MHz CPU. I recompiled using the -O flags and ran my program on a few other machines. We have a PI with GL version 3.3, which yielded results similar to your avogadro numbers (sorry - couldn't resist!), but with no large penalty in the hardware when going to -03. I have no idea what could have caused the disparity you recorded, unless you had some other applications running which accessed the GL pipeline... Here are some interesting results from our 4D/310VGX (33 MHz, 64MB RAM): 10000 iterations on yogi, with GL version: GL4DVGX-3.3 Compiler O level: -O0 -O1 -O2 -O3 Software - no optimization: 0.880 sec. 0.750 sec. 0.360 sec. 0.360 sec. Software - some optimization: 0.350 sec. 0.250 sec. 0.210 sec. 0.200 sec. Software - more optimization: 0.290 sec. 0.200 sec. 0.210 sec. 0.200 sec. Hardware - preserve CTM: 2.700 sec. 2.700 sec. 2.690 sec. 2.690 sec. Hardware - destroy CTM: 2.620 sec. 2.620 sec. 2.610 sec. 2.610 sec. Hardware - abandon results: 0.430 sec. 0.440 sec. 0.450 sec. 0.430 sec. Note that even my "slowest" implementation of a 4x4 multiply compiled with a -O3 flag runs faster then the hardware. And my "most optimized" version runs twice as fast, even when compiled with -O1, the compiler default. I'm using MIPS cc Version 2.00, running under IRIX 3.3.2 on the VGX (the other SGIs here run 3.3.1). Although these results seem to favor a software implementation on a machine with a fast enough CPU, I have a feeling that if you threw in a few other processes and some NFS file I/O, the hardware would probably end up back on top. It's good food for thought, though. Jamie Texas A&M University Visualization Laboratory jamie@archone.tamu.edu Newsgroups: comp.sys.sgi Subject: Re: SGI GL matrix performance -- more benchmarks, this time on a PI References: <1991Apr27.204135.18538@cunixf.cc.columbia.edu> Organization: College of Architecture, Texas A&M University Keywords: fritz is a 4D/50GT, with only an 8MHz CPU. I recompiled using the -O flags and ran my program on a few other machines. We have a PI with GL version 3.3, which yielded results similar to your avogadro numbers (sorry - couldn't resist!), but with no large penalty in the hardware when going to -03. I have no idea what could have caused the disparity you recorded, unless you had some other applications running which accessed the GL pipeline... Here are some interesting results from our 4D/310VGX (33 MHz, 64MB RAM): 10000 iterations on yogi, with GL version: GL4DVGX-3.3 Compiler O level: -O0 -O1 -O2 -O3 Software - no optimization: 0.880 sec. 0.750 sec. 0.360 sec. 0.360 sec. Software - some optimization: 0.350 sec. 0.250 sec. 0.210 sec. 0.200 sec. Software - more optimization: 0.290 sec. 0.200 sec. 0.210 sec. 0.200 sec. Hardware - preserve CTM: 2.700 sec. 2.700 sec. 2.690 sec. 2.690 sec. Hardware - destroy CTM: 2.620 sec. 2.620 sec. 2.610 sec. 2.610 sec. Hardware - abandon results: 0.430 sec. 0.440 sec. 0.450 sec. 0.430 sec. Note that even my "slowest" implementation of a 4x4 multiply compiled with a -O3 flag runs faster then the hardware. And my "most optimized" version runs twice as fast, even when compiled with -O1, the compiler default. I'm using MIPS cc Version 2.00, running under IRIX 3.3.2 on the VGX (the other SGIs here run 3.3.1). Although these results seem to favor a software implementation on a machine with a fast enough CPU, I have a feeling that if you threw in a few other processes and some NFS file I/O, the hardware would probably end up back on top. It's good food for thought, though. Jamie Texas A&M University Visualization Laboratory jamie@archone.tamu.edu