[comp.sys.sgi] SGI GL matrix performance -- more benchmarks, this time on a PI

shenkin@cunixf.cc.columbia.edu (Peter S. Shenkin) (04/27/91)

In article <15407@helios.TAMU.EDU> jamie@archone.tamu.edu (James Price) writes:
>Has anyone done any benchmarking of the SGI matrix functions?  I was curious
>and wrote the program included below....

Jamie:  You ought to tell us what kind of Iris "fritz" is, and also what 
version of IRIX you're running.  But in any case, I ran your benchmark on 
avogadro, a 4d25tg running 4.2.1, with the following results.  (Yours included 
for comparison.)  I see that avogadro is faster than fritz in every regard.

I note that compiling matperf.c with increasing levels of optimization
(-O2 and -O3) SLOWS DOWN the hardware performance -- and even, in some cases, 
the software performance (!) -- CONSIDERABLY.  Can anyone explain this?  I 
only did one run each, but these differences are BIG, and I've noted them 
in the table with exclamation points.  This is highly distressing, since one
wants to compile with high optimization to get the max out of one's own
code, and I'd hate to think that doing so necessarily slows down graphics
performance.

I note that with -O2 and -O3, software performance is far better than
hardware performance is in its best case, at least if one needs to get the
results back.  :-)  Thus I conclude that at least for my machine, it doesn't 
make sense to do matrix multiplication using the graphics pipeline, except 
in the context of graphics.  Another conclusion, at least on my machine:
stay away from -O3 !

Caveat:  My machine does not have <stdlib.h>, so I removed that #define;
I do get compilation warnings about parameter mismatches, but the thing
compiles.  Might this be affecting performance?

I've included the results Jamie reported for comparison.  All are for a
command-line argument of 10000 to Jamie's matperf program.
  
  
Machine:                        fritz       ----------- avogadro -------------
GL version:                     GL4DGT-3.3  ---------- GL4DPIT-3.2 -----------
Matperf Optimization level:     -O1 ??      -O1         -O2         -O3

Software - no optimization:     3.349 sec.  1.860 sec.  0.578 sec.  0.578 sec. 
  
Software - some optimization:   1.130 sec.  0.420 sec.  0.378 sec.  0.359 sec. 
  
Software - more optimization:   0.910 sec.  0.330 sec.  0.359 sec. !0.677 sec.
  
Hardware - preserve CTM:        2.379 sec.  0.890 sec.  0.976 sec.  0.876 sec. 
  
Hardware - destroy CTM:         2.289 sec.  0.820 sec.  1.086 sec.  0.837 sec. 
  
Hardware - abandon results:     0.580 sec.  0.430 sec.  0.539 sec. !0.797 sec. 


	-P.
************************f*u*cn*rd*ths*u*cn*gt*a*gd*jb**************************
Peter S. Shenkin, Department of Chemistry, Barnard College, New York, NY  10027
(212)854-1418  shenkin@cunixf.cc.columbia.edu(Internet)  shenkin@cunixf(Bitnet)
***"In scenic New York... where the third world is only a subway ride away."***

shenkin@cunixf.cc.columbia.edu (Peter S. Shenkin) (04/28/91)

Hmmm....  I was so incredulous at my own results, just posted earlier today,
that I re-ran the -O3 benchmark.  This time the results were almost identical
to -O2;  so I take back what I said about -O3.  I must say, I'm surprised at 
how different the two runs of -O3 were.  In both cases, I was the only one
logged on.

	-P.
************************f*u*cn*rd*ths*u*cn*gt*a*gd*jb**************************
Peter S. Shenkin, Department of Chemistry, Barnard College, New York, NY  10027
(212)854-1418  shenkin@cunixf.cc.columbia.edu(Internet)  shenkin@cunixf(Bitnet)
***"In scenic New York... where the third world is only a subway ride away."***

jamie@archone.tamu.edu (James Price) (04/28/91)

fritz is a 4D/50GT, with only an 8MHz CPU.  I recompiled using the -O flags
and ran my program on a few other machines.  We have a PI with GL version
3.3, which yielded results similar to your avogadro numbers (sorry - couldn't 
resist!), but with no large penalty in the hardware when going to -03.  I have
no idea what could have caused the disparity you recorded, unless you had some
other applications running which accessed the GL pipeline...

Here are some interesting results from our 4D/310VGX (33 MHz, 64MB RAM):


10000 iterations on yogi, with GL version: GL4DVGX-3.3

Compiler O level:                 -O0         -O1         -O2         -O3

Software - no optimization:     0.880 sec.  0.750 sec.  0.360 sec.  0.360 sec.

Software - some optimization:   0.350 sec.  0.250 sec.  0.210 sec.  0.200 sec.

Software - more optimization:   0.290 sec.  0.200 sec.  0.210 sec.  0.200 sec.

Hardware - preserve CTM:        2.700 sec.  2.700 sec.  2.690 sec.  2.690 sec.

Hardware - destroy CTM:         2.620 sec.  2.620 sec.  2.610 sec.  2.610 sec.

Hardware - abandon results:     0.430 sec.  0.440 sec.  0.450 sec.  0.430 sec.


Note that even my "slowest" implementation of a 4x4 multiply compiled with
a -O3 flag runs faster then the hardware.  And my "most optimized" version
runs twice as fast, even when compiled with -O1, the compiler default.  I'm 
using MIPS cc Version 2.00, running under IRIX 3.3.2 on the VGX (the other 
SGIs here run 3.3.1). 

Although these results seem to favor a software implementation on a machine
with a fast enough CPU, I have a feeling that if you threw in a few other 
processes and some NFS file I/O, the hardware would probably end up back 
on top.  It's good food for thought, though.

Jamie

Texas A&M University
Visualization Laboratory
jamie@archone.tamu.edu
Newsgroups: comp.sys.sgi
Subject: Re: SGI GL matrix performance -- more benchmarks, this time on a PI
References: <1991Apr27.204135.18538@cunixf.cc.columbia.edu>
Organization: College of Architecture, Texas A&M University
Keywords: 

fritz is a 4D/50GT, with only an 8MHz CPU.  I recompiled using the -O flags
and ran my program on a few other machines.  We have a PI with GL version
3.3, which yielded results similar to your avogadro numbers (sorry - couldn't 
resist!), but with no large penalty in the hardware when going to -03.  I have
no idea what could have caused the disparity you recorded, unless you had some
other applications running which accessed the GL pipeline...

Here are some interesting results from our 4D/310VGX (33 MHz, 64MB RAM):


10000 iterations on yogi, with GL version: GL4DVGX-3.3

Compiler O level:                 -O0         -O1         -O2         -O3

Software - no optimization:     0.880 sec.  0.750 sec.  0.360 sec.  0.360 sec.

Software - some optimization:   0.350 sec.  0.250 sec.  0.210 sec.  0.200 sec.

Software - more optimization:   0.290 sec.  0.200 sec.  0.210 sec.  0.200 sec.

Hardware - preserve CTM:        2.700 sec.  2.700 sec.  2.690 sec.  2.690 sec.

Hardware - destroy CTM:         2.620 sec.  2.620 sec.  2.610 sec.  2.610 sec.

Hardware - abandon results:     0.430 sec.  0.440 sec.  0.450 sec.  0.430 sec.


Note that even my "slowest" implementation of a 4x4 multiply compiled with
a -O3 flag runs faster then the hardware.  And my "most optimized" version
runs twice as fast, even when compiled with -O1, the compiler default.  I'm 
using MIPS cc Version 2.00, running under IRIX 3.3.2 on the VGX (the other 
SGIs here run 3.3.1). 

Although these results seem to favor a software implementation on a machine
with a fast enough CPU, I have a feeling that if you threw in a few other 
processes and some NFS file I/O, the hardware would probably end up back 
on top.  It's good food for thought, though.

Jamie

Texas A&M University
Visualization Laboratory
jamie@archone.tamu.edu