[comp.sys.sgi] Processor efficiency

rgb@PHY.DUKE.EDU ("Robert G. Brown") (06/15/90)

We have a Power Series 220S in our department as a compute server.
It has 24 Mb of RAM, no graphics console, and two processors.
My question is this:  we have empirically observed that small jobs
written in C or F77 for a single processor and optimized run at
around 3.5 MFLOPS (as advertised).  The problem is, that if one takes
these jobs (typically a loop containing just one equation with a
multiply, a divide, an add, and a subtract) and scales them up by
making the loop set every element of a vector and increasing the size
of the vector and the loop, there is a point (which I have not yet
tried to precisely pinpoint) where the speed degrades substantially --
by more than a factor of two.

This point is >>far<< short of saturating the available RAM, and seems
independent of "normal" system load (which is usually carried by one
processor when the other is running a numerical task like this).

My current hypothesis is that this phenomenon is caused by saturation
of some internal cache on the R3000.  Has anyone else noticed or
documented this?  Is there a technical explanation that someone could
post?  Since we (of course) want to use the SG machine for fairly
large jobs, it is important for us to learn about performance cutoffs
in order to optimize performance.  On the other hand, if there is
something wrong with our SG-220, we'd like to learn that too...

Thanks,


	Dr. Robert G. Brown 
 	System Administrator 
 	Duke University Physics Dept. 
 	Durham, NC 27706 
 	(919)-684-8130    Fax (24hr) (919)-684-8101 
 	rgb@phy.duke.edu   rgb@physics.phy.duke.edu

mccalpin@vax1.acs.udel.EDU (John D Mccalpin) (06/15/90)

In article <9006150334.AA03405@physics.phy.duke.edu> rgb@PHY.DUKE.EDU ("Robert G. Brown") writes:
>
>We have a Power Series 220S [....]
>[....] small jobs [...] run at around 3.5 MFLOPS (as advertised). 
>[...] if one takes these jobs (typically a loop containing just one 
>equation with a multiply, a divide, an add, and a subtract) and
>scales them up by making the loop set every element of a vector and
>increasing the size of the vector and the loop, there is a point
>(which I have not yet tried to precisely pinpoint) where the speed
>degrades substantially -- by more than a factor of two.

This degradation is a bit larger than is typical, but it is exactly
what one expects to find with many algorithms on a cached machine.
On my 4D/25 I typically see 25% slowdowns on dense linear algebra
benchmarks when the cache size is exceeded.

(Side note:  It is unfortunate that SGI put a 32kB data cache in
the 4D/25 as it is just a bit too small to handle the 100x100 LINPACK
benchmark case.  The rated performance is 1.6 MFLOPS for the 64-bit
case, while the Sparcstation I is rated at 2.6 MFLOPS.  Despite these
ratings, the 4D/25 is faster than the Sparcstation on almost every 
realistic FP benchmark that I have run.  Also, re-arranging the LINPACK
test case to run in block mode produces performance of up to 3.1 MFLOPS
for the same test case.)

(Side Question:  Does anyone at SGI want to tell me what the cache 
line size and refill delays are for the 4D/25?  Thanks for any info!)

>My current hypothesis is that this phenomenon is caused by saturation
>of some internal cache on the R3000.  Has anyone else noticed or
>documented this? Dr. Robert G. Brown rgb@phy.duke.edu

Here are some numbers from the port of LAPACK that I have been playing with
on my 4D/25 (32 kB data cache).  These use hand-coded BLAS routines from
earl@mips.com.

 size      factor     solve      total     mflops
  ------------------------------------------------
    32  0.000E+00  9.398E-03  9.398E-03  2.542E+00
    50  2.819E-02  0.000E+00  2.819E-02  3.133E+00
   100  1.692E-01  0.000E+00  1.692E-01  4.059E+00
   150  6.296E-01  1.880E-02  6.484E-01  3.539E+00
   200  1.626E+00  2.819E-02  1.654E+00  3.273E+00
   250  3.411E+00  4.699E-02  3.458E+00  3.048E+00
   300  6.137E+00  6.578E-02  6.202E+00  2.931E+00
   500  2.904E+01  1.692E-01  2.921E+01  2.870E+00

I get a bit more than 25% degradation going to the larger problems.

So what does one do about it?

Mostly it depends on the problem.  If you are doing problems that make
extensive use of reduction operations (sums and dot products) then
you should be able to improve the cache locality by unrolling the
outer loops.  This is roughly equivalent to the block-mode algorithms
used in LAPACK.
If your operations are vector<-vector+vector, then you are basically
out of luck and your problem will be memory bandwidth-limited.....

Please let me know if I have not made myself clear!
-- 
John D. McCalpin                               mccalpin@vax1.udel.edu
Assistant Professor                            mccalpin@delocn.udel.edu
College of Marine Studies, U. Del.             mccalpin@scri1.scri.fsu.edu

mike@BRL.MIL (Mike Muuss) (06/16/90)

In January I also noticed problems with running medium-large programs on the
SGIs, but I have not yet had time to dig in very far;  too much else to do.

In my case, watching a GR-OSVIEW shows a significant (like 25% of
each of the 8 CPUs) "system" loading, 10^5 interrupts/second,
and (as I recall) a large TLBFAULT rate.  My current theory is that
I may be using too many entries in the TLB.

The best description I have seen of this comes from the man page for
GR_OSVIEW, which says:

     tlbfault
             The TLB fault bar gives the number of operating-system handled
             TLB faults initiated by the processor.  There are two kinds of
             faults: double-level faults, and reference faults.

             On the 4D series, translation lookaside buffer (TLB) handling is
             performed entirely by software.  This is done by looking up the
             missing page entry in a page table, and entering the virtual to
             physical mapping into the TLB.  First-level faults are handled by
             extremely efficient low-level software.  The page tables
             themselves are virtually mapped, so when the first level TLB
             handler attempts to load a page table entry, it may fault because
             the page table isn't mapped.  This is a double-level fault, and
             must be repaired by high-level kernel routines.  A reference
             fault occurs when a page is touched, and is used by the operating
             system in keeping accurate usage information for efficient
             paging.

             A high double-level fault rate can be a problem, because of the
             overhead of kernel handling.  Each page table can map 2Mb of
             memory, but each program requires at least three segments: text,
             data and stack.  Additionally, most programs are linked with
             either the shared C library or shared graphics library, each of
             which adds two more segments to the program.  Mapping the
             graphics pipe requires another segment as well.  Gr_osview links
             with these, as well as with the shared font manager library,
             making for a total of 10 segments.  There are 62 TLB entries
             available and gr_osview uses more pages of data than this.  This
             results in a fairly high background double-level fault rate.
             However, the CPU load due to this double-level handling rate is
             not measurable for gr_osview, which is worse in these respects
             than most programs.


This may provide a clue.

By the way, my hat is off to Jim Barton, the author of GR_OSVEW.
A superb monitoring tool!

	Best,
	 -Mike