[comp.sys.sgi] Multi-processor problems

dixons%phvax.dnet@SMITHKLINE.COM (01/12/90)

I have been working on getting a FORTRAN program running parallel.  I seem
to have gotten it running with reasonable load balance, etc but have
observed a curious phenomenon which depends on the system load.  Here's
what happens:
When I run on a system with no other users, I see a speedup which
depends on the number of processors used in a sensible way.  The 
final speedup with 4 processors is about 1.75x.  But if I run the
same job on the system when one other compute bound (single processor,
non-mp) job is running here are the running times as a function of
the number of processor used in the parallel job:

1 proc		2 proc		3 proc		4 proc
7:14		5:17		4:32		about 22 min

I say about 22 minutes since time returns the rather strange results:
real    30:35.19
user  1:06:58.08
sys         6.41

A ps on the 4 proc job just before it finishes show the following

  5451 ?       22:03 pdg
  5439 ?       22:54 pdg
  5452 ?       22:01 pdg
  5453 ?       21:58 pdg

In other words, using four processors suddenly takes 3 times longer than
1 processor.  This seems to be repeatable.  Also if two other computer
bound jobs are each using a processor then the problem starts when
three processors are used for the mp job.  
Four single processor versions of the same job all running against the
same other compute bound job all finish in about 7:20 each.

Someone else with a 240 has mentioned to me that he has seen similar
behaviour.  Have others of you observed the same?  Is there a fix for
this?   It seems to me to be a rather serious problem which would
effectively prevent multi-processor SGI boxes from being used in
parallel mode unless they were dedicated to a single compute job.
The system in question is a 4D240 with 32Megs running Irix 3.2.  The
programs are CPU bound, do little I/O and are not swapping much (almost
not at all).  CPU utilization is high (>90%) in user mode on all cpus.
I believe that similar behaviour occurred with earlier releases of
Irix as well but I haven't gotten around to looking systematically till
now.
Scott Dixon (dixons@smithklin.com)

ktl@wag240.caltech.edu (Kian-Tat Lim) (01/13/90)

	On our 4D/240GTXB running 3.1F, we have observed similar
effects.  For a small, nearly-completely parallelizable program, here
are some representative timings (the compute-bound job was running in
the background while these tests were run):

1 compute-bound job + 1 thread:  14.5u 0.1s 0:14 98%
1 compute-bound job + 2 threads:  7.0u 0.1s 0:07 98%
1 compute-bound job + 3 threads:  4.9u 0.1s 0:05 98%
1 compute-bound job + 4 threads:  6.5u 0.1s 0:07 87%
1 compute-bound job + 5 threads: 17.0u 0.1s 0:26 65%
1 thread alone: 14.5u 0.1s 0:14 98%
2 threads alone: 7.0u 0.1s 0:07 96%
4 threads alone: 3.8u 0.1s 0:04 96%

A computation-less version of the same program that just did m_forks
was timed as follows (without the compute-bound background job):

1 thread:   0.3u 0.1s 0:00 73%
2 threads:  0.3u 0.0s 0:00 86%
3 threads:  0.3u 0.1s 0:00 82%
4 threads:  0.3u 0.1s 0:00 81%
5 threads: 18.9u 0.1s 0:28 66%

Multiple parallel jobs ("time cmd & time cmd")

2 x 2 threads: 7.4u 0.1s 0:07 98%
2 x 4 threads: 34.0u 0.1s 1:08 49%

Increasing the computation per thread by 25 times gave:

4 threads: 109.0u 1.1s 1:50 99%
2 x 4 threads: 150.8u 7.4s 5:11 50%

	There appear to be severe scheduling problems with more than
four simultaneous threads of execution.  We're in the process of
confirming these numbers and sending a letter to SGI; if someone has a
fix, we'd love to hear it.

--
Kian-Tat Lim (ktl@wagvax.caltech.edu, KTL @ CITCHEM.BITNET, GEnie: K.LIM1)

bron@bronze.wpd.sgi.com (Bron Campbell Nelson) (01/13/90)

In article <9001120157.AA15338@smithkline.com>, dixons%phvax.dnet@SMITHKLINE.COM writes:
> I have been working on getting a FORTRAN program running parallel.  I seem
> to have gotten it running with reasonable load balance, etc but have
> observed a curious phenomenon which depends on the system load.  Here's
> what happens:
[decription deleted]
> In other words, using four processors suddenly takes 3 times longer than
> 1 processor.  This seems to be repeatable.  Also if two other computer
> bound jobs are each using a processor then the problem starts when
> three processors are used for the mp job.  
[more stuff deleted]
> Scott Dixon (dixons@smithklin.com)

The brief answer is: yes, there is a problem here, and the tools needed
to overcome it will be in the next major release (3.3 or whatever we
wind up calling it).

The considerably longer answer goes like this:  

The first (i.e. current) release of SGI's parallel Fortran only supports a
single model of parallel execution.  Namely, equal numbers of iterations of a
DO loop are assigned to each process.  When a parallel loop is entered, the
work is parceled out.  When a process finishes its piece of the parallel
loop, it waits at the bottom of the loop until all the other processes
finish their pieces (i.e. we do a barrier synchronization at the bottom
of each loop).

What happens in the case Scott describes is that a parallel loop is entered,
and iterations are assigned to all 4 processes of the parallel job.  Unfor-
tunately, the forth process cannot run since there is already another
compute bound process running on the forth cpu.  The other 3 processes
finish their piece, and then wait for the forth process.  However, they
must typically wait a very long time since the forth process has to wait
for some other process's time slice to expire, and then do a task switch.
All in all, a very messy business.

This problem happens because the parallel job wants all 4 cpus in order to
run efficiently, but it can't get all 4 cpu's because other jobs are running.
Admittedly, this is hardly surprising; it's a rare person who gets a whole
4D/240 dedicated to their personal use!

Right now, what you can do is restrict the number of cpus that a job asks
for.  Instead of trying to use all the cpus, only use half (or whatever).
In the next release, there will be 2 new enhancements that will help
cure this problem:  First, the process scheduler has been enhanced to
support "gang" scheduling.  In this mode, the parallel job will have all
of its processes scheduled as a unit (i.e. "all or nothing").  This avoids
the "wait for a process to be scheduled" problem described above.  Second,
we support dynamic assignment of loop iterations to processes, so rather
than assigning some loop iterations to all the processes, the next iteration
gets assigned to the next available process.  This allows parallel loops to
complete even if some processes of the parallel job never get to run.  This is
more flexible, but since the parcelling out of iterations must now be
controlled with a critical section, the overhead is higher.

Personally, I suspect that the best way to run will be to gang schedule *and*
use only 3 cpus.  That way you won't get the whole job kicked out just
because one other process wants to run.

Hope the helps.

--
Bron Campbell Nelson
bron@sgi.com  or possibly  ..!ames!sgi!bron
These statements are my own, not those of Silicon Graphics.