dixons%phvax.dnet@SMITHKLINE.COM (01/12/90)
I have been working on getting a FORTRAN program running parallel. I seem to have gotten it running with reasonable load balance, etc but have observed a curious phenomenon which depends on the system load. Here's what happens: When I run on a system with no other users, I see a speedup which depends on the number of processors used in a sensible way. The final speedup with 4 processors is about 1.75x. But if I run the same job on the system when one other compute bound (single processor, non-mp) job is running here are the running times as a function of the number of processor used in the parallel job: 1 proc 2 proc 3 proc 4 proc 7:14 5:17 4:32 about 22 min I say about 22 minutes since time returns the rather strange results: real 30:35.19 user 1:06:58.08 sys 6.41 A ps on the 4 proc job just before it finishes show the following 5451 ? 22:03 pdg 5439 ? 22:54 pdg 5452 ? 22:01 pdg 5453 ? 21:58 pdg In other words, using four processors suddenly takes 3 times longer than 1 processor. This seems to be repeatable. Also if two other computer bound jobs are each using a processor then the problem starts when three processors are used for the mp job. Four single processor versions of the same job all running against the same other compute bound job all finish in about 7:20 each. Someone else with a 240 has mentioned to me that he has seen similar behaviour. Have others of you observed the same? Is there a fix for this? It seems to me to be a rather serious problem which would effectively prevent multi-processor SGI boxes from being used in parallel mode unless they were dedicated to a single compute job. The system in question is a 4D240 with 32Megs running Irix 3.2. The programs are CPU bound, do little I/O and are not swapping much (almost not at all). CPU utilization is high (>90%) in user mode on all cpus. I believe that similar behaviour occurred with earlier releases of Irix as well but I haven't gotten around to looking systematically till now. Scott Dixon (dixons@smithklin.com)
ktl@wag240.caltech.edu (Kian-Tat Lim) (01/13/90)
On our 4D/240GTXB running 3.1F, we have observed similar effects. For a small, nearly-completely parallelizable program, here are some representative timings (the compute-bound job was running in the background while these tests were run): 1 compute-bound job + 1 thread: 14.5u 0.1s 0:14 98% 1 compute-bound job + 2 threads: 7.0u 0.1s 0:07 98% 1 compute-bound job + 3 threads: 4.9u 0.1s 0:05 98% 1 compute-bound job + 4 threads: 6.5u 0.1s 0:07 87% 1 compute-bound job + 5 threads: 17.0u 0.1s 0:26 65% 1 thread alone: 14.5u 0.1s 0:14 98% 2 threads alone: 7.0u 0.1s 0:07 96% 4 threads alone: 3.8u 0.1s 0:04 96% A computation-less version of the same program that just did m_forks was timed as follows (without the compute-bound background job): 1 thread: 0.3u 0.1s 0:00 73% 2 threads: 0.3u 0.0s 0:00 86% 3 threads: 0.3u 0.1s 0:00 82% 4 threads: 0.3u 0.1s 0:00 81% 5 threads: 18.9u 0.1s 0:28 66% Multiple parallel jobs ("time cmd & time cmd") 2 x 2 threads: 7.4u 0.1s 0:07 98% 2 x 4 threads: 34.0u 0.1s 1:08 49% Increasing the computation per thread by 25 times gave: 4 threads: 109.0u 1.1s 1:50 99% 2 x 4 threads: 150.8u 7.4s 5:11 50% There appear to be severe scheduling problems with more than four simultaneous threads of execution. We're in the process of confirming these numbers and sending a letter to SGI; if someone has a fix, we'd love to hear it. -- Kian-Tat Lim (ktl@wagvax.caltech.edu, KTL @ CITCHEM.BITNET, GEnie: K.LIM1)
bron@bronze.wpd.sgi.com (Bron Campbell Nelson) (01/13/90)
In article <9001120157.AA15338@smithkline.com>, dixons%phvax.dnet@SMITHKLINE.COM writes: > I have been working on getting a FORTRAN program running parallel. I seem > to have gotten it running with reasonable load balance, etc but have > observed a curious phenomenon which depends on the system load. Here's > what happens: [decription deleted] > In other words, using four processors suddenly takes 3 times longer than > 1 processor. This seems to be repeatable. Also if two other computer > bound jobs are each using a processor then the problem starts when > three processors are used for the mp job. [more stuff deleted] > Scott Dixon (dixons@smithklin.com) The brief answer is: yes, there is a problem here, and the tools needed to overcome it will be in the next major release (3.3 or whatever we wind up calling it). The considerably longer answer goes like this: The first (i.e. current) release of SGI's parallel Fortran only supports a single model of parallel execution. Namely, equal numbers of iterations of a DO loop are assigned to each process. When a parallel loop is entered, the work is parceled out. When a process finishes its piece of the parallel loop, it waits at the bottom of the loop until all the other processes finish their pieces (i.e. we do a barrier synchronization at the bottom of each loop). What happens in the case Scott describes is that a parallel loop is entered, and iterations are assigned to all 4 processes of the parallel job. Unfor- tunately, the forth process cannot run since there is already another compute bound process running on the forth cpu. The other 3 processes finish their piece, and then wait for the forth process. However, they must typically wait a very long time since the forth process has to wait for some other process's time slice to expire, and then do a task switch. All in all, a very messy business. This problem happens because the parallel job wants all 4 cpus in order to run efficiently, but it can't get all 4 cpu's because other jobs are running. Admittedly, this is hardly surprising; it's a rare person who gets a whole 4D/240 dedicated to their personal use! Right now, what you can do is restrict the number of cpus that a job asks for. Instead of trying to use all the cpus, only use half (or whatever). In the next release, there will be 2 new enhancements that will help cure this problem: First, the process scheduler has been enhanced to support "gang" scheduling. In this mode, the parallel job will have all of its processes scheduled as a unit (i.e. "all or nothing"). This avoids the "wait for a process to be scheduled" problem described above. Second, we support dynamic assignment of loop iterations to processes, so rather than assigning some loop iterations to all the processes, the next iteration gets assigned to the next available process. This allows parallel loops to complete even if some processes of the parallel job never get to run. This is more flexible, but since the parcelling out of iterations must now be controlled with a critical section, the overhead is higher. Personally, I suspect that the best way to run will be to gang schedule *and* use only 3 cpus. That way you won't get the whole job kicked out just because one other process wants to run. Hope the helps. -- Bron Campbell Nelson bron@sgi.com or possibly ..!ames!sgi!bron These statements are my own, not those of Silicon Graphics.