gary@ut-emx.UUCP (Gary Smith) (10/07/89)
Following is the text from a message I posted to the unicos-l mail group approximately two weeks ago. Those of you who read both the unicos-l information and this information are probably quite tired of seeing it. Nonetheless, as there could be interested parties who read this news and not unicos-l, I wanted them to have an op- portunity to read it. CRAY Research is continuing to work on the problems discussed in the paper and indications are that a new memory scheduler will be included in UNICOS 6.0. --------------------------------------------------------------------- Following is the text from a paper I presented at the CRAY User Group Conference in Trondheim last week. The text is quite long but I felt it important that readers of this mailing group be made aware of our experiences with UNICOS 5.0 process scheduling. Lothar Wollschlager of KFA (BITNET: ZDV026@DJUKFA11) has had similar experiences on their Y-MP8/832. The most important section is 2.2. CRAY Research has already addressed many of these problems, but I believe a redesign of UNICOS memory scheduling is nevertheless required. UNICOS 5.0 Scheduling Problems on a CRAY X-MP EA/14se Gary Smith Internet: g.smith@chpc.utexas.edu BITNET: G.SMITH@UTCHPC DECnet: UTCHPC::G.SMITH The University of Texas System Center for High Performance Computing Balcones Research Center 10100 Burnet Road Austin, Texas 78758-4497 September 1, 1989 ABSTRACT Process scheduling algorithms that achieve ade- quate batch job throughput while providing acceptable response time for interactive jobs require the careful integration of swapping-store I/O scheduling, central memory scheduling, and CPU scheduling. Direct, effi- cient couplings among these scheduling algorithms are very critical for modest-resource supercomputers such as the CRAY X-MP EA/14se. They must maximize user CPU and I/O utilization by maximizing the degree of mul- tiprogramming, within any real-time and/or political constraints. As released, the UNICOS 5.0 process scheduling subsystem has proved ineffective at maximizing user CPU and I/O utilization for typical job distributions on our CRAY X-MP EA/14se. This is not because our modest UNICOS hardware platform is intrinsically inadequate to achieve acceptable performance for our predominately batch job mix: the COS 1.16 job scheduler consistently achieves user CPU utilization which exceeds 85% and simultaneously provides high user I/O utilization with similar job mixes on our CRAY X-MP/24. Nor is it due to poorly-chosen schedv parameters. Rather it is due to design oversight in the UNICOS 5.0 process scheduling software, software that fails to maximize the degree of multiprogramming for multiple reasons, including an ineffective strategy for avoiding central memory frag- mentation. September 26, 1989 - 2 - 1. Background 1.1 CHPC Established The University of Texas System Center for High Performance Computing (CHPC) was established by the University of Texas Sys- tem Board of Regents in 1986 to serve the research and instruc- tional supercomputing needs of the academic and health science component institutions of the UT System. The initial equipment configuration included a CRAY X-MP/24 supercomputer with 32- megaword SSD, 9.6 gigabytes of DD49 mass storage and XIOP-based 3420 and 3480 magnetic tape access, all controlled by the COS 1.14 operating system. Support computers included a VAX 8600 front end running VMS and an IBM 4381 file server running MVS. The CHPC initiated production operations on May 15, 1986. Access to the services of the CHPC is provided via the Texas Higher Edu- cation Network (THEnet), a state-wide, heterogeneous network administered by the UT System Office of Telecommunications Ser- vices (OTS). See Figure 1. 1.2 CRAY X-MP/24 Saturated By February of 1987, the CPU resources of the CRAY X-MP/24 had become fully saturated, with typical job backlogs of up to one week. Clearly, to address the turnaround time problem without strict allocation procedures, additional CPU resources were required. Even though larger central memory was being requested by several users, four megawords had proved adequate to achieve a degree of multiprogramming that allowed user CPU utili- zations of above 85% for typical job mixes. 1.3 UNICOS Migration Initiated Over the next year, UNIX continued its emergence as an evolving de facto standard operating system for computers in all performance classes, and our decision to migrate to a homogeneous UNIX environment was justified on three primary bases. UNIX had been chosen as the operating system of the future by CRAY Research and other high-performance computer vendors. Equally important was CHPC's goal to establish a partnership between com- puting resources locally available at the UT component campuses, primarily UNIX-based scientific workstations and near- supercomputers, and those of the CHPC facility. Achieving this goal required the uniform network service functionality approached only in the UNIX environment. Finally, UNIX provided the best available approximation to our goal of vendor indepen- dence. Presuming that the UT System Board of Regents would approve of a major expansion of the production vector supercomputing resources provided by the CHPC (approval was subsequently granted in June, 1989), migration of the UT System CHPC facility to UNIX was approved in April of 1988. As the COS 1.16-based CRAY X- September 26, 1989 - 3 - MP/24 was production-saturated, the vehicle chosen for migration was a CRAY X-MP EA/14se with 4 megawords of BMR, 8.4 gigabytes of DD39 mass storage and XIOP-based 3420 and 3480 magnetic tape access, all controlled by the UNICOS 4.0 operating system. Sup- port systems included a CONVEX C120 access server running CONVEX UNIX 6.2 and a Sun-3/280S gateway server running SunOS 3.5. See Figures 2 and 3. New hardware was installed during October, 1988 and acceptance testing was completed by January 1, 1989. The plan was that, after 9 to 12 months, the majority of the COS workload could be moved to the UNICOS context. At that time, either "COS stragglers" would be accommodated by reversing the roles of the COS 1.16 X-MP/24 and UNICOS 5.0 X-MP EA/14se, or UNICOS 5.0 would be run on both X-MP's. 2. Problems 2.1 UNICOS 4.0 Installed After a brief development period, in which CHPC-specific accounting and NQS-temporary file ($TMPDIR) modifications were integrated into UNICOS 4.0, the X-MP EA/14se was made available for production. As expected, the early job mix proved to be pri- marily batch, yet we were routinely measuring low CPU utiliza- tion. Careful investigation of the UNICOS 4.0 source code revealed that in an attempt to integrate batch scheduling with more-or-less standard UNIX priority-based process scheduling, the developers of UNICOS 4.0 had not provided us with the scheduling flexibility we required. We believe most sites purchase CRAY supercomputers primarily for high-performance sustainable CPU throughput. Yet due to the simple negative-feedback-on-CPU-usage process priority function in UNICOS 4.0, the highly I/O-bound memory hogs (memhogs) received a much larger share of central memory occupancy than CPU-bound memhogs. The I/O-bound memhogs even competed effectively with small CPU-bound processes. CPU- bound process priorities decayed much faster than that of I/O- bound memhogs, resulting in less than 50% CPU utilization at times when many CPU-bound processes were runable. After some trivial coding changes to make CPU-bound memhogs available for swap-in sooner, they achieved a larger share of central memory occupancy. This allowed us to achieve nearly 75% CPU utilization with the same job mix. 2.2 UNICOS 5.0 Installed Having heard from CRAY Research employees at many levels that UNICOS 5.0 was to be the most "feature-rich" and reliable release to date, due in part to extremely thorough testing under production loads, we ordered installation materials and documen- tation on May 15, 1989 and installed UNICOS 5.0.10 for production on July 11. By September 1, just over seven weeks later, we had two CRITICAL, ten URGENT and two MAJOR SPR's outstanding against UNICOS 5.0, including problems with the kernel (primarily the September 26, 1989 - 4 - process scheduler), TCP/IP modules, fsck, mkfs, NQS, msgdaemon and crayperf. Of these SPR's, two CRITICAL's and two URGENT's against the process scheduling subsystem represent significant problems for our site. After some definitions and a very brief overview of UNICOS 5.0 scheduling philosophy, the remainder of this section will detail several of the process scheduling problems we have experi- enced. See Figure 4. Central Memory Scheduling Terminology cpuhog a process that has exceeded the CPU time limit set for processes with modest CPU requirements memhog a process that has exceeded the central memory limit set for processes with modest central memory require- ments hog cpuhog or memhog hogmem total central memory that can be captured at any time by hogs roadhog an in-memory process is defined as a roadhog when the sum of its size and that of the highest-priority swap-in process exceeds available user memory maxbad long-term sleeping processes kindabad short-term sleeping processes Central Memory Scheduling Parameter Definitions Scheduling parameter weight factors: x_in defines weight factor to be multiplied by an attribute associated with x for processes residing in central memory when computing their swap-out prior- ity. Similarly, x_out defines a weight factor to be multiplied by an attribute associated with x for processes residing on the swap device (SWAPDEV) when computing their swap-in priority. + Throughput scheduling: mfactor: process size tfactor: process residence time + Political scheduling: pfactor: process standard UNIX priority shfactor: process fair share priority nfactor: process nice value Scheduling parameter damping/clipping factors: September 26, 1989 - 5 - min_inpri runable processes in central memory are excluded from swap-out until their swap-out priority is lower than min_inpri, unless max_inage is exceeded min_outpri runable processes on SWAPDEV are excluded from swap- in until their swap-in priority is higher than min_outpri, unless max_outage is exceeded (and the process is a hog) max_inage guaranteed central memory residency time for runable processes; processes that have occupied central memory for max_inage seconds can be swapped out regardless of min_inpri max_outage guaranteed SWAPDEV residency time for hog processes; hog processes that have been swapped-out for max_outage seconds can be swapped in regardless of min_outpri thrash-blks no more than thrash-blks blocks may be swapped per thrash-inter seconds if thrash-inter is nonzero Central Memory Scheduling Algorithm (sched) sched computes swap-in priorities to order swapped-out run- able processes for swap in. Similarly, sched computes swap-out priorities to order swapped-in processes for swap out, to make room to swap processes in. Swap priorities are computed as a simple sum of products, using weight factors set via schedv. Higher numeric values indicate better priority. Thus, the pro- cess with the numerically largest priority on the swap-in queue will swap in first. The in-memory process with the numerically smallest priority will swap out first. Once the "best" candi- dates for swap-in and swap-out are found, their priorities are compared. If the swap-in candidate has a priority less than that of the swap-out candidate, no swap occurs. sched then suspends itself for one second after which it reevaluates the situation. Of course, special cases occur for in-memory processes in states such as roadhog, sleeping or locked-in: any roadhog process is chosen prior to the "worst" maxbad process; any maxbad process is chosen prior the "worst" kindabad process; any kindabad process is chosen prior to the "worst" runable process, within the con- straints of any damping or clipping factors. Swap-in priority is computed as follows: P = priority*pfactor_out + sharepri*shfactor_out + nice*nfactor_out + size*mfactor_out + time*tfactor_out Swap-out priority is computed similarly: P = priority*pfactor_in + sharepri*shfactor_in + nice*nfactor_in + size*mfactor_in + time*tfactor_in September 26, 1989 - 6 - 2.2.1 Central Memory Scheduler Problems During beta testing of UNICOS 5.0 at NCAR, several problems were found in sched. One important problem was so-called "roadhog detection", wherein sched did not detect processes that were required to swap out in order to fit the highest-priority process on the swap-in queue into central memory. sched was modified such that when roadhogs were occupying central memory, they would be considered exclusively for swap out. Unfor- tunately, a precondition for roadhog detection was that the pro- cess be a hog. As released, UNICOS 5.0 schedv parameters declare no hogmem, and thus roadhog deadlock detection failed until the default schedv parameters were modified. This resulted in several uncontrolled shutdowns (because we could not log in on the system console to run our shutdown script) and our first UNICOS 5.0 CRITICAL SPR. As released, the UNICOS 5.0 default schedv parameters result in other problems as well, the most significant of which is thrashing. Following is a quote from a posting to the UNICOS mailing list by Doug Engert of Argonne Labs: "With the default settings, we were seeing very high swap rates. The situation went something like this. A number of processes are swapped out and their priority is increasing. One is swapped in, and its new priority is calculated, but it is less than others which are still out, and thus it gets swapped out. We had to set the min_inpri to much less than the min_outpri to slow this down. But the intent of the min_inpri and min_outpri is to have a range where the system can dynamically balance the swapping. A fudge factor was needed." At CHPC, we were often experiencing similar behavior with our job mix. Fortunately, Doug Engert went on to determine a set of schedv parameters that alleviate the thrashing problem somewhat. Unfortunately, sched has more fundamental problems that can cause instabilities, not the least of which is oscillatory behavior due to asymmetries introduced by two separate functions for computing swap priorities. One alterna- tive to the fudge factor Doug implemented is the concept of thrash locks with terms proportional to process size, in addition to a constant term. This method is used in COS and works quite well for CHPC job distributions. Other serious problems with sched are the following, first described to us by Ted Kline of CRAY Research, during a two-day UNICOS 5.0 memory scheduler workshop held at CHPC on August 21- 22, 1989 (during which time Ted installed a fix to roadhog deadlock detection for no-hogmem systems). Suppose A and B are in-memory processes with process C swapped-out and runable. There exist conditions wherein sched will swap out A, even though A will be ordered ahead of C on the swap-in queue, resulting in immediately swapping A back in. This can continue until priority(A) < priority(B), at which time similar oscillations can begin with process B. If process C is ever ordered first on the swap-in queue, of course it can swap in, but these "scheduler loops" can go on for some time, wasting CPU and I/O resources. September 26, 1989 - 7 - Yet another problem Ted mentioned was that unless the swap-in priority of runable swapped-out processes exceeds min_outpri, sched refuses to swap them in unless max_outage is exceeded, even when adequate central memory is available! Attempting to schedule processes strictly on the basis of share priorities, which include costs for central memory and I/O as well as CPU, we determined that the scheduler loop Ted described occurs when max_inage is zero. Unfortunately, nonzero max_inage can result in keeping sleeping processes in central memory when multiple runable processes are available on the swap-in queue! Even though the fair share CPU scheduler func- tions properly, in several situations the use of share priority alone for memory scheduling is unstable for at least the follow- ing reason. Suppose CPU-bound, roadhog processes A and B are the only runable processes over some time interval. Also suppose their share priorities are extremely close. Swap-in/swap-out interchanges of A and B can occur at very short intervals. If these swaps are at VHISP rates, lost CPU utilization might be hardly noticed, but with disk swapping, very low CPU utilization results. Again, thrash locks proportional to process size con- stitute a component of the solution to this kind of problem. Another inefficiency we have discovered in sched is the fol- lowing. Once sched has ordered the swap-in queue and chosen the best swap-in candidate, no prescan of the in-memory processes is done to ensure that adequate central memory can be reclaimed to swap it in: sched simply assumes it and begins sequentially swap- ping out all processes of lower priority than the swap-in candi- date. The result is unnecessary swapping and lower degree of multiprogramming, resulting in lower CPU utilization. Yet another recent discovery is as follows. Suppose sched has chosen a large process A for swap-in and there exists process B which is locked-in due to raw I/O. sched will not consider B for swap-out and might swap out all other processes, including those with higher priorities than B. After all processes other than B have been swapped out, if there is inadequate contiguous central memory to swap A in, sched sets a flag for bio to idle down B. One oversight here is that B might be a roadhog and thus must be swapped out for A to fit! Even worse, it can be the case that B is the only process sched needed to swap out to fit A and swapping out all other processes first was superfluous, not to mention inefficient! As a final example, there is packmem, the module called by sched to pack central memory when no single gap is adequate to swap in some process. The packmem algorithm does not pack memory but rather scans processes in storage address order, attempting to move the process into a gap adequate to contain the entire process space, at a lower storage address. The concept of storage moving a process downward into its lower gap (or upward into its upper gap), independent of that gap size, is simply not employed. packmem can actually increase the fragmentation at September 26, 1989 - 8 - worst and in many cases only does unnecessary storage moves at best! Perhaps the most unfortunate side-effect of the packmem algorithm is that swap-outs followed by swap-ins are sometimes required to accomplish central memory packing! There are other scenarios which result in sched instabili- ties. Many of these have been discovered by Bill Jones of the CHPC, using Ted Kline's schedvsim simulator. Ted provided our systems programming staff with this valuable algorithm-modeling tool during our August workshop. 2.2.2 NQS Scheduling Problems The first and most important problem we have with NQS is the lack of a direct coupling between NQS and the share scheduler. A simple example of the problem follows. Suppose a site prefers to schedule processes almost exclusively on the basis of share priorities. Suppose user A, who has recently used far in excess of his share of resources, submits "queue-limit" batch jobs via NQS. Now assume that user B, who has no recent usage, submits a batch job to the same queue. Since user A has used more than his fair share of resources recently, the share scheduler is going to allocate memory and CPU resources to A much less frequently than to other available batch jobs with lower recent usage, thus stal- ling A's progress through the system. But this is penalizing user B, the very user who deserves rapid turnaround! Another problem we see is the lack of a parameter to specify filesystem resource limits. TMPDIR (and other) filesystem deple- tion could be controlled if we could schedule batch jobs on the basis of disk resources in addition to CPU, memory, and tape requirements. 3. Conclusions and Suggestions 3.1 Four-Megaword Development and Testing A primary reason for our early discovery of the sched deadlocks and deficiencies related above is that such problems are very likely to occur under rich production loads on modest- resource machines. Sixteen to sixty-four megaword machines with VHISP swapping rates to SSD memories may mask such problems much longer than modest-resource machines. Yet four- and eight- megaword machines still constitute the majority of the X-MP installed base, a base wherein many sites plan to migrate to UNICOS prior to purchasing additional hardware. Primarily for these reasons, but also to increase developers' awareness of user memory constraints on modest-resource machines, we suggest that CRAY Research integrate modestly configured four- or eight- megaword X-MP's into the UNICOS development and production environments at Mendota Heights, so long as such machines main- tain prominence in the distribution of CRAY Research's installed September 26, 1989 - 9 - base. 3.2 New Category-Based Integrated Scheduler During the two-day workshop with Ted Kline of CRAY Research, we presented some of our thoughts regarding an integrated job scheduler for UNICOS, not completely unlike that of COS. We believe the distribution of jobs on typical production supercom- puters is too rich for the two simple categories of batch and interactive. Furthermore, we believe that job scheduling should be based on a combined "category-cost" concept, where batch job cost might typically be the product of process size and resource time remaining, while interactive job cost might be the product of process size and resource time accumulated since the last interaction. This tends to give modest-size and/or modest- resource-time-limit batch jobs higher priority, while simultane- ously ensuring that large, long-running batch jobs receive higher priority as they near completion, allowing their resources to be freed for other jobs. The cost function suggested for interac- tive jobs gives truly interactive jobs good service, while the "interactive grinders" receive lower priority as they compute without interacting. During the summary session of our workshop, Ted Kline indi- cated that he was considering the concept of site-specifiable categories with independent cost functions and central memory maxima for each category. This would allow each site to define their own process categories and associated cost functions. Dur- ing a memory scheduler run, only runable processes would be sorted on the basis of category and cost. Central memory would then be packed in cost order within category memory cutoff lim- its. To maximize the degree of multiprogramming, a final pass would then be made with all category memory cutoffs set to remaining user memory. This strategy must be augmented with several complementary algorithms, such as swapping out long-term sleeping processes immediately, at least under conditions of sub- stantial central memory contention. This is an oversimplifica- tion, of course, but contains the essence of a paradigm we would like to see implemented. 3.3 Central Memory Packing The UNICOS 5.0 process scheduler suffers from lack of emphasis on keeping central memory efficiently packed with as many runable processes as possible at all times. This maximiza- tion of the degree of multiprogramming is very critical in the single CPU case, not to mention the case of multiple CPU's. Perhaps sched should be willing to look beyond the front of the swap-in queue when that process momentarily cannot fit. Several smaller processes may fit at such times, resulting in better resource utilization. 3.4 Size-Proportional Thrash Locks September 26, 1989 - 10 - We believe that provision of size-proportional thrash locks is also extremely important, especially for machines with low- bandwidth swapping devices. 3.5 Properly-Documented Tuning Parameters In an attempt to allow scheduling flexibility, various tun- ing parameters were added to the UNICOS 5.0 process scheduler. The definitions above are paraphrased from the UNICOS 5.0 SCHEDV (1M) documentation. The parameters do not all function as speci- fied in that documentation. More importantly, these parameters interact in some unexpected and unstable ways. 3.6 UNICOS Stability Requirement We have made a long-term committment to CRAY X-MP hardware and UNICOS software, predicated on the quality of both. Quality consists not only of "installability" but more importantly, reli- ability, high-performance, functionality, flexibility, and sta- bility. If our user community is to migrate from our stable COS X-MP/24 platform to the UNICOS X-MP EA/14se, we must have these components of quality in all future releases of UNICOS. 4. Acknowledgements I wish to acknowledge the help of Bill Jones and Dan Rey- nolds of the CHPC in both discovering and solving problems in the UNICOS 5.0 process scheduler. I wish to acknowledge Ted Kline of CRAY Research for his help in discovering and solving problems in the UNICOS 5.0 process scheduler, as well as his willingness to listen to our thoughts regarding process scheduling in the abstract. Finally, I wish to acknowledge Doug Engert, for his discovery of a set of schedv parameters which partially offset the design deficiencies present in the UNICOS 5.0 memory scheduler and allow us to run production workloads for some job distributions. __________________________ UNIX is a registered trademark of AT&T Bell Labora- tories. CRAY, COS, UNICOS and X-MP are registered trademarks of CRAY Research, Inc. IBM and MVS are re- gistered trademarks of International Business Machines. VAX and VMS are registered trademarks of Digital Equip- ment Corporation. CONVEX is a registered trademark of Convex Computer Corporation. Sun is a registered trademark of Sun Microsystems, Inc.