barnett@vdsvax.UUCP (09/01/87)
We have a resource problem related to insufficient swap space. We plan to get additional disks, but I was wondering is anyone has any advice on the proper way to handle this. We have a machine (Sun 3/260) dedicated to simulations (HILO). Someone submits a large job that will take several hours and it allocated quite a bit of virtual memory. Now someone submits a small job on the same machine, which allocates some more memory. Now the large job runs out of swap space, and aborts after running for 20 hours. People scream, etc. I could write a queuing daemon, but people don't want to wait for a 24 hour job to complete until their 20 minute job starts. Assume only one machine. Assume no more disk space available. Should we complain to the vendor for writing software that can't recover from the swap limitation? Is there any scheme that we can use that will allow us to work around this problem in the short term? (Yes, we plan to purchase a Sun 4.) -- Bruce G. Barnett <barnett@ge-crd.ARPA> <barnett@steinmetz.UUCP> uunet!steinmetz!barnett
hunt@spar.SPAR.SLB.COM (Neil Hunt) (09/02/87)
In article <2433@vdsvax.steinmetz.UUCP> barnett@steinmetz.UUCP (Bruce G Barnett) writes: > > We have a machine (Sun 3/260) dedicated to simulations (HILO). >Someone submits a large job that will take several hours and it >allocated quite a bit of virtual memory. Now someone submits a small >job on the same machine, which allocates some more memory. > > Now the large job runs out of swap space, and aborts after >running for 20 hours. People scream, etc. > > I could write a queuing daemon, but people don't want to wait >for a 24 hour job to complete until their 20 minute job starts. > Here is a suggestion: Assuming that it is [mc]alloc which is failing when you are out of swap space, try writing an alternative version of malloc which doesn't return NULL when memory cannot be allocated, but which prints a warning on the console, and suspends the process. A human, an operator, or a deamon could then wait until the system became less loaded, and restart the stopped process, which would succeed in allocating memory and proceed as if nothing had happenned. Neil/.
barnett@vdsvax.steinmetz.UUCP (Bruce G Barnett) (09/04/87)
Re: my recovery from swap failure. I have enjoyed the few suggestions I have gotten. But I believe that there is no solution with the situation I proposed. Remember - this is with a vendor's simulation program, so I can't hack the sources. ( I will complain to the vendor about check-pointing). If I could, however, there is still a problem of recovery from a swap failure. To wit: Swap partition = 100 Meg Job A runs for 20 hours - allocates (say) 80 Meg . . . Job B (but same program as A) starts up, allocates 19 Meg . . . Job A needs 2 Meg more virtual memory - fails - aborts - riots start Without check-pointing, it does no good for Job A to suspend. Job B will continue, suspend, and then Job C will start, suspend, etc. Perhaps the software could detect a malloc failure, and given some parameter specified by the user, suspend or abort the job ( small jobs abort, big jobs suspend - or oldest job suspends, newest job aborts). As it turns out - we have a viable solution - multiple simulation machines! I will most likely implement: All simulaton jobs go into a queue Big jobs going to the large machine Small jobs going to the big system if idle Otherwise, the small system(s). Someone here has MDQS, which I will look into. Any (additional) ideas or suggestions will be appreciated. -- Bruce G. Barnett <barnett@ge-crd.ARPA> <barnett@steinmetz.UUCP> uunet!steinmetz!barnett