[comp.unix.wizards] Recovery from swap failure

barnett@vdsvax.UUCP (09/01/87)

We have a resource problem related to insufficient swap space.
We plan to get additional disks, but I was wondering is anyone has any
advice on the proper way to handle this.

	We have a machine (Sun 3/260) dedicated to simulations (HILO).
Someone submits a large job that will take several hours and it
allocated quite a bit of virtual memory. Now someone submits a small
job on the same machine, which allocates some more memory. 

	Now the large job runs out of swap space, and aborts after
running for 20 hours. People scream, etc.

	I could write a queuing daemon, but people don't want to wait
for a 24 hour job to complete until their 20 minute job starts.

	Assume only one machine. 
	Assume no more disk space available.

	Should we complain to the vendor for writing software that
can't recover from the swap limitation?

	Is there any scheme that we can use that will allow us to work
around this problem in the short term? (Yes, we plan to purchase a Sun 4.)

-- 
	Bruce G. Barnett 	<barnett@ge-crd.ARPA> <barnett@steinmetz.UUCP>
				uunet!steinmetz!barnett

hunt@spar.SPAR.SLB.COM (Neil Hunt) (09/02/87)

In article <2433@vdsvax.steinmetz.UUCP> barnett@steinmetz.UUCP (Bruce G Barnett) writes:
>
>	We have a machine (Sun 3/260) dedicated to simulations (HILO).
>Someone submits a large job that will take several hours and it
>allocated quite a bit of virtual memory. Now someone submits a small
>job on the same machine, which allocates some more memory. 
>
>	Now the large job runs out of swap space, and aborts after
>running for 20 hours. People scream, etc.
>
>	I could write a queuing daemon, but people don't want to wait
>for a 24 hour job to complete until their 20 minute job starts.
>

Here is a suggestion:

Assuming that it is [mc]alloc which is failing when you are out of swap space,
try writing an alternative version of malloc which doesn't return NULL
when memory cannot be allocated, but which prints a warning on the console,
and suspends the process. A human, an operator, or a deamon could then
wait until the system became less loaded, and restart the stopped
process, which would succeed in allocating memory and proceed as if
nothing had happenned.

Neil/.

barnett@vdsvax.steinmetz.UUCP (Bruce G Barnett) (09/04/87)

Re: my recovery from swap failure.

I have enjoyed the few suggestions I have gotten. But I believe
that there is no solution with the situation I proposed.

Remember - this is with a vendor's simulation program, so I can't
hack the sources. ( I will complain to the vendor about check-pointing).

If I could, however, there is still a problem of recovery from a swap failure.

To wit:
	Swap partition = 100 Meg
	Job A runs for 20 hours - allocates (say) 80 Meg
	. . .
	Job B (but same program as A) starts up, allocates 19 Meg
	. . .
	Job A needs 2 Meg more virtual memory - fails - aborts - riots start

Without check-pointing, it does no good for Job A to suspend. Job B
	will continue, suspend, and then Job C will start, suspend, etc.

	Perhaps the software could detect a malloc failure, and given
some parameter specified by the user, suspend or abort the job ( small
jobs abort, big jobs suspend - or oldest job suspends, newest job
aborts).

As it turns out - we have a viable solution - multiple simulation machines!
I will most likely implement:
	All simulaton jobs go into a queue
	Big jobs going to the large machine
	Small jobs going to the big system if idle
	Otherwise, the small system(s).

Someone here has MDQS, which I will look into. Any (additional) ideas
or suggestions will be appreciated.


	
-- 
	Bruce G. Barnett 	<barnett@ge-crd.ARPA> <barnett@steinmetz.UUCP>
				uunet!steinmetz!barnett