[mod.computers.vax] Warning concerning SUSPEND's on VMS V4.x - beware, NANNY!

LEICHTER-JERRY@YALE.ARPA (09/06/86)

    	2) Because Nanny pokes all processes during each cycle, the
    	   VMS V4 algorithm for swapping is destroyed. As a result,
    	   Nanny will *suspend* (not swap) low priority batch jobs when
    	   memory usage is high and resume them when usage reaches
    	   a certain level (both levels are run-time variables you
    	   may change at will). If a process is suspended, that
    	   process is the best candidate for being swapped out next.
VMS V4.x has a problem with SUSPEND - it is possible to completely hang a
cluster by suspending a process!  Actually, the problem has been there for
quite some time - since V3.0 or thereabouts - but for reasons I'll describe
was extremely rare until V4.4.  It's STILL quite rare, but it CAN happen.

Basically, the problem is one of synchronization.  RMS uses the lock manager
to synchronize access to files and pieces of files.  If a process becomes
suspended while it is holding a lock, it will be unable to release the lock
until it is RESUMEd.  As a result, any other processes needing access to the
lock will wait indefinitely.

In most cases, the lock is on a user file, and only other programs needing
access to that file can get into trouble.  This is no different in principle
from what happens is you suspend a process with USER-mode locks.  HOWEVER,
there are system-wide files that processes need access to at times.  For
example, LOGINOUT must access the UAF.  If a process running LOGINOUT were
suspended at just the wrong time, the UAF could become effectively inac-
cessible.  This would block all logins on an entire cluster.  Likely?  Not
very.  Possible?  Yes.

Prior to V4.4, RMS held locks for the minimum time necessary - i.e., the code
was { acquire lock; handle protected object; release lock }.  This made the
window in which a SUSPEND was dangerous very small.  (In fact, it was probably
the most common case that the "handle protected object" code would finish
before another process was able to run, so there was no REAL window at all.)
V4.4 changed that; for performance reasons, RMS now holds on to some locks for
long periods of time, releasing them only when it receives a blocking AST.
Unfortunately, SUSPEND's block AST delivery....

So, the safety rules for SUSPEND these days are:

	- Never SUSPEND your own process while you have asynchronous RMS
		I/O outstanding;

	- Avoid programs that SUSPEND each other unless you can make sure
		that the program you are suspending isn't doing RMS I/O;

	- Pretend the DCL SET PROCESS/SUSPEND command doesn't exist, unless
		you are damn sure you know what the process you are suspending
		is doing;

	- Avoid applying the STOP/QUEUE command to batch queues containing
		executing jobs; it causes the queue manager to SUSPEND those
		jobs.

Note that because hibernating processes CAN receive AST's, it is safe for a
process to go into hibernation while it has RMS I/O active.  Unfortunately,
there is no way to put ANOTHER process into hibernation.

Should you find your system - really, cluster - hanging in a way that seems to
indicate a "fly-paper" file - for example, everyone who trys to log in hangs,
or everyone trying to run some program hangs - try re-starting any stopped
batch queues and using SET PROCESS/RESUME to un-suspend any suspended proces-
ses.  If the problem I've discussed is the cause of the hang, your system will
come back to life the moment you RESUME the right process.

							-- Jerry
-------