LEICHTER-JERRY@YALE.ARPA (09/06/86)
2) Because Nanny pokes all processes during each cycle, the VMS V4 algorithm for swapping is destroyed. As a result, Nanny will *suspend* (not swap) low priority batch jobs when memory usage is high and resume them when usage reaches a certain level (both levels are run-time variables you may change at will). If a process is suspended, that process is the best candidate for being swapped out next. VMS V4.x has a problem with SUSPEND - it is possible to completely hang a cluster by suspending a process! Actually, the problem has been there for quite some time - since V3.0 or thereabouts - but for reasons I'll describe was extremely rare until V4.4. It's STILL quite rare, but it CAN happen. Basically, the problem is one of synchronization. RMS uses the lock manager to synchronize access to files and pieces of files. If a process becomes suspended while it is holding a lock, it will be unable to release the lock until it is RESUMEd. As a result, any other processes needing access to the lock will wait indefinitely. In most cases, the lock is on a user file, and only other programs needing access to that file can get into trouble. This is no different in principle from what happens is you suspend a process with USER-mode locks. HOWEVER, there are system-wide files that processes need access to at times. For example, LOGINOUT must access the UAF. If a process running LOGINOUT were suspended at just the wrong time, the UAF could become effectively inac- cessible. This would block all logins on an entire cluster. Likely? Not very. Possible? Yes. Prior to V4.4, RMS held locks for the minimum time necessary - i.e., the code was { acquire lock; handle protected object; release lock }. This made the window in which a SUSPEND was dangerous very small. (In fact, it was probably the most common case that the "handle protected object" code would finish before another process was able to run, so there was no REAL window at all.) V4.4 changed that; for performance reasons, RMS now holds on to some locks for long periods of time, releasing them only when it receives a blocking AST. Unfortunately, SUSPEND's block AST delivery.... So, the safety rules for SUSPEND these days are: - Never SUSPEND your own process while you have asynchronous RMS I/O outstanding; - Avoid programs that SUSPEND each other unless you can make sure that the program you are suspending isn't doing RMS I/O; - Pretend the DCL SET PROCESS/SUSPEND command doesn't exist, unless you are damn sure you know what the process you are suspending is doing; - Avoid applying the STOP/QUEUE command to batch queues containing executing jobs; it causes the queue manager to SUSPEND those jobs. Note that because hibernating processes CAN receive AST's, it is safe for a process to go into hibernation while it has RMS I/O active. Unfortunately, there is no way to put ANOTHER process into hibernation. Should you find your system - really, cluster - hanging in a way that seems to indicate a "fly-paper" file - for example, everyone who trys to log in hangs, or everyone trying to run some program hangs - try re-starting any stopped batch queues and using SET PROCESS/RESUME to un-suspend any suspended proces- ses. If the problem I've discussed is the cause of the hang, your system will come back to life the moment you RESUME the right process. -- Jerry -------