STEINBERGER@SRI-KL.ARPA.UUCP (03/14/87)
Sometimes a process gets into an RWAST state. If you do a SHOW SYSTEM you can see this. For example if you have a TK-50 tape drive and you give it a command (e.g. mount) before the green light comes on, it will often get in a RWAST state and NEVER get out of it. I've had to reboot the system. I'm more careful now - I always wait for the green light! However when I'm debugging some new hardware and software, sometimes a process will get into an RWAST state and DCL cannot kill it. I spoke to the DEC folks in Colorado and they confirmed that reboot is the only option here (I had been using STOP/ID=nnn, which gives no error message, but doesn't do anything either). Can anyone give me a more favorable second opinion? Is rebooting the only option to getting rid of a process in the RWAST state? If so, why does DEC allow VMS to create processes that it can't delete? Thanks to all who reply. Rebooting in Menlo Park, CA, Ric Steinberger steinberger@kl.sri.com (415) 859 - 5985 PS - Implicit in all of this is the fact that the AST will not be delivered due to some kind of hardware or hardware/software interaction problem. The process is in a "Waiting for Godot" state. -------
GKN@SDSC-SDS.ARPA.UUCP (03/15/87)
From: Richard Steinberger <STEINBERGER@SRI-KL.ARPA> Subject: Processes in a RWAST state Date: Sat 14 Mar 87 10:33:01-PST ... Is rebooting the only option to getting rid of a process in the RWAST state? If so, why does DEC allow VMS to create processes that it can't delete? Sigh. This is probably one of the things that causes the most frustration to people who do not have extensive backgrounds into VMS internals. To answer your question, yes, rebooting is sometimes the only way to get rid of a process in RWAST state. [allow me to apologize in advance for the length of this one...] Now that that's over with... a little more information on RWAST, and resource waits (what the RW stands for) in general. Resource wait is an involuntary wait state that VMS will place a process in while waiting for a resource to become available. This resource can usually be identified by the rest of the name (RWxxx) displayed by SHOW SYSTEM. Resource waits are a subset of the scheduler state SCH$C_MWAIT, which is shorthand for miscellaneous resource/MUTEX wait. A MUTEX is a mutual-exclusion semaphore, which is something VMS uses internally to protect various data structures, such as the I/O database, logical name tables, and a few other odds and ends which really aren't important in this discussion. You can tell the difference between a MUTEX wait and a resource wait by examining the event flag wait mask (PCB$L_EFWM in the PCB, JPI$_EFWM) if the process' scheduler state is SCH$C_MWAIT. If PCB$L_EFWM is negative, then it is the system-space address of the MUTEX which the process is attempting to gain access to. If it isn't, it's a small positive integer which is an indication of resource wait state the process is in. These are the known resource waits (as of VMS V4.x) and a little bit about what they mean: 1 RWAST Waiting for an AST (see below). 2 RWMBX Mailbox full. A process attempted to write more data into a mailbox than the buffer quota for that mailbox allows. 3 RWNPG Waiting for non-paged pool. 4 RWPFF Page file full. The page file to which this process is assigned is full. 5 RWPAG Waiting for paged pool. 6 RWBRK $BRKTHRU wait. 7 RWIMG Image activator interlock. 8 RWQUO A pooled quota has been exceeded. Use SDA to figure out which one. 9 RWLCK Lock ID database is full? (can anybody fill me in?). 10 RWSWP Swap file full. 11 RWMPE Waiting for the modified page write to empty the modified page list. 12 RWMPB Waiting for the modified page writer (to do something, but I'm not sure what. Can anybody fill me in?). 13 RWSCS Waiting for a systems communications services (cluster) event. 14 RWCLU Waiting for a cluster transition. RWAST is sort of the catch-all resource wait state. Routines inside VMS will place a process into RWAST hoping that the next kernel mode AST queued to the process will have called SCH$RAVAIL to report resource avaialability for the resource in question. By far the most popular reason to get placed in RWAST is a lack of a non-pooled quota, typically BYTLM, BIOLM, DIOLM or ASTLM (this is not an exhaustive list, but you get the idea). The BYTLM quota as it comes "out of the box" from DEC (4096 bytes) in the default account is so pitifully small that you can't even use DECnet effectively. For normal users I grant a BYTLM quota of 24000 bytes. You can tell if this is happening by using SDA on the running system to examine the PCB (the SDA SHOW PROCESS/INDEX=nn/PCB command) for the process that's stuck in RWAST. If some of the non-pooled quotas have zeroes for the "count" portion (really, the amount remaining) then there's your problem. Another popular reason for processes getting stuck in RWAST is a hangup in last-channel deassign. You can tell if this is happening if you look at the process in question with SDA and see EXE$DASSGN+6D or so floating near the top of that process' stack (use the SDA SHOW STACK/INDEX=nnn command). In this case R6 points to a data structure called a CCB (channel control block) which will generally have outstanding I/O that VMS is waiting to have complete before it completely deassigns the I/O channel. If a process is stuck for this reason you can sometimes unstick it by jostling whatever I/O device is involved (you can figure out which by deciphering the CCB, or using the SDA SHOW PROCESS/CHANNELS/INDEX=nnn command). This is not an exhaustive list of why processes get placed in RWAST, it's just two cases I see a fair amount. There are probably dozens of other scenarios. Why can't you kill a process in RWAST, you ask? Well, VMS put the process in that state to wait for a given resource. To delete that process before the resource comes available could leave the system in an inconsistent state, or cause system data structures to be corrupted. The actual mechanics behind it are such that the special kernel mode AST to delete the process remains queued until some process in VMS calls SCH$RAVAIL with the appropriate arguments to knock the process out of resource wait, and which time the process will be deleted. It is possible to break a process out of a quota related resource wait by writing a little program to go patch the PCB and quota in question and call SCH$RAVAIL, but it's definitely not for the novice (and probably there are real reasons why this shouldn't be done, either, but I've successfully done it). Could someone from DEC (or anyplace else, for that matter) please comment on the interpretations of RWxxx above if I'm completely off base on some of them? This information could really be helpful if more people had it. gkn -------------------------------------- Arpa: GKN@SDSC.ARPA Bitnet: GKN@SDSC Span: SDSC::GKN (5.600) USPS: Gerard K. Newman San Diego Supercomputer Center P.O. Box 85608 San Diego, CA 92138 AT&T: 619.534.5076 -------
LEICHTER-JERRY@YALE.ARPA.UUCP (03/17/87)
[W]hen I'm debugging some new hardware and software, sometimes a process will get into an RWAST state and DCL cannot kill it. I spoke to the DEC folks in Colorado and they confirmed that reboot is the only option here (I had been using STOP/ID=nnn, which gives no error message, but doesn't do anything either). Can anyone give me a more favorable second opinion? Is rebooting the only option to getting rid of a process in the RWAST state? If so, why does DEC allow VMS to create processes that it can't delete? RWAST is a kind of a catch-all wait state: It means the process needs some resource that it currently can't get, but it is expected that the delivery of an AST will make some of that resource available. A common example occurs when a process issues a QIO for buffered I/O but already has as many outstand- ing buffered I/O requests as it is allowed (BIOlm). A completely different example occurs when a process that has ACP/XQP work outstanding is deleted; it will wait in RWAST state until that work completes. Not all RWAST processes are undeletable. The example you cite involving the TK50 is probably undeletable because it is waiting for the mag tape ACP to finish the operation it requested, which will never happen. VMS can't delete the process because it would leave the ACP confused. (The ACP might possibly be taught to deal with that, but in the case of XQP operations, which are really taking place within the process, deleting the process while the XQP is busy stands a good chance of corrupting a disk.) There will always be cases in any operating system in which a process hangs while in a some state - such as holding some important resource like a low- level disk structure, which it may or may not have modified into some inter- mediate state - in which it cannot be safely deleted. Such a "dead" process often isn't really doing anything harmful - it's just that the OS can't be SURE, so has to assume the worst. It costs you very little to just leave the RWAST (or whatever) process alone; in most cases, you don't HAVE to re- boot. -- Jerry -------
LEICHTER-JERRY@YALE.ARPA.UUCP (03/17/87)
Some additional information about some of the resource waits you mention: 4 RWPFF Page file full. The page file to which this process is assigned is full. Symbol and meaning defined, but never used in the code in either V3 or V4. 5 RWPAG Waiting for paged pool. Never used in V3, used in V4. 6 RWBRK $BRKTHRU wait. Used in V3, obsolete and never used in V4. (The V3 broadcast mechanism was very different and had hooks deep into many pieces of the system. The V4 mechanism is much cleaner and works with, rather than against, the terminal driver.) 7 RWIMG Image activator interlock. Used V3, obsolete in V4. (Implemented an interlock between INSTALL and the image activator. Now, the lock manager is used.) 8 RWQUO A pooled quota has been exceeded. Use SDA to figure out which one. Symbol and meaning defined, but never used in the code in either V3 or V4. 9 RWLCK Lock ID database is full? (can anybody fill me in?). Symbol and meaning defined, but never used in the code in either V3 or V4. 12 RWMPB Waiting for the modified page writer (to do something, but I'm not sure what. Can anybody fill me in?). Indicates that a process tried to fault a modified page out of its working set, but the modified page list was too big (larger than MPW_WAITLIMIT). This is an emergency self-protection mechanism that triggers when insufficient page file space, or other incorrect parameter settings, cause the modified page writer to get way behind - processes that try to create even more work for the modified page writer get throttled. (Transient occurences of RMMPB are probably common. For example, setting MPW_WAITLIMIT to the same value as MPW_HILIMIT (which AUTOGEN seems to like to do) makes this quite likely.) Since modified page writing is done by the SWAPPER, which runs as a process at priority 16, high-priority real-time processes that hogged the system could also cause this situation. 13 RWSCS Waiting for a systems communications services (cluster) event. Used by the cluster-wide lock manager. 14 RWCLU Waiting for a cluster transition. Used by the cluster-wide lock manager. Some waits (RWSCS, RWCLU, for example) occur at elevated priorities and are never interruptable for things like CTRL/Y or process deletion. Others (RWAST, RWMPE) may or may not be interruptable. The wait states noted above as "Never used in the code" or "Obsolete in V4" may show up in SDA if the process state gets corrupted somehow. -- Jerry -------