[mod.computers.vax] Processes in a RWAST state

STEINBERGER@SRI-KL.ARPA.UUCP (03/14/87)

Sometimes a process gets into an RWAST state.  If you do a SHOW SYSTEM you
can see this.  For example if you have a TK-50 tape drive and you give
it a command (e.g. mount) before the green light comes on, it will often
get in a RWAST state and NEVER get out of it.  I've had to reboot the
system.  I'm more careful now - I always wait for the green light!

However when I'm debugging some new hardware and software, sometimes a
process will get into an RWAST state and DCL cannot kill it.  I spoke
to the DEC folks in Colorado and they confirmed that reboot is the only
option here (I had been using STOP/ID=nnn, which gives no error message,
but doesn't do anything either).

Can anyone give me a more favorable second opinion?  Is rebooting the only
option to getting rid of a process in the RWAST state?  If so, why does DEC
allow VMS to create processes that it can't delete?

Thanks to all who reply.


Rebooting in Menlo Park, CA,

	Ric Steinberger
	steinberger@kl.sri.com

	(415) 859 - 5985


PS - Implicit in all of this is the fact that the AST will not be delivered
due to some kind of hardware or hardware/software interaction problem.
The process is in a "Waiting for Godot" state.


-------

GKN@SDSC-SDS.ARPA.UUCP (03/15/87)

	From:	 Richard Steinberger <STEINBERGER@SRI-KL.ARPA>
	Subject: Processes in a RWAST state
	Date:	 Sat 14 Mar 87 10:33:01-PST

	... Is rebooting the only option to getting rid of a process in the
	RWAST state?  If so, why does DEC allow VMS to create processes that
	it can't delete?

Sigh.  This is probably one of the things that causes the most frustration
to people who do not have extensive backgrounds into VMS internals.

To answer your question, yes, rebooting is sometimes the only way to get
rid of a process in RWAST state.

[allow me to apologize in advance for the length of this one...]

Now that that's over with... a little more information on RWAST, and resource
waits (what the RW stands for) in general.  Resource wait is an involuntary
wait state that VMS will place a process in while waiting for a resource to
become available.  This resource can usually be identified by the rest of the
name (RWxxx) displayed by SHOW SYSTEM.

Resource waits are a subset of the scheduler state SCH$C_MWAIT, which is
shorthand for miscellaneous resource/MUTEX wait.  A MUTEX is a mutual-exclusion
semaphore, which is something VMS uses internally to protect various data
structures, such as the I/O database, logical name tables, and a few other
odds and ends which really aren't important in this discussion.  You can tell
the difference between a MUTEX wait and a resource wait by examining the event
flag wait mask (PCB$L_EFWM in the PCB, JPI$_EFWM) if the process' scheduler
state is SCH$C_MWAIT.  If PCB$L_EFWM is negative, then it is the system-space
address of the MUTEX which the process is attempting to gain access to.  If it
isn't, it's a small positive integer which is an indication of resource wait
state the process is in.

These are the known resource waits (as of VMS V4.x) and a little bit about
what they mean:

	1	RWAST	Waiting for an AST (see below).
	2	RWMBX	Mailbox full.  A process attempted to write more
			data into a mailbox than the buffer quota for that
			mailbox allows.
	3	RWNPG	Waiting for non-paged pool.
	4	RWPFF	Page file full.  The page file to which this process
			is assigned is full.
	5	RWPAG	Waiting for paged pool.
	6	RWBRK	$BRKTHRU wait.
	7	RWIMG	Image activator interlock.
	8	RWQUO	A pooled quota has been exceeded.  Use SDA to figure
			out which one.
	9	RWLCK	Lock ID database is full? (can anybody fill me in?).
	10	RWSWP	Swap file full.
	11	RWMPE	Waiting for the modified page write to empty the
			modified page list.
	12	RWMPB	Waiting for the modified page writer (to do something,
			but I'm not sure what.  Can anybody fill me in?).
	13	RWSCS	Waiting for a systems communications services (cluster)
			event.
	14	RWCLU	Waiting for a cluster transition.

RWAST is sort of the catch-all resource wait state.  Routines inside VMS will
place a process into RWAST hoping that the next kernel mode AST queued to
the process will have called SCH$RAVAIL to report resource avaialability for
the resource in question.

By far the most popular reason to get placed in RWAST is a lack of a non-pooled
quota, typically BYTLM, BIOLM, DIOLM or ASTLM (this is not an exhaustive list,
but you get the idea).  The BYTLM quota as it comes "out of the box" from DEC
(4096 bytes) in the default account is so pitifully small that you can't even
use DECnet effectively.  For normal users I grant a BYTLM quota of 24000 bytes.

You can tell if this is happening by using SDA on the running system to
examine the PCB (the SDA SHOW PROCESS/INDEX=nn/PCB command) for the process
that's stuck in RWAST.  If some of the non-pooled quotas have zeroes for the
"count" portion (really, the amount remaining) then there's your problem.

Another popular reason for processes getting stuck in RWAST is a hangup in
last-channel deassign.  You can tell if this is happening if you look at the
process in question with SDA and see EXE$DASSGN+6D or so floating near the
top of that process' stack (use the SDA SHOW STACK/INDEX=nnn command).  In
this case R6 points to a data structure called a CCB (channel control block)
which will generally have outstanding I/O that VMS is waiting to have complete
before it completely deassigns the I/O channel.  If a process is stuck for
this reason you can sometimes unstick it by jostling whatever I/O device is
involved (you can figure out which by deciphering the CCB, or using the
SDA SHOW PROCESS/CHANNELS/INDEX=nnn command).

This is not an exhaustive list of why processes get placed in RWAST, it's
just two cases I see a fair amount.  There are probably dozens of other
scenarios.

Why can't you kill a process in RWAST, you ask?  Well, VMS put the process in
that state to wait for a given resource.  To delete that process before the
resource comes available could leave the system in an inconsistent state, or
cause system data structures to be corrupted.  The actual mechanics behind it
are such that the special kernel mode AST to delete the process remains queued
until some process in VMS calls SCH$RAVAIL with the appropriate arguments to
knock the process out of resource wait, and which time the process will be
deleted.

It is possible to break a process out of a quota related resource wait by
writing a little program to go patch the PCB and quota in question and call
SCH$RAVAIL, but it's definitely not for the novice (and probably there are
real reasons why this shouldn't be done, either, but I've successfully done
it).

Could someone from DEC (or anyplace else, for that matter) please comment on
the interpretations of RWxxx above if I'm completely off base on some of them?
This information could really be helpful if more people had it.

gkn
--------------------------------------
Arpa:	GKN@SDSC.ARPA
Bitnet:	GKN@SDSC
Span:	SDSC::GKN (5.600)
USPS:	Gerard K. Newman
	San Diego Supercomputer Center
	P.O. Box 85608
	San Diego, CA 92138
AT&T:	619.534.5076
-------

LEICHTER-JERRY@YALE.ARPA.UUCP (03/17/87)

    [W]hen I'm debugging some new hardware and software, sometimes a process
    will get into an RWAST state and DCL cannot kill it.  I spoke to the
    DEC folks in Colorado and they confirmed that reboot is the only option
    here (I had been using STOP/ID=nnn, which gives no error message, but
    doesn't do anything either).
    
    Can anyone give me a more favorable second opinion?  Is rebooting the only
    option to getting rid of a process in the RWAST state?  If so, why does
    DEC allow VMS to create processes that it can't delete?

RWAST is a kind of a catch-all wait state:  It means the process needs some
resource that it currently can't get, but it is expected that the delivery of
an AST will make some of that resource available.   A common example occurs
when a process issues a QIO for buffered I/O but already has as many outstand-
ing buffered I/O requests as it is allowed (BIOlm).  A completely different
example occurs when a process that has ACP/XQP work outstanding is deleted;
it will wait in RWAST state until that work completes.

Not all RWAST processes are undeletable.  The example you cite involving the
TK50 is probably undeletable because it is waiting for the mag tape ACP to
finish the operation it requested, which will never happen.  VMS can't delete
the process because it would leave the ACP confused.  (The ACP might possibly
be taught to deal with that, but in the case of XQP operations, which are
really taking place within the process, deleting the process while the XQP is
busy stands a good chance of corrupting a disk.)

There will always be cases in any operating system in which a process hangs
while in a some state - such as holding some important resource like a low-
level disk structure, which it may or may not have modified into some inter-
mediate state - in which it cannot be safely deleted.  Such a "dead" process
often isn't really doing anything harmful - it's just that the OS can't be
SURE, so has to assume the worst.  It costs you very little to just leave
the RWAST (or whatever) process alone; in most cases, you don't HAVE to re-
boot.
							-- Jerry
-------

LEICHTER-JERRY@YALE.ARPA.UUCP (03/17/87)

Some additional information about some of the resource waits you mention:

    	4	RWPFF	Page file full.  The page file to which this process
    			is assigned is full.
Symbol and meaning defined, but never used in the code in either V3 or V4.

    	5	RWPAG	Waiting for paged pool.
Never used in V3, used in V4.

    	6	RWBRK	$BRKTHRU wait.
Used in V3, obsolete and never used in V4.  (The V3 broadcast mechanism was
very different and had hooks deep into many pieces of the system.  The V4
mechanism is much cleaner and works with, rather than against, the terminal
driver.)

    	7	RWIMG	Image activator interlock.
Used V3, obsolete in V4.  (Implemented an interlock between INSTALL and the
image activator.  Now, the lock manager is used.)

    	8	RWQUO	A pooled quota has been exceeded.  Use SDA to figure
    			out which one.
Symbol and meaning defined, but never used in the code in either V3 or V4.

    	9	RWLCK	Lock ID database is full? (can anybody fill me in?).
Symbol and meaning defined, but never used in the code in either V3 or V4.

    	12	RWMPB	Waiting for the modified page writer (to do something,
    			but I'm not sure what.  Can anybody fill me in?).
Indicates that a process tried to fault a modified page out of its working
set, but the modified page list was too big (larger than MPW_WAITLIMIT).
This is an emergency self-protection mechanism that triggers when insufficient
page file space, or other incorrect parameter settings, cause the modified
page writer to get way behind - processes that try to create even more work
for the modified page writer get throttled.  (Transient occurences of RMMPB
are probably common.  For example, setting MPW_WAITLIMIT to the same value as
MPW_HILIMIT (which AUTOGEN seems to like to do) makes this quite likely.)

Since modified page writing is done by the SWAPPER, which runs as a process
at priority 16, high-priority real-time processes that hogged the system
could also cause this situation.

    	13	RWSCS	Waiting for a systems communications services
    			(cluster) event.
Used by the cluster-wide lock manager.

    	14	RWCLU	Waiting for a cluster transition.
Used by the cluster-wide lock manager.

Some waits (RWSCS, RWCLU, for example) occur at elevated priorities and are
never interruptable for things like CTRL/Y or process deletion.  Others
(RWAST, RWMPE) may or may not be interruptable.

The wait states noted above as "Never used in the code" or "Obsolete in V4"
may show up in SDA if the process state gets corrupted somehow.

							-- Jerry
-------