info-vax@ucbvax.ARPA (02/13/85)
From: goldstein%star.DEC@decwrl.ARPA (Andy Goldstein) Just to add some more info in two recent queries... RWAST indeed means AST wait. What this really means in most cases is waiting for I/O completion (I/O's are always completed with a kernel AST whether you wanted a real one or not). So, you can end up in this state if you did a ctrl-Y / stop / exit and have a stuck I/O somewhere. You can also get in this state if you're out of BYTLIM. Whatever service that is trying to allocate pool is waiting for an I/O to come back in the hope that it will release buffer space to get you back under quota. In some cases, waiting is not justified because the pool is being used up by something that won't come back by itself (e.g., a mailbox). There are some obscure cases that will wait anyway. On systems crashing while they are serving disks: Yes indeed, the served disk (as seen from a surviving node) is marked as "host unavailable" because that's exactly what's going on - the driver cannot talk to the "controller" (i.e., the serving CPU). During this state, pending I/O's wait for the controller to come back. This state is also called "mount verification pending", and there are console messages to indicate that this is going on. If the crashed machine doesn't come back soon, mount verification will time out, and the pending I/O's are failed with device offline errors. Now the disk is in MvTimeout state, which is terminal. Further I/O's will fail until you dismount and remount it. The rationale for all this is that while the driver is sitting on the I/O's, the processes affected are essentially hung. Depending on what locks they happen to be holding, they can propagate hangs to other parts of the cluster. The timeout is controlled by the SYSGEN parameter MVTIMEOUT. You should set this parameter to be somewhat longer than the maximum normal reboot time of machines in the cluster, so that the server can be re-established before the servee times it out. - Andy Goldstein