swenson@nusc-wpn.arpa (12/27/89)
Greetings, We are currently running Unix 3.5 on our Sun 3 and we have been plagued by exiting processes. It goes something like this: We are using the standard tip line to a remote VME cage (i.e. just another machine). During (what appears to be) some relatively high bandwidth data transfers, the tip line loses its mind. Do a ps and the tip shows up as <exiting>. Try to kill the process -- it won't die. During fastboot we get a message like "Warning processes wouldn't die -- suggest using ps" (we are truly afraid at this point). The questions are, why is the tip line hanging up (difficult to answer with limited information, I know), and is there a way to kill the <exiting> process without rebooting the system? Any help with this problem would be greatly appreciated. Thanks, Stephen J. Swenson SWENSON@NUSC-WPN.ARPA ------
dyer@spdcc.COM (Steve Dyer) (12/28/89)
In article <21875@adm.BRL.MIL> swenson@nusc-wpn.arpa writes: > We are using the standard tip line to a remote VME cage (i.e. just >another machine). During (what appears to be) some relatively high bandwidth >data transfers, the tip line loses its mind. Do a ps and the tip shows up as ><exiting>. Try to kill the process -- it won't die. During fastboot we >get a message like "Warning processes wouldn't die -- suggest using ps" >(we are truly afraid at this point). The questions are, why is the tip >line hanging up (difficult to answer with limited information, I know), >and is there a way to kill the <exiting> process without rebooting the system? Almost always, when a process is stuck in the <exiting> state, it's in the middle of a device-specific close routine called from the exit code. A process can invoke the exit code either explicitly through the exit system call or in response to most signals which have SIG_DFL handling. Here, the device-specific close routine would probably be for the serial I/O hardware. If the device-specific close routine (or a routine it calls) sleeps, and for one reason or another there is no wakeup() forthcoming, you will get into this kind of a situation. Usually, the close routine in TTY drivers attempts to flush the characters on the output clist to the hardware before returning from the close. Now, with hardware problems or bugs in the driver itself, if the output interrupt never happens or it doesn't manage to issue a wakeup, the process will be hung up on a sleep() inside the exit code. You can issue a "kill" as much as you want. What it will do each time, however, is to interrupt the sleep and restart the exit code. The exit code will loop through all open files and call the device-specific close routine again and get stuck one more time. Without rewriting the device driver to handle this pathological situation (or ingenious adb hacking on an active kernel), the easiest way to recover from this is to reboot. This is a general description of what can go wrong--it isn't Sun-specific. -- Steve Dyer dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer dyer@arktouros.mit.edu, dyer@hstbme.mit.edu
jjb@sequent.UUCP (Jeff Berkowitz) (12/28/89)
>In article <21875@adm.BRL.MIL> swenson@nusc-wpn.arpa writes: >> We are using the standard tip line to a remote VME cage... >>...the tip line loses its mind...Try to kill the process -- it won't die. In article <1000@ursa-major.SPDCC.COM> dyer@ursa-major.spdcc.COM (Steve Dyer) writes: >You can issue a "kill" as much as you want. What it will do each time, >however, is to interrupt the sleep and restart the exit code. This statement suggests that the person who wrote the driver didn't bother to look and see whether a signal occurred (and abort the close if so). Actually it's worse than this on many flavors of UN*X. I can't speak for System 5, but on most (all?) BSD-derived systems the code in exit() does p->p_sigignore = ~0; fairly early in exit() processing. The code [actually in rexit()] then loops through the descriptors, closing them. The effect is that all signals, even SIGKILL, are ignored during close processing. In fact, they're actually thrown away; the fact that they occurred is not even recorded in the p_sig word. What this means is that (1) during exit(), any sleep in the driver is TOTALLY uninterruptible, and (2) the careful driver-writer doesn't even have the OPTION of polling the device, checking periodically for signals, and aborting the close if one arrives. This is difficult to solve. Most signal processing should certainly be disabled while the process is half-disassembled. I don't think anyone wants to add more state to the signal code, which is already grotesque. Just throwing away buffered characters on a flow controlled line at close() time isn't really acceptable either; the device could be, say, a printer which has been punched "off-line". When it goes "on-line", the owner deserves to get the end of the listing. -- Jeff Berkowitz N6QOM uunet!sequent!jjb Sequent Computer Systems Custom Systems Group
mario@theglove.Sun.COM (Mario Dorion SSE Sun Montreal) (01/02/90)
In article <1000@ursa-major.SPDCC.COM> dyer@ursa-major.spdcc.COM (Steve Dyer) writes: > >You can issue a "kill" as much as you want. What it will do each time, >however, is to interrupt the sleep and restart the exit code. The exit >code will loop through all open files and call the device-specific close >routine again and get stuck one more time. Without rewriting the device >driver to handle this pathological situation (or ingenious adb hacking >on an active kernel), the easiest way to recover from this is to reboot. > >This is a general description of what can go wrong--it isn't Sun-specific. > >-- >Steve Dyer >dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer >dyer@arktouros.mit.edu, dyer@hstbme.mit.edu Actually there is an easier way, at least with Sun OS. The gcore command (syntax: /usr/ucb/gcore <pid>) has the undocumented side-effect of effectively killing an <exiting> process. The trace command (syntax: trace -p <pid>) acts the same way. If a tty incoming line is hang <exiting> and its corresponding process is 'killed' using one of these method, init will not know about it and won't restart a getty on that line, you should kill and restart init. Still you can have a script that does that job and even have a cron entry hunting and killing exiting processes every few minutes if you have a serious exiting problem. This should be simpler than rebooting :) Hope this helps.