[comp.unix.wizards] <exiting> tip processes on the SUN

swenson@nusc-wpn.arpa (12/27/89)

Greetings,

	We are currently running Unix 3.5 on our Sun 3 and we have been plagued
by exiting processes.  It goes something like this:

	We are using the standard tip line to a remote VME cage (i.e. just
another machine).  During (what appears to be) some relatively high bandwidth
data transfers, the tip line loses its mind.  Do a ps and the tip shows up as
<exiting>.  Try to kill the process -- it won't die.  During fastboot we
get a message like "Warning processes wouldn't die -- suggest using ps" 
(we are truly afraid at this point).  The questions are, why is the tip
line hanging up (difficult to answer with limited information, I know),
and is there a way to kill the <exiting> process without rebooting the
system?

	Any help with this problem would be greatly appreciated.


				Thanks,
				Stephen J. Swenson
				SWENSON@NUSC-WPN.ARPA
------

dyer@spdcc.COM (Steve Dyer) (12/28/89)

In article <21875@adm.BRL.MIL> swenson@nusc-wpn.arpa writes:
>	We are using the standard tip line to a remote VME cage (i.e. just
>another machine).  During (what appears to be) some relatively high bandwidth
>data transfers, the tip line loses its mind.  Do a ps and the tip shows up as
><exiting>.  Try to kill the process -- it won't die.  During fastboot we
>get a message like "Warning processes wouldn't die -- suggest using ps" 
>(we are truly afraid at this point).  The questions are, why is the tip
>line hanging up (difficult to answer with limited information, I know),
>and is there a way to kill the <exiting> process without rebooting the system?

Almost always, when a process is stuck in the <exiting> state, it's in the
middle of a device-specific close routine called from the exit code.
A process can invoke the exit code either explicitly through the exit
system call or in response to most signals which have SIG_DFL handling.
Here, the device-specific close routine would probably be for the
serial I/O hardware.

If the device-specific close routine (or a routine it calls) sleeps, and
for one reason or another there is no wakeup() forthcoming, you will get
into this kind of a situation.  Usually, the close routine in TTY drivers
attempts to flush the characters on the output clist to the hardware before
returning from the close.  Now, with hardware problems or bugs in the driver
itself, if the output interrupt never happens or it doesn't manage to issue
a wakeup, the process will be hung up on a sleep() inside the exit code.

You can issue a "kill" as much as you want.  What it will do each time,
however, is to interrupt the sleep and restart the exit code.  The exit
code will loop through all open files and call the device-specific close
routine again and get stuck one more time.  Without rewriting the device
driver to handle this pathological situation (or ingenious adb hacking
on an active kernel), the easiest way to recover from this is to reboot.

This is a general description of what can go wrong--it isn't Sun-specific.

-- 
Steve Dyer
dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer
dyer@arktouros.mit.edu, dyer@hstbme.mit.edu

jjb@sequent.UUCP (Jeff Berkowitz) (12/28/89)

>In article <21875@adm.BRL.MIL> swenson@nusc-wpn.arpa writes:
>>	We are using the standard tip line to a remote VME cage...
>>...the tip line loses its mind...Try to kill the process -- it won't die.

In article <1000@ursa-major.SPDCC.COM>
  dyer@ursa-major.spdcc.COM (Steve Dyer) writes:

>You can issue a "kill" as much as you want.  What it will do each time,
>however, is to interrupt the sleep and restart the exit code.

This statement suggests that the person who wrote the driver didn't bother
to look and see whether a signal occurred (and abort the close if so).

Actually it's worse than this on many flavors of UN*X.  I can't speak for
System 5, but on most (all?) BSD-derived systems the code in exit() does

    p->p_sigignore = ~0;

fairly early in exit() processing.  The code [actually in rexit()] then
loops through the descriptors, closing them.  The effect is that all
signals, even SIGKILL, are ignored during close processing.  In fact,
they're actually thrown away; the fact that they occurred is not even
recorded in the p_sig word.  What this means is that (1) during exit(),
any sleep in the driver is TOTALLY uninterruptible, and (2) the careful
driver-writer doesn't even have the OPTION of polling the device, checking
periodically for signals, and aborting the close if one arrives.

This is difficult to solve.  Most signal processing should certainly be
disabled while the process is half-disassembled.  I don't think anyone
wants to add more state to the signal code, which is already grotesque.
Just throwing away buffered characters on a flow controlled line at
close() time isn't really acceptable either; the device could be, say,
a printer which has been punched "off-line".  When it goes "on-line",
the owner deserves to get the end of the listing.
-- 
Jeff Berkowitz N6QOM			uunet!sequent!jjb
Sequent Computer Systems		Custom Systems Group

mario@theglove.Sun.COM (Mario Dorion SSE Sun Montreal) (01/02/90)

In article <1000@ursa-major.SPDCC.COM> dyer@ursa-major.spdcc.COM (Steve Dyer) writes:
>
>You can issue a "kill" as much as you want.  What it will do each time,
>however, is to interrupt the sleep and restart the exit code.  The exit
>code will loop through all open files and call the device-specific close
>routine again and get stuck one more time.  Without rewriting the device
>driver to handle this pathological situation (or ingenious adb hacking
>on an active kernel), the easiest way to recover from this is to reboot.
>
>This is a general description of what can go wrong--it isn't Sun-specific.
>
>-- 
>Steve Dyer
>dyer@ursa-major.spdcc.com aka {ima,harvard,rayssd,linus,m2c}!spdcc!dyer
>dyer@arktouros.mit.edu, dyer@hstbme.mit.edu

Actually there is an easier way, at least with Sun OS. The gcore
command (syntax: /usr/ucb/gcore <pid>) has the undocumented
side-effect of effectively killing an <exiting> process. The trace
command (syntax: trace -p <pid>) acts the same way. 

If a tty incoming line is hang <exiting> and its corresponding process
is 'killed' using one of these method, init will not know about it and
won't restart a getty on that line, you should kill and restart init.
Still you can have a script that does that job and even have a cron
entry hunting and killing exiting processes every few minutes if you
have a serious exiting problem.

This should be simpler than rebooting :)

Hope this helps.