[comp.unix.questions] Need help with SCO: the process that would not die.

abcscnge@csun.UUCP (Scott Neugroschl) (11/20/87)

In article <116@citcom.UUCP> peter@citcom.UUCP (Peter Klosky) writes:
: We've been having some trouble killing off processes here,
: and would like comments or suggestions from the net.
: In specific, our PC AT running SCO XENIX 2.1.3 or 2.2 will
: get itself into a state where there will be a process
: which is a child of init which will not respond to a
: kill request.  You can send the kill over and over, and the
: process will still exist.  The signal sent is a -9, and the
: process has a WCHAN and is not marked as <defunct> in the
: ps output, so we are pretty sure it is not a zombie.
: Of course the SCO ps does not have any symbolic information
: in the WCHAN column like the better versions of ps, so we
: are not able to tell what event the process is waiting on.

I realize this isn't a Xenix question (from me), but we have a similar
problem with our Zilog S8000 running ZEUS 3.2 (Zilog's version of SYS III)
at work (not CSUN).  It appears to be related to signal processing.   Our
in-house guru tells us that the process is "locked on I/O", implying that
the signal really screwed up the kernel data.  Recommend you look at the
signal handling logic if possible, and ask the people causing the lockup
if they have done an interrupt (ctrl-c or DEL) just before it locked...

Any wizards out there know of such bugs in either kernel (xenix or zilog)?

-- 
Scott "The Pseudo-Hacker" Neugroschl
UUCP: {litvax,humboldt,sdcrdcf,rdlvax,ttidca,}\_ csun!abcscnge
      {psivax,csustan,nsc-sca,trwspf         }/
-- "They also surf who stand on waves"

lvc@tut.cis.ohio-state.edu (Lawrence V. Cipriani) (11/28/87)

In article <911@csun.UUCP>, abcscnge@csun.UUCP (Scott Neugroschl) writes:
> 
> I realize this isn't a Xenix question (from me), but we have a similar
> problem with our Zilog S8000 running ZEUS 3.2 (Zilog's version of SYS III)
> at work (not CSUN).  It appears to be related to signal processing.   Our
> in-house guru tells us that the process is "locked on I/O", implying that
> the signal really screwed up the kernel data.  Recommend you look at the
> signal handling logic if possible, and ask the people causing the lockup
> if they have done an interrupt (ctrl-c or DEL) just before it locked...
> 
> Any wizards out there know of such bugs in either kernel (xenix or zilog)?
>
> Scott "The Pseudo-Hacker" Neugroschl

Its not a bug.  This is the way UNIX and all derivatives (that I know
of) are designed.  Whether this is a good design is another question.
If the operating system is performing certain I/O operation on behalf of
your program (eg a close), and the operation does not complete (for whatever
reason - usually a hardware problem) your program won't die, and can't die
with a signal, not even SIGKILL.  You might adb the os and fiddle some
bits, but I don't recommend it. A reboot is the only sure way to make it
go away, though other tricks sometimes work depending on the circumstances.
The wchan is an address that can be used to identify the offendig hardware,
a tty structure, a tape, or network device for example.  A local guru should
be able to tell you what device corresponds to the address.  If he or she
can't they aren't much of a guru.

This is one area of UNIX where it is particularly weak.  Hardware failures
ought to be handled more robustly, and most certainly if they are for
non critical devices.  I don't see any hope soon for a better strategy.

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (12/01/87)

I believe that the process which won't die and is not a zombie has an
i/o outstanding.  The kernel thinks the i/o will terminate and it
doesn't.  Lost interrupt? This may have nothing to do with your problem,
but I have seen it on other systems. 

-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me