peter@citcom.UUCP (Peter Klosky) (11/18/87)
We've been having some trouble killing off processes here, and would like comments or suggestions from the net. In specific, our PC AT running SCO XENIX 2.1.3 or 2.2 will get itself into a state where there will be a process which is a child of init which will not respond to a kill request. You can send the kill over and over, and the process will still exist. The signal sent is a -9, and the process has a WCHAN and is not marked as <defunct> in the ps output, so we are pretty sure it is not a zombie. Of course the SCO ps does not have any symbolic information in the WCHAN column like the better versions of ps, so we are not able to tell what event the process is waiting on. Our typical solution to this problem is to reboot the system. Sadly, this destroys the evidence. We can not reproduce this problem on demand. One of the programmers here suggested that we call SCO support for help in gathering evidence; i.e. we wanted to ask about how to take a crash dump before rebooting. SCO responded in the usual manner for a large, insensitive institution. Plenty of music on hold, then a non-technical person took our number and said someone would be calling us back within a week or so. Needless to say, we had to reboot the machine, so we lost the evidence. We did manage a "cat /dev/mem >crash.dump" before rebooting, but we wonder how much help this will be. Any clues on how to get a crash dump? Any clues on why this process would refuse to go away? (The process reads and writes message queues, writes the console, and does file i/o.) -- Peter Klosky, Citcom Systems (materiel de telecommunications) seismo!vrdxhq!baskin!citcom!peter (703) 689-2800 x 235
abcscnge@csun.UUCP (Scott Neugroschl) (11/20/87)
In article <116@citcom.UUCP> peter@citcom.UUCP (Peter Klosky) writes:
: We've been having some trouble killing off processes here,
: and would like comments or suggestions from the net.
: In specific, our PC AT running SCO XENIX 2.1.3 or 2.2 will
: get itself into a state where there will be a process
: which is a child of init which will not respond to a
: kill request. You can send the kill over and over, and the
: process will still exist. The signal sent is a -9, and the
: process has a WCHAN and is not marked as <defunct> in the
: ps output, so we are pretty sure it is not a zombie.
: Of course the SCO ps does not have any symbolic information
: in the WCHAN column like the better versions of ps, so we
: are not able to tell what event the process is waiting on.
I realize this isn't a Xenix question (from me), but we have a similar
problem with our Zilog S8000 running ZEUS 3.2 (Zilog's version of SYS III)
at work (not CSUN). It appears to be related to signal processing. Our
in-house guru tells us that the process is "locked on I/O", implying that
the signal really screwed up the kernel data. Recommend you look at the
signal handling logic if possible, and ask the people causing the lockup
if they have done an interrupt (ctrl-c or DEL) just before it locked...
Any wizards out there know of such bugs in either kernel (xenix or zilog)?
--
Scott "The Pseudo-Hacker" Neugroschl
UUCP: {litvax,humboldt,sdcrdcf,rdlvax,ttidca,}\_ csun!abcscnge
{psivax,csustan,nsc-sca,trwspf }/
-- "They also surf who stand on waves"
lvc@tut.cis.ohio-state.edu (Lawrence V. Cipriani) (11/28/87)
In article <911@csun.UUCP>, abcscnge@csun.UUCP (Scott Neugroschl) writes: > > I realize this isn't a Xenix question (from me), but we have a similar > problem with our Zilog S8000 running ZEUS 3.2 (Zilog's version of SYS III) > at work (not CSUN). It appears to be related to signal processing. Our > in-house guru tells us that the process is "locked on I/O", implying that > the signal really screwed up the kernel data. Recommend you look at the > signal handling logic if possible, and ask the people causing the lockup > if they have done an interrupt (ctrl-c or DEL) just before it locked... > > Any wizards out there know of such bugs in either kernel (xenix or zilog)? > > Scott "The Pseudo-Hacker" Neugroschl Its not a bug. This is the way UNIX and all derivatives (that I know of) are designed. Whether this is a good design is another question. If the operating system is performing certain I/O operation on behalf of your program (eg a close), and the operation does not complete (for whatever reason - usually a hardware problem) your program won't die, and can't die with a signal, not even SIGKILL. You might adb the os and fiddle some bits, but I don't recommend it. A reboot is the only sure way to make it go away, though other tricks sometimes work depending on the circumstances. The wchan is an address that can be used to identify the offendig hardware, a tty structure, a tape, or network device for example. A local guru should be able to tell you what device corresponds to the address. If he or she can't they aren't much of a guru. This is one area of UNIX where it is particularly weak. Hardware failures ought to be handled more robustly, and most certainly if they are for non critical devices. I don't see any hope soon for a better strategy.
davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (12/01/87)
I believe that the process which won't die and is not a zombie has an i/o outstanding. The kernel thinks the i/o will terminate and it doesn't. Lost interrupt? This may have nothing to do with your problem, but I have seen it on other systems. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me