Leisner.Henr@xerox.com (Marty) (08/04/89)
On a sun386 running sunOS4.0, I was playing around with Minix boot disks in a DOS window. I had an infinite loop in my boot loader, and I couldn't kill the DOS task via a kill -9 pid which I always though always worked (I tried doing it as rooot after my own account didn't work). I did a shutdown (geez, maybe it knows something I don't know) and it said: "Something is hung--won't die, ps axl advised". I give up...Was Jason in my machine? marty ARPA: leisner.henr@xerox.com GV: leisner.henr NS: leisner:wbst139:xerox UUCP: hplabs!arisia!leisner
debra@alice.UUCP (Paul De Bra) (08/05/89)
In article <20495@adm.BRL.MIL> Leisner.Henr@xerox.com (Marty) writes: }On a sun386 running sunOS4.0, I was playing around with Minix boot disks in }a DOS window. } }I had an infinite loop in my boot loader, and I couldn't kill the DOS task }via a }kill -9 pid }which I always though always worked (I tried doing it as rooot after my own }account didn't work). } }I did a shutdown (geez, maybe it knows something I don't know) and it said: } }"Something is hung--won't die, ps axl advised". } kill -9 pid (executed as the owner of the process or as root) is guaranteed to work. when the process exits (due to the kill -9) it may get stuck in a device driver or something, so it enters a "zombie" state. This means that the process is busy exiting, but hasn't quite gone far enough to tell init that it's really gone. in any case for your purpose the kill -9 must have stopped the infinite loop. had you executed ps a couple of times then you should have noticed that the process was no longer consuming cpu-time. it should also have been marked as <exiting> instead of its own name. Paul. >I give up...Was Jason in my machine? > >marty >ARPA: leisner.henr@xerox.com >GV: leisner.henr >NS: leisner:wbst139:xerox >UUCP: hplabs!arisia!leisner > -- ------------------------------------------------------ |debra@research.att.com | uunet!research!debra | ------------------------------------------------------
guy@auspex.auspex.com (Guy Harris) (08/06/89)
>I had an infinite loop in my boot loader, and I couldn't kill the DOS task >via a >kill -9 pid >which I always though always worked (I tried doing it as rooot after my own >account didn't work). Nope. Processes in UNIX (or, at least, AT&T-derived UNIXes, including 4.xBSD), when blocked, are either sleeping at "positive" or "non-positive" priorities. (The quotes are there because all priorities are numerically >= 0; they weren't in V6, but they're stored in a "char", and I suspect when they ported V7 to the Interdata machines, said machines' C implementation had unsigned "char"s, so they fixed the problem by adding PZERO to the priority values, so "positive" means "> PZERO".) A sleep at a "positive" priority is interruptable; if a signal arrives, the process is woken up. A sleep at a "non-positive" priority is not interruptable; the process stays blocked until it's explicitly woken up. The idea is that, for example, if a process is holding onto some critical resource, it will sleep at a "non-positive" priority, since an "interrupted" sleep causes the moral equivalent of a "longjmp", so the process has no chance to release said critical resource. In general, if the process has done something that requires undoing, it would sleep at a "non-positive" priority. In more recent versions of UNIX, including S5Rn (for some value of "n" <= 2) and SunOS 3.2 and later (which picked this up from S5), you can specify that an interrupted "sleep" should, instead of "longjmp"ing, just return 1 (in these later versions, it returns 0 if not interrupted), which gives the process a chance to release critical resources, etc.. *Some* cases of sleeps at "non-positive" priorities can be replaced with interruptable sleeps in those systems; I don't know that all can, though, since it may still be extremely difficult or impossible to undo anyting the process has done in the kernel. (In addition, processes sleeping inside the forced "close" of all open descriptors when exiting can't be killed, either; they're already dead....) Presumably, the process in question was sleeping at a "non-positive" priority.
jerry@xroads.UUCP (Jerry M. Denman) (08/07/89)
In article <9748@alice.UUCP> debra@alice.UUCP () writes: >In article <20495@adm.BRL.MIL> Leisner.Henr@xerox.com (Marty) writes: >}I had an infinite loop in my boot loader, and I couldn't kill the DOS task >}via a >}kill -9 pid > >kill -9 pid (executed as the owner of the process or as root) is >guaranteed to work. > I would have to differ in opinion on that answer. According to Bach, if a process gets "hung" while in kernal mode, there is no way to kill it. This is to prevent corruption of the kernal tables. If a process is in any other mode besides kernal, then a kill -9 will terminate it. The most common example of this is if you hang a device driver. They spend a greater share of the time executing kernal level tasks and do tend to drop off into never never land without notice. Many times when this happens a reboot is the only way to clear the process from the table. Of course, I have been know to be wrong.
dmt@PacBell.COM (Dave Turner) (08/07/89)
In article <9748@alice.UUCP> debra@alice.UUCP () writes: >kill -9 pid (executed as the owner of the process or as root) is >guaranteed to work. > According to ps(1) in my URM SVR2.1.0 for the 3B20, if the Flag contains a 10 the process "cannot be awakened by signal". >when the process exits (due to the kill -9) it may get stuck in a device >driver or something, so it enters a "zombie" state. This means that the >process is busy exiting, but hasn't quite gone far enough to tell init that >it's really gone. There are times when a process cannot be killed and does not enter the zombie state. It will not use cpu time and will live (in a coma) until the system is rebooted. I have seen this on other systems besides 3B20s. -- Dave Turner 415/542-1299 {att,bellcore,sun,ames,decwrl}!pacbell!dmt
Thomas_McFadden.Henr801M@xerox.com (08/08/89)
Marty, Using kill -9 does not work when a process is waiting on i/o to complete or when the priority of the process is set by the kernel to be less than PZERO found in <sys/param.h>. Your process may have caused this if it was doing alot of i/o and the kill program never got a chance to post the signal to the run away process. Otherwise, the kernel may have increased the priority of your process and gone to sleep waiting for some event (i.e. i/o) to complete. The kernel does this so that another process doesn't come along, use the same process memory space or other shared resource, and get trashed. Tom
hutch@lzaz.ATT.COM (R.HUTCHISON) (08/08/89)
From article <9748@alice.UUCP>, by debra@alice.UUCP (Paul De Bra): > In article <20495@adm.BRL.MIL> Leisner.Henr@xerox.com (Marty) writes: [stuff omitted] > when the process exits (due to the kill -9) it may get stuck in a device > driver or something, so it enters a "zombie" state. This means that the > process is busy exiting, but hasn't quite gone far enough to tell init that > it's really gone. [stuff omitted] > Paul. >>I give up...Was Jason in my machine? Small correction. If the process was hung in an exit... Context: - System V, Release 0,2,3 Scenario: - process gets signal (any) or calls exit without closing all files explicitly - exit called - exit ignores all (including #9) signals - exit closes all open files - exit changes process to ZOMBIE - exit deallocated all memory - ... and so on If the close() routine for a logical device wants to contact the physical device and wait for a response, it should have a timer set, in case the device doesn't respond. Sometimes the driver writer doesn't put in a timer (mistake). The device never responds. The close() never finishes. The signal is already being ignored. The process hasn't been changed into a zombie yet. Its memory hasn't been deallocated yet. It's just sitting there, wasting memory and slots in kernel tables. Question to the original poster: If you do a ps -l on the process, what is its value under the PRI heading? My guess is that there is a bug in the device driver for that device and it might be hanging the process with the priority high (low number) or with the "don't wake me up when a signal comes in" flag (value 10 in ps listing under F heading?). The ps output would vary depending on your version of the OS. Bob Hutchison att!lzaz!hutch
nagle@well.UUCP (John Nagle) (08/09/89)
OK. What's going on here is simple, but has several parts. First, you can send a signal, including signal 9, to a process at any time. But no action is taken on a signal until the receiving process is in a position to receive signals, with control in user space. So, in general, if you send a signal to a process while the process is making a system call, the signal will not be processed until the system call is completed. This protects the internal consistency of the kernel's tables. (Historical note: In TOPS-20, you could kill a process while it was making a system call. This made for an interesting kernel architecture. UNIX is simpler internally because this is disallowed.) Thus, if a process is making a system call, and the system call has resulted in a wait within the kernel, sending a signal to that process will have no effect until the wait completes. However, to prevent processes from remaining stuck at some well-known wait points, such as waiting for input from a terminal, there is special code within the kernel so that some specific wait conditions are checked for when a signal is sent, and the kernel will abort those waits. I don't have access to kernel sources here, so I can't check this, but I think that all kernel-buffered character device waits can be escaped. SELECT is also escapable via signal, as I recall. John Nagle
jeffrey@algor2.uu.net (Jeffrey Kegler) (08/09/89)
Bob Hutchinson (att!lzaz!hutch) writes:
=> My guess is that there is a
=> bug in the device driver for that device and it might be hanging the
=> process with the priority high (low number) or with the "don't wake me
=> up when a signal comes in" flag
I would say that if kill -9 does not kill a process, that is by
definition a kernel bug. Almost certainly it is a device driver
causing the bug. I specialize in driver writing and have seen a lot
of marginal code in device drivers. I think a lot of people writing
them do not realize you should not sleep with signals disabled on a
hardware event, or any event which might take a while to occur. No
matter how quick you expect the response from the hardware, and how
reliable it is, hardware can fail. A timer should be thrown in to
wake up the process, in case the hardware event does not happen, if
you find it necessary to sleep with signals disabled.
Often, this is how a race condition manifests. That is, you write the
code:
1) Start board doing whatever.
2) Sleep on interrupt with signals disabled.
If the board finishes and interrupts before you sleep, you will sleep
forever, and the process will be unkillable.
In short, if you ever have this problem, ask the vendor to fire
whoever wrote the driver and hire me. It is a bug and a readily
preventable one. There are only so many sleep()'s in the code, they
can all be grep'ed out and they can all be proofed against this
problem. Anything less is driver writing malpractice.
--
Jeffrey Kegler, Independent UNIX Consultant, Algorists, Inc.
jeffrey@algor2.ALGORISTS.COM or uunet!algor2!jeffrey
1762 Wainwright DR, Reston VA 22090
dg@lakart.UUCP (David Goodenough) (08/09/89)
dmt@PacBell.COM (Dave Turner) sez: > In article <9748@alice.UUCP> debra@alice.UUCP () writes: >>when the process exits (due to the kill -9) it may get stuck in a device >>driver or something, so it enters a "zombie" state. This means that the >>process is busy exiting, but hasn't quite gone far enough to tell init that >>it's really gone. > > There are times when a process cannot be killed and does not enter the > zombie state. It will not use cpu time and will live (in a coma) until > the system is rebooted. I have seen this on other systems besides 3B20s. Also processes that have exited & not been waited for: "<defunct>" can't be removed with a kill -9. As Chris Torek, or Doug Gwyn, or someone said: "There's no point shooting a corpse - it's already dead" -- dg@lakart.UUCP - David Goodenough +---+ IHS | +-+-+ ....... !harvard!xait!lakart!dg +-+-+ | AKA: dg%lakart.uucp@xait.xerox.com +---+
gnb@melba.bby.oz (Gregory N. Bond) (08/17/89)
One source of non-interruptable sleeps in live processes is kernel-based network services. (My experience is with SunOS 3.5 kernels, but I suspect most NFS implemenations would have the same problems.) I have had a number of experiences with processes hanging when things like lockd or statd or portmap are dead on a machine on the network. These are things used by the kernel alorithms, so are < PZERO priority. This morning, suntools on one workstation was frozen because one process was trying to lock a file, and the local statd was dead. kill -9 wouldn't kill the process. When the statd was restarted (a hairy experience, too!) the process went away and the accumulated input events were processed by the window system. These are an indication that the paradigm for NFS in huge kernels is a bit strained. Perhaps a mach-like messages-with-kernel-processes paradigm could avoid this? Greg.-- Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia Internet: gnb@melba.bby.oz.au non-MX: gnb%melba.bby.oz@uunet.uu.net Uucp: {uunet,pyramid,ubc-cs,ukc,mcvax,prlb2,nttlab...}!munnari!melba.bby.oz!gnb