[comp.unix.wizards] when does kill -9 pid not work?

Leisner.Henr@xerox.com (Marty) (08/04/89)

On a sun386 running sunOS4.0, I was playing around with Minix boot disks in
a DOS window.

I had an infinite loop in my boot loader, and I couldn't kill the DOS task
via a
kill -9 pid
which I always though always worked (I tried doing it as rooot after my own
account didn't work).

I did a shutdown (geez, maybe it knows something I don't know) and it said:

"Something is hung--won't die, ps axl advised".

I give up...Was Jason in my machine?

marty
ARPA:	leisner.henr@xerox.com
GV:  leisner.henr
NS:  leisner:wbst139:xerox
UUCP:	hplabs!arisia!leisner

debra@alice.UUCP (Paul De Bra) (08/05/89)

In article <20495@adm.BRL.MIL> Leisner.Henr@xerox.com (Marty) writes:
}On a sun386 running sunOS4.0, I was playing around with Minix boot disks in
}a DOS window.
}
}I had an infinite loop in my boot loader, and I couldn't kill the DOS task
}via a
}kill -9 pid
}which I always though always worked (I tried doing it as rooot after my own
}account didn't work).
}
}I did a shutdown (geez, maybe it knows something I don't know) and it said:
}
}"Something is hung--won't die, ps axl advised".
}

kill -9 pid (executed as the owner of the process or as root) is
guaranteed to work.

when the process exits (due to the kill -9) it may get stuck in a device
driver or something, so it enters a "zombie" state. This means that the
process is busy exiting, but hasn't quite gone far enough to tell init that
it's really gone.

in any case for your purpose the kill -9 must have stopped the infinite
loop. had you executed ps a couple of times then you should have noticed
that the process was no longer consuming cpu-time. it should also have been
marked as <exiting> instead of its own name.

Paul.
>I give up...Was Jason in my machine?
>
>marty
>ARPA:	leisner.henr@xerox.com
>GV:  leisner.henr
>NS:  leisner:wbst139:xerox
>UUCP:	hplabs!arisia!leisner
> 

-- 
------------------------------------------------------
|debra@research.att.com   | uunet!research!debra     |
------------------------------------------------------

guy@auspex.auspex.com (Guy Harris) (08/06/89)

>I had an infinite loop in my boot loader, and I couldn't kill the DOS task
>via a
>kill -9 pid
>which I always though always worked (I tried doing it as rooot after my own
>account didn't work).

Nope.  Processes in UNIX (or, at least, AT&T-derived UNIXes, including
4.xBSD), when blocked, are either sleeping at "positive" or "non-positive"
priorities.  (The quotes are there because all priorities are
numerically >= 0; they weren't in V6, but they're stored in a "char",
and I suspect when they ported V7 to the Interdata machines, said
machines' C implementation had unsigned "char"s, so they fixed the
problem by adding PZERO to the priority values, so "positive" means ">
PZERO".)

A sleep at a "positive" priority is interruptable; if a signal arrives,
the process is woken up.  A sleep at a "non-positive" priority is not
interruptable; the process stays blocked until it's explicitly woken up.
The idea is that, for example, if a process is holding onto some
critical resource, it will sleep at a "non-positive" priority, since an
"interrupted" sleep causes the moral equivalent of a "longjmp", so the
process has no chance to release said critical resource.  In general, if
the process has done something that requires undoing, it would sleep at
a "non-positive" priority.

In more recent versions of UNIX, including S5Rn (for some value of "n"
<= 2) and SunOS 3.2 and later (which picked this up from S5), you can
specify that an interrupted "sleep" should, instead of "longjmp"ing,
just return 1 (in these later versions, it returns 0 if not
interrupted), which gives the process a chance to release critical
resources, etc..  *Some* cases of sleeps at "non-positive" priorities
can be replaced with interruptable sleeps in those systems; I don't know
that all can, though, since it may still be extremely difficult or
impossible to undo anyting the process has done in the kernel.

(In addition, processes sleeping inside the forced "close" of all open
descriptors when exiting can't be killed, either; they're already
dead....)

Presumably, the process in question was sleeping at a "non-positive"
priority.

jerry@xroads.UUCP (Jerry M. Denman) (08/07/89)

In article <9748@alice.UUCP> debra@alice.UUCP () writes:
>In article <20495@adm.BRL.MIL> Leisner.Henr@xerox.com (Marty) writes:
>}I had an infinite loop in my boot loader, and I couldn't kill the DOS task
>}via a
>}kill -9 pid
>
>kill -9 pid (executed as the owner of the process or as root) is
>guaranteed to work.
>

I would have to differ in opinion on that answer.  According to Bach, 
if a process gets "hung" while in kernal mode, there is no way to kill
it.  This is to prevent corruption of the kernal tables.  If a process is
in any other mode besides kernal, then a kill -9 will terminate it.  The 
most common example of this is if you hang a device driver.  They spend
a greater share of the time executing kernal level tasks and do tend to 
drop off into never never land without notice.  Many times when this happens
a reboot is the only way to clear the process from the table.

Of course, I have been know to be wrong.

dmt@PacBell.COM (Dave Turner) (08/07/89)

In article <9748@alice.UUCP> debra@alice.UUCP () writes:
>kill -9 pid (executed as the owner of the process or as root) is
>guaranteed to work.
>
According to ps(1) in my URM SVR2.1.0 for the 3B20,
if the Flag contains a 10 the process "cannot be awakened by signal".

>when the process exits (due to the kill -9) it may get stuck in a device
>driver or something, so it enters a "zombie" state. This means that the
>process is busy exiting, but hasn't quite gone far enough to tell init that
>it's really gone.

There are times when a process cannot be killed and does not enter the
zombie state. It will not use cpu time and will live (in a coma) until
the system is rebooted. I have seen this on other systems besides 3B20s.


-- 
Dave Turner	415/542-1299	{att,bellcore,sun,ames,decwrl}!pacbell!dmt

Thomas_McFadden.Henr801M@xerox.com (08/08/89)

Marty,

Using kill -9 does not work when a process is waiting on i/o to complete or
when the priority of the process is set by the kernel to be less than PZERO
found in <sys/param.h>. Your process may have caused this if it was doing
alot of i/o and the kill program never got a chance to post the signal to
the run away process. Otherwise, the kernel may have increased the priority
of your process and gone to sleep waiting for some event (i.e. i/o) to
complete. The kernel does this so that another process doesn't come along,
use the same process memory space or other shared resource, and get
trashed.

Tom

hutch@lzaz.ATT.COM (R.HUTCHISON) (08/08/89)

From article <9748@alice.UUCP>, by debra@alice.UUCP (Paul De Bra):
> In article <20495@adm.BRL.MIL> Leisner.Henr@xerox.com (Marty) writes:
[stuff omitted]
> when the process exits (due to the kill -9) it may get stuck in a device
> driver or something, so it enters a "zombie" state. This means that the
> process is busy exiting, but hasn't quite gone far enough to tell init that
> it's really gone.
[stuff omitted]
> Paul.
>>I give up...Was Jason in my machine?

Small correction.  If the process was hung in an exit...

Context:
	- System V, Release 0,2,3
Scenario:
	- process gets signal (any) or calls exit without closing all
	  files explicitly
	- exit called
	- exit ignores all (including #9) signals
	- exit closes all open files
	- exit changes process to ZOMBIE
	- exit deallocated all memory
	- ... and so on

If the close() routine for a logical device wants to contact 
the physical device and wait for a response, it should have a 
timer set, in case the device doesn't respond.  Sometimes the 
driver writer doesn't put in a timer (mistake).  The device 
never responds.  The close() never finishes.  The signal is 
already being ignored.  The process hasn't been changed
into a zombie yet.  Its memory hasn't been deallocated yet.  It's just
sitting there, wasting memory and slots in kernel tables.

Question to the original poster:  If you do a ps -l on the process,
what is its value under the PRI heading?  My guess is that there is a
bug in the device driver for that device and it might be hanging the
process with the priority high (low number) or with the "don't wake me
up when a signal comes in" flag (value 10 in ps listing under F
heading?).  The ps output would vary depending on your version of the OS.

Bob Hutchison
att!lzaz!hutch

nagle@well.UUCP (John Nagle) (08/09/89)

     OK.  What's going on here is simple, but has several parts.

     First, you can send a signal, including signal 9, to a process at
any time.  But no action is taken on a signal until the receiving
process is in a position to receive signals, with control in user space.
So, in general, if you send a signal to a process while the process is
making a system call, the signal will not be processed until the
system call is completed.  This protects the internal consistency of
the kernel's tables.  (Historical note: In TOPS-20, you could kill a
process while it was making a system call.  This made for an interesting
kernel architecture.  UNIX is simpler internally because this is disallowed.)

     Thus, if a process is making a system call, and the system call
has resulted in a wait within the kernel, sending a signal to that process
will have no effect until the wait completes.  

      However, to prevent processes from remaining stuck at some well-known
wait points, such as waiting for input from a terminal, there is special
code within the kernel so that some specific wait conditions are checked
for when a signal is sent, and the kernel will abort those waits.
I don't have access to kernel sources here, so I can't check this,
but I think that all kernel-buffered character device waits can be
escaped.  SELECT is also escapable via signal, as I recall.

					John Nagle

jeffrey@algor2.uu.net (Jeffrey Kegler) (08/09/89)

Bob Hutchinson (att!lzaz!hutch) writes:
=> My guess is that there is a
=> bug in the device driver for that device and it might be hanging the
=> process with the priority high (low number) or with the "don't wake me
=> up when a signal comes in" flag

I would say that if kill -9 does not kill a process, that is by
definition a kernel bug.  Almost certainly it is a device driver
causing the bug.  I specialize in driver writing and have seen a lot
of marginal code in device drivers.  I think a lot of people writing
them do not realize you should not sleep with signals disabled on a
hardware event, or any event which might take a while to occur.  No
matter how quick you expect the response from the hardware, and how
reliable it is, hardware can fail.  A timer should be thrown in to
wake up the process, in case the hardware event does not happen, if
you find it necessary to sleep with signals disabled.

Often, this is how a race condition manifests.  That is, you write the
code:

	1) Start board doing whatever.
	2) Sleep on interrupt with signals disabled.

If the board finishes and interrupts before you sleep, you will sleep
forever, and the process will be unkillable.

In short, if you ever have this problem, ask the vendor to fire
whoever wrote the driver and hire me.  It is a bug and a readily
preventable one.  There are only so many sleep()'s in the code, they
can all be grep'ed out and they can all be proofed against this
problem.  Anything less is driver writing malpractice.
-- 

Jeffrey Kegler, Independent UNIX Consultant, Algorists, Inc.
jeffrey@algor2.ALGORISTS.COM or uunet!algor2!jeffrey
1762 Wainwright DR, Reston VA 22090

dg@lakart.UUCP (David Goodenough) (08/09/89)

dmt@PacBell.COM (Dave Turner) sez:
> In article <9748@alice.UUCP> debra@alice.UUCP () writes:
>>when the process exits (due to the kill -9) it may get stuck in a device
>>driver or something, so it enters a "zombie" state. This means that the
>>process is busy exiting, but hasn't quite gone far enough to tell init that
>>it's really gone.
> 
> There are times when a process cannot be killed and does not enter the
> zombie state. It will not use cpu time and will live (in a coma) until
> the system is rebooted. I have seen this on other systems besides 3B20s.

Also processes that have exited & not been waited for: "<defunct>" can't
be removed with a kill -9. As Chris Torek, or Doug Gwyn, or someone said:
"There's no point shooting a corpse - it's already dead"
-- 
	dg@lakart.UUCP - David Goodenough		+---+
						IHS	| +-+-+
	....... !harvard!xait!lakart!dg			+-+-+ |
AKA:	dg%lakart.uucp@xait.xerox.com		  	  +---+

gnb@melba.bby.oz (Gregory N. Bond) (08/17/89)

One source of non-interruptable sleeps in live processes is
kernel-based network services. (My experience is with SunOS 3.5
kernels, but I suspect most NFS implemenations would have the same
problems.) 

I have had a number of experiences with processes hanging when things
like lockd or statd or portmap are dead on a machine on the network.
These are things used by the kernel alorithms, so are < PZERO
priority. This morning, suntools on one workstation was frozen because
one process was trying to lock a file, and the local statd was dead.
kill -9 wouldn't kill the process.  When the statd was restarted (a
hairy experience, too!) the process went away and the accumulated
input events were processed by the window system.

These are an indication that the paradigm for NFS in huge kernels is a
bit strained.  Perhaps a mach-like messages-with-kernel-processes
paradigm could avoid this?

Greg.-- 
Gregory Bond, Burdett Buckeridge & Young Ltd, Melbourne, Australia
Internet: gnb@melba.bby.oz.au    non-MX: gnb%melba.bby.oz@uunet.uu.net
Uucp: {uunet,pyramid,ubc-cs,ukc,mcvax,prlb2,nttlab...}!munnari!melba.bby.oz!gnb