[comp.unix.wizards] Defunct process

mark@cygnet.UUCP (Mark Quattrocchi) (03/09/90)

Someone out there must have the answer to this little annoyance...

I occasionally get both a <defunct> and <exiting> process after having
tip croak on me. This happens when once in a while when I manually try
logging into other systems to generate chat scripts. Somehow the remote
system sends me some garbage and tip just freaks out. The only way out
is to kill my shell and log in again. So now I'm back in and I have two
processes left over from the dead tip, one <exiting> on my tty, and a
<defunct> one belonging to nobody (these processes exist even before I
kill my shell). Even as root I can't kill these damn processes and it 
leaves me with an unusable modem until I reboot the system. BTW this is 
on a Sun 3/280 using 4.0.3. Any ideas on how to get out of this state 
would be great. 

Mark Quattrocchi {3comvax|oliveb|hplabs}!cygnet!mark

pratap@hpcllcm.HP.COM (Pratap Subrahmanyam) (03/10/90)

Here's my 2 cents worth on this topic. When an process that has a lot of 
children (like a shell) dies for some reason (like due a kill signal), the
OS takes it upon itself to walk down the list of children and reparent them
to the root process ( or the init process ). This works fine in most cases.

When a child dies it send a signal to its parent (I think it's called a 
"death_of_child_signal"). When the parent recieves this signal, it resets the
process PID table, after doing several other cleanup operations (like closing
opne files, pipes etc.. ). Now the PID table, will not contain an entry for 
the child process. (This is why ps -ef will not show it).

However, if there is a race condition, like this .. The child dies soon after
the parent is "killed", that is the child dies before it can be reparented.
Then the signal that the child sends out, will be lost in space. No process 
exists to recieve it. Hence it will be there, an orphan. I don't believe that
such orphan processes cause a overhead, because evan the OS will not know of
their existance. This means that the process never gets scheduled again, etc.
I'm not sure what happens to that space allocated to the process image.

In any case, in this situation, the PID table, doesn't get updated. That is
why you see <defunct> processes with ps -ef.


If any one has better (or if this is a bogus ) answers, please post. I'll
be interested.

- Pratap
  pratap@hpcllcm.hp.hplabs.com.

ray@ctbilbo.UUCP (Ray Ward) (03/10/90)

In article <1805@cygnet.UUCP> mark@cygnet.UUCP (Mark Quattrocchi) writes:
>I occasionally get both a <defunct> and <exiting> process after having
>tip croak on me. [...]
>                Even as root I can't kill these damn processes and it 
>leaves me with an unusable modem until I reboot the system.

From the symptoms you describe, it seems that a likely explanation
is that the driver you are using is sleeping, waiting for a high-level
interrupt to wake him up.  Unfortunately, the communications have somehow
croaked, so the interrupt will never come.  The driver has set the
interrupt level so high that a normal signal will not be able to break
him out of his sleep.  Rebooting is the only general, reliable method
I know of to remedy the problem.

Perhaps there should be another command to allow su to interrupt
sleeping beauty?  (As opposed to ad hoc hacking with the kernel...)

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Ray Ward                                          Email:  uunet!ctbilbo!ray  
Voice:  (214) 991-8338x226, (800) 331-7032        Fax  :  (214) 991-8968     
=-=-=-=-  There _are_ simple answers, just no _easy_ ones. -- R.R. -=-=-=-=

chris@mimsy.umd.edu (Chris Torek) (03/10/90)

In article <6840005@hpcllcm.HP.COM> pratap@hpcllcm.HP.COM
(Pratap Subrahmanyam) writes:
>... race condition ... The child dies soon after the parent is "killed",
>that is the child dies before it can be reparented.  Then the signal that
>the child sends out, will be lost in space.

That would be a kernel bug.  Fortunately, those who wrote the kernel were
not that sloppy.  When a parent exits, its children are passed over to
/etc/init (process 1).  If they try to exit while they are moving, nothing
happens until they finish moving; then they finish exiting and init wait()s
for them.  Then then go away.

>That is why you see <defunct> processes with ps -ef.

No.  There are two reasons for <defunct> or <exiting> processes: kernel
bugs (typically in device drivers), and parent processes that do not wait()
for children.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

cpcahil@virtech.uucp (Conor P. Cahill) (03/11/90)

In article <6840005@hpcllcm.HP.COM> pratap@hpcllcm.HP.COM (Pratap Subrahmanyam) writes:
   [long story deleted]
>In any case, in this situation, the PID table, doesn't get updated. That is
>why you see <defunct> processes with ps -ef.

No.  <defunct> processes are simply processes that have died, but have not
yet been waited on by thier parent.  These processes have an entry in the
process table, but no associated data space,etc.  BTW- The reason that they
stay around in the process table is so that the process exit status, and other
such information can be reported to the parent.

Since the process do not really exist, there is no way to deliver a signal 
to them and therefore killing such a process has no effect.

The other "unkillable" processes, those that are stuck somewhere in the
kernel (usually,if not always, in device driver code) sleeping with a 
priority < PZERO, are usually stuck there due to some hardware problem,
or a device driver bug. 

/* Disclaimer - this next part may be me smoking some rope, I can't create
		the problem to test it */

I believe that once stuck there they may get changed to a <defunct> by
sending a kill -9. However, they still will not go away until the
condition that got them stuck is cleared.

-- 
Conor P. Cahill            (703)430-9247        Virtual Technologies, Inc.,
uunet!virtech!cpcahil                           46030 Manekin Plaza, Suite 160
                                                Sterling, VA 22170

allbery@NCoast.ORG (Brandon S. Allbery) (03/12/90)

As quoted from <6840005@hpcllcm.HP.COM> by pratap@hpcllcm.HP.COM (Pratap Subrahmanyam):
+---------------
| When a child dies it send a signal to its parent (I think it's called a 
| "death_of_child_signal"). When the parent recieves this signal, it resets the
| process PID table, after doing several other cleanup operations (like closing
| opne files, pipes etc.. ). Now the PID table, will not contain an entry for 
| the child process. (This is why ps -ef will not show it).
+---------------

The parent process does not do this; the kernel does.

+---------------
| However, if there is a race condition, like this .. The child dies soon after
| the parent is "killed", that is the child dies before it can be reparented.
+---------------

It will still be reparented, to 1 (init).  I don't believe the race condition
you describe exists.

In any case, this does not explain <defunct>; those processes are trapped in
process tear-down because an open file (usually a device, sometimes a socket
in buggy TCP/IP implementations) can't be closed.  It usually indicates a
buggy device driver.

And SIGCLD/SIGCHLD (depending on religious affiliation ;-) is not the trigger
for process cleanup; it's part of that cleanup.  The kernel sends it, on
behalf of the process, to its parent.  In System V, it is sent *only if the
parent is expecting to receive it*; I suspect BSD is similar, since most
processes could care less about child-death signals whatever the system.

+---------------
| If any one has better (or if this is a bogus ) answers, please post. I'll
| be interested.
+---------------

Done.  Although I expect Chris Torek will have some words to say on the
subject as well.  ;-)

++Brandon
-- 
Brandon S. Allbery (human), allbery@NCoast.ORG (Inet), BALLBERY (MCI Mail)
ALLBERY (Delphi), uunet!cwjcc.cwru.edu!ncoast!allbery (UUCP), B.ALLBERY (GEnie)
BrandonA (A-Online) ("...and a partridge in a pear tree!" ;-)

deastman@pilot.njin.net (Don Eastman) (03/12/90)

Conor P. Cahill writes:

a lot of useful information and the following speculative comment.

> /* Disclaimer - this next part may be me smoking some rope, I can't create
> 		the problem to test it */
>  
> I believe that once stuck there they may get changed to a <defunct> by
> sending a kill -9. However, they still will not go away until the
> condition that got them stuck is cleared.
>

This appears to me to be exceedingly unlikely.

A process becomes <defunct> as a result of going through the exiting sequence
where kernel resources are relinquished.  It is very likely that the device
driver stuck at an noninterruptable priority is reliant upon some of these
resources.

It is also not obvious what benefits accrue from making a special case of
SIGKILL in this scenario.  Thoughts?

Don Eastman
deastman@pilot.njin.net or ...!rutgers!pilot!deastman