davidf@uunet.uu.net (David.J.Ferbrache) (03/04/89)
root%helios.UCSC.EDU@ucscc.ucsc.edu (De Clarke (Systems Mgr)): > 1) Has anyone managed to kill off user processes that go <exiting>? Mine > hang about for anywhere from a day to three days before they finally drop > off. Since they do drop off in the end (I guess the lbolt that they're > waiting on finally hits) there must be a way to force them out by making > the lbolt arrive sooner, right? I asked Sun and they said basically > "reboot." (Great answer >: -( ) So thought I'd ask the public. As far as I know (as the moderator said) there is no safe way of terminating such a process, there is however a slightly dodgy (it involves manipulation of the kernel proc structures) method. Basically the method involves running a program (called reaper on our systems) which scans the process table for zombie entries (examining the stat field in the process structure entry for SZOMB states). When a zombie process is found the program rewrites the process structure for that process modifying the PPID field to contain the process id of the reaper program. The reaper program can then wait on its newly adopted child process which should affect an orderly clean up of process resources. This method works in BSD 4.2 releases which do not contain the rewritten process table code. The new process table code replaces a search for processes based on a hash of the pid, with no explicit structuring of child-parent relations excepting the ppid field, with a tree structure of pointers lacing together the child and parent processes. In such a more complex environment rewritting the pointer fields is potentially far more dangerous. The reaper program (which runs on Sun 3 os 3.5, Orion HLH and BSD 4.2) seems to work for all processes which are actually in the zombie state, but not for processes which are blocked waiting on a close of a file in the _exit routines. Anyway, if anybody is remotely interested I will mail them the source. Dave Ferbrache Personal mail to: Dept of computer science Internet <davidf@cs.hw.ac.uk> Heriot-Watt University Janet <davidf@uk.ac.hw.cs> 79 Grassmarket UUCP ..!mcvax!hwcs!davidf Edinburgh,UK. EH1 2HJ Tel (UK) 031-225-6465 ext 553
ado@fcs280s.ncifcrf.gov (Arthur David Olson) (03/14/89)
We're running SunOS 4.0 (plus the "general hygiene" patches) on a 3/280. I was interested in learning just what an <exiting> process was, so I tried using "gcore" to get a dump. Serendipitously enough. . . elsie$ /bin/su Password: elsie# ps -l#22255 F UID PID PPID CP PRI NI SZ RSS WCHAN STAT TT TIME COMMAND 8601 0 22255 1 0 3 0 48 0 lbolt S h4 0:00 <exiting> elsie# kill -9 22255 elsie# ps -l#22255 F UID PID PPID CP PRI NI SZ RSS WCHAN STAT TT TIME COMMAND 8601 0 22255 1 0 3 0 48 0 lbolt S h4 0:00 <exiting> elsie# gcore 22255 & [1] 26519 elsie# [1] + Stopped(tty output) gcore 22255 & %1 gcore 22255 gcore: 22255 exiting - not dumped elsie# ps -l#22255 F UID PID PPID CP PRI NI SZ RSS WCHAN STAT TT TIME COMMAND elsie# exit elsie$ exit Of course this gcore behavior probably changes in 4.0.1 or 4.1. -- Arthur David Olson ado@ncifcrf.gov ADO is a trademark of Ampex. [[ I wouldn't be so sure. An exiting process usually doesn't even have any allocated memory left to gcore. --wnl ]]
davidf@uunet.uu.net (David.J.Ferbrache) (03/21/89)
To everyone who asked for a copy of my zombie reaper program for Suns thanks. The demand has been very high, so I have decided to take the time to tidy up the code, and update it to handle the tree based process table organisation used in Sun 4.0. Expect to see an updated version in comp.sources.misc in about a fortnight, the new version should deal with all Sun OS releases (I will test it on 3.2, 3.5 and 4.0 before release), VAX BSD 4.2 and Orion HLH. Thanks for the interest Dave Ferbrache Personal mail to: Dept of computer science Internet <davidf@cs.hw.ac.uk> Heriot-Watt University Janet <davidf@uk.ac.hw.cs> 79 Grassmarket UUCP ..!mcvax!hwcs!davidf Edinburgh,UK. EH1 2HJ Tel (UK) 031-225-6465 ext 553
rta@ucbvax.berkeley.edu (Rick Ace) (03/31/89)
Here's the lowdown on exiting and zombie processes, circa SunOS 3.5. It may or may not be different under 4.0. Process exit begins when 1) the process exits voluntarily via the "exit" syscall, or 2) when it is forced to do so by an uncaught signal. The kernel enters a routine called exit() [those of you with source can sing along, the rest just have to believe me :-]. Upon entering exit() (the kernel's exit(), that is), the kernel sets the SWEXIT flag in the struct proc of the process. This flag advises the paging and swapping logic that the process is on its way out and should be held in core so its demise will be quick. The next step taken is to release the user virtual memory occupied by the process. This encompasses the text, data, and stack segments, but not the kernel's "u. area" for the process (yet). Now the kernel runs through all open file descriptors, closing each one. This can result in calls to the "close" routines within device drivers. The drivers are at liberty to suspend the process if they so choose (for example, a tty driver may suspend the process until all characters in the output queue have been delivered to the hardware). Each driver is unique in its behavior, so the reasons for suspending a process will vary. One would hope that the programmer who coded the driver would implement a timeout, which would give up and resume the user process after a reasonable amount of time, but unfortunately this is more the exception than the rule. If a device driver should choose to suspend the process, "ps" will report the process as "exiting". In this case, the WHCAN column of the "ps" display will in an obscure way reflect the event the device driver is awaiting to wake the process from its sleep. When "ps" reports a process as "exiting", the process is most likely delayed in the close-the-file-descriptors phase of exiting. After all of the file descriptors are closed, the kernel then discards the page tables and "u. area" of the process, and places the process in the "zombie" state, which is signified by the value SZOMB in the p_stat field of the proc structure. At this point, the proc structure is the only vestige of the process remaining on the system (it's pretty minimal, see /usr/include/sys/proc.h), and its purpose it to maintain process exit status and accounting information for the parent. A process in this state will appear as a "zombie" in the "ps" display. When the parent reaps the process using wait(), wait3(), or whatever else is fashionable these days, the proc struct is discarded and the process is completely gone. Regarding "gcore": Since the VM of the process is discarded very shortly after the kernel sets the SWEXIT flag, when "gcore" sees SWEXIT, it concludes that the process has no VM to dump, so it tells you that the process is exiting and gives up. It cannot dump memory because there is no memory left to dump. Rick Ace Pixar 3240 Kerner Blvd, San Rafael CA 94901 ...!{sun,ucbvax}!pixar!rta [[ Thank you very much! That was most informative. --wnl ]]