[comp.unix.questions] Trouble killing processes

arosen@eagle.ulowell.edu (MFHorn) (05/17/88)

In article <1037@unmvax.unm.edu> mike@turing.UNM.EDU.UUCP (Michael I. Bushnell) writes:
>In article <6832@swan.ulowell.edu> arosen@hawk.ulowell.edu (MFHorn) writes:
>>Would a program that does the following get rid of the process?
>>
>>1: Gets the process' proc struct from the kernel.
>>2: Changes fields like the status, priority, cpu usage, wchan, exit status
>>   and maybe others so the kernel will have good reason to terminate the
>>   process.
>>3: Writes the new struct back out (open /dev/mem for write, lseek, write).
>
>Ack! no!
>
> [If you kill it, you may lose the resource for good (no process to release
> it).  You could try figuring out what the resource is and unlocking it
> yourself.]

[Good, I thought this had died..]

The two instances I've wanted such a program were the one I mentioned, a
user permanently 'allocating' processors in a parallel machine.  The other
is a tape drive getting hung (which happens a little too often), making
backups impossible until you reboot.

The idea behind nuke(1) would be to talk the kernel into letting the
process exit.  Changing fields in the proc (and maybe user) struct like
it's priority (send it through the roof), cpu time used (the roof),
status and exit status (hey, this process already died?!), maybe pointing
wchan to null, sending it a SIGHUP (and making sure it's not catching or
ignoring SIGHUP), etc.

Chris Torek and Guy Harris both said in the mail that a nuke program
could probably work, but it would also likely nuke the system.

>your program even smarter, and have it figure out just what things were
>locked and unlock them, but remember, they may be partially modified,
>and fixing them makes this an even more daunting prospect.

Any ideas on how to release the resource?  Or even on how to find it?
In the tape drive and processor examples, either method of attack (kill
it dead, and resource preemption) should work safely, if they work.

The real fix would of course be in the kernel.  I would suggest setting
a timeout on each system call.  This way, an lseek on a dead tape drive,
say, would fail after n secs of cpu.  Some sort of context might need
to be saved before the syscall starts, so things can be restored.  This
could be expensive.  Comments?

Andy Rosen           | arosen@hawk.ulowell.edu | "I got this guitar and I
ULowell, Box #3031   | ulowell!arosen          |  learned how to make it
Lowell, Ma 01854     |                         |  talk" -Thunder Road
                   RD in '88 - The way it should be

guy@gorodish.Sun.COM (Guy Harris) (05/18/88)

> The real fix would of course be in the kernel.  I would suggest setting
> a timeout on each system call.  This way, an lseek on a dead tape drive,
> say, would fail after n secs of cpu.  Some sort of context might need
> to be saved before the syscall starts, so things can be restored.  This
> could be expensive.  Comments?

Probably not a good idea.

"lseek" is a bad example; in all current UNIX systems that I'm familiar with,
"lseek" only sets a "seek pointer" in memory - it never goes near the device.
This pointer is then used by the driver to position the tape before doing any
I/O operation.

A more germane example *might* be an I/O operation or an "position the tape"
"ioctl" operation on a dead tape drive, except that the *only* reason this
would require a timeout should either be that the tape driver is buggy and
doesn't immediately detect a dead drive or that it doesn't have some timeout
scheme *in the driver* to detect a dead drive.  Even such a timeout could be
tricky; some magtape operations can take a *very* long time to complete.

Basically, system calls should take as long as they need to; this could very
well be infinite ("pause()" or "sigpause()") or, worse, finite but
indeterminate.  In either case, no timeout can be imposed.

A typical "wedged" process is either waiting for something that *must* complete
(in which case its unkillability is unfortunate but unavoidable) or is hung due
to a kernel bug (in which case the real fix is, of course, in the kernel - but
it's not to kludge in a timeout).

(P.S. the timer obviously doesn't want to be based on CPU time - a blocked
process tends to consume CPU time *extremely* slowly, if at all.)