[net.unix-wizards] close problem on single-use devices

obrien@RAND-UNIX@sri-unix (08/09/82)

Date: Thursday, 22 Jul 1982 17:43-PDT
There is a problem with devices which are single-use, when a
process which has one open dies on a signal.  It would appear that there
are cases where the close routine is not called, hence locking the device
until reboot (or mucking in /dev/kmem with adb, which amounts to the same
thing).  I believe Berkeley mentioned this, but did not have a fix.  Does
anyone out there know this symptom, and have a fix (or at least an
explanation)?  This has occurred in every version of UNIX I've ever seen,
from V6 to 4.1BSD.

	It's particularly annoying when you gradually lose all of the
"/dev/imp?" devices for talking to a network.  I've also lost the magtape
drive on occasion, though not under 4.1.  It doesn't happen every time a
process dies on a signal, just sometimes.  TTY-generated signals do not
seem to cause the problem as much as other signals.

thomas (08/09/82)

I've run into this and finally concluded that it was a result of the
close routine calling sleep() with an interruptable priority level. 
If a signal occurs during the sleep, the close is forcibly exited (with
a longjump) and any cleanup following the sleep never occurs.  On the
other hand, if the sleep is called with a non-interuptable priority
level, and the awaited event never occurs, there is no way to kill the
process.  The best solution I can think of is to sleep at a
non-interruptable level, but to invoke a timeout routine to terminate
the sleep after some reasonable period.  Messy, yes, but device
interactions always are.

=Spencer

dan@Bbn-Unix@sri-unix (08/20/82)

From: Dan Franklin <dan@Bbn-Unix>
Date:  9 Aug 1982  9:10:53 EDT (Monday)
One way I know of that locking might not get undone on a signal is the general
UNIX open bug. For any device which sleeps at interruptable priority in its
open routine, if a signal comes along while it is sleeping there, control
jumps directly back into the trap routine, bypassing any little cleanup like
deletion of the fd, clearing out the lock bit, etc. This problem was alluded
to in an earlier message about file descriptors which suggested the user-level
hack of "predicting" the fd before doing the open, and then, if the open
failed, closing this fd. 

A simple solution for the locking problem might be to avoid locking until
after the sleep; however, this means that several processes can contend for
the device until that time, which might defeat the purpose of locking for your
application.

A general solution is to have devices which sleep in the open routine at
interruptable priority make a copy of the value of u.u_qsav and set up their
own handler for interrupts (via a "save" into u.u_qsav) just before going to
sleep. Then when an interrupt occurs, the code should clean up as necessary
and resume at the saved value of u.u_qsav. I haven't tried it yet, though; in
the one place it hurt us, in the network mailer (which would run out of fds),
'predicting' the fd was enough. 
	Dan Franklin