rkc@xn.ll.mit.edu (03/28/91)
I have written an application that is similar to a network database
application in which data is stored in on NFS-accessable file. To protect
from multiple simultaneous updates, I have used the lockf subroutine to
lock the entire file. I have had numerous problems with the client lockd
deamon getting confused and forgetting to unlock a lock that is in place,
while the server thinks an unlock has succeeded. Explicitely, a ps -aux
on machine B produces the following:
USER PID %CPU %MEM SZ RSS TT STAT START TIME COMMAND
username 4484 4.7 0.0 64 0 ? D 17:45 0:39 lock_program
while machine C can successfully aquire the lock, go about its business,
and release the lock. (Machine A is the host for the filesystem where the
file resides, the filesystem is hardmounted on the client systems.) A
final clue to my problems is that lock_program is sometimes run by the at
program.
My questions are three:
1. Once machine B is confused, how do I unconfuse it? Specifically, PID
4484 cannot be killed, as it is waiting in a non-interruptable state for a
resource. Killing and restarting both lockd and statd on the host and
client machines neither lets the above process continue nor allows others
to access the lock. The only fix appears to be rebooting the machine.
(Other users don't like this too much!)
2. Am I doing something inherently wrong? In order to avoid processes
being killed when they own the lock, I catch the following signals:
signal( SIGHUP, clnp ); signal( SIGQUIT, clnp ); signal( SIGINT, clnp );
signal( SIGILL, clnp ); signal( SIGIOT, clnp ); signal( SIGEMT, clnp );
signal( SIGFPE, clnp ); signal( SIGBUS, clnp ); signal( SIGSEGV, clnp );
signal( SIGSYS, clnp ); signal( SIGTERM, clnp ); should I catch more?
Here's what the lock code looks like:
for(NumAttempts = 0;NumAttempts <= NUMPOLLS ; NumAttempts++){
if( lockf(fd, F_TLOCK, 0L ) != (-1)) { success = TRUE; break; }
sleep(2);
}
I avoid the indefinate wait lock because this appears to increase the
probability that the error will occur.
3. Would creating a lock file via open be a workable network solution?
Are their other workarounds (semaphores, etc) that I should try?
I would prefer to get this to work properly using lockf, since this seems
to be exactly what lockf is designed for.
Our network consists of sparcstation 1+'s running either 4.0.1 or 4.1, and
sun3's running 4.0. In the near future we will also be using DG's
aviion/UX workstations.
Thanks for any help you can provide,
-Rob