[comp.sys.sun] lockf

rkc@xn.ll.mit.edu (03/28/91)
I have written an application that is similar to a network database
application in which data is stored in on NFS-accessable file.  To protect
from multiple simultaneous updates, I have used the lockf subroutine to
lock the entire file.  I have had numerous problems with the client lockd
deamon getting confused and forgetting to unlock a lock that is in place,
while the server thinks an  unlock has succeeded.  Explicitely, a ps -aux
on machine B produces the following:

USER       PID %CPU %MEM   SZ  RSS TT STAT START  TIME COMMAND
username  4484  4.7  0.0   64    0 ?  D    17:45   0:39 lock_program

while machine C can successfully aquire the lock, go about its business,
and release the lock.  (Machine A is the host for the filesystem where the
file resides, the filesystem is hardmounted on the client systems.)   A
final clue to my problems is that lock_program is sometimes run by the at
program.

	My questions are three:

1. Once machine B is confused, how do I unconfuse it?  Specifically, PID
   4484 cannot be killed, as it is waiting in a non-interruptable state for a
   resource.  Killing and restarting both lockd and statd on the host and
   client machines neither lets the above process continue nor allows others
   to access the lock.  The only fix appears to be rebooting the machine.
   (Other users don't like this too much!)

2. Am I doing something inherently wrong?  In order to avoid processes
   being killed when they own the lock, I catch the following signals:
signal( SIGHUP, clnp ); signal( SIGQUIT, clnp ); signal( SIGINT, clnp );
signal( SIGILL, clnp ); signal( SIGIOT, clnp ); signal( SIGEMT, clnp );
signal( SIGFPE, clnp ); signal( SIGBUS, clnp ); signal( SIGSEGV, clnp );
signal( SIGSYS,	clnp ); signal( SIGTERM, clnp ); should I catch more?

Here's what the lock code looks like:

for(NumAttempts = 0;NumAttempts <= NUMPOLLS ; NumAttempts++){ 
   if( lockf(fd, F_TLOCK, 0L ) != (-1))	{ success = TRUE; break; } 
   sleep(2); 
   } 

I avoid the indefinate wait lock because this appears to increase the
probability that the error will occur.

3. Would creating a lock file via open be a workable network solution?
   Are their other workarounds (semaphores, etc) that I should try?

I would prefer to get this to work properly using lockf, since this seems
to be exactly what lockf is designed for.

Our network consists of sparcstation 1+'s running either 4.0.1 or 4.1, and
sun3's running 4.0.  In the near future we will also be using DG's
aviion/UX workstations. 

		Thanks for any help you can provide,
			-Rob