[comp.unix.internals] lockf, NFS, and file locking issues

rkc@xn.ll.mit.edu (05/01/91)

=This is a slight modification of a posting that has occured elsewhere.
=It was suggested that I post these questions to these newsgroups.

I have written an application that is similar to a network database
application in which data is stored in on NFS-accessable file.  To protect
from multiple simultaneous updates, I have used the lockf subroutine to lock
the entire file.  I have had numerous problems with the lockf routine "locking
up".  The symptoms vary:

	S1. The client dies and the server doesn't realize it.  In order to
	avoid processes being killed when they own the lock, I catch the
	following signals: 
		signal( SIGHUP, clnp );
		signal( SIGQUIT, clnp );
		signal( SIGINT, clnp );
		signal( SIGILL, clnp );
		signal( SIGIOT, clnp );
		signal( SIGEMT, clnp );
		signal( SIGFPE, clnp );
		signal( SIGBUS, clnp );
		signal( SIGSEGV, clnp );
		signal( SIGSYS,	clnp );
		signal( SIGTERM, clnp );
	Should I catch more?
	
	FYI, Here's what the lock code looks like:

	  for(NumAttempts = 0;NumAttempts <= NUMPOLLS ; NumAttempts++){
	    if( lockf( fd, F_TLOCK, 0L ) != (-1))	{
	      success = TRUE;
	      break;
	    }
	    sleep(2);
	  }
	I avoid the indefinate wait lock because this appears to increase the
	probability that an error will occur.

	S2. Sometimes the client doesn't die--it just hangs.  Attaching the
hung program indicates something hangs inside of fcntl.

	S3. Occasionally, I get messages like 
		unknown klm_reply proc(0)
		unknown klm_reply proc(40)

	Does anyone have any idea where these come from?

	Other questions include:
	1. Is there any known way to unconfuse our machines and reset
state without rebooting the things?  Killing statd and lockd is not always
sufficient.

	2. I was once told that sun released patches to their lock daemon, but
noone could direct me to them.  Does a wizard know where such things exist?

	3. If lockf cannot be made to work, would I be at risk using the old
technique of creating a "lock directory"?  I've read that with NFS this won't
work, but I've never read a good explanation of the problems with this approach.
Are their other workarounds (semaphores, etc) that I should try?

I would prefer to get this to work properly using lockf, since this seems to
be exactly what lockf is designed for.

Our network consists of sparcstation 1+ and IPC's running either 4.0.1, 4.1 or
4.1.1, and sun3's running 4.0.3.  Currently the client is on one of the sun3's.
In the near future we will also be using DG's aviion/UX workstations. 

		Thanks for any help you can provide,

			-Rob

kdenning@genesis.Naitc.Com (Karl Denninger) (05/02/91)

In article <1991May1.165813.17465@xn.ll.mit.edu> rkc@xn.ll.mit.edu writes:
>=This is a slight modification of a posting that has occured elsewhere.
>=It was suggested that I post these questions to these newsgroups.
>
>I have written an application that is similar to a network database
>application in which data is stored in on NFS-accessable file.  To protect
>from multiple simultaneous updates, I have used the lockf subroutine to lock
>the entire file.  I have had numerous problems with the lockf routine "locking
>up".  The symptoms vary:
>
>	S1. The client dies and the server doesn't realize it.  In order to
>	avoid processes being killed when they own the lock, I catch the
>	following signals: 
>
>	S2. Sometimes the client doesn't die--it just hangs.  Attaching the
>hung program indicates something hangs inside of fcntl.
>
>	S3. Occasionally, I get messages like 
>		unknown klm_reply proc(0)
>		unknown klm_reply proc(40)
>
>	Does anyone have any idea where these come from?

Heck, you're fortunate.

If it's a Sun you're on, get on the horn with them and raise HELL.  Sun
hasn't had a working lockd in their OS for at least three releases that I
know of (4.03, 4.1, and now 4.1.1).  Their patches fix some of the bugs, and
break other things.

In short, yes, it's broke.  Call the vendor and make a stink.

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.