[comp.unix.wizards] file locking issues, NFS, lockf

rkc@xn.ll.mit.edu (05/01/91)

=This is a slight modification of a posting that occured in comp.sys.sun.
=I received only a few answers which seemed to open as many questions as they
=answered. I now call upon the unix wizards to help me out.

I have written an application that is similar to a network database
application in which data is stored in on NFS-accessable file.  To protect
from multiple simultaneous updates, I have used the lockf subroutine to lock
the entire file.  I have had numerous problems with the lockf routine "locking
up".  The symptoms vary:

	S1. The client dies and the server doesn't realize it.  In order to
	avoid processes being killed when they own the lock, I catch the
	following signals: 
		signal( SIGHUP, clnp );
		signal( SIGQUIT, clnp );
		signal( SIGINT, clnp );
		signal( SIGILL, clnp );
		signal( SIGIOT, clnp );
		signal( SIGEMT, clnp );
		signal( SIGFPE, clnp );
		signal( SIGBUS, clnp );
		signal( SIGSEGV, clnp );
		signal( SIGSYS,	clnp );
		signal( SIGTERM, clnp );
	Should I catch more?
	
	FYI, Here's what the lock code looks like:

	  for(NumAttempts = 0;NumAttempts <= NUMPOLLS ; NumAttempts++){
	    if( lockf( fd, F_TLOCK, 0L ) != (-1))	{
	      success = TRUE;
	      break;
	    }
	    sleep(2);
	  }
	I avoid the indefinate wait lock because this appears to increase the
	probability that an error will occur.

	S2. Sometimes the client doesn't die--it just hangs.  Attaching the
hung program indicates something hangs inside of fcntl.

	S3. Occasionally, I get messages like 
		unknown klm_reply proc(0)
		unknown klm_reply proc(40)

	Does anyone have any idea where these come from?

	Other questions include:
	1. Is there any known way to unconfuse our machines and reset
state without rebooting the things?  Killing statd and lockd is not
sufficient.

	2. I was once told that sun released patches to their lock daemon, but
noone could direct me to them.  Does a wizard know where such things exist?

	3. If lockf cannot be made to work, would I be at risk using the old
technique of creating a "lock directory"?  I've read that with NFS this won't
work, but I've never read a good explanation of the problems with this approach.
Are their other workarounds (semaphores, etc) that I should try?

I would prefer to get this to work properly using lockf, since this seems to
be exactly what lockf is designed for.

Our network consists of sparcstation 1+ and IPC's running either 4.0.1, 4.1 or
4.1.1, and sun3's running 4.0.3.  In the near future we will also be using
DG's aviion/UX workstations. 

		Thanks for any help you can provide,

			-Rob

thurlow@convex.com (Robert Thurlow) (05/01/91)

In <1991Apr30.192117.4730@xn.ll.mit.edu> rkc@xn.ll.mit.edu writes:

>=This is a slight modification of a posting that occured in comp.sys.sun.
>=I received only a few answers which seemed to open as many questions as they
>=answered. I now call upon the unix wizards to help me out.

The best audience may have been comp.protocols.nfs; though NFS and the
Sun lock manager are almost completely separate, they are both based on
RPC and many companies picked them both up from Sun.

>	S1. The client dies and the server doesn't realize it.  In order to
>	avoid processes being killed when they own the lock, I catch the
>	following signals:  ...  Should I catch more?

I guess you have no idea why they are dying?  That looked like a pretty
good list to me, I can't say why your clients might be dying.

>	I avoid the indefinate wait lock because this appears to increase the
>	probability that an error will occur.

Something you may want to try to verify:  Sun is said to have badly
broken the server side of the SunOS 4.1.x lock manager in that F_LOCK
requests that have to pend are answered with a GRANTED message with the
wrong process id.  These responses are discarded by the client kernel
as being ridiculous.  Do you have greater problems working against the
4.1.1 servers?  You may get more happiness from either backing off to
4.0.3 or yelling at Sun _really_ loudly :-)

>	S2. Sometimes the client doesn't die--it just hangs.  Attaching the
>hung program indicates something hangs inside of fcntl.

Hmmm.  Does anyone know how to get a backtrace of the kernel context
of a sleeping/waiting process on a Sun?  I could use the information,
and it would be helpful here.

>	S3. Occasionally, I get messages like 
>		unknown klm_reply proc(0)
>		unknown klm_reply proc(40)
>	Does anyone have any idea where these come from?

See S1; do you see this anywhere other than on a SunOS 4.1.x server?

>	Other questions include:
>	1. Is there any known way to unconfuse our machines and reset
>state without rebooting the things?  Killing statd and lockd is not
>sufficient.

Part (but not enough) of the lock manager lives in the kernel, and if
things get bad enough, a reboot will be necessary.  I don't find I have
to do this very much, though.  Do both daemons start and respond to
"rpcinfo -u <host> {l,n}lockmgr" requests when they're confused?

>	2. I was once told that sun released patches to their lock daemon, but
>noone could direct me to them.  Does a wizard know where such things exist?

Sun, as far as I can tell, can give you patches, but not THE patches
needed to make the lock manager work properly.  They _are_ finally
working on it, but it's taking awhile.

>	3. If lockf cannot be made to work, would I be at risk using the old
>technique of creating a "lock directory"?

No, keep after Sun for a  working lock manager, because NFS doesn't do
locking-via-file-creation well.  The protocol has no O_EXCL flag, so
you can't be sure another process on another machine or on the server
didn't get to the file while your NFS daemon was trying your request,
and you can easily get false failures if your server doesn't keep track
of the retransmissions made necessary by UDP.  Rauhl Desai posted a
scheme based on symbolic link creation to comp.protocols.nfs that will
at least work better than creating files.

>I would prefer to get this to work properly using lockf, since this seems to
>be exactly what lockf is designed for.

You're right; sadly, the only implementation available for a whole lot
of machines is Sun's, and it's never worked properly.  It's also never
been something Sun cared about until customers started eating them for
breakfast.  There's a lot of OEM vendors who feel helpless waiting for
Sun to fix the thing properly, as well.  Keep pressuring your Sun sales
rep for information.

>Our network consists of sparcstation 1+ and IPC's running either 4.0.1, 4.1 or
>4.1.1, and sun3's running 4.0.3.  In the near future we will also be using
>DG's aviion/UX workstations. 

Good for you; DG has done a splendid job of fixing their lock manager.

Rob T
--
Rob Thurlow, thurlow@convex.com
An employee and not a spokesman for Convex Computer Corp., Dallas, TX