rkc@xn.ll.mit.edu (05/01/91)
=This is a slight modification of a posting that has occured elsewhere. =It was suggested that I post these questions to these newsgroups. I have written an application that is similar to a network database application in which data is stored in on NFS-accessable file. To protect from multiple simultaneous updates, I have used the lockf subroutine to lock the entire file. I have had numerous problems with the lockf routine "locking up". The symptoms vary: S1. The client dies and the server doesn't realize it. In order to avoid processes being killed when they own the lock, I catch the following signals: signal( SIGHUP, clnp ); signal( SIGQUIT, clnp ); signal( SIGINT, clnp ); signal( SIGILL, clnp ); signal( SIGIOT, clnp ); signal( SIGEMT, clnp ); signal( SIGFPE, clnp ); signal( SIGBUS, clnp ); signal( SIGSEGV, clnp ); signal( SIGSYS, clnp ); signal( SIGTERM, clnp ); Should I catch more? FYI, Here's what the lock code looks like: for(NumAttempts = 0;NumAttempts <= NUMPOLLS ; NumAttempts++){ if( lockf( fd, F_TLOCK, 0L ) != (-1)) { success = TRUE; break; } sleep(2); } I avoid the indefinate wait lock because this appears to increase the probability that an error will occur. S2. Sometimes the client doesn't die--it just hangs. Attaching the hung program indicates something hangs inside of fcntl. S3. Occasionally, I get messages like unknown klm_reply proc(0) unknown klm_reply proc(40) Does anyone have any idea where these come from? Other questions include: 1. Is there any known way to unconfuse our machines and reset state without rebooting the things? Killing statd and lockd is not always sufficient. 2. I was once told that sun released patches to their lock daemon, but noone could direct me to them. Does a wizard know where such things exist? 3. If lockf cannot be made to work, would I be at risk using the old technique of creating a "lock directory"? I've read that with NFS this won't work, but I've never read a good explanation of the problems with this approach. Are their other workarounds (semaphores, etc) that I should try? I would prefer to get this to work properly using lockf, since this seems to be exactly what lockf is designed for. Our network consists of sparcstation 1+ and IPC's running either 4.0.1, 4.1 or 4.1.1, and sun3's running 4.0.3. Currently the client is on one of the sun3's. In the near future we will also be using DG's aviion/UX workstations. Thanks for any help you can provide, -Rob
kdenning@genesis.Naitc.Com (Karl Denninger) (05/02/91)
In article <1991May1.165813.17465@xn.ll.mit.edu> rkc@xn.ll.mit.edu writes: >=This is a slight modification of a posting that has occured elsewhere. >=It was suggested that I post these questions to these newsgroups. > >I have written an application that is similar to a network database >application in which data is stored in on NFS-accessable file. To protect >from multiple simultaneous updates, I have used the lockf subroutine to lock >the entire file. I have had numerous problems with the lockf routine "locking >up". The symptoms vary: > > S1. The client dies and the server doesn't realize it. In order to > avoid processes being killed when they own the lock, I catch the > following signals: > > S2. Sometimes the client doesn't die--it just hangs. Attaching the >hung program indicates something hangs inside of fcntl. > > S3. Occasionally, I get messages like > unknown klm_reply proc(0) > unknown klm_reply proc(40) > > Does anyone have any idea where these come from? Heck, you're fortunate. If it's a Sun you're on, get on the horn with them and raise HELL. Sun hasn't had a working lockd in their OS for at least three releases that I know of (4.03, 4.1, and now 4.1.1). Their patches fix some of the bugs, and break other things. In short, yes, it's broke. Call the vendor and make a stink. -- Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285 kdenning@nis.naitc.com "The most dangerous command on any computer is the carriage return." Disclaimer: The opinions here are solely mine and may or may not reflect those of the company.