scl@sasha.acc.Virginia.EDU (Steve Losen) (03/07/90)
Thanks so much to everyone who responded to my recent posting about the nfsd daemons hanging forever in a "D" (disk wait) state on our sun3 server. This turns out to be a known bug for which there is a kernel fix (a new ufs_bmap.o file). I installed the fix and our servers have been up almost a whole week (probably a record!). I ended up having to get this fix by calling Sun support. To avoid this hassle, I have made this available via anonymous ftp on virginia.EDU in pub/nfsd.tar.Z. [[Ed's Note: Hopefully you verified that this was okay with Sun? -bdg]] The following is the README file that comes with the patch. The patch contains a ufs_bmap.o for the sun2, sun3, and sun4. README: Problem description: Occassionally on NFS server machines the nfsd daemons have been reported to get into a disk wait ("DW") state as noted in a listing of "ps aux". The result of this condition causes all client requests to the server to fail. Problem descriptions reported in Sun bugId's 1017518 and 1017893 identify at least two distinct different causes of this problem, described below: Case 1017518: On the server system, processes go into DW state and don't return. This problem is related to VM and may happen even in non NFS instances. The core dump will show _sleep, _cv_wait, _page_cv_wait, and _page_wait at the top of the stack trace. Basically the process is blocked waiting for the keep count on the page it wants to go to zero (meaning that it is available) but somehow it didn't get decremented correctly and will never go to zero. Case 1017893: This is a server problem similar to the client problem in bugId 1018954. The process is blocked waiting for an mbuf structure to be released back to NFS, but it is never being released. The core dump for this problem shows the hung process with a stack trace of _svc_sendreply, _svckudp_send(0x7hexdigits,0x7hexdigits) + 2C, _sleep. The routine svckudp_send is trying to send a reply to the client, but is blocked waiting for the mbuf structure pointed to by the first 0x7hexdigits argument above. Actually, the first 0x7hexdigits argument to svckudp_send is a SVCXPRT pointer, not an mbuf. However, it's possible to derive the mbuf's address given this argument. Fix description: Case 1017518: There currently are two patches available for this case: 1) an adb patch which sets nfsreadmap to 0: # adb -w /vmunix - nfsreadmap?W 0 $q This eliminates most of the code that increments and decrements the keep count. 2) The included patched ufs_bmap.o files which fixes a bug in bmap() where "softlocked" were never released after failing to extend the original block. Both patches may not be necessary. It is recommended that the ufs_bmap.o patch be tried first before the adb patch is also used. Case 1017893: There is not a patch available for case 1017893 at this time. Otherwise if it is not clear which case symptoms are the cause of your nfs server hang, or if after applying the above patches you continue to experience the problem, it will be necessary to get a system core dump sample and submit them to Sun Customer Support so that your problem can be further distinquished. Install instructions: After extracting fix tape contents into /tmp, as root install the appropriate sun2, sun3, or sun4 patches as follows: cd /sys/{sun2,sun3,sun4}/OBJ mv ufs_bmap.o ufs_bmap.o_orig cp /tmp/ufs_bmap.o_{sun2,sun3,sun4} ufs_bmap.o chmod 444 ufs_bmap.o Then a new kernel will need to be remade and used. Bug Id: 1017518 Release summary: 4.0, 4.0.1, 4.0.3 Fixed in Release: 4.1 ******* Steve Losen scl@virginia.edu University of Virginia Academic Computing Center