lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (06/27/91)
We are experiencing a peculiar bug which has appeared from time to time on our Sun-4/490 server. This system is very heavily loaded, mainly because it is running Sybase. It recently had an official clean 4.1.1 release installed, with DBE 1.1, and selected patches added. The system has a Sun VME FDDI board, and FDDI 1.1 is installed. There is heavy NFS traffic to another Sun server via FDDI (at the moment - Ethernet has also been used). The bug has appeared in 4.1, 4.1 + various patches (almost 4.1.1), 4.1.1, with and without DBE installed, with and without FDDI (ie, with NFS traffic over ethernet). The same symptom has appeared in all cases: a process which is usually doing NFS I/O will hang in "D" state. The offending process cannot be killed, and eventually other processes start hanging as well. During this period, Sybase activity will have been very heavy. The Sybase datasever process itself, however, never hangs (note: Sybase is set up so that its I/O is local, *and* Sybase is using its own raw partitions). Even though Sybase itself never hangs, *If Sybase asych. I/O is turned OFF, the problem rarely if ever appears.* So, to cause the hang, you seem to need: Sybase, with asynch I/O on. A heavy Sybase load. Another process doing NFS reads/writes... Oh yes. It seems to take a while to get in this predicament. After the inevitable reboot, the system is usually OK for a while. Has anyone else experienced this problem? It could be an NFS problem, an asynch I/O problem, a load dependent kernel problem, ... Any help would be much appreciated. -- Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster NASA Ames Research Center Internet: lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 With Good Mailer: lamaster@george.arc.nasa.gov Phone: 415/604-1056 #include <std.disclaimer>
lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (06/27/91)
I previously wrote: >The bug has appeared in 4.1, 4.1 + various patches (almost 4.1.1), 4.1.1, >with and without DBE installed, with and without FDDI (ie, with NFS >traffic over ethernet). The same symptom has appeared in all cases: >a process which is usually doing NFS I/O will hang in "D" state. The >offending process cannot be killed, and eventually other processes >start hanging as well. During this period, Sybase activity >will have been very heavy. The Sybase datasever process itself, however, >never hangs (note: Sybase is set up so that its I/O is local, *and* >Sybase is using its own raw partitions). Even though Sybase itself >never hangs, *If Sybase asych. I/O is turned OFF, >the problem rarely if ever appears.* 1) We are not running with /tmp in swap with tmpfs. However, I understand that this can cause a similar sounding problem, which may be related. It could be a bug somewhere in the allocation of swap space. 2) I should have made it clear that the Sybase raw partitions are local to the machine with Sybase, and are not doing NFS on the Database files. Only user-type files are mounted off of the fileserver using NFS. Also, lockd and statd are not running. I believe that there is no need for them to be running, since Sybase is not reading/writing over NFS, and is not complaining about lock requests failing. 3) We had another hang yesterday afternoon. The processes which hung this time looked like the following: F UID PID PPID CP PRI NI SZ RSS WCHAN STAT TT TIME COMMAND 200080001002 9562 9542 0 -1 0149376 0 kernelma DW pa 0:00 model 200080011002 9529 4227 0 -1 0149376 72 kernelma D pb 0:00 model A pstat -Ts showed the following: [149] pstat -Ts >pstat: number of files is preposterous (14019) >1470/1470 inodes >454/4090 processes >460952/781032 swap > We have a lot of swap space allocated, to run some of these big jobs. -- Hugh LaMaster, M/S 233-9, UUCP: ames!lamaster NASA Ames Research Center Internet: lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 With Good Mailer: lamaster@george.arc.nasa.gov Phone: 415/604-1056 #include <std.disclaimer>