lsf@astrosun.tn.cornell.edu (Sam Finn) (04/04/89)
Machine(s): Sun3/280-P14 (serial #831E0454) 16Mbytes 2x892Mbyte disks 6250bpi tape Sun4/110FCE (serial #828E0623) diskless Operating System: SunOS 4.0.1 Environment: 3/280 a server for 4/110 + several diskless 3/50s, 3/60s. Only root logins enabled on server. Serves NFS and yp. Vanilla NFS (ie., no automount or other fancy games, no secure NFS) Problem: Periodically the load on the server jumps to 8-10s. Less frequently, the load jumps higher (up to 20) and portmap has been observed to spawn huge numbers of children. At time such as these, the world comes to a halt. We have just once been able to login to the server and nice -15 a shell and kill some of the portmappers to bring the load down to the 8-10s. The system time was in the 99%s, and the nfsds were the chief culprits. In the particular instance being documented here, we halted and restarted, only to have the same problem occur again. We halted again, disconnected all clients from the ethernet, and rebooted the server. It came up fine. Connecting the clients back up one at a time (we never halted them, as we wanted to isolate the problem which was now fairly well established as client related), we found that the 4/110 was paging at a very high rate, and would bring the server to its knees, even in the absence of all other clients. The 4/110 is connected to a multiport tranceiver, and as soon as it was connected its port would dominate the tranceiver box. Traffic would also report that the 4/110 was dominating the ethernet (until it stopped receiving packets, owing to server failing to respond). We found we could get the 4/110 to respond (slowly) to commands by typing them, then disconnecting from the ethernet for 30 or so seconds, then reconnecting, and repeating. Conjecture: whatever was dominating the cpu time was network related (obviously) and would be somewhat thwarted when it could not communicate. The server would also quickly recover when the 4/110 would be disconnected. We niced +15 the nfsds on the server to keep it somewhat responsive and continued to experiment. Using these tricks, we got a nice -15 shell on the 4/110. On the 4/110, we found 4 user programs, each the executable of the same trivial ~60 line program (simpson rule integration). There were no arrays, nor anything that should consume significant memory. The entire set was small enough to be running entirely in core, and top reported that all four were resident, had sizes on order 140K, were sleeping, had a -5 priority, and were running without any niceing. Using vmstat, renice, and kill, we found that they were responsible for page-ins at the rate of 1600/5sec for the all of them. There was no significant page-out. They did not respond to TERM or QUIT, but did respond to KILL (thank God). We slowly killed off the 4 programs, watching system response. The page-in rate would drop by about 400/5sec for each one we killed, consistent with our experimentation using renice. Even with one running, it was capable of impacting the server response signficantly. We killed off the final one, and tried to restart one. It ran normally, with no indications of any untoward behavior. Frankly, we are at a complete loss as to how to proceed to isolate and document the problem. This runaway server/client situation has occured before (a frequency of once a week is about right), and experience tells us that it will occur again. We cannot reproduce the problem at will, and have no idea even what further questions we need to be asking ourselves. Request: 1) Contact from anyone who has encountered problems that are similar or are thought to be related; 2) Contact from anyone who can suggest things worth checking or trying. Please respond to lsf@astrosun.tn.cornell.edu