[comp.sys.sun] Server/Client process runaway?

lsf@astrosun.tn.cornell.edu (Sam Finn) (04/04/89)
Machine(s):
	Sun3/280-P14 (serial #831E0454)
		16Mbytes
		2x892Mbyte disks
		6250bpi tape
        Sun4/110FCE (serial #828E0623)
		diskless

Operating System: 
	SunOS 4.0.1

Environment:
	3/280 a server for 4/110 + several diskless 3/50s, 3/60s.
	Only root logins enabled on server.
	Serves NFS and yp.
	Vanilla NFS (ie., no automount or other fancy games, no secure
		NFS)

Problem:

Periodically the load on the server jumps to 8-10s.  Less frequently, the
load jumps higher (up to 20) and portmap has been observed to spawn huge
numbers of children. At time such as these, the world comes to a halt. We
have just once been able to login to the server and nice -15 a shell and
kill some of the portmappers to bring the load down to the 8-10s. The
system time was in the 99%s, and the nfsds were the chief culprits.

In the particular instance being documented here, we halted and restarted,
only to have the same problem occur again. We halted again, disconnected
all clients from the ethernet, and rebooted the server. It came up fine.
Connecting the clients back up one at a time (we never halted them, as we
wanted to isolate the problem which was now fairly well established as
client related), we found that the 4/110 was paging at a very high rate,
and would bring the server to its knees, even in the absence of all other
clients.

The 4/110 is connected to a multiport tranceiver, and as soon as it was
connected its port would dominate the tranceiver box. Traffic would also
report that the 4/110 was dominating the ethernet (until it stopped
receiving packets, owing to server failing to respond). We found we could
get the 4/110 to respond (slowly) to commands by typing them, then
disconnecting from the ethernet for 30 or so seconds, then reconnecting,
and repeating. Conjecture: whatever was dominating the cpu time was
network related (obviously) and would be somewhat thwarted when it could
not communicate. The server would also quickly recover when the 4/110
would be disconnected. We niced +15 the nfsds on the server to keep it
somewhat responsive and continued to experiment.

Using these tricks, we got a nice -15 shell on the 4/110.  On the 4/110,
we found 4 user programs, each the executable of the same trivial ~60 line
program (simpson rule integration). There were no arrays, nor anything
that should consume significant memory. The entire set was small enough to
be running entirely in core, and top reported that all four were resident,
had sizes on order 140K, were sleeping, had a -5 priority, and were
running without any niceing.  Using vmstat, renice, and kill, we found
that they were responsible for page-ins at the rate of 1600/5sec for the
all of them.  There was no significant page-out. They did not respond to
TERM or QUIT, but did respond to KILL (thank God).

We slowly killed off the 4 programs, watching system response.  The
page-in rate would drop by about 400/5sec for each one we killed,
consistent with our experimentation using renice.  Even with one running,
it was capable of impacting the server response signficantly.  We killed
off the final one, and tried to restart one.  It ran normally, with no
indications of any untoward behavior.

Frankly, we are at a complete loss as to how to proceed to isolate and
document the problem. This runaway server/client situation has occured
before (a frequency of once a week is about right), and experience tells
us that it will occur again. We cannot reproduce the problem at will, and
have no idea even what further questions we need to be asking ourselves.

Request:
	1) Contact from anyone who has encountered problems that are
		similar or are thought to be related;
	2) Contact from anyone who can suggest things worth checking
		or trying.

Please respond to 

	lsf@astrosun.tn.cornell.edu