[comp.sys.mips] 4.52 is it reliable?

rex@cs.su.oz (Rex Di Bona) (04/24/91)

We have just installed the 4.52 kernel on our M120's and 3240 to try and reduce
the frequency of crashes.

The symptoms we are/were having are:
	1) The machine runs out of mbufs.
	2) The machine panics and then takes a monitor exception
		Or.. it panics with 'clget: null client'
		Or.. the machine hangs the silent death
	3) The machine double panics or fails to talk to the fuji scsi
		controller resulting in no kernel core dump...

The load on the machines is usually 2.5+ in the run queue, the machines
are swapping, but not really heavily.

The result... well, we used to have two crashes or so a day, now we only
have one (per machine :-)

Is this a known problem? The 4.52 release notes said a mbuf leak was fixed,
but there may still be another :-)

The configuration is:
	4 machines (3 by M120, 1 by 3240) each with...
	20+ users, all on ncd-19 X terminals, 48MB of memory
	about 200 procs, 1100 inodes, 600 files, 100+ process switches and
	5000+ system calls a sec. 60 or so ethernet packets a second doing
	X traffic. (running 4.51, then 4.51 upgraded with 4.52 kernels)

We have tried increasing the number of mbufs, but the machines still
eventually crash.

Any suggestions? Fixes?
P.s. we have 3230's with similar loads which have been running fine for weeks.
--------
Rex di Bona (rex@cs.su.oz.au)
Penguin Lust is NOT immoral

kdenning@pcserver2.naitc.com (Karl Denninger) (04/25/91)

In article <2338@cluster.cs.su.oz.au> rex@cluster.cs.su.oz (Rex Di Bona) writes:
>We have just installed the 4.52 kernel on our M120's and 3240 to try and reduce
>the frequency of crashes.
>
>The symptoms we are/were having are:
>	1) The machine runs out of mbufs.
>	2) The machine panics and then takes a monitor exception
>		Or.. it panics with 'clget: null client'
>		Or.. the machine hangs the silent death
>	3) The machine double panics or fails to talk to the fuji scsi
>		controller resulting in no kernel core dump...

4.52 seems to be ok here, with one exception:

rpc.lockd has a HUGE memory leak.  The result is that it will eventually
consume all the physical memory in the machine (and it appears to plock()
it's memory so it can't be paged out!)  This has a most undesirable outcome,
of course :-)

I have reported this to MIPS; they're working on it.

No other problems I have noticed...

Check rpc.lockd and see what it's current memory usage is (this is assuming
you're exporting one or more filesystems, or mounting some remotely from
elsewhere).

--
Karl Denninger - AC Nielsen, Bannockburn IL (708) 317-3285
kdenning@nis.naitc.com

"The most dangerous command on any computer is the carriage return."
Disclaimer:  The opinions here are solely mine and may or may not reflect
  	     those of the company.