aco%math.tau.ac.il@cunyvm.cuny.edu (11/10/89)
> On occasions, to numerous to mention, over the last couple of weeks our > Sun 4/280S have had the load average skyrocket (i.e. >60) after which it > is catatonic and must be rebooted. Our Sun 3/160S has the same problems. (since SunOS 4.0.3_Export has been installed). > When this happened the other day we tried to look around before rebooting. > Shutting down to single user mode we got the "Something won't die, ps axl > advised." Below is the output from the "ps axl". We have also noticed > one of the nfsds waiting on `kernelma' just as the load starts to climb. > This machine is also subject to the infamous `Bus Error Reg 20<TIMEOUT>' > and `BAD TRAP' panics that have been reported here by ourselves and > others. We get a similar behaviour - a daily Bus Error Reg 20 <TIMEOUT> We also have nfsd's waiting on kernelma. Additional information: our server serves 9 clients. Most of the nfsd's (8) seem to be working hard (they are seen as 'running' in the 'top' display most of the time). Vmstat shows a lot of interrupts (about 10 times more than other servers with similar configuration and load). Vmstat -i shows ie0 as the main source of interrupts). There is no cpu problem (we have idle cpu most of the time). I/O is horrible - there is almost no response to keyboard input. There are constantly 8 - 12 processes in the run queue and the number of context switches is huge. All this sometimes happens when most of the clients are idle and nobody is logged in on the server (except root). Disconnecting the Ethernet cable reduces the number of interrupts to 'normal', empties the run queue and seems to solve the problem. Killing 7 out of 8 nfsd's has the same effect. Here is a comparison (of possibly relevant devices) between our machine and psuvax1: psuvax1 our server mem = 32768K (0x2000000) mem = 12288k (0xc00000) avail mem = 31481856 avail mem = 11214848 xd0: <NEC D2363 ... > xd0: <Fujitsu-M2361 Eagle ...> xd1: <NEC D2363 ... > xd1: <CDC-SABRE-1230 ...> xdc1 at vme16d32 0xee90 vec 0x45 xyc0 at vme16d16 0xee40 vec 0x48 xd4: <NEC D2363 ... > xy0: <Fujitsu-M2351 Eagle ...> si0 at vme24d16 0x200000 vec 0x40 sc0 at vme24d16 0x200000 vec 0x40 st1 at si0 slave 40 st0 at sc0 slave 32 sd0 at sc0 slave 0 <-- no disk attached zs0 at obio 0xf1000000 pri 3 zs0 at obio 0x20000 pri 3 zs1 at obio 0xf0000000 pri 3 mcp0 at vme32d32 0x1000000 vec 0x8b mcp0 at vme32d32 0x1000000 vec 0x8b ie0 at obio 0xf6000000 pri 3 ie0 at obio 0xc0000 pri 3 ie1 at vme24d16 0xe88000 vec 0x75 We have similar servers (3/160 and 3/180) with similar configurations - except the ALM board (mcp) which is installed on the problematic host only. There are no similar problems on these servers. Thus we suspect the mcp or the interaction between the 2 disk controllers (2 xdc's on psuvax1, 1 xdc and 1 xyc on our machine), their device drivers or the interaction between these drivers and other parts of the kernel. A different (?) question: The manual states that "Four seems to be a good number" concerning the number of nfsd's. However, in the distributed /etc/rc.local the number of nfsd's is 8. Any comments or ideas? Ariel Cohen System manager Tel-Aviv University CS lab, School of Math Sci.
rodney@taac.ipl.rpi.edu (Rodney Peck II) (11/20/89)
>>>>> On 10 Nov 89 14:06:44 GMT, aco%math.tau.ac.il@cunyvm.cuny.edu said: aco> X-Refs: Original: v8n180 aco> X-Sun-Spots-Digest: Volume 8, Issue 197, message 2 of 12 > On occasions, to numerous to mention, over the last couple of weeks our > Sun 4/280S have had the load average skyrocket (i.e. >60) after which it > is catatonic and must be rebooted. aco> Our Sun 3/160S has the same problems. (since SunOS 4.0.3_Export has been aco> installed). > When this happened the other day we tried to look around before rebooting. > Shutting down to single user mode we got the "Something won't die, ps axl > advised." Below is the output from the "ps axl". We have also noticed > one of the nfsds waiting on `kernelma' just as the load starts to climb. > This machine is also subject to the infamous `Bus Error Reg 20<TIMEOUT>' > and `BAD TRAP' panics that have been reported here by ourselves and > others. Our sun 4/280 file server has a similar problem. The other day, the number of jobs went thru the roof and the nfsd's (eight of them) were all taking about 5 to 7% of the cpu time. We ignored it for the day, hoping it would mellow out because we had two jobs that had been running for days on the machine and couldn't really just bag them and reboot. I hope someone will come up with the solution to this problem. Rodney