mcba@newt.phys.unsw.OZ.AU (Michael C. B. Ashley) (05/27/91)
Hi, This is a rather long message describing a problem I have with a machine crashing. If anyone could shed some light on a possible solution, I would be most grateful. I have a DS5000/200PX running ULTRIX 4.1 (Rev. 52), and the machine crashes an average of once a day. The symptoms of the crash are that the system does not respond to keyboard entry, or to /etc/ping from another machine. If the screen saver is activated, the console screen remains off despite mouse movement or keyboard presses. If the screen saver is not activated then the screen remains on (no error messages visible), and the mouse will move the cursor. No error messages are observed in the output of /etc/uerf. The machine (and memory and disks) pass every diagnostic in /usr/field, and I have run the hardware (V5.3) ROM tests for days at a time without picking up any errors. Last week the system board was replaced, however, the problem remains. Once I noticed a message similar to "swap error" appearing in the Session Manager message area at the instant of a crash. As far as I can see my swap space is configured correctly (about 300 MBytes of swap for 48 MBytes of memory). I have tried rebuilding the kernel a few times with minor changes, all with no effect. Running /etc/sec/auditd doesn't show up anything unusual at the time of the crash (although the buffering of auditd would probably prevent the interesting information being written to disk). The machine will run without crashing if I disconnect the ethernet. The crashes aren't related to some user's program, since there aren't any users other than root at the moment. Needless to say this is a very frustrating problem, can anyone make any suggestions as to what I should do next? I have two ideas: (1) Maybe my copy of ULTRIX is corrupt. It came from a TK50, a rather unreliable medium in my experience. I have run /etc/stl/fverify to try and check the files, and everything appears to be OK although it is difficult to be sure since the *410.inv files show lots of checksum errors since they have been overwritten by *411.inv files. (2) Since the crashes appear to be related to the ethernet, maybe I need the "ln*.o kernel fix" that has been mentioned recently with respect to using tcpdump and LAT with ULTRIX 4.1. Note that our ethernet is teaming with exotic packets from all sorts of machines, and regularly crashes a couple of VT1000's we have in the building (they die with "illegal opcode 28", despite a recent ROM upgrade, but that is another story ...). Thanks for any suggestions! Michael Ashley mcba@newt.phys.unsw.oz.au
grr@cbmvax.commodore.com (George Robbins) (05/27/91)
In article <1606@usage.csd.unsw.oz.au> mcba@newt.phys.unsw.OZ.AU (Michael C. B. Ashley) writes: > Hi, > > This is a rather long message describing a problem I have with a > machine crashing. If anyone could shed some light on a possible > solution, I would be most grateful. > > I have a DS5000/200PX running ULTRIX 4.1 (Rev. 52), and the machine > crashes an average of once a day... ... > No error messages are observed in the output of /etc/uerf. The machine > (and memory and disks) pass every diagnostic in /usr/field, and I have > run the hardware (V5.3) ROM tests for days at a time without picking up > any errors. Last week the system board was replaced, however, the > problem remains. > > Once I noticed a message similar to "swap error" appearing in the > Session Manager message area at the instant of a crash... Why don't you attach a decwriter or some similar device and set up for hard-copy console mode. Then you can put in crontab periodic executions of pstat, vmstat and ps to see how things go downhill leading to the "crash". You may find that your swap space is eroding or mountd is cancerous or something else obvious that you lose from the crt screen. > The machine will run without crashing if I disconnect the ethernet. The > crashes aren't related to some user's program, since there aren't any > users other than root at the moment. Hmmm... > (2) Since the crashes appear to be related to the ethernet, maybe I > need the "ln*.o kernel fix" that has been mentioned recently > with respect to using tcpdump and LAT with ULTRIX 4.1. Note > that our ethernet is teaming with exotic packets from all sorts > of machines, and regularly crashes a couple of VT1000's we have > in the building (they die with "illegal opcode 28", despite a > recent ROM upgrade, but that is another story ...). Well, you can always try to talk the support center into analyzing a crash dump... -- George Robbins - now working for, uucp: {uunet|pyramid|rutgers}!cbmvax!grr but no way officially representing: domain: grr@cbmvax.commodore.com Commodore, Engineering Department phone: 215-431-9349 (only by moonlite)