[comp.unix.ultrix] help: random crashing of DS5000 running ULTRIX 4.1

mcba@newt.phys.unsw.OZ.AU (Michael C. B. Ashley) (05/27/91)

Hi,

This is a rather long message describing a problem I have with a
machine crashing. If anyone could shed some light on a possible
solution, I would be most grateful.

I have a DS5000/200PX running ULTRIX 4.1 (Rev. 52), and the machine
crashes an average of once a day. The symptoms of the crash are that the
system does not respond to keyboard entry, or to /etc/ping from another
machine. If the screen saver is activated, the console screen remains
off despite mouse movement or keyboard presses. If the screen saver is
not activated then the screen remains on (no error messages visible),
and the mouse will move the cursor.

No error messages are observed in the output of /etc/uerf. The machine
(and memory and disks) pass every diagnostic in /usr/field, and I have
run the hardware (V5.3) ROM tests for days at a time without picking up
any errors. Last week the system board was replaced, however, the
problem remains.

Once I noticed a message similar to "swap error" appearing in the
Session Manager message area at the instant of a crash. As far as I can
see my swap space is configured correctly (about 300 MBytes of swap for
48 MBytes of memory). I have tried rebuilding the kernel a few times
with minor changes, all with no effect. Running /etc/sec/auditd doesn't
show up anything unusual at the time of the crash (although the
buffering of auditd would probably prevent the interesting information
being written to disk).

The machine will run without crashing if I disconnect the ethernet. The
crashes aren't related to some user's program, since there aren't any
users other than root at the moment.

Needless to say this is a very frustrating problem, can anyone make any
suggestions as to what I should do next? I have two ideas:

  (1) Maybe my copy of ULTRIX is corrupt. It came from a TK50, a
      rather unreliable medium in my experience. I have run
      /etc/stl/fverify to try and check the files, and everything
      appears to be OK although it is difficult to be sure since the
      *410.inv files show lots of checksum errors since they have
      been overwritten by *411.inv files.

  (2) Since the crashes appear to be related to the ethernet, maybe I
      need the "ln*.o kernel fix" that has been mentioned recently
      with respect to using tcpdump and LAT with ULTRIX 4.1. Note
      that our ethernet is teaming with exotic packets from all sorts
      of machines, and regularly crashes a couple of VT1000's we have
      in the building (they die with "illegal opcode 28", despite a
      recent ROM upgrade, but that is another story ...).

Thanks for any suggestions!
Michael Ashley mcba@newt.phys.unsw.oz.au

grr@cbmvax.commodore.com (George Robbins) (05/27/91)

In article <1606@usage.csd.unsw.oz.au> mcba@newt.phys.unsw.OZ.AU (Michael C. B. Ashley) writes:
> Hi,
> 
> This is a rather long message describing a problem I have with a
> machine crashing. If anyone could shed some light on a possible
> solution, I would be most grateful.
> 
> I have a DS5000/200PX running ULTRIX 4.1 (Rev. 52), and the machine
> crashes an average of once a day...
...
> No error messages are observed in the output of /etc/uerf. The machine
> (and memory and disks) pass every diagnostic in /usr/field, and I have
> run the hardware (V5.3) ROM tests for days at a time without picking up
> any errors. Last week the system board was replaced, however, the
> problem remains.
> 
> Once I noticed a message similar to "swap error" appearing in the
> Session Manager message area at the instant of a crash...

Why don't you attach a decwriter or some similar device and set up
for hard-copy console mode.  Then you can put in crontab periodic
executions of pstat, vmstat and ps to see how things go downhill
leading to the "crash".

You may find that your swap space is eroding or mountd is cancerous
or something else obvious that you lose from the crt screen.

> The machine will run without crashing if I disconnect the ethernet. The
> crashes aren't related to some user's program, since there aren't any
> users other than root at the moment.

Hmmm...

>   (2) Since the crashes appear to be related to the ethernet, maybe I
>       need the "ln*.o kernel fix" that has been mentioned recently
>       with respect to using tcpdump and LAT with ULTRIX 4.1. Note
>       that our ethernet is teaming with exotic packets from all sorts
>       of machines, and regularly crashes a couple of VT1000's we have
>       in the building (they die with "illegal opcode 28", despite a
>       recent ROM upgrade, but that is another story ...).

Well, you can always try to talk the support center into analyzing a
crash dump...

-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr@cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)