[comp.unix.ultrix] System crashing.. HELP!

parker@zaphod.Berkeley.EDU (Ross Parker) (03/06/90)

System:  Microvax-II with Emulex QD-32 disk controller and
         two Fujitsu Eagle disk drives. 13 Mb memory. Running
         Ultrix 3.0. System supports perhaps 15-20 interactive
         logins, and perhaps 20 PCs connected via Sun's PC-NFS.
         The PCs access files using standard NFS on the Microvax.

Symptom: One user on a PC can try to bring up a particular
         file under WordPerfect (version 5.0 or 5.1) on the
         PC, and, without fail, cause the Microvax to
         instantly crash. This problem just started happening.
         No system changes, either hardware or software, have
         taken place for a number of months. The user does
         not have a problem with any other files, nor does
         any other user cause the system to die. The system
         is also used for NFS operations from other Vaxen, and
         from some Sun systems, and no problems occur. The crash
         symptoms are (on the console):


Trap Type 9, code = 803771ff, pc = 80034ca0

panic: Protection fault


         and in the error log:


********************************* ENTRY    29.
*********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  104.     CONTROLLER ERROR
SEQUENCE NUMBER                  0.
OPERATING SYSTEM                        ULTRIX 32
OCCURRED/LOGGED ON                      Mon Mar  5 13:44:28 1990 PST
OCCURRED ON SYSTEM                      waters
SYSTEM ID                 x08000000
SYSTYPE REG.              x01010000
                                        FIRMWARE REV = 1.
PROCESSOR TYPE                          KA630

----- UNIT INFORMATION -----

UNIT CLASS                              ADAPTER/CONTROLLER
UNIT TYPE                               UDA50A
CONTROLLER NO.
UNIT NO.                         0.
ERROR SYNDROME                          CONTROLLER ERROR

********************************* ENTRY    30.
*********************************

----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  200.     PANIC
SEQUENCE NUMBER                  5.
OPERATING SYSTEM                        ULTRIX 32
OCCURRED/LOGGED ON                      Mon Mar  5 13:42:26 1990 PST
OCCURRED ON SYSTEM                      waters
SYSTEM ID                 x08000000
SYSTYPE REG.              x01010000
                                        FIRMWARE REV = 1.
PROCESSOR TYPE                          KA630
PANIC MESSAGE                           Protection fault

********************************* ENTRY    31.
*********************************
----- EVENT INFORMATION -----

EVENT CLASS                             ERROR EVENT
OS EVENT TYPE                  109.     EXCEPTION/FAULT
SEQUENCE NUMBER                  4.
OPERATING SYSTEM                        ULTRIX 32
OCCURRED/LOGGED ON                      Mon Mar  5 13:42:26 1990 PST
OCCURRED ON SYSTEM                      waters
SYSTEM ID                 x08000000
SYSTYPE REG.              x01010000
                                        FIRMWARE REV = 1.
PROCESSOR TYPE                          KA630

----- UNIT INFORMATION -----

ERROR SYNDROME                          PROTECTION FAULT

********************************* ENTRY    32.
*********************************

----- EVENT INFORMATION -----

EVENT CLASS                             OPERATIONAL EVENT
OS EVENT TYPE                  250.     ASCII MSG
SEQUENCE NUMBER                  7.
OPERATING SYSTEM                        ULTRIX 32
OCCURRED/LOGGED ON                      Mon Mar  5 13:42:40 1990 PST
OCCURRED ON SYSTEM                      waters
SYSTEM ID                 x08000000
SYSTYPE REG.              x01010000
                                        FIRMWARE REV = 1.
PROCESSOR TYPE                          KA630
MESSAGE                                 done

********************************* ENTRY    33.
*********************************

----- EVENT INFORMATION -----

EVENT CLASS                             OPERATIONAL EVENT
OS EVENT TYPE                  250.     ASCII MSG
SEQUENCE NUMBER                  6.
OPERATING SYSTEM                        ULTRIX 32
OCCURRED/LOGGED ON                      Mon Mar  5 13:42:39 1990 PST
OCCURRED ON SYSTEM                      waters
SYSTEM ID                 x08000000
SYSTYPE REG.              x01010000
                                        FIRMWARE REV = 1.
PROCESSOR TYPE                          KA630
MESSAGE                                 syncing disks...



         Now this certainly looks like a probable bad controller,
right? Well, we've replaced the controller with a new one, and
get an identical problem... down to identical register values in
the register dump. We've also run DEC diagnostics and the system
passes with no problems, other than (and this I'm mildly worried
about) the disk controller... however, I believe the controller
is failing because it's a non-DEC controller, and DEC's diags are
expecting a KDA50. The controller (the new one) passed the
vendor's diags, and the diags included scanning the disks. No
problems were found anywhere.

	In addition, we can read and write any file on both disks
locally (not via NFS) and no problems occur, so the problem is
possibly related to NFS rather than to disk driver code or
whatever.

	Perusing the resultant crash dumps has not given me
much enlightenment, however, I'm not an expert at that, and
have misplaced my list of magic incantations to have adb show
anything useful. Perhaps someone can enlighten me? Care to
bite, George? I'm sure you've done this numerous times.

	If anyone can shed any light on this, it'd be *much*
appreciated. This Monday, the system went down about 7 times
before this particular user called us to say that each crash
happened exactly when she tried to access this file!

Thanks,

Ross Parker				parker@mpre.mpr.ca
(604)293-5495				uunet!ubc-cs!mpre!parker

grr@cbmvax.commodore.com (George Robbins) (03/07/90)

In article <2080@kiwi.mpr.ca> parker@zaphod.Berkeley.EDU (Ross Parker) writes:
> 
> Symptom: One user on a PC can try to bring up a particular
>          file under WordPerfect (version 5.0 or 5.1) on the
>          PC, and, without fail, cause the Microvax to
>          instantly crash. This problem just started happening.
>          No system changes, either hardware or software, have
>          taken place for a number of months. The user does
>          not have a problem with any other files, nor does
>          any other user cause the system to die. The system
>          is also used for NFS operations from other Vaxen, and
>          from some Sun systems, and no problems occur. The crash
>          symptoms are (on the console):

Well, from your description there isn't enough info to make a good guess,
but I'd suggest two avenues to pursue...

1) look a the crash dump information on the console and a nm -n /vmunix
   and see what kind of code the pc is in, nfs, ufs or something else.

2) fsck the disk and make sure there are no organizational problems
   associated with the file.

3) try to read the file/files in question with tar or other utility to
   make sure that the problem is (perhaps) NFS and not basic filesystem
   stuff.

4) try to correlate the errors with the UERF output.  Is there exactly
   one of the "controller errors" per crash?  Do all the crashes have
   the same PC value?

5) try the uerf -o full option to see if you get more information on the
   "controller error" that might give a clue...

6) talk to the support center about known 3.0 NFS problems (I think there
   are quite few) and either apply the 3.0 NFS / networking patch tape
   or upgrade to 3.1 and do any appropriate patches there...

-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr@cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)

parker@zaphod.Berkeley.EDU (Ross Parker) (03/08/90)

In article <10020@cbmvax.commodore.com>, grr@cbmvax.commodore.com
(George Robbins) writes:
> In article <2080@kiwi.mpr.ca> parker@zaphod.Berkeley.EDU (Ross Parker)
writes:
> > ... stuff about one of my systems crashing ...
> 
> George suggests a number of things to try to track down the problem...

Actually, George, I had tried all that you had suggested, and also finally
remembered some of my long-lost adb knowledge to get a stack trace from
one of the core dumps. The problem is definitely in the nfs code. My original
panicked message was written in the wee hours of the morning with my brain
on autopilot, so it wasn't very clear!

> 
> 6) talk to the support center about known 3.0 NFS problems (I think there
>    are quite few) and either apply the 3.0 NFS / networking patch tape
>    or upgrade to 3.1 and do any appropriate patches there...
> 

This I did the next day, and (much to my surprise) they seemed to think that
the problem was a known one... I've been playing phone tag since then, but
I believe an upgrade to 3.1 plus some patches will cure my woes (exactly
what George suggests).

> George Robbins - now working for,     uucp:  
{uunet|pyramid|rutgers}!cbmvax!grr
> but no way officially representing:   domain: grr@cbmvax.commodore.com
> Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)

Thanks for the relpy,

Ross Parker				parker@mpre.mpr.ca
(604)293-5495				uunet!ubc-cs!mpre!parker

cmaarpc@cc.ic.ac.uk (Peter Churchyard) (03/08/90)

We have a microvax with Fujitsu drives and QD32 controller. When I
first tried to install 3.0, the kernel would panic whenever I wrote
to the Eagle. (chpt mount fsck etc) I checked with emulex and
found that we were running OLD firmware in the QD32. (revision C)
we have now moved to revision H and I then had to RE-FORMAT the Eagle.
This installs some diag stuff. Put up 3.0 etc and now works great! So
Check those revs.

-------------------------------------------------------------------------
If you are reading this and you are not a DECUS member, WHY NOT?.
Peter Churchyard. DECUS Europe UNiX SIG Chair.
Real life: Group Leader Communications Section, Imperial College Comp Centre.