[comp.unix.wizards] Mysterious Sun-4 bug

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (06/27/91)

We are experiencing a peculiar bug which has appeared from time to time
on our Sun-4/490 server.  This system is very heavily loaded, mainly
because it is running Sybase.  It recently had an official clean 4.1.1
release installed, with DBE 1.1, and selected patches added.  The
system has a Sun VME FDDI board, and FDDI 1.1 is installed.  There is
heavy NFS traffic to another Sun server via FDDI (at the moment -
Ethernet has also been used).

The bug has appeared in 4.1, 4.1 + various patches (almost 4.1.1), 4.1.1,
with and without DBE installed, with and without FDDI (ie, with NFS
traffic over ethernet).  The same symptom has appeared in all cases:
a process which is usually doing NFS I/O will hang in "D" state.  The 
offending process cannot be killed, and eventually other processes
start hanging as well.   During this period, Sybase activity
will have been very heavy.  The Sybase datasever process itself, however,
never hangs (note: Sybase is set up so that its I/O is local, *and*
Sybase is using its own raw partitions). Even though Sybase itself 
never hangs, *If Sybase asych. I/O is turned OFF,
the problem rarely if ever appears.*

So, to cause the hang, you seem to need:

Sybase, with asynch I/O on.
A heavy Sybase load.
Another process doing NFS reads/writes...

Oh yes.  It seems to take a while to get in this predicament.  After
the inevitable reboot, the system is usually OK for a while.


Has anyone else experienced this problem?

It could be an NFS problem, an asynch I/O problem, a load dependent
kernel problem, ...

Any help would be much appreciated.



-- 
  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-1056                            #include <std.disclaimer>

lamaster@pioneer.arc.nasa.gov (Hugh LaMaster) (06/27/91)

I previously wrote:

>The bug has appeared in 4.1, 4.1 + various patches (almost 4.1.1), 4.1.1,
>with and without DBE installed, with and without FDDI (ie, with NFS
>traffic over ethernet).  The same symptom has appeared in all cases:
>a process which is usually doing NFS I/O will hang in "D" state.  The 
>offending process cannot be killed, and eventually other processes
>start hanging as well.   During this period, Sybase activity
>will have been very heavy.  The Sybase datasever process itself, however,
>never hangs (note: Sybase is set up so that its I/O is local, *and*
>Sybase is using its own raw partitions). Even though Sybase itself 
>never hangs, *If Sybase asych. I/O is turned OFF,
>the problem rarely if ever appears.*

1) We are not running with /tmp in swap with tmpfs.  However, I understand
that this can cause a similar sounding problem, which may be related.  It
could be a bug somewhere in the allocation of swap space.

2) I should have made it clear that the Sybase raw partitions are local
to the machine with Sybase, and are not doing NFS on the Database files.
Only user-type files are mounted off of the fileserver using NFS.  Also,
lockd and statd are not running.  I believe that there is no need for
them to be running, since Sybase is not reading/writing over NFS, and
is not complaining about lock requests failing.

3)  We had another hang yesterday afternoon.  The processes which hung
this time looked like the following:

       F UID   PID  PPID CP PRI NI  SZ  RSS WCHAN    STAT TT  TIME COMMAND
200080001002  9562  9542  0  -1  0149376    0 kernelma DW   pa  0:00 model
200080011002  9529  4227  0  -1  0149376   72 kernelma D    pb  0:00 model

A pstat -Ts showed the following:

[149] pstat -Ts
>pstat: number of files is preposterous (14019)
>1470/1470 inodes
>454/4090 processes
>460952/781032 swap
>

We have a lot of swap space allocated, to run some of these big jobs.

-- 
  Hugh LaMaster, M/S 233-9,  UUCP:                ames!lamaster
  NASA Ames Research Center  Internet:            lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035    With Good Mailer:    lamaster@george.arc.nasa.gov 
  Phone:  415/604-1056                            #include <std.disclaimer>