[comp.unix.wizards] processes that get stuck

thoth@springs.cis.ufl.edu (Gilligan) (03/12/90)

  While we're all discussing defunct and exiting processes, how about
that SunOS 4.x that sometimes puts processes in permanent disk wait?
Processes show up with a D in the STAT field of a ps -gux.  These
processes are unkillable and can only been removed with a halt and a
reboot.  They tend to collect other processes as well.  If you ever
put an emacs into this totally-hosed-state it is quite likely that any
others you start up after it will follow it into D-space.
  I have on occasion been using X windows and watched the xload
skyrocket to 15 as all my shells and compilations go to hell.  The
L1-a is the only solution.
  It hasn't happened as much since we got 8 megs of ram for all of our
computers.

  Comments? explanations?
--
( My name's not really Gilligan, It's Robert Forsman, without an `e' )

antony@lbl-csam.arpa (Antony A. Courtney) (03/12/90)

In article <THOTH.90Mar11213935@springs.cis.ufl.edu> thoth@springs.cis.ufl.edu (Gilligan) writes:
>
>  While we're all discussing defunct and exiting processes, how about
>that SunOS 4.x that sometimes puts processes in permanent disk wait?
>Processes show up with a D in the STAT field of a ps -gux.  These
>processes are unkillable and can only been removed with a halt and a
>reboot.  They tend to collect other processes as well.  If you ever
>put an emacs into this totally-hosed-state it is quite likely that any
>others you start up after it will follow it into D-space.
>  I have on occasion been using X windows and watched the xload
>skyrocket to 15 as all my shells and compilations go to hell.  The
>L1-a is the only solution.
>  It hasn't happened as much since we got 8 megs of ram for all of our
>computers.
>
>  Comments? explanations?
>--
>( My name's not really Gilligan, It's Robert Forsman, without an `e' )

We noticed this a lot, too.

The best explanation that I could come up with was the following:

We frequently noticed that when this did happen, we were ocassionally getting
messages in syslog about how the system "lost interrupt from controller".  My
theory on what was going on is this:

your application requests access to some file.  The inode of the file or a
block of the file is allocated and is locked by the kernel on behalf of your
process.  Then the kernel initiates the disk controller for the read(), and
puts your process to sleep on the block pending the DMA transfer from the
disk controller.  IF THE SYSTEM LOSES THE INTERRUPT TELLING IT THE DMA TRANSFER
HAS COMPLETED, OR IF THE DMA TRANSFER NEVER OCCURES, YOUR PROCESS SLEEPS ON
THIS BLOCK INDEFINITELY.   Furthermore, any other processes which attempt to
access this file will find the particular block locked and will sleep pending
a brelse() of this block.  Since your first process is never woken up, it never
releases the block and those subsequent processes also sleep indefinitely. 

We have not had this happen for quite a while.  (we have also been running
SunSos 4.1Beta since X-mas).  We may have also replaced our disk controller,
I'm not sure.

Has anyone else experienced this problem?  Is my 'theory' a valid explanation?

				antony
--
*******************************************************************************
Antony A. Courtney				antony@lbl.gov
Advanced Development Group			ucbvax!lbl-csam.arpa!antony
Lawrence Berkeley Laboratory			AACourtney@lbl.gov