[comp.unix.wizards] Can a process stop with a locked inode?

hue@coney.island.COM (Pond Scum) (06/19/91)

I'm seeing some strange behavior on file systems mounted on a device
driver (erasable optical disk) I wrote.  Almost everything seems to be ok,
I can newfs the disk, mount it, copy files to it, fsck it, etc.  However,
if I copy a huge file to the disk and send the cp process a SIGSTOP while
it's copying the file, I can't stat(2) the destination file.  It just hangs
until I either continue the cp in the foreground or background, or kill it.
No data is lost or anything else, but if I do an ls -l it hangs until
the cp starts running again or exits.  It's as if the cp is getting stopped
while holding a lock on the inode for the destination file, and the stat
is sleeping until the inode is free.  Is this even possible?  I can't think
of anything else to explain what I'm seeing.  I checked to make sure that
the driver's interrupt routine always calls the start routine until there
are no more bufs in its queue, so that's not the problem.  I can do an ls
on the directory, and an ls -l on all the other files, but ls -l on the
destination hangs, so I'm assuming that it's the stat(2) on the destination
file that causes the hang.

The driver we got from the distributor that sold us the disk also exhibits
the same behavior.

The environment is a Sun SPARCStation 2 running SunOS 4.1.1.  I would greatly
appreciate any help.

Thanks
Jonathan			hue@island.COM

richard@locus.com (Richard M. Mathews) (06/26/91)

hue@coney.island.COM (Pond Scum) writes:

>No data is lost or anything else, but if I do an ls -l it hangs until
>the cp starts running again or exits.  It's as if the cp is getting stopped
>while holding a lock on the inode for the destination file, and the stat
>is sleeping until the inode is free.  Is this even possible?

It sounds like that is what is happening.  This is possible if you ever
sleep at pri>PZERO while the inode is locked.  First check the wchan
of the "ls" process.  If it points at the incore inode, then you know
SOMEONE has the inode locked, the "cp" is a good candidate, and you know
that since it did get stopped it must have been at pri>PZERO; thus this
is almost definitely the problem.

If a quick glance at the 2nd arguments to your sleep calls doesn't find
the bad sleep call, you could use "crash" or equivalent to look at the
kernel stack of the "cp" process.  (If a program to find the kernel
stack is not available for you, you might have to check out page table
entries or page pointers in the proc structure to find that process's
kernel stack.)  That should help you find exactly where the process is
sleeping.

Richard M. Mathews			 Freedom for Lithuania
richard@locus.com				Laisve!
lcc!richard@seas.ucla.edu
...!{uunet|ucla-se|turnkey}!lcc!richard

boyd@prl.dec.com (Boyd Roberts) (06/26/91)

In article <1991Jun25.232436.1215039@locus.com>, richard@locus.com (Richard M. Mathews) writes:
> 
> It sounds like that is what is happening.  This is possible if you ever
> sleep at pri>PZERO while the inode is locked.

This is nonsense.  When the inode is locked no process, apart
from the one with the lock, can operate on it.  Sleeping with
the inode locked may make things worse, but the priority is irrelevant.

For completeness, I'll just add that priorities > PZERO are interruptable.

> First check the wchan
> of the "ls" process.  If it points at the incore inode, then you know
> SOMEONE has the inode locked, the "cp" is a good candidate, and you know
> that since it did get stopped it must have been at pri>PZERO; thus this
> is almost definitely the problem.

Well I'd hardly call it a problem.  The `cp' will have the inode locked while
it is doing I/O on it, or stat(2)ing it, or the directory it's in will
be locked during open/creat/unlink.  You should really be asking whether:

    1. Is the cause of this due to the inode being locked?
    2. If so, does `cp' has the inode locked?
    3. if so, for how long?
    4. Why is it locked for so long?

In practice this isn't a problem.

> If a quick glance at the 2nd arguments to your sleep calls doesn't find
> the bad sleep call, you could use "crash" or equivalent to look at the
> kernel stack of the "cp" process.  (If a program to find the kernel
> stack is not available for you, you might have to check out page table
> entries or page pointers in the proc structure to find that process's
> kernel stack.)  That should help you find exactly where the process is
> sleeping.

Unless you're really sure about what you're doing, any kernel data you observe
will just confuse you.  Even in a static system (a crash dump) it is often
far from obvious what is going on.  Maybe the `problem' isn't with `cp'.

Boyd Roberts			boyd@prl.dec.com

``When the going gets wierd, the weird turn pro...''

hue@island.COM (Pond Scum) (06/28/91)

In article <1991Jun25.232436.1215039@locus.com> richard@locus.com (Richard M. Mathews) writes:
>hue@coney.island.COM (Pond Scum) writes:
>>the cp starts running again or exits.  It's as if the cp is getting stopped
>>while holding a lock on the inode for the destination file, and the stat
>>is sleeping until the inode is free.  Is this even possible?
>
>It sounds like that is what is happening.  This is possible if you ever
>sleep at pri>PZERO while the inode is locked.  First check the wchan

I don't call sleep except when I need a buf to do a special command,
and that isn't happening here (only through ioctl).  All this stuff
is coming through the strategy routine from the ufs file system, so I
figure someone higher up is calling iowait() (biowait()) when they need
to sleep, and sleeping at PRIBIO.

>of the "ls" process.  If it points at the incore inode, then you know
>SOMEONE has the inode locked, the "cp" is a good candidate, and you know
>that since it did get stopped it must have been at pri>PZERO; thus this
>is almost definitely the problem.

Well, I had another theory, but it doesn't explain everything.  Writes to
an ordinary file (a "fast device") are normally not interruptable because
they sleep at priorities higher than PZERO (lower numerically), correct?
In SunOS 4.1.1, they made some change to the kernel that makes writes to
an ordinary files interruptable (and not restartable!).  It seems to me that
this same change would allow a stop signal to interrupt cp while it had
an inode locked and stop it, and cause the ls to hang when it tried to
stat the file.

Jonathan Hue	Island Graphics Corporation, Graphic Arts Division
hue@island.COM	{sun,uunet}!island!hue