hue@coney.island.COM (Pond Scum) (06/19/91)
I'm seeing some strange behavior on file systems mounted on a device driver (erasable optical disk) I wrote. Almost everything seems to be ok, I can newfs the disk, mount it, copy files to it, fsck it, etc. However, if I copy a huge file to the disk and send the cp process a SIGSTOP while it's copying the file, I can't stat(2) the destination file. It just hangs until I either continue the cp in the foreground or background, or kill it. No data is lost or anything else, but if I do an ls -l it hangs until the cp starts running again or exits. It's as if the cp is getting stopped while holding a lock on the inode for the destination file, and the stat is sleeping until the inode is free. Is this even possible? I can't think of anything else to explain what I'm seeing. I checked to make sure that the driver's interrupt routine always calls the start routine until there are no more bufs in its queue, so that's not the problem. I can do an ls on the directory, and an ls -l on all the other files, but ls -l on the destination hangs, so I'm assuming that it's the stat(2) on the destination file that causes the hang. The driver we got from the distributor that sold us the disk also exhibits the same behavior. The environment is a Sun SPARCStation 2 running SunOS 4.1.1. I would greatly appreciate any help. Thanks Jonathan hue@island.COM
richard@locus.com (Richard M. Mathews) (06/26/91)
hue@coney.island.COM (Pond Scum) writes: >No data is lost or anything else, but if I do an ls -l it hangs until >the cp starts running again or exits. It's as if the cp is getting stopped >while holding a lock on the inode for the destination file, and the stat >is sleeping until the inode is free. Is this even possible? It sounds like that is what is happening. This is possible if you ever sleep at pri>PZERO while the inode is locked. First check the wchan of the "ls" process. If it points at the incore inode, then you know SOMEONE has the inode locked, the "cp" is a good candidate, and you know that since it did get stopped it must have been at pri>PZERO; thus this is almost definitely the problem. If a quick glance at the 2nd arguments to your sleep calls doesn't find the bad sleep call, you could use "crash" or equivalent to look at the kernel stack of the "cp" process. (If a program to find the kernel stack is not available for you, you might have to check out page table entries or page pointers in the proc structure to find that process's kernel stack.) That should help you find exactly where the process is sleeping. Richard M. Mathews Freedom for Lithuania richard@locus.com Laisve! lcc!richard@seas.ucla.edu ...!{uunet|ucla-se|turnkey}!lcc!richard
boyd@prl.dec.com (Boyd Roberts) (06/26/91)
In article <1991Jun25.232436.1215039@locus.com>, richard@locus.com (Richard M. Mathews) writes: > > It sounds like that is what is happening. This is possible if you ever > sleep at pri>PZERO while the inode is locked. This is nonsense. When the inode is locked no process, apart from the one with the lock, can operate on it. Sleeping with the inode locked may make things worse, but the priority is irrelevant. For completeness, I'll just add that priorities > PZERO are interruptable. > First check the wchan > of the "ls" process. If it points at the incore inode, then you know > SOMEONE has the inode locked, the "cp" is a good candidate, and you know > that since it did get stopped it must have been at pri>PZERO; thus this > is almost definitely the problem. Well I'd hardly call it a problem. The `cp' will have the inode locked while it is doing I/O on it, or stat(2)ing it, or the directory it's in will be locked during open/creat/unlink. You should really be asking whether: 1. Is the cause of this due to the inode being locked? 2. If so, does `cp' has the inode locked? 3. if so, for how long? 4. Why is it locked for so long? In practice this isn't a problem. > If a quick glance at the 2nd arguments to your sleep calls doesn't find > the bad sleep call, you could use "crash" or equivalent to look at the > kernel stack of the "cp" process. (If a program to find the kernel > stack is not available for you, you might have to check out page table > entries or page pointers in the proc structure to find that process's > kernel stack.) That should help you find exactly where the process is > sleeping. Unless you're really sure about what you're doing, any kernel data you observe will just confuse you. Even in a static system (a crash dump) it is often far from obvious what is going on. Maybe the `problem' isn't with `cp'. Boyd Roberts boyd@prl.dec.com ``When the going gets wierd, the weird turn pro...''
hue@island.COM (Pond Scum) (06/28/91)
In article <1991Jun25.232436.1215039@locus.com> richard@locus.com (Richard M. Mathews) writes: >hue@coney.island.COM (Pond Scum) writes: >>the cp starts running again or exits. It's as if the cp is getting stopped >>while holding a lock on the inode for the destination file, and the stat >>is sleeping until the inode is free. Is this even possible? > >It sounds like that is what is happening. This is possible if you ever >sleep at pri>PZERO while the inode is locked. First check the wchan I don't call sleep except when I need a buf to do a special command, and that isn't happening here (only through ioctl). All this stuff is coming through the strategy routine from the ufs file system, so I figure someone higher up is calling iowait() (biowait()) when they need to sleep, and sleeping at PRIBIO. >of the "ls" process. If it points at the incore inode, then you know >SOMEONE has the inode locked, the "cp" is a good candidate, and you know >that since it did get stopped it must have been at pri>PZERO; thus this >is almost definitely the problem. Well, I had another theory, but it doesn't explain everything. Writes to an ordinary file (a "fast device") are normally not interruptable because they sleep at priorities higher than PZERO (lower numerically), correct? In SunOS 4.1.1, they made some change to the kernel that makes writes to an ordinary files interruptable (and not restartable!). It seems to me that this same change would allow a stop signal to interrupt cp while it had an inode locked and stop it, and cause the ls to hang when it tried to stat the file. Jonathan Hue Island Graphics Corporation, Graphic Arts Division hue@island.COM {sun,uunet}!island!hue