[comp.unix.internals] Can a process stop with a locked inode?

jfh@rpp386.cactus.org (John F Haugh II) (06/27/91)

In article <1991Jun26.093356.12661@prl.dec.com> boyd@prl.dec.com (Boyd Roberts) writes:
>In article <1991Jun25.232436.1215039@locus.com>, richard@locus.com (Richard M. Mathews) writes:
>> It sounds like that is what is happening.  This is possible if you ever
>> sleep at pri>PZERO while the inode is locked.
>
>This is nonsense.  When the inode is locked no process, apart
>from the one with the lock, can operate on it.  Sleeping with
>the inode locked may make things worse, but the priority is irrelevant.

No, it isn't nonsense.  The reason that a level like "PZERO" exists is
to distinguish between things that can happen real fast (short term
sleeps) versus real slow (longer term sleeps).  PINOD is as low as it
is to insure that the sleeping process 1) isn't interrupted (including
SIGSTOP or SIGTSTP) and 2) gets the CPU back.  No process should =ever=
sleep with an inode locked above PZERO for many very simple and obvious
reasons.

>Unless you're really sure about what you're doing, any kernel data you observe
>will just confuse you.  Even in a static system (a crash dump) it is often
>far from obvious what is going on.  Maybe the `problem' isn't with `cp'.

Trust me, Richard knows exactly what he is talking about.  Arguing with
him is not unlike arguing with Guy Harris or Doug Gwyn.  It gives you this
warm feeling in your stomach that is often confused with "warm fuzzies",
but which is really heartburn.  Richard's analysis is dead on.  Unless you
don't have source code access, crash is almost always your best friend.
-- 
John F. Haugh II        | Distribution to  | UUCP: ...!cs.utexas.edu!rpp386!jfh
Ma Bell: (512) 255-8251 | GEnie PROHIBITED :-) |  Domain: jfh@rpp386.cactus.org
"UNIX signals are not interrupts.  Worse, SIGCHLD/SIGCLD is not even a UNIX
 signal, it's an abomination."  -- Doug Gwyn

torek@elf.ee.lbl.gov (Chris Torek) (06/28/91)

In article <19416@rpp386.cactus.org> jfh@rpp386.cactus.org
(John F Haugh II) writes:
>... The reason that a level like "PZERO" exists is to distinguish between
>things that can happen real fast (short term sleeps) versus real slow
>(longer term sleeps).  PINOD is as low as it is to insure that the
>sleeping process 1) isn't interrupted (including SIGSTOP or SIGTSTP)
>and 2) gets the CPU back. ...

This is largely true, but changing.

SunOS borrowed an idea from System V; the current 4BSD kernel has a
rather different approach to the same problem.

In Version 6 and many of its derivatives, the `sleep priority' is
intimately tied to the action the kernel takes when a signal is delivered
to a sleeping process.  One of two things happens:

	a. priority < PZERO: the signal is OR'ed into the `pending signals'
	   mask.  The sleep continues sleeping.
	b. priority > PZERO: the signal is OR'ed in, but the sleep is
	   aborted and the process is resumed via a longjmp to u.u_qsave.
	   Normally this goes back to code in syscall() that returns EINTR.

(I cannot recall offhand the boundary condition for PZERO itself.)

Thus, you *can* sleep at priority > PZERO with something locked, provided
you arrange in advance to catch longjmp()s out of sleep().

Newer kernels often have something called PCATCH: if you set PCATCH
on a call to sleep(), signals never longjmp() out; instead, they return
EINTR.  A `successful' sleep returns 0.  In the last SunOS I recall
(3.5? 3.2? something like that), longjmp() out of sleep still occurred
in some cases (that is, the mechanism was the union of all past approaches).

The current 4BSD kernel has taken a more radical step.  u.u_qsave is
gone.  *All* sleep calls are uninterruptable unless you set PCATCH.
Thus, if you do not set PCATCH, sleep returns zero; if you do, you must
check for error returns.  All functions unwind the call stack in the
usual way, and it is now impossible to longjmp() past an unlock.  There
are no (zero, none) calls to setjmp() or longjmp() in the kernel.
(Actually, sleep() has been replaced with tsleep(), which also takes a
timeout.)  Perhaps surprisingly, this generally speeds up system calls,
as the setjmp() in syscall() was relatively expensive and usually
unnecessary.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov