boyd@basser.oz (Boyd Roberts) (10/24/85)
This could be a duplicate posting as my original got garbled.
Do you ever get the feeling that the news software doesn't like you?
Our VAX 780 has been crashing a lot lately with "bad memfree"'s
and "lost text"'s. While looking for this bug I'm pretty sure I've
found a bug in newproc(). It's certainly in 32V and probably in 5.0
(and other 5.N swapping systems). I'm not completely sure about 5.0,
but I am about 32V (5.0 has this "curproc" thing, and i'm not too
sure about it's function as I haven't got all of the source).
Anyway, this bug may cause the problems that we've been experiencing,
but I can't think of a scenario. But, it will do hideous things.
The situation is this:
The machine has run out of memory and it's desperate for core.
Some process fork()s, and during the out-swap of the child
(by the parent) the parent gets swapped by sched().
This is probably very rare because is not certain that the parent will be
a candidate for swapping. The situation is even rarer on our system where
we've got copy-on-write fork()s. The problem is that the parent is not
prevented from becoming a candidate for swapping. There is an attempt to
do this but it just doesn't work.
The code goes like this:
"op" is the parent and "np" is the child
np->p_stat = SRUN;
np->p_flag = SLOAD;
u.u_procp = np;
...
if (save(u.u_ssav))
return 1;
if (procdup(np) == NULL)
{
/*
* We've run out of core here, so swap the current
* process to generate the copy.
*/
...
op->p_stat = SIDL;
xswap(np, 0, 0, -1);
op->p_stat = SRUN;
}
...
u.u_procp = op;
setrq(np);
np->p_flag |= SSWAP;
...
return 0;
Now, the "op->p_stat = SIDL" is an attempt to put the parent in
a state where it's not a candidate to be swapped. You've got
to be SRUN, SSTOP or SSLEEP state and not SLOCKed. So everything
is fine until the parent goes to sleep in swap() (waiting for the
io to complete). When this happens you're in the shit.
The parent's state then becomes SSLEEP, and from that point on it will
cycle between SSLEEP and SRUN (because of the sleep()/wakeup() cycle).
Given that sched() wakes up when the parent is either of these states,
the parent is then in a position to be swapped WHILE IT IS SWAPPING
THE CHILD. Oh dear.
Once sched() invokes xswap() on the parent you then have two xswap()s
working with the same core. The one called from sched() will cause the
core to be freed after the swap. The other won't free the core. The
results of the core being freed from underneath the xswap() in newproc()
are not really known. But, they are certainly not conducive to data integrity.
Random process dumping core or crash city...
My fix would be to tear out that revolting mess that is text.c and
re-write it. I mean, it's time to use some algorithms and real
data-structures. That h-h-h-hideous mess that's there turns my stomach.
Do you know how partial swaps work? Un-fortunately I do.
However, I'm not in a position to do that as we can't afford the developement
time and my resignation becomes effective in a week and a half.
Sooo, the fix is this:
op->p_flag |= SLOCK;
xswap(np, 0, 0, -1);
op->p_flag &= ~SLOCK;
Just lock the parent across the swap. Normally you don't have to worry
because xswap() is called by a process that is SSYS (sched()) or will
be SLOCKed (ie. itself for core expansion swaps).
Also I'd change things so that across the swap the child's state is SIDL
and the parent's state is not changed (ie. it just stays SRUN). These
are really style choices. But, changing the child's state to SIDL will
doubly protect it from being swapped (it's SLOCKed by xswap()).
Boyd Roberts ...!seismo!munnari!basser.oz!boyd
"Stand back -- and hold this..."