meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) (04/24/89)
Hi,
I've discovered the following problem in minix/st. I'm almost sure that it
applies to minix/pc as well.
symptoms:
system hangs. debugging function keys work.
caused by:
a process which does frequent alarm calls. Also other interrupts can cause this
behaviour.
analysis:
Whenever an interrupt occurs during an I/O action this action is cancelled
and EINTR is returned on the system call initiating that action.
This is done in the following way:
from withing fs/pipe.c the routine rw_dev from fs/device.c is called to
send a CANCEL message to the task servicing the I/O action.
Within rw_dev sendrec is called to send the message to the task in question.
Sendrec can return E_LOCKED if there would occur deadlock by sending the
message. If so rw_dev reads the message waiting and processes it (it can
only be a revive message). Then a new try for sendrec is done.
Hovever, if the message waiting is a completion message for the I/O action
to be cancelled
the next call of sendrec is a cancel message to a task which has no action
in progress. The task does not reply and thus the fs will be waiting ad
infinitum.
fix:
The problem can be fixed in several ways:
- always give a reply back on the CANCEL message.
however withing the pc tty driver a warning for races is given in the
do_cancel routine.
- check in the routine rw_dev for the condition described above and deal
with it.
I've chosen for the second approach.
I include the routine rw_dev from fs/device.c below.
No cdiff since my version may as well vary on other places.
Good luck,
Frans Meulenbroeks (.signature at end)
PS: this was a tough one.
/*===========================================================================*
* rw_dev *
*===========================================================================*/
PUBLIC void rw_dev(task_nr, mess_ptr)
int task_nr; /* which task to call */
message *mess_ptr; /* pointer to message for task */
{
/* All file system I/O ultimately comes down to I/O on major/minor device
* pairs. These lead to calls on the following routines via the dmap table.
*/
int r;
message m;
while ((r = sendrec(task_nr, mess_ptr)) == E_LOCKED) {
/* sendrec() failed to avoid deadlock. The task 'task_nr' is
* trying to send a REVIVE message for an earlier request.
* Handle it and go try again.
*/
if (receive(task_nr, &m) != OK)
panic("rw_dev: can't receive", NO_NUM);
/* if we're trying to send a cancel message to a task which just tries
* to send a completion reply, ignore the reply and abort the cancel
* request. The caller will do the revive for the process.
* 22-4-89 F. Meulenbroeks
*/
if ((m.REP_PROC_NR == mess_ptr->PROC_NR) && (mess_ptr->m_type == CANCEL))
return;
revive(m.REP_PROC_NR, m.REP_STATUS);
}
if (r != OK)
panic("rw_dev: can't send", NO_NUM);
}
Frans Meulenbroeks (meulenbr@cst.prl.philips.nl)
Centre for Software Technology
( or try: ...!mcvax!philmds!prle!cst!meulenbr)meulenbr@cstw01.prl.philips.nl (Frans Meulenbroeks) (04/24/89)
Oops, forgot to tell one thing: If the action cancelled is a read action, my fix discards the data read which may be annoying when reading from a tty. On the other hand, signals can come on a lot of places, and if you issue a longjump in your interrupt handler it really doesn't matter. regards, Frans Meulenbroeks (meulenbr@cst.prl.philips.nl) Centre for Software Technology ( or try: ...!mcvax!philmds!prle!cst!meulenbr)
evans@ditsyda.oz (Bruce Evans) (04/25/89)
in article <443@prles2.UUCP>, meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) says: > [fs sometimes sends a CANCEL message which is never replied to] > fix: > The problem can be fixed in several ways: > - always give a reply back on the CANCEL message. > however withing the pc tty driver a warning for races is given in the > do_cancel routine. > - check in the routine rw_dev for the condition described above and deal > with it. > > I've chosen for the second approach. My PC TTY driver fixes it using the 1st approach. The race conditions are avoided by avoiding duplicate cancellations. It is unreasonable for a task not to reply to a sendrec()! The 1.3d printer driver has the same bad behaviour. It only replies to messages it understands. So an ioctl() to the printer will hang the system. > PS: this was a tough one. Same. How can we avoid duplicating the effort of fixing bugs for the PC, the ST, and out of date versions of both? Bruce Evans evans@ditsyda.oz.au