[comp.os.minix] bug in minix fs/devicce.c

meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) (04/24/89)

Hi,

I've discovered the following problem in minix/st. I'm almost sure that it
applies to minix/pc as well.

symptoms:
system hangs. debugging function keys work.

caused by:
a process which does frequent alarm calls. Also other interrupts can cause this
behaviour.

analysis:
Whenever an interrupt occurs during an I/O action this action is cancelled
and EINTR is returned on the system call initiating that action.
This is done in the following way:
from withing fs/pipe.c the routine rw_dev from fs/device.c is called to 
send a CANCEL message to the task servicing the I/O action.

Within rw_dev sendrec is called to send the message to the task in question.
Sendrec can return E_LOCKED if there would occur deadlock by sending the
message. If so rw_dev reads the message waiting and processes it (it can
only be a revive message). Then a new try for sendrec is done.

Hovever, if the message waiting is a completion message for the I/O action
to be cancelled
the next call of sendrec is a cancel message to a task which has no action
in progress. The task does not reply and thus the fs will be waiting ad
infinitum.

fix:
The problem can be fixed in several ways:
- always give a reply back on the CANCEL message.
  however withing the pc tty driver a warning for races is given in the
  do_cancel routine.
- check in the routine rw_dev for the condition described above and deal
  with it.

I've chosen for the second approach.
I include the routine rw_dev from fs/device.c below.
No cdiff since my version may as well vary on other places.

Good luck,
Frans Meulenbroeks (.signature at end)

PS: this was a tough one.

/*===========================================================================*
 *				rw_dev					     *
 *===========================================================================*/
PUBLIC void rw_dev(task_nr, mess_ptr)
int task_nr;			/* which task to call */
message *mess_ptr;		/* pointer to message for task */
{
/* All file system I/O ultimately comes down to I/O on major/minor device
 * pairs.  These lead to calls on the following routines via the dmap table.
 */
  int r;
  message m;

  while ((r = sendrec(task_nr, mess_ptr)) == E_LOCKED) {
	/* sendrec() failed to avoid deadlock. The task 'task_nr' is
	 * trying to send a REVIVE message for an earlier request.
	 * Handle it and go try again.
	 */
	if (receive(task_nr, &m) != OK)
		panic("rw_dev: can't receive", NO_NUM);
	/* if we're trying to send a cancel message to a task which just tries
	 * to send a completion reply, ignore the reply and abort the cancel
	 * request. The caller will do the revive for the process. 
	 *				22-4-89 F. Meulenbroeks
	 */
	if ((m.REP_PROC_NR == mess_ptr->PROC_NR) && (mess_ptr->m_type == CANCEL))
		return;
	revive(m.REP_PROC_NR, m.REP_STATUS);
  }
  if (r != OK)
	panic("rw_dev: can't send", NO_NUM);
}
Frans Meulenbroeks        (meulenbr@cst.prl.philips.nl)
	Centre for Software Technology
	( or try: ...!mcvax!philmds!prle!cst!meulenbr)

meulenbr@cstw01.prl.philips.nl (Frans Meulenbroeks) (04/24/89)

Oops, forgot to tell one thing:

If the action cancelled is a read action, my fix discards the data read
which may be annoying when reading from a tty. 
On the other hand, signals can come on a lot of places, and if you
issue a longjump in your interrupt handler it really doesn't matter.

regards,
Frans Meulenbroeks        (meulenbr@cst.prl.philips.nl)
	Centre for Software Technology
	( or try: ...!mcvax!philmds!prle!cst!meulenbr)

evans@ditsyda.oz (Bruce Evans) (04/25/89)

in article <443@prles2.UUCP>, meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) says:
> [fs sometimes sends a CANCEL message which is never replied to]
> fix:
> The problem can be fixed in several ways:
> - always give a reply back on the CANCEL message.
>   however withing the pc tty driver a warning for races is given in the
>   do_cancel routine.
> - check in the routine rw_dev for the condition described above and deal
>   with it.
> 
> I've chosen for the second approach.

My PC TTY driver fixes it using the 1st approach. The race conditions
are avoided by avoiding duplicate cancellations. It is unreasonable for
a task not to reply to a sendrec()!

The 1.3d printer driver has the same bad behaviour. It only replies to
messages it understands. So an ioctl() to the printer will hang the
system.

> PS: this was a tough one.

Same. How can we avoid duplicating the effort of fixing bugs for the PC,
the ST, and out of date versions of both?

Bruce Evans             evans@ditsyda.oz.au