meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) (04/24/89)
Hi, I've discovered the following problem in minix/st. I'm almost sure that it applies to minix/pc as well. symptoms: system hangs. debugging function keys work. caused by: a process which does frequent alarm calls. Also other interrupts can cause this behaviour. analysis: Whenever an interrupt occurs during an I/O action this action is cancelled and EINTR is returned on the system call initiating that action. This is done in the following way: from withing fs/pipe.c the routine rw_dev from fs/device.c is called to send a CANCEL message to the task servicing the I/O action. Within rw_dev sendrec is called to send the message to the task in question. Sendrec can return E_LOCKED if there would occur deadlock by sending the message. If so rw_dev reads the message waiting and processes it (it can only be a revive message). Then a new try for sendrec is done. Hovever, if the message waiting is a completion message for the I/O action to be cancelled the next call of sendrec is a cancel message to a task which has no action in progress. The task does not reply and thus the fs will be waiting ad infinitum. fix: The problem can be fixed in several ways: - always give a reply back on the CANCEL message. however withing the pc tty driver a warning for races is given in the do_cancel routine. - check in the routine rw_dev for the condition described above and deal with it. I've chosen for the second approach. I include the routine rw_dev from fs/device.c below. No cdiff since my version may as well vary on other places. Good luck, Frans Meulenbroeks (.signature at end) PS: this was a tough one. /*===========================================================================* * rw_dev * *===========================================================================*/ PUBLIC void rw_dev(task_nr, mess_ptr) int task_nr; /* which task to call */ message *mess_ptr; /* pointer to message for task */ { /* All file system I/O ultimately comes down to I/O on major/minor device * pairs. These lead to calls on the following routines via the dmap table. */ int r; message m; while ((r = sendrec(task_nr, mess_ptr)) == E_LOCKED) { /* sendrec() failed to avoid deadlock. The task 'task_nr' is * trying to send a REVIVE message for an earlier request. * Handle it and go try again. */ if (receive(task_nr, &m) != OK) panic("rw_dev: can't receive", NO_NUM); /* if we're trying to send a cancel message to a task which just tries * to send a completion reply, ignore the reply and abort the cancel * request. The caller will do the revive for the process. * 22-4-89 F. Meulenbroeks */ if ((m.REP_PROC_NR == mess_ptr->PROC_NR) && (mess_ptr->m_type == CANCEL)) return; revive(m.REP_PROC_NR, m.REP_STATUS); } if (r != OK) panic("rw_dev: can't send", NO_NUM); } Frans Meulenbroeks (meulenbr@cst.prl.philips.nl) Centre for Software Technology ( or try: ...!mcvax!philmds!prle!cst!meulenbr)
meulenbr@cstw01.prl.philips.nl (Frans Meulenbroeks) (04/24/89)
Oops, forgot to tell one thing: If the action cancelled is a read action, my fix discards the data read which may be annoying when reading from a tty. On the other hand, signals can come on a lot of places, and if you issue a longjump in your interrupt handler it really doesn't matter. regards, Frans Meulenbroeks (meulenbr@cst.prl.philips.nl) Centre for Software Technology ( or try: ...!mcvax!philmds!prle!cst!meulenbr)
evans@ditsyda.oz (Bruce Evans) (04/25/89)
in article <443@prles2.UUCP>, meulenbr@cst.prl.philips.nl (Frans Meulenbroeks) says: > [fs sometimes sends a CANCEL message which is never replied to] > fix: > The problem can be fixed in several ways: > - always give a reply back on the CANCEL message. > however withing the pc tty driver a warning for races is given in the > do_cancel routine. > - check in the routine rw_dev for the condition described above and deal > with it. > > I've chosen for the second approach. My PC TTY driver fixes it using the 1st approach. The race conditions are avoided by avoiding duplicate cancellations. It is unreasonable for a task not to reply to a sendrec()! The 1.3d printer driver has the same bad behaviour. It only replies to messages it understands. So an ioctl() to the printer will hang the system. > PS: this was a tough one. Same. How can we avoid duplicating the effort of fixing bugs for the PC, the ST, and out of date versions of both? Bruce Evans evans@ditsyda.oz.au