pc (04/18/83)
Well, thanks to all those who responded to my plea for help. For those who don't know what this is about: I have found that one of my device drivers sometimes hangs apparently with the close routine not being called even when all inode and file references have gone. In terms of fixes, there general consensus seems to be: 1) if you are writing a device driver don't allow the close() routine to sleep. (Actually I knew this and the code makes lots of efforts to allow the close routine to complete without giving up the processor). 2) This is a bug which has been in UNIX since V6. My own investigations. The main trouble with a bug like this is that it makes you begin to doubt your knowledge of the behaviour of the kernel. So, some of the questions below are to do with this - if you have any answers I would be grateful. Now, it did seem to be tied up with another bug which has been floating around the net, i.e. inode references which sometimes change to 255. My symptoms seem to be tied up with the death of the controlling process possibly due to a signal, so my first doubts were in the internal exit() routine. Question???? Is exit() called twice for a process on a Berkeley 4.1BSd system? If so then: the code which says plock(dir); /* dir is two entries */ iput(dir); should probably read ip = dir; dir = NULL; plock(ip); iput(ip); The code which frees the file table entries already does this type of trick. If the first code is ever called twice then you get an inode on the free list with a reference count of -1. Or worse, you can close an inode which belongs to another process. Now, rather than altering the code in exit, I decided to see whether I could get the iget/iput routines to tell me what was happening. On examination of these routines you find iput says if(ip->i_count == 1) <***do free stuff***> else ip->i_count--; and iget says ip->i_count++; This code works but what happens on an internal error? I decided to put if(ip->i_count == 0) panic("Zero inode free"); else if(ip->i_count == 1) etc.... And then Heisenberg struck...... Since, then the last close bug has happened twice, when before it was happening about once every two days. We have never had the panic. My conclusion is that there is a very funny timing bug in the kernel and adding the panic test has shifted it. Of course, I cannot now take the test out........ Any ideas anyone???? Peter Collinson lime!ukc!pc or mcvax!ukc!pc