[net.unix-wizards] Possible File table problem

pc (04/18/83)
Well, thanks to all those who responded to my plea for help.

For those who don't know what this is about:

	I have found that one of my device drivers sometimes hangs
apparently with the close routine not being called even when all inode
and file references have gone.

In terms of fixes, there general consensus seems to be:

1)      if you are writing a device driver don't allow the close()
routine to sleep. (Actually I knew this and the code makes lots of
efforts to allow the close routine to complete without giving up the
processor).

2)	This is a bug which has been in UNIX since V6.

My own investigations.

The main trouble with a bug like this is that it makes you begin to
doubt your knowledge of the behaviour of the kernel. So, some of the
questions below are to do with this - if you have any answers I would
be grateful.

Now, it did seem to be tied up with another bug which has been floating
around the net, i.e. inode references which sometimes change to 255.

My symptoms seem to be tied up with the death of the controlling
process possibly due to a signal, so my first doubts were in the
internal exit() routine.

Question????
Is exit() called twice for a process on a Berkeley 4.1BSd system?

If so then:
	the code which says
		plock(dir);	/* dir is two entries */
		iput(dir);
	should probably read
		ip = dir;
		dir = NULL;
		plock(ip);
		iput(ip);
The code which frees the file table entries already does this type of
trick. If the first code is ever called twice then you get an inode on
the free list with a reference count of -1. Or worse, you can close
an inode which belongs to another process.
Now, rather than altering the code in exit, I decided to see whether I
could get the iget/iput routines to tell me what was happening.

On examination of these routines you find
iput says
	if(ip->i_count == 1)
		<***do free stuff***>
	else ip->i_count--;
and iget says
	ip->i_count++;

This code works but what happens on an internal error? I decided to put
	if(ip->i_count == 0)
		panic("Zero inode free");
	else
	if(ip->i_count == 1)
	etc....

And then Heisenberg struck......
Since, then the last close bug has happened twice, when before it
was happening about once every two days. We have never had the panic.

My conclusion is that there is a very funny timing bug in the kernel and
adding the panic test has shifted it. Of course, I cannot now take the
test out........

Any ideas anyone????


Peter Collinson
lime!ukc!pc
or
mcvax!ukc!pc