[comp.sys.sun] Bizarre nfs problem

richard%aiai.edinburgh.ac.uk@nsfnet-relay.ac.uk (Richard Tobin) (06/02/89)

Bute is a 4/260 running 4.0.1.  Skye is a 3/280 running 3.5.  Staffa is a
3/260 running 3.2.  Bute and staffa mount /usr/skye2 (rw, hard, intr)
from skye.

User rt has a file thesis.tex in /usr/skye2, which he edits with gnu emacs
on staffa.  However, the new version of the file isn't visible on bute
where he runs latex on it.  Indeed, it looks very strange:

 34906 -rw-r--r--  1 rt           1865 May 31 14:46 thesis.tex
 34906 -rw-r--r--  1 rt           1865 May 31 14:46 thesis.tex.~38~

Two files, each with the same inode number, with but a single link each!

If user richard (me) looks at it (on bute):

 34889 -rw-r--r--  1 rt           1811 Jun  1 18:06 thesis.tex
 34906 -rw-r--r--  1 rt           1865 May 31 14:46 thesis.tex.~38~

If rt looks at it from a different machine he see the right thing (ie
the same as me).

Fsck on skye says that the filesystem is fine.  So it seems to be a
problem with nfs on bute.  This seems very strange (especially that it
depends on the user - does nfs do per-user cacheing?).  Is there a known
bug that could account for it?

Richard

pereira@warbucks.ai.sri.com (Fernando Pereira) (06/08/89)

We have also seen this problem in a similar configuration. What seems to
be happening is the following

1.	3.X NFS client renames "foo" to "foo~" on 3.X NFS server
2.	3.X NFS client creates new file "foo" on 3.X NFS server
	(this seems to be the sequence of events on a GNU Emacs save)
3.	4.0.X NFS client sees the old directory entry for "foo" but
	also the new directory entry for "foo~", hence the two
	different file names with one link each but the same inode.
4.	Some time later (when the system administrator starts trying
	to help the confused user and thus creates some disk activity
	flushing buffers (-:)) the 4.0.X client starts seeing things
	consistently.

One hypothesis here is that the old directory entry for "foo" is in
directory block i, while the new entry for "bar" is in block j. The "ls"
sees a cached version of i but the new version of j. Eventually the cache
is flushed, the new block i is read back, and things appear consistent
again. Someone more knowledgeable in kernel/NFS matters might be able to
say whether this hypothesis is at all reazsonable.  We've seen this
problem only on a almost unloaded 4.0.X client; the lifetime of the
problem has varied from over 10 minutes to just a few seconds. Marvels of
distributed, stateless, asynchronous file systems...

Fernando Pereira
AI Center
SRi International