[comp.protocols.nfs] what causes a stale file handle?

rayan@cs.toronto.edu (Rayan Zachariassen) (09/10/89)

I have an application which sits as a daemon with a bunch of files open
all the time, to avoid reopening them 3 times a second when things get
busy.  One of the files it keeps open is the active file for news on an
NFS partition.  Once in a blue moon, non-repeatable but occurring in clusters,
an lseek() followed by a read() (actually rewind() and fgets()) on the open
file descriptor (pointer) will fail with a stale NFS file handle error.
This happens inside stdio, but this is the error reported by trace on the
active daemon.

To summarize:

NFS client (SunOS 4.0.3) has daemon keeping file open for read on NFS server
(SunOS 3.5), said file is frequently modified locally on the server, perhaps
even recreated (i.e. unlinked and new independent file by same name created)
by the server.

Question:

Why is the stale NFS handle error returned?  Any way to compensate within
the daemon apart from doing a close()/open() sequence?  Any way to tell
that a read() will return ESTALE without actually doing the read() ?

Hypothesis:

NFS doesn't deal as expected ("unix semantics") with disappearing file links.

Any info would be appreciated.

rayan

chuq@Apple.COM (Chuq Von Rospach) (09/10/89)

>Question:

>Why is the stale NFS handle error returned?  Any way to compensate within
>the daemon apart from doing a close()/open() sequence?  Any way to tell
>that a read() will return ESTALE without actually doing the read() ?

A stale file handle is reported when the Fhandle to the file in question
becomes out of date. Every inode on an NFS-server has a generation count
that is modified whenever that inode is re-used (or whenever fsirand is run,
like when you run newfs).

What's probably happening to you, since this is a news partition, is that
expire is being run. As part of expire, a new active file is created, then
there's a renaming of active to oactive and nactive to active. Your daemon
now has an fhandle pointing to what is now oactive.

Since NFS is stateless, the fhandle has a generation count to avoid the
following scenario:

o client opens server:/tmp/foo
o server deletes /tmp/foo
o server creates new file /tmp/financial_data (which happens to re-use the
  same inode)
o client attempts to use the fhandle given it by /tmp/foo. Without the
  generation count, since the fhandle would point to an inode with a real
  file attached, the access would succeed, even though the file you opened
  has nothing to do with the data you are now playing with.

Generally, you'll see stale fhandles under two circumstances:

o the filesystem you have mounted gets restored from backup and you haven't
  umounted it.
o the file you have open has been deleted and re-created.

the stale message is a way of telling you that something has happened to the
file since you opened it that makes the fhandle you have invalid, so you
don't continue using an fhandle that may well be pointing to incorrect data.


-- 

Chuq Von Rospach      <+>     Editor,OtherRealms     <+>     Member SFWA/ASFA
         chuq@apple.com   <+>  CI$: 73317,635  <+>  AppleLink: CHUQ
      [This is myself speaking. No company can control my thoughts.]

        Perhaps I should say Dr. *Von* Rospach, Dr. Rospach? (Gasp)