[gnu.emacs.bug] Emacs bug?

rivest@THEORY.LCS.MIT.EDU (02/20/89)

Hi --

The version of emacs that the LCS Theory group is using,
(18.50.16 of Mon Oct 24 1988 on allspice) seems to be having problems
with properly detecting whether or not the disk version of a file in a
buffer has been changed.  More precisely, it frequently reports the
error message:
	"File on disk has changed, ..."
when in fact the only program manipulating that file is emacs itself.  

This occurs most frequently during the use of rmail, but it can also be
triggered by an auto-save of a file being edited.  (The most common occurence
is when it is writing back the RMAIL file after having retrieved some new
messages.)

I'm not able to force this to occur in any deterministic manner, so I can't
give you a simple sequence of commands to execute to cause it to appear.
Nonetheless, it occurs all too frequently.  

It may be important to note that most users of the theory group use NFS 
as provided by the "allspice system" in such a way that all files accessed
by emacs are stored on a remote file server.  It is conceivable that the bug
is somehow an NFS bug and not an EMACS bug, or that it is due to some 
unfortunate interaction between NFS and EMACS.  Ray and I have experimented
to see if it might be due to clock skew between the client and the server;
our experiments indicate that this is unlikely to be the cause.

The bug appeared recently when Ray Hirschfeld reconciled our server 
(theory) with the allspice system.  This primarily brought in the X11
system to all the theory machines (although some had been running X11
previously).  It is probably not an X problem, since it has appeared
when I was logged from a modem from a dumb terminal.  

I know this is not a crisp bug report, but the bug is not a crisp one
either.  

	-- Is this a known problem?  If so, is there a known fix?

I will attempt to gather more information on this problem, but would
appreciate any information or guidance you may have at this point...

	Thanks!
	Ron Rivest

jis@ATHENA.MIT.EDU (Jeffrey I. Schiller) (02/21/89)

	I have also run into this problem and tracked it down to an
interaction between the way GNU Emacs determines whether or not a file
has been modified, and the asynchronous nature of NFS.

	Basically when you write out a file with Emacs, it writes the
file, closes the file and then "stats" it to get its modification
time. With a local file system the modification time for the file is
stable after the close() routine completes. However with NFS the
close() routine can complete and return control to Emacs (which then
stats() the file for its modification time) while output to the NFS
file is still pending. Some amount of time later the output is done
being written to the file and the modification time stabilizes on a
value further in the future then when Emacs stat()'d it. This of
course later confuses Emacs into thinking that the file has been
changed since the last time it was written. This tends to happen more
often with large files, and RMAIL files do tend to be large (my RMAIL
file is currently at 1.5 Megabytes for example).

	Unfortunately I don't have a clean solution to this problem
(there isn't a way with the current interface to the kernel to
deterministically know when a file is done being written and thus its
modification time stable).  A kludge might be a add a pause between
the time the file is close()'d and the stat() is performed. The length
of time for this pause could be proportional to the size of the file
(so unnecessary delays aren't added when small files are written).  I
don't know what a good set of parameters for the pause would be.

			-Jeff

rivest@THEORY.LCS.MIT.EDU (02/21/89)

Thanks for your reply and the diagnosis of the problem.  I'm sorry
to hear there doesn't seem to be a clean fix to the problem.  

Actually, your suggestion to put a pause in reminds me that another
"problem" we've been having recently is that emacs seems to take 
an unreasonably long time already to save files.  It is certainly
much longer than it took, say, using Ultrix.  Maybe someone has already
stuffed a number of long pauses into the file-save code (but not enough
to catch all occurrences of this timing problem...)

I think the best fix, given the current kernel interface, is the following:

	-- when emacs detects (by means of its stats() information) that
	   a file has been ``changed'', it should actually go read the file
	   and see if it is actually any different than the buffer.  If 
	   the file and buffer contents are the same, then it should 
	   swallow its complaint and not bother the user with a spurious
	   notification that a change has occurred.  (I'd rather have a
	   little extra delay on writing than having to try to distinguish
	   real from spurious warning messages myself.)

Cheers,
	Ron Rivest

rivest@THEORY.LCS.MIT.EDU (02/22/89)

Jeff --

Regarding your explanation of the NFS/Emacs bug we discussed, wherein
emacs stats() the file before NFS is done getting it all written onto the
remote server, causing a spurious warning from emacs that the file has
changed on disk:

It seems to me that either the problem ought to be easily fixable, or else
the problem is more serious than I thought.  Consider the following 
scenario:
	-- I open a file for output with NFS, and write to it.
	-- I close the file.
	-- I open the file and read from it.
Presumably, NFS will guarantee that the contents of the file I read are 
identical to what I wrote.  That is, NFS should not have the bug that the
read operation can get inaccurate data because the write is not yet finished.
If NFS has this bug, then things are worse than I thought.  If NFS doesn't
have this bug, then can't we use a dummy read operation to more or less
force NFS to finish writing and close the file so stats() will work OK?

	Thanks,
	Ron

jis@ATHENA.MIT.EDU (Jeffrey I. Schiller) (02/23/89)

   Date: Wed, 22 Feb 89 13:27:27 PST
   From: guy@auspex.com (Guy Harris)

   >	-- I open a file for output with NFS, and write to it.
   >	-- I close the file.
   >	-- I open the file and read from it.
   >Presumably, NFS will guarantee that the contents of the file I read are 
   >identical to what I wrote.

   Yes, but it does *not* necessarily do so by syncing all unwritten data
   to the server before doing the "read".  The unwritten data is stored
   somewhere on the NFS client; the "read" can just pick up that data
   rather than going to the server.

	Indeed that is what is happening, a read is going to the local
cache.

   >If NFS doesn't have this bug, then can't we use a dummy read operation
   >to more or less force NFS to finish writing and close the file so
   >stats() will work OK?

   Try doing an "fsync" before the "close", assuming your OS has "fsync"
   (SunOS and Ultrix should both have it); if "fsync" is implemented
   properly (as far as I know, it is so implemented in SunOS), it will not
   return until *all* unwritten data for the file descriptor handed to it
   has been sent to the server.  The SunOS version of "vi" does an "fsync"
   after writing out a file and before closing it.

	I believe Ron is using a Wisconsin port of the Sun NFS code.
Last time I looked at the source code for fsync, it only guarantees
that the writes have been queued (which only means that they are on
the "async_daemon" queue [the queues serviced by the /etc/biod
processes]) not that the i/o has in fact completed.

	An alternative to what I suggested in my last message might be
to change the comparison code in Emacs so that rather then requiring
the file modification date to exactly match the buffer modification
date, allow a certain small tolerance, proportional to file size. This
will eliminate an unnecessary pause and in effect provide the same
semantics.

			-Jeff