[comp.unix.wizards] NFS [un]reliability

karl@cbrma.UUCP (Karl Kleinpaste) (11/10/86)

mike@louis.UUCP writes:
>Recently we have been doing a study of NFS fileservers and we have
>come across unreliability in NFS (i.e writing something to a remote
>file and finding something different when reading it back) when the
>server was under extreme load. Now we are starting to notice the same
>behaviour on our existing Sun fileservers. 
>
>The question is, have other noticed this and does anyone know why
>it happens? 

[mumble]

Yes, I've seen such a thing.  At OSU, there is a small set of Suns
(11?), 3 of which are Sun-2s and the rest are recently-purchased
Sun-3s.  Unfortunately, one of the Sun-2s is the server for *all* the
rest.  Some would call this a Bad Thing, and they would be right.  It
is equipped with 2 Eagle drives for a decent amount of disc, and all
those other Suns are usually quite busy during office hours.

This problem was first noticed in, of all things, the "hack" game, and
more recently in GNU Emacs.  GNU Emacs has lisp code to detect whether
a file has changed on disc more recently than the last time the
current user either read the file in or wrote his changes out.
Periodically, when the server node is seriously overloaded (which is
the case more and more often), GNU Emacs utters the evil phrase, "File
has changed on disc; save anyway [y or n]?"  It is *believed* (that
is, we can't quite prove it yet) that this is due to the sequence of
events where [a] Joe User saves his file, which causes additional work
for an already-overloaded server, [b] GNU Emacs stat(2)'s the file to
get its modification time, but [c] the server is so overloaded that
the file wasn't finished being written at the time of the stat(2), so
[d] Joe goes on and hacks at his file a while longer, [e] issues
another save for it, at which time [f] GNU Emacs stat(2)'s the file
again, compares it against its saved write-time, and [g] finds that
the last modification time is later than the saved write-time.

Potent words of evil tend to get uttered by Joe when he sees GNU
Emacs' comment, because (generally speaking) he hasn't the FAINTEST
idea what caused it.

> And, of course, does anyone know how to stop it?

OSU is choosing to solve the whole problem (that is, overall
performance, not just GNU Emacs and similar programs' foolish
comments) by replacing the Sun-2 file server with >1 Sun-3 file
servers.  You do what you have to.  Unfortunately, it costs
significant $$$ to do what you have to in such cases.
-- 
Karl Kleinpaste

west@onion.cs.reading.ac.uk (Jerry West) (11/15/86)

With regard to the Emacs filetime problems on a nova of Suns hanging
off one server.... we found that we had more problems with date(1) being
different on different machines. Sun have "fixed" ls et al to allow for
files created some (small) time in the future, but GNU Emacs might be
falling foul of this. An rdate(8) in crontab keeps things in line.

Jerry West

jim@cs.strath.ac.uk (Jim Reid) (11/16/86)

In article <85@onion.cs.reading.ac.uk> west@onion.UUCP (Jerry West) writes:
>With regard to the Emacs filetime problems on a nova of Suns hanging
>off one server.... we found that we had more problems with date(1) being
>different on different machines. Sun have "fixed" ls et al to allow for
>files created some (small) time in the future, but GNU Emacs might be
>falling foul of this. An rdate(8) in crontab keeps things in line.

Of course, you should also ensure the client and server machines have
their kernels configured for the same timezone. A former colleague of
mine was baffled by time funnies until he found that the client kernel
thought it was in California (PST) while the server had been properly
configured for local time in Norway!

		Jim