[comp.sys.sun] Slow [ NFS ] file update

das@harvard.harvard.edu (David Steffens) (07/25/90)

In article <9767@brazos.Rice.edu>
dz@cornu.ucsb.edu (Daniel James Zerkle) writes:
> A certain quirk with NFS is causing major headaches...
> Basically, my program uses three Unix machines:
> 1. The main program runs on a Sun 3/60.
> 2. The main program, via rsh, calls a secondary program on a Masscomp.
> 3. A Sun 3/160 (I think) is a file server for the above two machines.

Gasp! This is almost exactly the same configuration we are using
(substitute Sun4 for Sun3) and exactly the same problem we are having.  In
fact, I was just getting ready to submit something to SunSpots a couple of
weeks ago when I got distracted by other things.  IMHO, this is a Sun NFS
bug, since I can reproduce it using a simple pair of test programs, each
running on a Sun and using a third Sun as a server (i.e. _no_ Masscomp
involved).

> To summarize the operation:
> a. 1 calls program on 2
> b. 2 writes first 256 bytes of a file (open-write)
> c. 1 reads those 256 bytes (open-read-close)
> d. 2 writes rest of file (write-close-exit)
> e. 1 reads rest of file

Our paradigm is similar, but not identical:
a. 1 starts a program on 2 via rsh and pauses
b. 2 opens a file on 3 for writing and signals 1 that it is ready
c. 1 waits for 2 to signal ready then opens the same file on 3 for reading
d. 2 then continuously writes variable length hunks of data
   into the file on 3 and tells 1 how much was written each time
e. 1 loops reading and processing each hunk of data written by 2

> Right now, I have 1 do a periodic check to see if NFS has gotten it
> straight how big the file is.  In other words, 1 may wait about a minute
> while the file size gets straightened out.  This is totally unacceptable...

Some things which seem to improve the situation:
1. After writing a hunk of data on 2, it helps to have 2 do an fsync(2).
2. Before reading the data on 1, it helps to close the file, reopen it
   and then seek to the end of the data already read.

Kludge #1 is important for us but may not be for you.  Apparently, there
is some buffering on machine 2 for optimization of NFS writes since the
file on 3 always seems to grow by 8K increments no matter how much (or
little) is written on 2.  Therefore, Kludge #1 helps, but does not cure
the problem.

Kludge #2 seems to be very important also.  Apparently, 1 asks 3 for the
attributes of the file when the file is first opened and then caches the
results.  Since 1 now thinks that it knows  everything there is to know
about the file, it doesn't bother to interrogate 3 for the current file
attributes before each read, and thus doesn't see that the file has
changed size.  As a consequence of this, 1 won't ask 3 for any piece of
the file beyond the size it knows.  The close/open/seek seems to force 1
to check the attributes of the file against reality on 3 and update its
cache.  Just seeking around in the file is insufficient -- the file must
be closed and reopened for the seek to have the desired effect.

Kludge #2 improves the situation tremendously, although even both Kludges
taken together are not sufficient to guarantee success under all
conditions.  The residuum seems to be explainable by network delays,
however.

It is still a mystery to me what mechanism eventually causes 1 to catch up
with the state of affairs on 3 when 1 is left to its own devices.  I've
seen delays of a minute or more which don't seem to correlate with
anything.  As long as the program on 1 keeps the file open, the attributes
of the file remain incorrect, even as reported by standard utilities, e.g.
ls -l.

P.S. For the skeptics out there... source for the test suite is available
by mail from me.  If you can find something wrong with our approach or our
implementation, I'd like to hear about it!

{harvard,mit-eddie,think}!eplunix!das	David Allan Steffens
243 Charles St., Boston, MA 02114	Eaton-Peabody Laboratory
(617) 573-3748 (1400-1900h EST)		Mass. Eye & Ear Infirmary
	I'm a firm believer in learning from one's past mistakes...
	...but why should it take so many to get a good education?

dupuy@cs.columbia.edu (07/26/90)

dz@cornu.ucsb.edu (Daniel James Zerkle) writes:
> To summarize the operation:
> a. 1 calls program on 2
> b. 2 writes first 256 bytes of a file (open-write)
> c. 1 reads those 256 bytes (open-read-close)
> d. 2 writes rest of file (write-close-exit)
> e. 1 reads rest of file

eplunix!das@harvard.harvard.edu (David Steffens) replies:
| Our paradigm is similar, but not identical:
| a. 1 starts a program on 2 via rsh and pauses
| b. 2 opens a file on 3 for writing and signals 1 that it is ready
| c. 1 waits for 2 to signal ready then opens the same file on 3 for reading
| d. 2 then continuously writes variable length hunks of data
|    into the file on 3 and tells 1 how much was written each time
| e. 1 loops reading and processing each hunk of data written by 2

> Right now, I have 1 do a periodic check to see if NFS has gotten it
> straight how big the file is.  In other words, 1 may wait about a minute
> while the file size gets straightened out.  This is totally unacceptable.

| Some things which seem to improve the situation:
| 1. After writing a hunk of data on 2, it helps to have 2 do an fsync(2).
| 2. Before reading the data on 1, it helps to close the file, reopen it
|    and then seek to the end of the data already read.

| Apparently, 1 asks 3 for the
| attributes of the file when the file is first opened and then caches the
| results.  Since 1 now thinks that it knows  everything there is to know
| about the file, it doesn't bother to interrogate 3 for the current file
| attributes before each read, and thus doesn't see that the file has
| changed size.  As a consequence of this, 1 won't ask 3 for any piece of
| the file beyond the size it knows.  The close/open/seek seems to force 1
| to check the attributes of the file against reality on 3 and update its
| cache.

As David suspected, there be caches here.  What you are seeing here is the
NFS attribute cache, which normally acts to speed up programs like ls and
make (i.e. standard unix utilities, which use pipes or stream sockets to
communicate with each other, rather than NFS mounted files).  In SunOS 4.0
you can minimize the effects of the cache, by specifying "actimeo=1" as an
option to the mount on the client.  This invalidates the cache after 1
second, rather than the default 60 seconds.  In SunOS 4.1, you can disable
the cache entirely using the "noac" mount option (actually, you can do
this in 4.0, but it tickles a bug that will corrupt your file, if not your
filesystem).

However, might I suggest that since you wish to move data from a process
on machine X to a process on machine Y, that you use sockets to move the
data, instead of NFS-mounted files.  TCP is an excellent protocol, and the
performance you get should easily match the performance of UDP-based NFS,
and do substantially better over longhaul networks, in the presence of
noise, and NFS implementation glitches (TCP hasn't had any significant
ones since before NFS was born).

Using sockets can be as easy as using pipes to rsh, or in the case of your
programs:

	machine1$ rsh machine2 program2 | program1

If you need to have the data logged into a file, you could always put a
"tee" in the pipeline somewhere.  Hope these suggestions help.

inet: dupuy@cs.columbia.edu
uucp: ...!rutgers!cs.columbia.edu!dupuy