[comp.protocols.nfs] NFS writes and fsync

peter@objy.objy.com (Peter Moore) (10/10/90)

	My apologies if this is FAQ material, but I just have to know.

There are is a (somewhat) new product out that is supposed to speed up
NFS by cacheing NFS writes in a (somewhat) non-volatile cache.  The
advertisement states that the NFS protocol requires that NFS-write
requests not return until the data has actually been written to disk,
i.e. only after the equivalent of a UNIX fsync() on the file.  Since
this is a EXPENSIVE operation, they get a significant speed up of NFS by
keeping a write-back cache with some sort of battery backup up, and
just lie to the kernel that data has been synced.

Given that the above description is in roughly the same neighborhood
of the truth (it has come through a noisy channel, me).  I have to know:
WHY ON EARTH DOES NFS REQUIRE THE FSYNC ON WRITES?  Without that
requirement, we could the effect of this cache board by just not
calling fsync().

Now whenever I see something ugly in NFS, it usually comes from the
stateless requirement.  But the only state dependent reason I can see is:

    Process P on machine A writes to machine B

    machine B crashes before the write is synced to disk

    machine B recovers

    Process P does another write, dependent on the first succeeding.
    This second write does succeed and now we have inconsistent data.

But in real life, I have seen situations vaguely like this, and the
writing process gets a `stale NFS handle' error.  So it seems that at
least the NFS implementations I have run into have that much state.


So what am I missing?

	Peter Moore
	peter@objy.com

thurlow@convex.com (Robert Thurlow) (10/12/90)

In <1990Oct9.152612@objy.objy.com> peter@objy.objy.com (Peter Moore) writes:
>WHY ON EARTH DOES NFS REQUIRE THE FSYNC ON WRITES?  Without that
>requirement, we could the effect of this cache board by just not
>calling fsync().

No, you couldn't.  The cache board for PCs that I know about is a nice
unit that essentially promises you the data won't go away and keeps it
in battery-backed memory to ensure it.  That's important, since once the
write request is acknowledged, the client will not try the write again,
and may discard its copy of the data.  You can easily lose data when the
server goes down without the server syncing it.  Usually, too, what
waits for the acknowledgement is a block I/O daemon (biod) that will
handle your async writes for you; your process has to wait for all I/O
only when it does an fsync() or a close(), though aggregate throughput
is reduced.

I think most people would agree that the default behaviour should be to
make writes reliable, since that provides the semantics of a local
filesystem.  You are more free to buy extra throughput by upgrading the
server disk or CPU that you are to buy more reliability.  That said,
I'll add that we do provide an export option to allow you to tell the
server to acknowledge the write request immediately upon receipt, and
spool the request to its local I/O subsystem.  It can help performance
a good bit if you don't mind the risks.  It's great for filesystems all
clients mount with -soft; their processes will be gone after a server
reboot, anyway.

>Now whenever I see something ugly in NFS, it usually comes from the
>stateless requirement.  But the only state dependent reason I can see is:
>    Process P on machine A writes to machine B
>    machine B crashes before the write is synced to disk

Stop right there.  Your 'disk' has just lost data, period.  Do you
expect your local disk to ever do that?  The effects could be very
devastating, depending on what exactly cared about the data.  Think
of the havoc you could wreak on a database server.

>But in real life, I have seen situations vaguely like this, and the
>writing process gets a `stale NFS handle' error.  So it seems that at
>least the NFS implementations I have run into have that much state.

ESTALE only happens when the server can't find anything matching the
file handle on its disks, and usually happens when some other process
did a creat() or an unlink(), or the server filesystems got mounted in
a different order.  I don't see the connection here.

Hope that helped,
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"I was so much older then, I'm younger than that now."

peter@prefect.uucp (Peter Moore) (10/14/90)

> 
> I think most people would agree that the default behaviour should be to
> make writes reliable, since that provides the semantics of a local
> filesystem.  
>

But that isn't the semantics of local write.  See below.

> 
> Stop right there.  Your 'disk' has just lost data, period.  Do you
> expect your local disk to ever do that? 
> 

Yes I do expect my local disk to do that.  As you sort of mention,
local writes under almost all Unix systems are asynchronous.  The
write returns immediately, but the data stays in the buffer pool until
either it is pushed out to make room for more active pages or until
sync() is called.  Typically sync() is called every 30 seconds by the
update daemon.  So you have no guarantee that your last 30 seconds of
local I/O ever make it to disk unless you explicitly do a sync.  And if
something does go wrong (a unrecoverable bad block, drive off line, or
a full crash during the 30 second period), there is no way to signal
back to you that it failed.  Heck, your process could have exited
before the write failed.

>   The effects could be very
> devastating, depending on what exactly cared about the data.  Think
> of the havoc you could wreak on a database server.
>

But, as I pointed out, this effect can happen on local writes too.
That is why any `database-like' application must explicitly call
fsync() if it wishes to guarantee that pages have made it to disk. No
recoverable system can depend on the write() alone when writing to
local disk.  So synchronous NFS isn't helpful to the database people, at
least for that reason.  They are already doing the right thing with
explicit syncs just to make it work locally.

This is why synchronous NFS writes seems to be unmotivated to me.
It is MORE synchronous than local Unix I/O (assuming that network
latency is a lot less than 30 seconds).  Why pay such a cost to make
it MORE synchronous than we already are willing to live with on Unix?

The strongest justification I have been able to synthesize from
various sources is basically the original one in my article:

1) Process P on machine L writes to file F on machine R
2) R crashes before syncing the changes to disk and either:
     a) recovers before P does any more writes to F, or
     b) crashes after P has finished with F.
3) P continues on blithely unaware that the writes to F failed and
    produces some data.

The reason this scenario is seemingly unique to NFS is because most
failures of the local machine to write also involve the local machine
crashing.  So P couldn't continue after the write really failed.

    ( Actually this is still possible with local writes: P could either
    exit before the crash or the write error could be a softer error,
    perhaps bad block, that didn't force the system to crash.  I think it
    is even possible to have later I/Os make it to disk, but earlier I/Os not.
    But I do admit this scenario is more likely in the NFS case. )

I have some hand-wavey counter-arguments to the above scenario.

   1) Only relatively naive programs will run into this.  Reliable
      programs will already be doing fsync at the proper times.  So as
      long as an asynchronous form of NFS implements fsync, they will
      be all right.

   2) Only long-lived programs are vulnerable (at least more
      vulnerable than local writes).  If a process takes much less
      that 30 seconds to run, then is very unlikely that the process
      will be actually killed by a machine crash that wipes out its
      buffered writes.  So if the process worked with local writes, it
      must either be calling fsync, or already be willing to live
      with not knowing if its I/O made it to disk.
      
   3) This isn't all that likely.  For case 2a) you need R to crash
      and recover before P does any more I/O to F.  Considering that big
      servers I have worked with can take over half an hour to reboot,
      that is a very wide window to miss. And for case 2b) R has to
      crash within 30 seconds of the very last I/O to F; no sooner, no later.

   4) Kludges can be added.  If NFS handles were invalidated across
      reboots (perhaps by including a byte computed from the boot time in
      the handle), then at least 2a) would be impossible without an
      explicit reopen of F.  More complicated support from the local
      OS could probably make even make reopens of F by P fail (though
      the vague implementations I can think of are unacceptably kludgey).

Now none of these arguments are overwhelming, but they do add up.  I
am not trying to argue that NO one needs or wants synchronous NFS.  I
am arguing that not everybody does, (and I believe, but can't defend
better than the above, that MOST people don't need it).

> I'll add that we do provide an export option to allow you to tell the
> server to acknowledge the write request immediately upon receipt, and
> spool the request to its local I/O subsystem.  It can help performance
> a good bit if you don't mind the risks.  It's great for filesystems all
> clients mount with -soft; their processes will be gone after a server
> reboot, anyway.

This is exactly the sort of thing I want.  Now I just need it on all
my machines as an option.  Guy Harris mentioned in an email thread of
this conversation that a asynchronous extension of NFS has been
considered.  This would seem the best path, allowing some protocol to
negotiate whether the NFS connection will be synced (`nfsmount -o
async` perhaps?).  This could be a big win for a lot of installations.
( Maybe even linking a RISC application over NFS could finally take a
finite amount of time.)

     Peter Moore
     peter@objy.com

        
P.S.
     While I don't know if they even want to be associated with this
     argument, I would like to thank Craig Everhart, Carl Smith, and
     Guy Harris for having the patience to discuss much of the above
     with me in email conversations.
 

thurlow@convex.com (Robert Thurlow) (10/14/90)

In <1990Oct14.082712.10811@objy.com> peter@prefect.uucp (Peter Moore) writes:

>> Stop right there.  Your 'disk' has just lost data, period.  Do you
>> expect your local disk to ever do that? 

>Yes I do expect my local disk to do that.  As you sort of mention,
>local writes under almost all Unix systems are asynchronous.  The
>write returns immediately, but the data stays in the buffer pool until
>either it is pushed out to make room for more active pages or until
>sync() is called.  Typically sync() is called every 30 seconds by the
>update daemon.  So you have no guarantee that your last 30 seconds of
>local I/O ever make it to disk unless you explicitly do a sync.  And if
>something does go wrong (a unrecoverable bad block, drive off line, or
>a full crash during the 30 second period), there is no way to signal
>back to you that it failed.  Heck, your process could have exited
>before the write failed.

I agree that write(2) won't return you an error in general, but processes
can, at any point they wish, call fsync() to ensure the data is secured.
That ability is lost if the server is acknowledging only the receipt of
the request.  close(2) will fail if you can't secure your writes, as well,
though people are very poor at paying attention to the return code.  If
you throw away this ability for a process to get an accurate indication
of success, you've definitely made it impossible to trust database I/O
over NFS.  You've also lost the ability to choose synchronous writes.

>>   The effects could be very
>> devastating, depending on what exactly cared about the data.  Think
>> of the havoc you could wreak on a database server.

>But, as I pointed out, this effect can happen on local writes too.
>That is why any `database-like' application must explicitly call
>fsync() if it wishes to guarantee that pages have made it to disk. No
>recoverable system can depend on the write() alone when writing to
>local disk.  So synchronous NFS isn't helpful to the database people, at
>least for that reason.  They are already doing the right thing with
>explicit syncs just to make it work locally.

This is the problem: after a server says "Yo!", your client need never
write that data again, FSYNC() OR NOT, because it "trusts" the server
and can in no way tell it should not.  The block may be in your buffer
cache, but I/O is marked complete; fsync() will lie, likely without even
going over the wire.  You just can't trust an fsync() anymore, period.

>This is why synchronous NFS writes seems to be unmotivated to me.
>It is MORE synchronous than local Unix I/O (assuming that network
>latency is a lot less than 30 seconds).  Why pay such a cost to make
>it MORE synchronous than we already are willing to live with on Unix?

Networks go down a _lot_ more often that local disk in my experience.
People kick out cables, router boxes fail, network adaptors hang, 
machines get powered down, and of course, the local server disk can
fail :-)

>Now none of these arguments are overwhelming, but they do add up.  I
>am not trying to argue that NO one needs or wants synchronous NFS.  I
>am arguing that not everybody does, (and I believe, but can't defend
>better than the above, that MOST people don't need it).

I like the idea of having an option.  But I'm sufficiently convinced
that it should be the default.

>> <stuff about the -async option>

>This is exactly the sort of thing I want.  Now I just need it on all
>my machines as an option.

You're right; it's a real issue getting new features out across the
installed base so that everyone can count on them; Brian Pawlowski
of Sun underlined it in his talk at the ONC/NFS Industry Networking
Conference last week.  I think we have to try to cut time-to-market
for new functionality and get this stuff out there faster.

Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

peter@prefect.Berkeley.EDU (Peter Moore) (10/15/90)

Oops.  Further thought brings out some problems. Given the scenario:

1) Process P on machine L writes to file F on machine R
2) R crashes before syncing the changes to disk and either:
     a) recovers before P does any more writes to F, or
     b) crashes after P has finished with F.
3) P continues on blithely unaware that the writes to F failed and
    produces some data.


In my last note I said that a process that did the careful fsyncs that
even local writes require wouldn't be bothered by the above scenario.
But calling fsync during stage 3 would only commit changes since R
reconnected.  P would falsely think that all changes have been
committed. This is not good.

Two changes could fix this.  First, invalidate all NFS handles on
reboot.  This would cause fsyncs on existing file descriptors to fail.
Second, (sadly) do an implicit fsync on every close.  This would protect
processes that do: write, close, reopen, and then fsync.

So we still need implicit fsyncs, but not on every write. But now a server
crash will take down all processes with open file descriptors on that server,
whether or not the process actually lost any data.

So there are more tradeoffs to be made.  But it still seems like a
worthwhile option.

	Peter

thurlow@convex.com (Robert Thurlow) (10/15/90)

In <1990Oct15.025613.12574@objy.com> peter@prefect.Berkeley.EDU (Peter Moore) writes:

>Two changes could fix this.  First, invalidate all NFS handles on
>reboot.  This would cause fsyncs on existing file descriptors to fail.
>Second, (sadly) do an implicit fsync on every close.  This would protect
>processes that do: write, close, reopen, and then fsync.

Invalidate all file handles and you've made recovery after a reboot
damned expensive.  I don't want the process to have to think about
reopening its files, so the kernel would get a lot uglier.  I'd prefer
to leave the biod doing implicit fsyncs, thanks; I think I have other
tools to tweak performance.

Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

guy@auspex.auspex.com (Guy Harris) (10/16/90)

>Guy Harris mentioned in an email thread of this conversation that a
>asynchronous extension of NFS has been considered.

Err, umm, no, I didn't.  What I mentioned was the WRITECACHE operation
from an NFS3 protocol spec.  The WRITECACHE operation, as proposed
therein, was (in the RPC language used in that spec, which was the
2/10/89 version):

	NFSPROC_WRITECACHE(file,flush,beginoffset,totalcount,offset,data)
	    returns(reply)

	fhandle file;
	boolean flush;
	fileoffset beginoffset;
	unsigned totalcount;
	fileoffset offset;
	nfsdata data;

	struct wcokres {
		fattr attributes;
		unsigned count;
		union flushinfo switch (bool flushed) {

		case TRUE:
			void;

		case FALSE:
			unsigned flushcnt;
			errinfo flusherror;
		} flushed;
	};

	union writecacheres switch (stat status) {

	case NFS_OK:
		wcokres	ok;

	case NFS_WARN:
		wcokres aok;
		errinfo warning;

	case NFS_ERROR:
		errinfo errinfo;
	} reply;

  This call writes "data" (which is just a bunch of data) into the
  server's data cache for a regular (ftype NFREG) file.  "Beginoffset" and
  "totalcount" describe the offset and size, in bytes, of the entire piece
  of data to be written.  These values will be the same across a set of
  WRITECACHE operations.  Each of the WRITECACHE operations in a set will
  have different values for "offset" and "data".  "Offset" is the byte
  offset into "file" where "data" should be written.

  The boolean "flush", if TRUE, causes the server to flush a whole data
  set.  That is, commit to disk the data from several WRITECACHE
  operations, whose "offset" values fall between "beginoffset" and
  "beginoffset+totalcount".  If the server's attempt to flush "totalcount"
  bytes of data starting at "beginoffset" bytes into the file is
  successful the server will return "reply.flushed" TRUE.

  If "reply.status" is NFS_OK "reply.ok.attributes" is the attributes of
  the file following the write.  If "reply.flushd" is FALSE
  "reply.ok.flushed.flushcnt" is the number of consecutive bytes that
  actually got written starting at "beginoffset", which may be less than
  "totalcount", and "reply.ok.flushed.flusherror" contains error
  information about the write operation that failed.  If "reply.status" is
  NFS_ERROR then the call failed and "reply.errinfo" contains error
  information.  If "reply.status" is NFS_WARN than all the data returned
  in the NFS_OK case and a warning "errinfo" structure is returned.

  The server has the option of accepting only a portion of data.  In this
  case "reply.ok.count" is the number of bytes of data that were cached
  starting at "offset" in the file.  The size of "data" must be less than
  or equal to the value of the "wsize" field in the GETFSINFO reply
  structure [this is the maximum number of bytes the server will accept in
  a "write" operation] for the file system that contains "file".  In
  addition, "totalcount" must be less than or equal to the "wcsize" field
  in the GETFSINFO reply [this is the maximum number of bytes that the
  server will let you write out in a set of WRITECACHE operations].

  IMPLEMENTATION

  The WRITECACHE operation is provided for performance only.  Servers
  are not required to support it.

  Clients can use the WRITECACHE operation to group consecutive WRITE
  operations without incurring the overhead of flushing each chunk to
  data through to disk on the server.  The client takes responsibility
  for recovvering from server errors by holding on to data that has been
  written with WRITECACHE until a successful flush has occurred.  This
  way, to recover from an error the client can either retry the set of
  WRITECACHE operations or use WRITE operations to insure that the data
  is safely on the server's disk.

  The server may pre-flush cached data to disk to free up cache space. 
  If this happens the server can either return an error in response to
  the flush request and force the client to resend everything, or keep
  track of data that has already been flushed when the flush request
  comes along.  This way, if the server can account for all data as
  either in the cache or already flushed the flush request can return
  success.

So this is *NOT* the same as making writes asynchronous.  It's more like
letting a single *synchronous* write be broken up into several pieces.
Those pieces are WRITECACHE operations with the same "file",
"beginoffset", and "totalcount" values, those values being the values
that correspond to the single write.  The individual pieces are
identified by the "offset" values in the WRITECACHE operations, and the
data in the pieces are the "data" values.

The final WRITECACHE operation has a "flush" value of TRUE, the others
having a "flush" value of FALSE.  Just as is the case with a single
"write" operation, the client holds onto the data until it's *all*
flushed to disk; it must not free up stuff just because it's been sent
to the server with a WRITECACHE operation - it has to wait until the
final WRITECACHE operation succeeds.  All but the final WRITECACHE
operation resemble asynchronous writes; the final WRITECACHE is still
synchronous.

cs@Eng.Sun.COM (Carl Smith) (10/16/90)

> Two changes could fix this.  First, invalidate all NFS handles on
> reboot.

	This would break every NFS client that I know of.  It's the server's
responsibility to ensure that file handles are as permanent as the files to
which they refer.

			Carl

mogul@wrl.dec.com (Jeffrey Mogul) (10/16/90)

This seems like a good time to insert a few comments into the NFS
synchrony debate (especially since I'm going away for a week).

One of the problems with NFS is that it manages to tangle up several
different issues, which makes it hard to solve one without breaking
something else.

For example, the reason why NFS clients do write-throughs to the server
is partly for reliability (the client could crash before a delayed
write is sent to the server) and partly a consquence of the statelessness
dogma.  This is because if you have two clients sharing the same file,
changes have to appear on the server "as soon as possible" in order
to preserve some shreds of local-Unix-like cache consistency.

If NFS clients behaved like local-disk Unix systems (only write dirty
blocks every 30 seconds), then it wouldn't matter as much if the server
acknowledged them immediately, or waited until the data was safely on
the disk.  (As has been pointed out, it would be a trivial change to
allow the client to distinguish "precious" blocks from others, just
as the local-disk Unix file system has always done.)  But, since server
disk write latency is so nakedly exposed to client applications, anything
that speeds that latency (such as a "stable-storage" cache, or faster
disks) helps a lot.

Of course, NFS isn't the last word in file systems.  Anyone interested
in a better design can read the papers on Sprite (e.g., Michael N. Nelson,
Brent B. Welch, and John K. Ousterhout, "Caching in the Sprite Network File
System", Trans. Computer Systems 6:1, pages 134-154, Feb. 1988) and Spritely
NFS (V. Srinivasan and Jeffrey C. Mogul, "Spritely NFS: Experiments with
Cache-Consistency Protocols", Proc. 12th SOSP, pages 45-57, Dec. 1988).

But for many of us (including me!), NFS is what we use, so solutions that
don't require protocol changes (such as server stable-storage boards) might
still be a win.

-Jeff

peter@prefect.Berkeley.EDU (Peter Moore) (10/16/90)

In article <thurlow.655915475@convex.convex.com>, thurlow@convex.com (Robert Thurlow) writes:

|> I agree that write(2) won't return you an error in general, but processes
|> can, at any point they wish, call fsync() to ensure the data is secured.
|> That ability is lost if the server is acknowledging only the receipt of
|> the request.

I am sorry.  I was being unclear.  I was only advocating writes become
asynchronous.  Calls like fsync (and close if we make it implicitly
fsync) would have to be synchronous.


	Peter Moore
	peter@objy.com

guy@auspex.auspex.com (Guy Harris) (10/17/90)

>For example, the reason why NFS clients do write-throughs to the server

By "do write-throughs to the server" do you mean that if a process on an
NFS *client* writes to a file, the data is immediately sent to the
server?

If so, UNIX NFS clients (or, at least, those derived from the Sun code)
do *not* do write-throughs to the server; in fact, they behave like
local-disk UNIX systems, writing dirty blocks out every 30 seconds (or
when the buffer/page is needed), unless somebody on the client has done
an "fcntl()" lock on the file (which causes writes to be done
synchronously on the client, and reads always to go to the server).

To quote NFSD(8):

     biod starts 'nservers' asynchronous block I/O  daemons. This
     command  is  used  on  a  NFS  client to buffer cache handle
     read-ahead and write-behind. The magic number for 'nservers'
     in here is also four.

thurlow@convex.com (Robert Thurlow) (10/18/90)

In <1990Oct16.085057.16691@objy.com> peter@prefect.Berkeley.EDU (Peter Moore) writes:

>In article <thurlow.655915475@convex.convex.com>, thurlow@convex.com (Robert Thurlow) writes:

>|> I agree that write(2) won't return you an error in general, but processes
>|> can, at any point they wish, call fsync() to ensure the data is secured.
>|> That ability is lost if the server is acknowledging only the receipt of
>|> the request.

>I am sorry.  I was being unclear.  I was only advocating writes become
>asynchronous.  Calls like fsync (and close if we make it implicitly
>fsync) would have to be synchronous.

But fsync doesn't go over the wire; if you let the server asynchronously
respond to write requests, you won't be able to trust your fsyncs, either.

Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

stukenborg@mavplus9.rtp.dg.com (Stephen Stukenborg) (10/20/90)

(Sorry this is slightly dated, but our news feed has been wacked out all week.)

After reading the postings back and forth on the issue, I still haven't
seen anyone really hit the nail on the head on why nfs write operations 
are synchronous.  The primary reason is to make the client tolerant of 
server crashes.  If I'm an NFS client, I don't want the fact that the 
server has crashed to impact my "view" of the world.  My system is still up 
and running.  Why should I lose any data?

As has already been described by Rob Thurlow, if I mark my client buffer cache 
block as "clean" when the server acks my write, then I'm counting on that
data being on stable storage.  If the server is merely going to ack the
receipt of my write request, then I have to hold on to that buffer until
close (or the janitor daemon) verifies that everything is on the server's 
disk.  As Jeff Mogul pointed out, the primary difference between traditional 
unix file system behavior and NFS is that the NFS close operation writes 
all of a files dirty buffers to disk.  It is this "sync-on-close" behavior 
that really dictates the synchronous write policy.  (Any also provides a wimpy 
consistancy-control policy.)

The performance win is if the writes (before the close) are asynchronous;
then the disk arm on the server is not "forced" to seek all over the place.
This is essentially the Rev. 3 protocol spec WRITECACHE call that Guy Harris 
talked about.  It would probably be worth the added complexity to the client 
side code for the improved server performance. 

Another thought.  I'm having trouble with the several people who
have mentioned that NFS data integrity is not a big factor to some users.  
Do users really want a MIPS-like export option that says "don't do sync 
writes"?  (Note that these async writes are different that those mentioned 
above.  Now I'm talking about the possibility that data will be lost 
on a server crash.)  The only reason I can think of having this feature 
is for truly wondrous benchmark results that you can wave in a 
customer's face.  If I've just edited my code module or circuit layout, 
then I want to be sure that a server crash is not going to lose any 
of my thoughts.  (Maybe the users who want this feature are the same ones 
that use "soft" mounts on filesystems mounted r/w.)


Steve Stukenborg			DG/UX Kernel Development
Data General Corporation
62 Alexander Drive			stukenborg@dg-rtp.dg.com
Research Triangle Park, NC  27709	...!mcnc!rti!xyzzy!stukenborg
Steve Stukenborg
Data General Corporation
62 Alexander Drive			stukenborg@dg-rtp.dg.com
Research Triangle Park, NC  27709	...!mcnc!rti!xyzzy!stukenborg

beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) (10/20/90)

[This is a long response describing some aspects of NFS behaviour
on writes. Slightly delayed because of news problems.]

In article <1990Oct16.004225.22754@wrl.dec.com>,
mogul@wrl.dec.com (Jeffrey Mogul) writes:

> One of the problems with NFS is that it manages to tangle up several
> different issues, which makes it hard to solve one without breaking
> something else.

Perhaps because the issues are related. Sprite cache strategies
consider data consistency issues, as does AFS 4.0, because
aggressive cache coherency strategies must make concessions
to data consistency (cooperation amongst clients).

I find it incredibly hard to disentangle the two in any discussion,
though I find it useful to separate them in looking at problems
in distributed file systems.

Synchronous Write Behaviour, NFS and Applications
-------------------------------------------------

> For example, the reason why NFS clients do write-throughs to the server
> is partly for reliability (the client could crash before a delayed
> write is sent to the server) and partly a consquence of the statelessness
> dogma.  This is because if you have two clients sharing the same file,
> changes have to appear on the server "as soon as possible" in order
> to preserve some shreds of local-Unix-like cache consistency.

I believe you've confused me above. An application is still subject to
"sync" delays for writes to server via NFS exactly the same as writing
to local disk. This is the behaviour we've come to know and love on UNIX.
NFS writes are triggered by normal buffer flushing (sync activity) on a client. 
Your above makes it sound like writing is synchronous to the application. 
This is, generally, not the case.

As a colleague points out:

  In the normal case, I/O is handled asynchronously subject to
  the normal update syncs.  I would not consider this synchronous to the
  application! There are cases in NFS when I/O is synchronous (to the
  application), however:

	- Any time an fsync is done by the application
	- For the remaining life of the file descriptor once a lock
	  has been applied to it (even if all locks are then cleared)

Is there reason for you to believe it works otherwise than this?

NFS client requirements for servers to store data in stable storage
before replying NFS_OK is not a result of statelessness "dogma".
It is a result of stateless design.

Further, client assumption on the writing by servers of data to
stable storage (the semantics of a good reply) is unrelated
to the 30 second sync time consideration, which as you state,
attempts to preserve some shred of local UNIX like data consistency.
However, the 30 second sync time more resembles local file write
behaviour in UNIX (subject to periodic sync's).

I believe you have confused several issues here.

Client Requirements for Server Writes to Stable Storage
-------------------------------------------------------

More critical to the on-going discussion is the reasons for an NFS
client requiring servers to write data (including "meta-data" like
file size) to stable storage before replying back to the client.
It is an inherent assumption in the current design of NFS that servers
will not respond NFS_OK (that is, "Write Successful") until data from 
a client has been written to stable storage. It is not partly a
consequence of the "stateless dogma" but inherently a consequence of 
the "stateless design" of NFS.

The assumption that servers flush data to stable storage before returning
NFS_OK to the client has nothing do with client crashes () but has
everything to do with the implications of server crashes. By requiring
the server to write its data to stable storage, the client need not
concern itself with the current server state. On receiving an NFS_OK from
the server, the client is free reuse data buffers which held the data just
written. If the server crashes and returns (reboots), the client
will (in classic "hard" mounted situations) wait for the server to
return and continue where it left of. The server crash has not affected
the operation of clients. This is some of the behaviour usually implied
when people say "NFS is stateless".

["stateless" is a relative term--we're obviously talking about state
on a client in the form of buffers held for 30 seconds.  This is normal
"UNIX" buffering behaviour. There are other "stateless" design implications,
the other well-known one being the simple cache coherency strategy used
by NFS which results in checking the attributes of a file to validate
whether locally cached data in the client is still valid--that is,
in agreement with server data. Another is the READDIR cookie, another
is the encapsulation of file location information in the file handle.
The approach is to keep the server simple, and burdening the client 
with responsibility of keeping critical state. This critical state
is not shared with other clients. Servers are also not without "state"--
servers typically employ a read-ahead strategy to improve performance--
however the key here is that such server state is not critical to proper
operation of NFS.]

The semantics of an NFS write are to preserve data in event of a server
crash (by requiring it ot be on stable storage--static RAM or disk).

Suggestions on just allowing servers to return NFS_OK without flushing
to stable storage [as have been made in preceding e-mails]
are in some sense dangerous. Because all existing
clients are implemented under the assumption that NFS servers only
reply okay if the data is "safe". {Assuming you didn't just lose
the server disk you wrote to during the server crash.}

It is a client data reliability issue that it flushes modified
buffers every 30 seconds (or so) in exactly the same way it is
for flushing buffers to local disk--preserve data in event of a
client crash. In this way, NFS is no different from local writing.

In addition, 30 second flushes preserve shreds of UNIX data consistency
amongst clients (as you mention above). [A useful side effect].

> If NFS clients behaved like local-disk Unix systems (only write dirty
> blocks every 30 seconds), then it wouldn't matter as much if the server
> acknowledged them immediately, or waited until the data was safely on
> the disk.  (As has been pointed out, it would be a trivial change to
> allow the client to distinguish "precious" blocks from others, just
> as the local-disk Unix file system has always done.)  But, since server
> disk write latency is so nakedly exposed to client applications, anything
> that speeds that latency (such as a "stable-storage" cache, or faster
> disks) helps a lot.

NFS clients do behave like local-disk UNIX systems... What do you mean
above--this is where I remain confused? Server disk latency is
not exposed so nakedly to a client application. (See above discussion).

To help applications detect error in writes, asynchronous write
errors (to the execution of the write system call by the application),
are returned at close() time. This is why it is so critical for an
application to check the results of a "close" operation to detect 
such errors. I repeat again: NFS writes are not (in general) synchronous
from the client application viewpoint, only from the NFS client viewpoint.
         ----------------------------                --------------------

In effect, a file close() results in an fsync() of the file to
ensure that any asycnhronous errors are seen by the application.

The current protocol has no provision for later acknowledgement
of data being on stable storage (asynchronous writing), allowing
the client to implement a "precious block" policy. Such a change
would require a protocol revision.

What do you consider "precious"? The NFS design considers user data
precious, and ensures approximately the same guarantee of reliability
to an application that is provided by the local UNIX file system.
The semantics of "close" returning any asynchronous write errors
(in effect returning following the flush of data to stable storage
on the server) provide further guarantees to the application.

The attempt is to eliminate inisidious silent errors.

Stable storage caching (static RAM techniques) on the server accellerate
client applications OVERALL because latency on NFS write requests
are reduced (as read-ahead techniques reduce latency by eliminating
synchronous disk access, so writing to Static RAM reduces latency
by eliminating synchronous disk write activity). The key point
here is that no one particular application's write performance
is improved, but an OVERAL NFS client's performance is improved
(thereby improving all applications).

Future Directions
-----------------

> Of course, NFS isn't the last word in file systems.  Anyone interested
> in a better design can read the papers on Sprite (e.g., Michael N. Nelson,
> Brent B. Welch, and John K. Ousterhout, "Caching in the Sprite Network File
> System", Trans. Computer Systems 6:1, pages 134-154, Feb. 1988) and Spritely
> NFS (V. Srinivasan and Jeffrey C. Mogul, "Spritely NFS: Experiments with
> Cache-Consistency Protocols", Proc. 12th SOSP, pages 45-57, Dec. 1988).
> 
> But for many of us (including me!), NFS is what we use, so solutions that
> don't require protocol changes (such as server stable-storage boards) might
> still be a win.

NFS is what we use because it is a solution available commercially
today, while the papers you reference above describe research in 
distributed file systems. I take possible exception to your term
"better" design--the NFS design met its goals, provides a good
solution, and works.

I believe that Sprite, AFS and Spritely NFS have shown a lot of promise.
In one form or another, they address the issues of (1) cache coherency,
and (2) data consistency. Compare AFS 3.0 and AFS 4.0 and you may arrive
at the dichotomy on coherency that helps me understand the differences
twixt the two. (Or maybe not.) AFS 4.0 definitely provides stronger
data consistency semantics (through the Token Manager) than AFS 3.0
(which had well-defined, but possibly moot cache consistency since
the guarantees for data consistency amongst cooperating clients
was--is--weak. See the paper by Kazar and crew in Summer Usenix
proceedings).

AFS, Sprite and Spritely NFS provide direction for us [the NFS community]
on ways to improve performance and data consistency guarantees in
future distributed file systems. NFS improvements in data consistency
(the view of data as seen by multiple clients) are not addressed by 
stable-storage boards. Stable storage boards provide a performance
boost within the framework of the current NFS protocol while preserving 
correctness (the implicit agreement made bewteen an NFS client and server
on write semantics).

I think it is time to consider alternative cache consistency models
for NFS, and research in the area provides several directions. HOWEVER,
I also believe that the simplicity of the current design of NFS,
particularly in regards to data reliability, are not things we should
toss aside lightly. NFS has been made available on the wide
variety of platforms because it has been both easy to port
and fairly easy to implement from the specification.

Simplicity is not a bad word. Simple error recovery semantics in
a distributed application is not a bad design. Complex error recovery
techniques may accompany complex cache coherency schemes. There
is a body of knowledge now on many of the issues. Perhaps it is
time to exploit this knowledge seriously in NFS.

'Lest we lapse into a mode where we believe data and cache consistency 
are the only issues, one should look around at others: operation
over unreliable networks (WAN's), administration, support for shared
file name spaces, etc.

Feedback on issues are solicited,

> -Jeff

Brian Pawlowski

thurlow@convex.com (Robert Thurlow) (10/21/90)

In <1990Oct19.222754.17622@dg-rtp.dg.com> stukenborg@mavplus9.rtp.dg.com (Stephen Stukenborg) writes:

>After reading the postings back and forth on the issue, I still haven't
>seen anyone really hit the nail on the head on why nfs write operations 
>are synchronous.  The primary reason is to make the client tolerant of 
>server crashes.  If I'm an NFS client, I don't want the fact that the 
>server has crashed to impact my "view" of the world.  My system is still up 
>and running.  Why should I lose any data?

>As has already been described by Rob Thurlow, if I mark my client buffer cache 
>block as "clean" when the server acks my write, then I'm counting on that
>data being on stable storage.  If the server is merely going to ack the
>receipt of my write request, then I have to hold on to that buffer until
>close (or the janitor daemon) verifies that everything is on the server's 
>disk.  As Jeff Mogul pointed out, the primary difference between traditional 
>unix file system behavior and NFS is that the NFS close operation writes 
>all of a files dirty buffers to disk.  It is this "sync-on-close" behavior 
>that really dictates the synchronous write policy.  (Any also provides a wimpy 
>consistancy-control policy.)

Here's your problem.  There is no "open" operation, nor "close" operation,
in the NFS protocol.  If you want to open a file, your client does an
NFS getattr operation to ensure there is indeed a file by that name.  If
you want to close, you simply stop using that filehandle.  All close does
is force over-the-wire writes on all VM pages of the associated file.

In fact, the only way in the current protocol to send data is with the
write operation, and the only way to send metadata is with the set
attributes operation.  There is no way to have a second ack when I/O
is complete.  Now, maybe a future protocol will be changed; that
writecache from NeFS looks like it has a lot of potential.  But right
now, if the client doesn't at least feel free to destroy the data
when the write ack is received, it never will get any information that
will make it feel better about it in the future.

>Do users really want a MIPS-like export option that says "don't do sync 
>writes"?  (Note that these async writes are different that those mentioned 
>above.  Now I'm talking about the possibility that data will be lost 
>on a server crash.)  The only reason I can think of having this feature 
>is for truly wondrous benchmark results that you can wave in a 
>customer's face.

Remember how hard a diskless node hits it's NFS-mounted swap device -
I've read numbers akin to five writes to every read from /export/swap.
If I'm the sort of user who doesn't run long-lived batch jobs from my
workstation, I might enjoy the performance edge I gain with async I/O
without minding the cost of rebooting most or all the time when my
boot server crashes.

#include <iso/std/disclaimer>
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (10/21/90)

In article <143972@sun.Eng.Sun.COM>, beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) writes:
> ...
> More critical to the on-going discussion is the reasons for an NFS
> client requiring servers to write data (including "meta-data" like
> file size) to stable storage before replying back to the client.
> It is an inherent assumption in the current design of NFS that servers
> will not respond NFS_OK (that is, "Write Successful") until data from 
> a client has been written to stable storage. It is not partly a
> consequence of the "stateless dogma" but inherently a consequence of 
> the "stateless design" of NFS.

Confounding statelessness, to the limited degree it is an attribute of NFS,
with server caching policies is bad.  Consider that "state" is the purpose
of a file system.

NFS is not now and never was stateless.  It is relatively stateless, in
that the server is not notified of open()'s, unlike several other remote
file systems developed then and since.  AT&T made a big deal that NFS was
"stateless and so bad," while Sun responded that NFS was "stateless and so
good."  It was blarney in the battle between what AT&T called the "emerging
network file system standard" (RFS) and NFS.  The battle was not just the
public one, but the internal one between Sun engineering and whatever you
call the AT&T New Jersey UNIX department.  (I worked on the first SVR3 NFS
port in '85 in Mtn.View and saw some of the smoke of the cannons.)

"XID cache" is vital for making NFS come even as close as it does to real
UNIX file system symantics, and is by itself a sufficient counter to the
old claim that "NFS is stateless."

> ...
> The assumption that servers flush data to stable storage before returning
> NFS_OK to the client has nothing do with client crashes () but has
> everything to do with the implications of server crashes. By requiring
> the server to write its data to stable storage, the client need not
> concern itself with the current server state. On receiving an NFS_OK from
> the server, the client is free reuse data buffers which held the data just
> written. If the server crashes and returns (reboots), the client
> will (in classic "hard" mounted situations) wait for the server to
> return and continue where it left of. The server crash has not affected
> the operation of clients. This is some of the behaviour usually implied
> when people say "NFS is stateless".

No, the phrase "NFS is stateless" has been almost devoid of meaning for
years, because it is confounded with the general notion of state, as in
your paragraph above.

> ["stateless" is a relative term--we're obviously talking about state
> on a client in the form of buffers held for 30 seconds.  This is normal
> "UNIX" buffering behaviour. There are other "stateless" design implications,
> the other well-known one being the simple cache coherency strategy used
> by NFS which results in checking the attributes of a file to validate
> whether locally cached data in the client is still valid--that is,
> in agreement with server data.

What has this to do with "statelessness"?  Please say what this "stateless"
has to do with the differences between the NFS cache coherence mechanism
and the coherency mechanisms in the distributed cache systems for files,
RAM, host names, and toaster tempuratures.

>          ...                      Servers are also not without "state"--
> servers typically employ a read-ahead strategy to improve performance--
> however the key here is that such server state is not critical to proper
> operation of NFS.]

Wrong.  Without a proper XID cache, an NFS filesystem is an unacceptibly
poor imitation of a UNIX filesystem.  Remember the problems at the
Connectathon before last.

Please understand that I like NFS very much and stuff many megabytes thru
NFS filesystems everyday.  I think the trade-offs of Bob Lyon &co. were
great and continue to be close to optimimal.  Honesty conflicts with claims
that NFS==UFS.  There are many common UNIX behaviors where NFS is a poor
imitation of a Real Filesystem(tm).

> The semantics of an NFS write are to preserve data in event of a server
> crash (by requiring it ot be on stable storage--static RAM or disk).
> ...

> Suggestions on just allowing servers to return NFS_OK without flushing
> to stable storage [as have been made in preceding e-mails]
> are in some sense dangerous. Because all existing
> clients are implemented under the assumption that NFS servers only
> reply okay if the data is "safe". {Assuming you didn't just lose
> the server disk you wrote to during the server crash.}

Exactly.  Life is "dangerous" and filled with disk crashes.

> ...
> The semantics of "close" returning any asynchronous write errors
> (in effect returning following the flush of data to stable storage
> on the server) provide further guarantees to the application.
> ...
> The attempt is to eliminate inisidious silent errors.
 
I understand guarantees as absolute, except where explicitly limited.  The
Federal Government and the State of Calif agree with me.  If something is
guaranteed to not lose data, then it better not.  The NFS server dogma does
not provide a valid guanrantee of preserving data, or of no silent errors.
It only improves the likelihoods.  This is because there is no such thing
as absolutely stable storage.  (As I write this, I'm restoring a crashed
disk.)

In most UNIX systems, the server cache in DRAM is lost during a crash, disk
sectors are usually not lost, and there is no third medium.  There are
other possibilities.  In the 1960's I worked with "mainframes" (Kronos on
6000's) where you could push the reset button ("level 3 restart"), and not
only have all active jobs resume, but where the contents of the RAM disk
caches would be recovered.  Amdhal, Unisys, CDC, and IBM probably still
have such features.  There are also systems where there are more than 2
layers of storage.  Where would the NFS server dogma require that a system
with "permanent" optical storage (whether modern WORM or anchient
microfiche), behind slow disks, behind fast drums, behind bulk RAM, behind
fast DRAM, behind SRAM cache preserve client data?  On the most stable,
even if it takes minutes to write?

> Stable storage caching (static RAM techniques) on the server accellerate
> client applications OVERALL because latency on NFS write requests
> are reduced (as read-ahead techniques reduce latency by eliminating
> synchronous disk access, so writing to Static RAM reduces latency
> by eliminating synchronous disk write activity). The key point
> here is that no one particular application's write performance
> is improved, but an OVERAL NFS client's performance is improved
> (thereby improving all applications).
> ...

This is a strange statement.  We found years ago that violating the NFS
cache dogma improved the numbers on many NFS benchmarks, from the Sun test
suite to many other benchmarks by 50%.  (Yes, fellow Connectionathon
attendees, that is one of our secrets, now disclosed in an /etc/exports
option.)


It would be less dogmatic to say that when a server returns NFS_OK, it is
saying that the MTBF of the place containing the client's data is greater
than XXX, where the MTBF includes all possibilities of failure from power
to earthquake to kernel bug.  

The NFS protocol should dictate the external characteristics of the server
file system, not its internal implementation.  Whether the server flushes
to disk is an internal implementation issue.

Rational ustomers buy solutions to problems.  They don't care about
violations of dogma.  They only want an appropriate engineering solution to
preserving their data.  They don't care whether server buffers are flushed
to disk.  They care only that data are sufficently rarely lost.

I was not present when the NFS cache dogma was graven in stone, but I
wonder if it was not mostly a statement about the lack of reliability of
NFS servers of the time (i.e. 68010 UNIX systems in 1984).

The NFS cache dogma does solve problems, but those problems are of people
selling things, not of people building or buying things.



Vernon Schryver
Silicon Graphics
vjs@sgi.com

beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) (10/21/90)

[On several recommendations, I'll try to keep the verbage down. Whoops-
total failure.]

In article <72781@sgi.sgi.com>,
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes:

> >                                              It is not partly a
> > consequence of the "stateless dogma" but inherently a consequence of 
> > the "stateless design" of NFS.
> 
> Confounding statelessness, to the limited degree it is an attribute of NFS,
> with server caching policies is bad.  Consider that "state" is the purpose
> of a file system.
> 
> NFS is not now and never was stateless.  It is relatively stateless, in
> that the server is not notified of open()'s, unlike several other remote
> file systems developed then and since.  AT&T made a big deal that NFS was
> "stateless and so bad," while Sun responded that NFS was "stateless and so
> good."  It was blarney in the battle between what AT&T called the "emerging
> network file system standard" (RFS) and NFS.  The battle was not just the
> public one, but the internal one between Sun engineering and whatever you
> call the AT&T New Jersey UNIX department.  (I worked on the first SVR3 NFS
> port in '85 in Mtn.View and saw some of the smoke of the cannons.)

I would not disagree with this. I was simplifying the discussion. Sorry.
No, NFS is not stateless, it is relative (which I attempted to point
at below). The "stateless" wars are pointless; however the fact that
the "relatively stateless" design of NFS has simplified implementations
should not be ignored...

"stateless" is not simply a notification of "open()" though--shared
knowledge on the part of clients and servers (particularly in the
knowledge of cache consistency) is more critical (and difficult) state
to track.

> "XID cache" is vital for making NFS come even as close as it does to real
> UNIX file system symantics, and is by itself a sufficient counter to the
> old claim that "NFS is stateless."

I like your term relatively stateless, particularly for this reason.
However, at the level of the discussion the e-mails I responded to were
at, I felt comfortable pointing out that some of the fundamental design
considerations in NFS have pretty basic implications to what one
can and cannot do in an implementation. The need for an "XID cache"
addresses a "bug" in the protocol. Suggestions to the effect of
eliminating syncing data to stable storage on a server before returning
NFS_OK on a write undermines basic assumptions made by clients.

> > ...
> > The assumption that servers flush data to stable storage before returning
> > NFS_OK to the client has nothing do with client crashes () but has
> > everything to do with the implications of server crashes. By requiring
> > the server to write its data to stable storage, the client need not
> > concern itself with the current server state. On receiving an NFS_OK from
> > the server, the client is free reuse data buffers which held the data just
> > written. If the server crashes and returns (reboots), the client
> > will (in classic "hard" mounted situations) wait for the server to
> > return and continue where it left of. The server crash has not affected
> > the operation of clients. This is some of the behaviour usually implied
> > when people say "NFS is stateless".
> 
> No, the phrase "NFS is stateless" has been almost devoid of meaning for
> years, because it is confounded with the general notion of state, as in
> your paragraph above.

Yes, perhaps I should have moved up the lower paragraph. I understand
and accept the relativity of the term "stateless".

> > ["stateless" is a relative term--we're obviously talking about state
> > on a client in the form of buffers held for 30 seconds.  This is normal
> > "UNIX" buffering behaviour. There are other "stateless" design implications,
> > the other well-known one being the simple cache coherency strategy used
> > by NFS which results in checking the attributes of a file to validate
> > whether locally cached data in the client is still valid--that is,
> > in agreement with server data.
> 
> What has this to do with "statelessness"?  Please say what this "stateless"
> has to do with the differences between the NFS cache coherence mechanism
> and the coherency mechanisms in the distributed cache systems for files,
> RAM, host names, and toaster tempuratures.

Ummmm... This was my small way of saying that a bald statement
of "NFS is stateless" is untrue, putting me in violent agreement with
your "relatively stateless" statement. 

There are a lot of interesting "state" thingies agreed to by the
clients and servers. File handles are agreed to "persist" over a
crash. (Is this in the specification?) The state describing a file
handle in UNIX is information on disk.

> >          ...                      Servers are also not without "state"--
> > servers typically employ a read-ahead strategy to improve performance--
> > however the key here is that such server state is not critical to proper
> > operation of NFS.]
> 
> Wrong.  Without a proper XID cache, an NFS filesystem is an unacceptibly
> poor imitation of a UNIX filesystem.  Remember the problems at the
> Connectathon before last.

Yes, the XID Reply Cache is "highly recommended state":-) for a working
NFS server. Would you allow me to separate out the XID cache solution
from the things like read-ahead which are not required for "proper
operation."

[Also, if you have implemented an NFS server, but don't know what
the XID cache is, look at the Usenix Winter 88 paper on "Improving
Correctness and Performance in an NFS File Server"--I believe that's
the title.]

> Please understand that I like NFS very much and stuff many megabytes thru
> NFS filesystems everyday.  I think the trade-offs of Bob Lyon &co. were
> great and continue to be close to optimimal.  Honesty conflicts with claims
> that NFS==UFS.  There are many common UNIX behaviors where NFS is a poor
> imitation of a Real Filesystem(tm).

Yes. Lyon & Co. made trade-offs. And yes, NFS is not a UNIX file system.
It comes close where it counts for most situations. And when it doesn't
satisfy your requirements (strict read/write consistency without
locking example, append mode writes, ???) then you have a problem.

[Vernon: Do you feel like posting an enumerated prioritized list of
missing features in NFS--with some measure of how important that feature
is? That should start an interesting discussion--I'd like to see it.]

> > The semantics of an NFS write are to preserve data in event of a server
> > crash (by requiring it ot be on stable storage--static RAM or disk).
> > ...
> 
> > Suggestions on just allowing servers to return NFS_OK without flushing
> > to stable storage [as have been made in preceding e-mails]
> > are in some sense dangerous. Because all existing
> > clients are implemented under the assumption that NFS servers only
> > reply okay if the data is "safe". {Assuming you didn't just lose
> > the server disk you wrote to during the server crash.}
> 
> Exactly.  Life is "dangerous" and filled with disk crashes.

Yes, life is dangerous, but because your "server crash" might mean you
lost your disk doesn't lead to the conclusion that you should
implement your NFS server such that it doesn't synchronize NFS writes
to disk because you might lose your disk!!! I have this paranoia that
some people are making that leap in some of the discussions I've seen
and heard. I would postulate that most server crashes don't result
in lost disks, and that clients can continue once the machine comes
back on-line (if the server "dogmatically" flushed the data to
"stable" storage)..

> > ...
> > The semantics of "close" returning any asynchronous write errors
> > (in effect returning following the flush of data to stable storage
> > on the server) provide further guarantees to the application.
> > ...
> > The attempt is to eliminate inisidious silent errors.
>  
> I understand guarantees as absolute, except where explicitly limited.  The
> Federal Government and the State of Calif agree with me.  If something is
> guaranteed to not lose data, then it better not.  The NFS server dogma does
> not provide a valid guanrantee of preserving data, or of no silent errors.
> It only improves the likelihoods.  This is because there is no such thing
> as absolutely stable storage.  (As I write this, I'm restoring a crashed
> disk.)

I agree. I have eliminated "guaranteed" from my vocabulary. Guaranteed.

> In most UNIX systems, the server cache in DRAM is lost during a crash, disk
> sectors are usually not lost, and there is no third medium.  There are
> other possibilities.  In the 1960's I worked with "mainframes" (Kronos on
> 6000's) where you could push the reset button ("level 3 restart"), and not
> only have all active jobs resume, but where the contents of the RAM disk
> caches would be recovered.  Amdhal, Unisys, CDC, and IBM probably still
> have such features.  There are also systems where there are more than 2
> layers of storage.  Where would the NFS server dogma require that a system
> with "permanent" optical storage (whether modern WORM or anchient
> microfiche), behind slow disks, behind fast drums, behind bulk RAM, behind
> fast DRAM, behind SRAM cache preserve client data?  On the most stable,
> even if it takes minutes to write?

Have you or anyone ever seen NFS servers with "intelligent" caching
disk controllers create a "loss of data" problem?

At this point I'm wondering if you are advocating throwing away
the "requirement" for a server to flush to "stable" storage? Are you?

> > Stable storage caching (static RAM techniques) on the server accellerate
> > client applications OVERALL because latency on NFS write requests
> > are reduced (as read-ahead techniques reduce latency by eliminating
> > synchronous disk access, so writing to Static RAM reduces latency
> > by eliminating synchronous disk write activity). The key point
> > here is that no one particular application's write performance
> > is improved, but an OVERAL NFS client's performance is improved
> > (thereby improving all applications).
> > ...
> 
> This is a strange statement.  We found years ago that violating the NFS
> cache dogma improved the numbers on many NFS benchmarks, from the Sun test
> suite to many other benchmarks by 50%.  (Yes, fellow Connectionathon
> attendees, that is one of our secrets, now disclosed in an /etc/exports
> option.)

Maybe I'm being misunderstood. Try again. Using static RAM as "fast"
stable storage as a buffer to disk enables an NFS server to speed
up writes while providing the same level of "assurance" to the NFS
client on the subject of data persistence over a server crash. Violating
the "NFS cache dogma" would increase server write performance
in the same, but with an increase in probability of lost data
if a server crashed. The critical question is "how much is the increase
in failure possibilities--lost data"? Which then leaves you with the
decision of: "How lucky do I feel, given these probabilities?"

Do you mean that SGI was doing this silently (not requiring syncs
to disk) and have now made it an external option? What's the
default? Can you send me the man page describing this option?
You firmly believe this "flush to stable storage" requirement is in 
the realm of dogma?

> It would be less dogmatic to say that when a server returns NFS_OK, it is
> saying that the MTBF of the place containing the client's data is greater
> than XXX, where the MTBF includes all possibilities of failure from power
> to earthquake to kernel bug.  
> 
> The NFS protocol should dictate the external characteristics of the server
> file system, not its internal implementation.  Whether the server flushes
> to disk is an internal implementation issue.

Actually, since we're being honest, we both know the NFS protocol
specification is none too clear on these issues. For instance, the XID 
Reply cache is not specified, whereas you imply that it is a necessary
component of an NFS server implementation (and I would not disagree).
The protocol specification dictates pretty straightforward
external characteristics.

Perhaps I should add for the interested reader that most of what we're
discussing (XID cache, consistency semantics, and other "implementation"
details) are not called out in the protocol specification, but are
merely aspects of particular implementations. The real world intrudes
here. A lot of practical knowledge is exchanged at Connectathon every
year on how to improve implementations. [I'm of the school that
no specification eliminates the need for interoperability testing.
I think Connectathon is one of the very good things done in the
NFS community.]

> Rational ustomers buy solutions to problems.  They don't care about
> violations of dogma.  They only want an appropriate engineering solution to
> preserving their data.  They don't care whether server buffers are flushed
> to disk.  They care only that data are sufficently rarely lost.

Agreed. How does your company ensure (Ah! he artfully avoids the
contentious word "guarantee") "that data are sufficently rarely lost."?
What was the "appropriate engineering solution to preserving their
data" added when the requirement for synchronous writes was dropped?

> I was not present when the NFS cache dogma was graven in stone, but I
> wonder if it was not mostly a statement about the lack of reliability of
> NFS servers of the time (i.e. 68010 UNIX systems in 1984).

Is your basis simply then that today servers are more reliable, and that
in practice this is not a problem? Is server reliability the critical
factor or are external factors like power outages, errant flipping
of power switches, etc. significant? I would assume that disk MTBF's
were much greater than server MTBF's, and synchronous writes exploit
this.

> The NFS cache dogma does solve problems, but those problems are of people
> selling things, not of people building or buying things.

Wow. Wow again. I'm thinking about what everyone is selling (including
you). Forget absolute failure probabilities... Do you have a relative
probability of lost data between flushing to disk and not flushing
to disk on a server before responding to client? Or any failure data?
Because this has obviously been (and seems to be a growing) contentious
point between "strict" (you would say "dogmatic") NFS implementations
and "loose" (would you say "enlightened":-) implementations. Feedback
on how little (or non-existent) a problem this is of great interest to
me. And others, as this seems to be an increasingly polarizing issue.

> Vernon Schryver
> Silicon Graphics
> vjs@sgi.com

Brian Pawlowski
Sun Microsystems
beepy@eng.sun.com

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (10/21/90)

In article beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) writes:
> ...

We agree:
  -"NFS is stateless" is not a technical statement, but describes the
    general philosophy used by the NFS designers.
  -NFS!=UFS.
  -the "NFS server cache dogma" increases the reliability of the system,
    and is consistent with the stateless philosophy.


>                           ...           The need for an "XID cache"
> addresses a "bug" in the protocol.

I may disagree.  The external behavior produced by an XID cache should have
been specified in the beginning.  It is required by real world networks.

>                                    Suggestions to the effect of
> eliminating syncing data to stable storage on a server before returning
> NFS_OK on a write undermines basic assumptions made by clients.

No, the synchronous server data cache is an implementation of "safe" or
high MBTF server data storage.  What if I build a server with non-volatile
RAM (e.g. a 180-day UPS on a Sun), put cache blocks in a reserved part of
RAM, have the bootstrap code for UNIX and the diagnostics discover and
preserve all valid cache blocks, and operate in the evil async mode?
(Similarities to Prestoserve are unintended and inevitible.)  If
Prestoserve is OK, then so is this, or any other with the same MTBF.

> There are a lot of interesting "state" thingies agreed to by the
> clients and servers. File handles are agreed to "persist" over a
> crash.

Good point.

> [Vernon: Do you feel like posting an enumerated prioritized list of
> missing features in NFS....

The UNIX file system is not holy.  The NFS lacks are irrelevant, except
where they are needed by users.   NFS needs about 6 things, including some
kind of cache operation like that discussed recently, some open-unlink
support, and a few others that our local NFS Master chants under his
breath.  The new protocol had almost all of them 2 years ago.  It's too bad
it went non-linear.

>    ... I would postulate that most server crashes don't result
> in lost disks, and that ...[sync writes work]...

Yes, synchronous operation is a good BruteForceAndIgnorance implementation
of what should be the protocol requirement.  (I like BF&I--on the first cut.)

> Have you or anyone ever seen NFS servers with "intelligent" caching
> disk controllers create a "loss of data" problem?

Good point.  I've heard rumors, but not seen anything.   (We limit
controller caches--sync. writes wait.)

> At this point I'm wondering if you are advocating throwing away
> the "requirement" for a server to flush to "stable" storage? Are you?

Yes, the stated requirement is bogus.  Pick whatever MTBF or equivalent
you wish, but please stop using an implementation to describe a protocol.
(Yes, sometimes an implemenation is the best spec, but only if you can
and do say which characteristics you really care about.)

> ...
> You firmly believe this "flush to stable storage" requirement is in 
> the realm of dogma?

Yes.  It is a taboo or folk medicine like quinine water or willow bark.  It
is ok, if we have rationally chosen it instead of its equivalents, or if we
don't understand it.  Since we all understand it, let's replace the taboo
with engineering requirements.

Actually, "flush to stable storage" would be ok, if people would not
keep reading it as "call bwrite()," and if it were quantitative.

> ...
> The protocol specification dictates pretty straightforward
> external characteristics.

The protocol dictates many external characteristics in terms of an
implementation.  The only complete protocol spec comes on the Sun reference
tapes.  Still, I much prefer the NFS protocol spec-tape to the ANSI/IEEE
paper swill I've been fighting lately.

> > I wonder if it was not mostly a statement about the lack of reliability of
> > NFS servers of the time (i.e. 68010 UNIX systems in 1984).
> 
> Is your basis simply then that today servers are more reliable, and that
> in practice this is not a problem? Is server reliability the critical
> factor or are external factors like power outages, errant flipping
> of power switches, etc. significant? I would assume that disk MTBF's
> were much greater than server MTBF's, and synchronous writes exploit
> this.

Yes, careful operators, solid hardware, a UPS, and sufficently bug free
software are more important and effective than synchronous writes ever
were.  People do more damage to files with keyboards than with power
switches.  The standard lightening drill has always been to hit the switch
to keep power off to protect disks from flickers and surges.  The servers I
use have reasonably balanced MTBF's.  The big NFS servers I know about (all
source for everything from forever on racks of GB drives) stay up for
months, and suffer disk problems as often as all others.

> > The NFS cache dogma does solve problems, but those problems are of people
> > selling things, not of people building or buying things.
> 
> Wow. Wow again. I'm thinking about what everyone is selling (including
> you). 

I was referring to "selling" as in "the marketing department," not selling
as in verbally counting coup.  I'm not selling because I won't get any $ if
I convince you--I might get less because your boxes would be better.

>     ...    this seems to be an increasingly polarizing issue.

In my personal experience it has been very controversial since 1985.  Until
Prestoserv broke the ice, 15% of the NFS vendors have been hiding in the
closet.  It's just that now we're "coming out."


Vernon Schryver,    vjs@sgi.com

thurlow@convex.com (Robert Thurlow) (10/22/90)

In <143983@sun.Eng.Sun.COM> beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) writes:
>[Vernon: Do you feel like posting an enumerated prioritized list of
>missing features in NFS--with some measure of how important that feature
>is? That should start an interesting discussion--I'd like to see it.]

I'll bite.  I see these as problems that needed to be solved a year ago
in a minor protocol revision almost orthogonal to NeFS:

- There is nothing in the mount protocol to carry any filesystem info
  beyond the filehandle.  For example, this prevents me from warning my
  client user that the filesystem was exported read-only at mount time
  or that the filesystem needs to be accessed via the Secure protocol.
  AUTH_KERB will make this worse.

- access(2) badly needs to go over-the-wire.  "Guessing" permissions
  from attributes is completely inadequate when we have root access
  issues, UID/GID mapping issues with Secure NFS, access control lists,
  and other gremlins.  We have a perfectly good file server; why not
  ask the owner of the resource.

- It should be possible to do exclusive file creates over the wire.
  With even a bit flag, the server would be able to do the work which
  the client now has to make vague stabs at.

What hurts is that I know exactly how to implement each of these, but
can't do so until Sun blesses a protocol revision.  I found NeFS very
interesting, but would really like a way to fix some things *now*.

#include <osi/std/disclaimer>
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (10/22/90)

[one characteristic of news is that the long winded cannot see the audience
muttering "shut up, already!"   Sorry about this.]

Some have written of paranoia about silently losing data to a server that
crashes after returning NFS_OK and before flushing data to disk.  That
worry must be kept in perspective.  We must quantify the probabilities of
many failures, and act rationally.
 -until recently, many vendors, including almost all of those who now run
    servers synchronously, ran without UDP checksums.  I know of mutual
    customers outraged by that, because they traced silent errors to missing
    UDP checksums.
 -many of us use VME or other busses, which do not have parity, and so
    will occassionally have undetected data corruption.
 -many systems have only byte-parity on RAM, so two cosmic rays in the
    same byte will cause silent corruption.  The rest have ECC, which does
    not detect all soft errors.
 -using 1500 byte ethernet blocks, even with UDP checksums, increases the
    likelihood of undetected errors compared to using 64 byte blocks by >
    30 times.  Recall that one of the determinantes of the maximum FDDI
    block size was the probability of an undetected error given 500
    stations and the limits on LER.  We could subtantially improve the
    calculated reliability of NFS transmissions by using small blocks.  Why
    doesn't the protocol require 64 byte blocks?  Why is everyone using 4KB
    FDDI blocks, with the same old 32-bit FCS?
 -all systems have zillions of circuits that are known to have "metastable"
    or "resolver" failures--that is, we know hardware will sometimes decide
    your bit was 1 or 0 when it was really a 0 or 1.  We all try to choose
    things so the MTBF of such a failure is low compared to any other.


 -in the average crash, the update deamon would have you lose changes of the
    last 15 seconds.  The more modern bdflush comes closer to losing no work,
    since it tries to keep the disk continually busy flushing blocks.
 -the early 1980's decision to run without UDP checksums, but to run
    servers synchronously says volumes about the relative probabilites
    of server crashes and network corruption then.  My recollections are that
    a server that stayed up for days was a wonder.  The market requires
    and many of us deliver a different order of server reliability today.
    That Sun now runs with UDP checksums turned on says volumes about
    the low relative probability of Sun server failure today.
 -is it the sticky bit that Sun has used for the last 3 years to tell
    the server that a file should be written asynchronously?  I remember
    hearing about it at Connectathon-before-last.


Vernon Schryver,    vjs@sgi.com

geoff@bodleian.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (10/22/90)

Quoth thurlow@convex.com (Robert Thurlow) (in <thurlow.656467376@convex.convex.com>):
#>Do users really want a MIPS-like export option that says "don't do sync 
#>writes"?  (Note that these async writes are different that those mentioned 
#>above.  Now I'm talking about the possibility that data will be lost 
#>on a server crash.)  The only reason I can think of having this feature 
#>is for truly wondrous benchmark results that you can wave in a 
#>customer's face.
#
#Remember how hard a diskless node hits it's NFS-mounted swap device -
#I've read numbers akin to five writes to every read from /export/swap.
#If I'm the sort of user who doesn't run long-lived batch jobs from my
#workstation, I might enjoy the performance edge I gain with async I/O
#without minding the cost of rebooting most or all the time when my
#boot server crashes.

Simple-minded question: if you want to introduce this best-effort
behaviour, why not do so on the client? I know it makes the imple-
mentation simpler to shove the responsibility to the server, but it's
really a client issue: only the client knows whether or not data is
"precious" (to borrow an earlier attribute). It's pretty convoluted to
force the server to create and export a filesystem with your whizzy
"don't sync" attribute to solve this.

The other reason not to fix this "problem" on the server is that it's
so damned unilateral! Suppose some zealous system administrator decides
to earn himself a bonus by "improving" the network performance by
enabling your attribute on all file systems.  Most users may say "sure,
it's probably a good tradeoff, so go for it." The upshot is that it
becomes impossible to reliably fsync a file, even when you want to! If
you argue that this simply points up the need for a protocol revision,
I won't disagree on the desirability of revising and fixing the
protocol, but I would point out that this particular problem arises
from solving the asynchrony issue on the server rather than the client,
and is therefore pretty bogus...

Bottom line: given the choice between changing the protocol and
its well-known semantics or changing an implementation of that
protocol, I'd rather change the implementation. The _right_
implementation, of course!

-- Geoff Arnold, PC-NFS architect, Sun Microsystems. (geoff@East.Sun.COM)   --
   *** "Now is no time to speculate or hypothecate, but rather a time ***
   *** for action, or at least not a time to rule it out, though not  ***
   *** necessarily a time to rule it in, either." - George Bush       ***

dricejb@drilex.UUCP (Craig Jackson drilex1) (10/23/90)

In the discussion of how async writes on the server are not correct NFS,
and how sync writes to a controller with battery-backed RAM cache are OK
NFS, isn't this another example of relativity?  What if the battery goes
dead before the power is restored?  On the other hand, what if the server is
on an UPS?  Would it then be OK to do async writes?
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

thurlow@convex.com (Robert Thurlow) (10/24/90)

In <3012@jaytee.East.Sun.COM> geoff@bodleian.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) writes:

>Simple-minded question: if you want to introduce this best-effort
>behaviour, why not do so on the client?

Well, to a degree, we already have some control of this.  I can set up
to use synchronous writes through the buffer cache if I have a process
that wants to see write(2) fail unless the server commits, or control
timeouts on my attribute cache to force more consistency checks.  But
it's true that there's no per-filesystem control on the client.  I'm
not sure this is really important.

>it's really a client issue: only the client knows whether or not data is
>"precious" (to borrow an earlier attribute).

I don't agree at all.  The server is providing the storage, and the
sysadmin of the server machine is the person best equipped to judge
it's overall value; after all, he has to do the backups!  I think
individual client processes have a need to secure data, and that the
server is the machine that really needs per-filesystem control over
flushing policy.  More client control in the form of a mount option
would be welcome if it was easy to implement, though.

>Bottom line: given the choice between changing the protocol and
>its well-known semantics or changing an implementation of that
>protocol, I'd rather change the implementation.

I agree with the caveat that we need to tweak a few specific things
ASAP without getting sidetracked by postscript-based protcols.

#include <osi/std/disclaimer>
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

ddp+@andrew.cmu.edu (Drew Daniel Perkins) (10/25/90)

> Excerpts from netnews.comp.protocols.nfs: 20-Oct-90 Re: NFS writes and
> fsync(). Brian Pawlowski@ennoyab. (11773)

> > One of the problems with NFS is that it manages to tangle up several
> > different issues, which makes it hard to solve one without breaking
> > something else.

> Perhaps because the issues are related. Sprite cache strategies
> consider data consistency issues, as does AFS 4.0, because
> aggressive cache coherency strategies must make concessions
> to data consistency (cooperation amongst clients).


One area it definitely tangled up was transport vs. session protocol
layering.  I'm particularly referring to mixing "I acknowledge receipt
of your data" (a transport function) with "here are the results of your
request" (arguably a session function).  Retransmissions shouldn't occur
because a request simply hasn't finished yet.  If these issues weren't
tangled, then one function of the Chet cache (tossing out requests which
are already in progress) wouldn't have been necessary.


>   In the normal case, I/O is handled asynchronously subject to
>   the normal update syncs.  I would not consider this synchronous to the
>   application! There are cases in NFS when I/O is synchronous (to the
>   application), however:

> 	- Any time an fsync is done by the application
> 	- For the remaining life of the file descriptor once a lock
> 	  has been applied to it (even if all locks are then cleared)

You do mention it later, but I believe that "Any time a close is done by
the application" is an important enough case of synchronous application
behaviour that it should be elevated in importance and mentioned here.

Drew