peter@objy.objy.com (Peter Moore) (10/10/90)
My apologies if this is FAQ material, but I just have to know. There are is a (somewhat) new product out that is supposed to speed up NFS by cacheing NFS writes in a (somewhat) non-volatile cache. The advertisement states that the NFS protocol requires that NFS-write requests not return until the data has actually been written to disk, i.e. only after the equivalent of a UNIX fsync() on the file. Since this is a EXPENSIVE operation, they get a significant speed up of NFS by keeping a write-back cache with some sort of battery backup up, and just lie to the kernel that data has been synced. Given that the above description is in roughly the same neighborhood of the truth (it has come through a noisy channel, me). I have to know: WHY ON EARTH DOES NFS REQUIRE THE FSYNC ON WRITES? Without that requirement, we could the effect of this cache board by just not calling fsync(). Now whenever I see something ugly in NFS, it usually comes from the stateless requirement. But the only state dependent reason I can see is: Process P on machine A writes to machine B machine B crashes before the write is synced to disk machine B recovers Process P does another write, dependent on the first succeeding. This second write does succeed and now we have inconsistent data. But in real life, I have seen situations vaguely like this, and the writing process gets a `stale NFS handle' error. So it seems that at least the NFS implementations I have run into have that much state. So what am I missing? Peter Moore peter@objy.com
thurlow@convex.com (Robert Thurlow) (10/12/90)
In <1990Oct9.152612@objy.objy.com> peter@objy.objy.com (Peter Moore) writes: >WHY ON EARTH DOES NFS REQUIRE THE FSYNC ON WRITES? Without that >requirement, we could the effect of this cache board by just not >calling fsync(). No, you couldn't. The cache board for PCs that I know about is a nice unit that essentially promises you the data won't go away and keeps it in battery-backed memory to ensure it. That's important, since once the write request is acknowledged, the client will not try the write again, and may discard its copy of the data. You can easily lose data when the server goes down without the server syncing it. Usually, too, what waits for the acknowledgement is a block I/O daemon (biod) that will handle your async writes for you; your process has to wait for all I/O only when it does an fsync() or a close(), though aggregate throughput is reduced. I think most people would agree that the default behaviour should be to make writes reliable, since that provides the semantics of a local filesystem. You are more free to buy extra throughput by upgrading the server disk or CPU that you are to buy more reliability. That said, I'll add that we do provide an export option to allow you to tell the server to acknowledge the write request immediately upon receipt, and spool the request to its local I/O subsystem. It can help performance a good bit if you don't mind the risks. It's great for filesystems all clients mount with -soft; their processes will be gone after a server reboot, anyway. >Now whenever I see something ugly in NFS, it usually comes from the >stateless requirement. But the only state dependent reason I can see is: > Process P on machine A writes to machine B > machine B crashes before the write is synced to disk Stop right there. Your 'disk' has just lost data, period. Do you expect your local disk to ever do that? The effects could be very devastating, depending on what exactly cared about the data. Think of the havoc you could wreak on a database server. >But in real life, I have seen situations vaguely like this, and the >writing process gets a `stale NFS handle' error. So it seems that at >least the NFS implementations I have run into have that much state. ESTALE only happens when the server can't find anything matching the file handle on its disks, and usually happens when some other process did a creat() or an unlink(), or the server filesystems got mounted in a different order. I don't see the connection here. Hope that helped, Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "I was so much older then, I'm younger than that now."
peter@prefect.uucp (Peter Moore) (10/14/90)
> > I think most people would agree that the default behaviour should be to > make writes reliable, since that provides the semantics of a local > filesystem. > But that isn't the semantics of local write. See below. > > Stop right there. Your 'disk' has just lost data, period. Do you > expect your local disk to ever do that? > Yes I do expect my local disk to do that. As you sort of mention, local writes under almost all Unix systems are asynchronous. The write returns immediately, but the data stays in the buffer pool until either it is pushed out to make room for more active pages or until sync() is called. Typically sync() is called every 30 seconds by the update daemon. So you have no guarantee that your last 30 seconds of local I/O ever make it to disk unless you explicitly do a sync. And if something does go wrong (a unrecoverable bad block, drive off line, or a full crash during the 30 second period), there is no way to signal back to you that it failed. Heck, your process could have exited before the write failed. > The effects could be very > devastating, depending on what exactly cared about the data. Think > of the havoc you could wreak on a database server. > But, as I pointed out, this effect can happen on local writes too. That is why any `database-like' application must explicitly call fsync() if it wishes to guarantee that pages have made it to disk. No recoverable system can depend on the write() alone when writing to local disk. So synchronous NFS isn't helpful to the database people, at least for that reason. They are already doing the right thing with explicit syncs just to make it work locally. This is why synchronous NFS writes seems to be unmotivated to me. It is MORE synchronous than local Unix I/O (assuming that network latency is a lot less than 30 seconds). Why pay such a cost to make it MORE synchronous than we already are willing to live with on Unix? The strongest justification I have been able to synthesize from various sources is basically the original one in my article: 1) Process P on machine L writes to file F on machine R 2) R crashes before syncing the changes to disk and either: a) recovers before P does any more writes to F, or b) crashes after P has finished with F. 3) P continues on blithely unaware that the writes to F failed and produces some data. The reason this scenario is seemingly unique to NFS is because most failures of the local machine to write also involve the local machine crashing. So P couldn't continue after the write really failed. ( Actually this is still possible with local writes: P could either exit before the crash or the write error could be a softer error, perhaps bad block, that didn't force the system to crash. I think it is even possible to have later I/Os make it to disk, but earlier I/Os not. But I do admit this scenario is more likely in the NFS case. ) I have some hand-wavey counter-arguments to the above scenario. 1) Only relatively naive programs will run into this. Reliable programs will already be doing fsync at the proper times. So as long as an asynchronous form of NFS implements fsync, they will be all right. 2) Only long-lived programs are vulnerable (at least more vulnerable than local writes). If a process takes much less that 30 seconds to run, then is very unlikely that the process will be actually killed by a machine crash that wipes out its buffered writes. So if the process worked with local writes, it must either be calling fsync, or already be willing to live with not knowing if its I/O made it to disk. 3) This isn't all that likely. For case 2a) you need R to crash and recover before P does any more I/O to F. Considering that big servers I have worked with can take over half an hour to reboot, that is a very wide window to miss. And for case 2b) R has to crash within 30 seconds of the very last I/O to F; no sooner, no later. 4) Kludges can be added. If NFS handles were invalidated across reboots (perhaps by including a byte computed from the boot time in the handle), then at least 2a) would be impossible without an explicit reopen of F. More complicated support from the local OS could probably make even make reopens of F by P fail (though the vague implementations I can think of are unacceptably kludgey). Now none of these arguments are overwhelming, but they do add up. I am not trying to argue that NO one needs or wants synchronous NFS. I am arguing that not everybody does, (and I believe, but can't defend better than the above, that MOST people don't need it). > I'll add that we do provide an export option to allow you to tell the > server to acknowledge the write request immediately upon receipt, and > spool the request to its local I/O subsystem. It can help performance > a good bit if you don't mind the risks. It's great for filesystems all > clients mount with -soft; their processes will be gone after a server > reboot, anyway. This is exactly the sort of thing I want. Now I just need it on all my machines as an option. Guy Harris mentioned in an email thread of this conversation that a asynchronous extension of NFS has been considered. This would seem the best path, allowing some protocol to negotiate whether the NFS connection will be synced (`nfsmount -o async` perhaps?). This could be a big win for a lot of installations. ( Maybe even linking a RISC application over NFS could finally take a finite amount of time.) Peter Moore peter@objy.com P.S. While I don't know if they even want to be associated with this argument, I would like to thank Craig Everhart, Carl Smith, and Guy Harris for having the patience to discuss much of the above with me in email conversations.
thurlow@convex.com (Robert Thurlow) (10/14/90)
In <1990Oct14.082712.10811@objy.com> peter@prefect.uucp (Peter Moore) writes: >> Stop right there. Your 'disk' has just lost data, period. Do you >> expect your local disk to ever do that? >Yes I do expect my local disk to do that. As you sort of mention, >local writes under almost all Unix systems are asynchronous. The >write returns immediately, but the data stays in the buffer pool until >either it is pushed out to make room for more active pages or until >sync() is called. Typically sync() is called every 30 seconds by the >update daemon. So you have no guarantee that your last 30 seconds of >local I/O ever make it to disk unless you explicitly do a sync. And if >something does go wrong (a unrecoverable bad block, drive off line, or >a full crash during the 30 second period), there is no way to signal >back to you that it failed. Heck, your process could have exited >before the write failed. I agree that write(2) won't return you an error in general, but processes can, at any point they wish, call fsync() to ensure the data is secured. That ability is lost if the server is acknowledging only the receipt of the request. close(2) will fail if you can't secure your writes, as well, though people are very poor at paying attention to the return code. If you throw away this ability for a process to get an accurate indication of success, you've definitely made it impossible to trust database I/O over NFS. You've also lost the ability to choose synchronous writes. >> The effects could be very >> devastating, depending on what exactly cared about the data. Think >> of the havoc you could wreak on a database server. >But, as I pointed out, this effect can happen on local writes too. >That is why any `database-like' application must explicitly call >fsync() if it wishes to guarantee that pages have made it to disk. No >recoverable system can depend on the write() alone when writing to >local disk. So synchronous NFS isn't helpful to the database people, at >least for that reason. They are already doing the right thing with >explicit syncs just to make it work locally. This is the problem: after a server says "Yo!", your client need never write that data again, FSYNC() OR NOT, because it "trusts" the server and can in no way tell it should not. The block may be in your buffer cache, but I/O is marked complete; fsync() will lie, likely without even going over the wire. You just can't trust an fsync() anymore, period. >This is why synchronous NFS writes seems to be unmotivated to me. >It is MORE synchronous than local Unix I/O (assuming that network >latency is a lot less than 30 seconds). Why pay such a cost to make >it MORE synchronous than we already are willing to live with on Unix? Networks go down a _lot_ more often that local disk in my experience. People kick out cables, router boxes fail, network adaptors hang, machines get powered down, and of course, the local server disk can fail :-) >Now none of these arguments are overwhelming, but they do add up. I >am not trying to argue that NO one needs or wants synchronous NFS. I >am arguing that not everybody does, (and I believe, but can't defend >better than the above, that MOST people don't need it). I like the idea of having an option. But I'm sufficiently convinced that it should be the default. >> <stuff about the -async option> >This is exactly the sort of thing I want. Now I just need it on all >my machines as an option. You're right; it's a real issue getting new features out across the installed base so that everyone can count on them; Brian Pawlowski of Sun underlined it in his talk at the ONC/NFS Industry Networking Conference last week. I think we have to try to cut time-to-market for new functionality and get this stuff out there faster. Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
peter@prefect.Berkeley.EDU (Peter Moore) (10/15/90)
Oops. Further thought brings out some problems. Given the scenario: 1) Process P on machine L writes to file F on machine R 2) R crashes before syncing the changes to disk and either: a) recovers before P does any more writes to F, or b) crashes after P has finished with F. 3) P continues on blithely unaware that the writes to F failed and produces some data. In my last note I said that a process that did the careful fsyncs that even local writes require wouldn't be bothered by the above scenario. But calling fsync during stage 3 would only commit changes since R reconnected. P would falsely think that all changes have been committed. This is not good. Two changes could fix this. First, invalidate all NFS handles on reboot. This would cause fsyncs on existing file descriptors to fail. Second, (sadly) do an implicit fsync on every close. This would protect processes that do: write, close, reopen, and then fsync. So we still need implicit fsyncs, but not on every write. But now a server crash will take down all processes with open file descriptors on that server, whether or not the process actually lost any data. So there are more tradeoffs to be made. But it still seems like a worthwhile option. Peter
thurlow@convex.com (Robert Thurlow) (10/15/90)
In <1990Oct15.025613.12574@objy.com> peter@prefect.Berkeley.EDU (Peter Moore) writes: >Two changes could fix this. First, invalidate all NFS handles on >reboot. This would cause fsyncs on existing file descriptors to fail. >Second, (sadly) do an implicit fsync on every close. This would protect >processes that do: write, close, reopen, and then fsync. Invalidate all file handles and you've made recovery after a reboot damned expensive. I don't want the process to have to think about reopening its files, so the kernel would get a lot uglier. I'd prefer to leave the biod doing implicit fsyncs, thanks; I think I have other tools to tweak performance. Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
guy@auspex.auspex.com (Guy Harris) (10/16/90)
>Guy Harris mentioned in an email thread of this conversation that a >asynchronous extension of NFS has been considered. Err, umm, no, I didn't. What I mentioned was the WRITECACHE operation from an NFS3 protocol spec. The WRITECACHE operation, as proposed therein, was (in the RPC language used in that spec, which was the 2/10/89 version): NFSPROC_WRITECACHE(file,flush,beginoffset,totalcount,offset,data) returns(reply) fhandle file; boolean flush; fileoffset beginoffset; unsigned totalcount; fileoffset offset; nfsdata data; struct wcokres { fattr attributes; unsigned count; union flushinfo switch (bool flushed) { case TRUE: void; case FALSE: unsigned flushcnt; errinfo flusherror; } flushed; }; union writecacheres switch (stat status) { case NFS_OK: wcokres ok; case NFS_WARN: wcokres aok; errinfo warning; case NFS_ERROR: errinfo errinfo; } reply; This call writes "data" (which is just a bunch of data) into the server's data cache for a regular (ftype NFREG) file. "Beginoffset" and "totalcount" describe the offset and size, in bytes, of the entire piece of data to be written. These values will be the same across a set of WRITECACHE operations. Each of the WRITECACHE operations in a set will have different values for "offset" and "data". "Offset" is the byte offset into "file" where "data" should be written. The boolean "flush", if TRUE, causes the server to flush a whole data set. That is, commit to disk the data from several WRITECACHE operations, whose "offset" values fall between "beginoffset" and "beginoffset+totalcount". If the server's attempt to flush "totalcount" bytes of data starting at "beginoffset" bytes into the file is successful the server will return "reply.flushed" TRUE. If "reply.status" is NFS_OK "reply.ok.attributes" is the attributes of the file following the write. If "reply.flushd" is FALSE "reply.ok.flushed.flushcnt" is the number of consecutive bytes that actually got written starting at "beginoffset", which may be less than "totalcount", and "reply.ok.flushed.flusherror" contains error information about the write operation that failed. If "reply.status" is NFS_ERROR then the call failed and "reply.errinfo" contains error information. If "reply.status" is NFS_WARN than all the data returned in the NFS_OK case and a warning "errinfo" structure is returned. The server has the option of accepting only a portion of data. In this case "reply.ok.count" is the number of bytes of data that were cached starting at "offset" in the file. The size of "data" must be less than or equal to the value of the "wsize" field in the GETFSINFO reply structure [this is the maximum number of bytes the server will accept in a "write" operation] for the file system that contains "file". In addition, "totalcount" must be less than or equal to the "wcsize" field in the GETFSINFO reply [this is the maximum number of bytes that the server will let you write out in a set of WRITECACHE operations]. IMPLEMENTATION The WRITECACHE operation is provided for performance only. Servers are not required to support it. Clients can use the WRITECACHE operation to group consecutive WRITE operations without incurring the overhead of flushing each chunk to data through to disk on the server. The client takes responsibility for recovvering from server errors by holding on to data that has been written with WRITECACHE until a successful flush has occurred. This way, to recover from an error the client can either retry the set of WRITECACHE operations or use WRITE operations to insure that the data is safely on the server's disk. The server may pre-flush cached data to disk to free up cache space. If this happens the server can either return an error in response to the flush request and force the client to resend everything, or keep track of data that has already been flushed when the flush request comes along. This way, if the server can account for all data as either in the cache or already flushed the flush request can return success. So this is *NOT* the same as making writes asynchronous. It's more like letting a single *synchronous* write be broken up into several pieces. Those pieces are WRITECACHE operations with the same "file", "beginoffset", and "totalcount" values, those values being the values that correspond to the single write. The individual pieces are identified by the "offset" values in the WRITECACHE operations, and the data in the pieces are the "data" values. The final WRITECACHE operation has a "flush" value of TRUE, the others having a "flush" value of FALSE. Just as is the case with a single "write" operation, the client holds onto the data until it's *all* flushed to disk; it must not free up stuff just because it's been sent to the server with a WRITECACHE operation - it has to wait until the final WRITECACHE operation succeeds. All but the final WRITECACHE operation resemble asynchronous writes; the final WRITECACHE is still synchronous.
cs@Eng.Sun.COM (Carl Smith) (10/16/90)
> Two changes could fix this. First, invalidate all NFS handles on > reboot. This would break every NFS client that I know of. It's the server's responsibility to ensure that file handles are as permanent as the files to which they refer. Carl
mogul@wrl.dec.com (Jeffrey Mogul) (10/16/90)
This seems like a good time to insert a few comments into the NFS synchrony debate (especially since I'm going away for a week). One of the problems with NFS is that it manages to tangle up several different issues, which makes it hard to solve one without breaking something else. For example, the reason why NFS clients do write-throughs to the server is partly for reliability (the client could crash before a delayed write is sent to the server) and partly a consquence of the statelessness dogma. This is because if you have two clients sharing the same file, changes have to appear on the server "as soon as possible" in order to preserve some shreds of local-Unix-like cache consistency. If NFS clients behaved like local-disk Unix systems (only write dirty blocks every 30 seconds), then it wouldn't matter as much if the server acknowledged them immediately, or waited until the data was safely on the disk. (As has been pointed out, it would be a trivial change to allow the client to distinguish "precious" blocks from others, just as the local-disk Unix file system has always done.) But, since server disk write latency is so nakedly exposed to client applications, anything that speeds that latency (such as a "stable-storage" cache, or faster disks) helps a lot. Of course, NFS isn't the last word in file systems. Anyone interested in a better design can read the papers on Sprite (e.g., Michael N. Nelson, Brent B. Welch, and John K. Ousterhout, "Caching in the Sprite Network File System", Trans. Computer Systems 6:1, pages 134-154, Feb. 1988) and Spritely NFS (V. Srinivasan and Jeffrey C. Mogul, "Spritely NFS: Experiments with Cache-Consistency Protocols", Proc. 12th SOSP, pages 45-57, Dec. 1988). But for many of us (including me!), NFS is what we use, so solutions that don't require protocol changes (such as server stable-storage boards) might still be a win. -Jeff
peter@prefect.Berkeley.EDU (Peter Moore) (10/16/90)
In article <thurlow.655915475@convex.convex.com>, thurlow@convex.com (Robert Thurlow) writes: |> I agree that write(2) won't return you an error in general, but processes |> can, at any point they wish, call fsync() to ensure the data is secured. |> That ability is lost if the server is acknowledging only the receipt of |> the request. I am sorry. I was being unclear. I was only advocating writes become asynchronous. Calls like fsync (and close if we make it implicitly fsync) would have to be synchronous. Peter Moore peter@objy.com
guy@auspex.auspex.com (Guy Harris) (10/17/90)
>For example, the reason why NFS clients do write-throughs to the server
By "do write-throughs to the server" do you mean that if a process on an
NFS *client* writes to a file, the data is immediately sent to the
server?
If so, UNIX NFS clients (or, at least, those derived from the Sun code)
do *not* do write-throughs to the server; in fact, they behave like
local-disk UNIX systems, writing dirty blocks out every 30 seconds (or
when the buffer/page is needed), unless somebody on the client has done
an "fcntl()" lock on the file (which causes writes to be done
synchronously on the client, and reads always to go to the server).
To quote NFSD(8):
biod starts 'nservers' asynchronous block I/O daemons. This
command is used on a NFS client to buffer cache handle
read-ahead and write-behind. The magic number for 'nservers'
in here is also four.
thurlow@convex.com (Robert Thurlow) (10/18/90)
In <1990Oct16.085057.16691@objy.com> peter@prefect.Berkeley.EDU (Peter Moore) writes: >In article <thurlow.655915475@convex.convex.com>, thurlow@convex.com (Robert Thurlow) writes: >|> I agree that write(2) won't return you an error in general, but processes >|> can, at any point they wish, call fsync() to ensure the data is secured. >|> That ability is lost if the server is acknowledging only the receipt of >|> the request. >I am sorry. I was being unclear. I was only advocating writes become >asynchronous. Calls like fsync (and close if we make it implicitly >fsync) would have to be synchronous. But fsync doesn't go over the wire; if you let the server asynchronously respond to write requests, you won't be able to trust your fsyncs, either. Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
stukenborg@mavplus9.rtp.dg.com (Stephen Stukenborg) (10/20/90)
(Sorry this is slightly dated, but our news feed has been wacked out all week.) After reading the postings back and forth on the issue, I still haven't seen anyone really hit the nail on the head on why nfs write operations are synchronous. The primary reason is to make the client tolerant of server crashes. If I'm an NFS client, I don't want the fact that the server has crashed to impact my "view" of the world. My system is still up and running. Why should I lose any data? As has already been described by Rob Thurlow, if I mark my client buffer cache block as "clean" when the server acks my write, then I'm counting on that data being on stable storage. If the server is merely going to ack the receipt of my write request, then I have to hold on to that buffer until close (or the janitor daemon) verifies that everything is on the server's disk. As Jeff Mogul pointed out, the primary difference between traditional unix file system behavior and NFS is that the NFS close operation writes all of a files dirty buffers to disk. It is this "sync-on-close" behavior that really dictates the synchronous write policy. (Any also provides a wimpy consistancy-control policy.) The performance win is if the writes (before the close) are asynchronous; then the disk arm on the server is not "forced" to seek all over the place. This is essentially the Rev. 3 protocol spec WRITECACHE call that Guy Harris talked about. It would probably be worth the added complexity to the client side code for the improved server performance. Another thought. I'm having trouble with the several people who have mentioned that NFS data integrity is not a big factor to some users. Do users really want a MIPS-like export option that says "don't do sync writes"? (Note that these async writes are different that those mentioned above. Now I'm talking about the possibility that data will be lost on a server crash.) The only reason I can think of having this feature is for truly wondrous benchmark results that you can wave in a customer's face. If I've just edited my code module or circuit layout, then I want to be sure that a server crash is not going to lose any of my thoughts. (Maybe the users who want this feature are the same ones that use "soft" mounts on filesystems mounted r/w.) Steve Stukenborg DG/UX Kernel Development Data General Corporation 62 Alexander Drive stukenborg@dg-rtp.dg.com Research Triangle Park, NC 27709 ...!mcnc!rti!xyzzy!stukenborg Steve Stukenborg Data General Corporation 62 Alexander Drive stukenborg@dg-rtp.dg.com Research Triangle Park, NC 27709 ...!mcnc!rti!xyzzy!stukenborg
beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) (10/20/90)
[This is a long response describing some aspects of NFS behaviour on writes. Slightly delayed because of news problems.] In article <1990Oct16.004225.22754@wrl.dec.com>, mogul@wrl.dec.com (Jeffrey Mogul) writes: > One of the problems with NFS is that it manages to tangle up several > different issues, which makes it hard to solve one without breaking > something else. Perhaps because the issues are related. Sprite cache strategies consider data consistency issues, as does AFS 4.0, because aggressive cache coherency strategies must make concessions to data consistency (cooperation amongst clients). I find it incredibly hard to disentangle the two in any discussion, though I find it useful to separate them in looking at problems in distributed file systems. Synchronous Write Behaviour, NFS and Applications ------------------------------------------------- > For example, the reason why NFS clients do write-throughs to the server > is partly for reliability (the client could crash before a delayed > write is sent to the server) and partly a consquence of the statelessness > dogma. This is because if you have two clients sharing the same file, > changes have to appear on the server "as soon as possible" in order > to preserve some shreds of local-Unix-like cache consistency. I believe you've confused me above. An application is still subject to "sync" delays for writes to server via NFS exactly the same as writing to local disk. This is the behaviour we've come to know and love on UNIX. NFS writes are triggered by normal buffer flushing (sync activity) on a client. Your above makes it sound like writing is synchronous to the application. This is, generally, not the case. As a colleague points out: In the normal case, I/O is handled asynchronously subject to the normal update syncs. I would not consider this synchronous to the application! There are cases in NFS when I/O is synchronous (to the application), however: - Any time an fsync is done by the application - For the remaining life of the file descriptor once a lock has been applied to it (even if all locks are then cleared) Is there reason for you to believe it works otherwise than this? NFS client requirements for servers to store data in stable storage before replying NFS_OK is not a result of statelessness "dogma". It is a result of stateless design. Further, client assumption on the writing by servers of data to stable storage (the semantics of a good reply) is unrelated to the 30 second sync time consideration, which as you state, attempts to preserve some shred of local UNIX like data consistency. However, the 30 second sync time more resembles local file write behaviour in UNIX (subject to periodic sync's). I believe you have confused several issues here. Client Requirements for Server Writes to Stable Storage ------------------------------------------------------- More critical to the on-going discussion is the reasons for an NFS client requiring servers to write data (including "meta-data" like file size) to stable storage before replying back to the client. It is an inherent assumption in the current design of NFS that servers will not respond NFS_OK (that is, "Write Successful") until data from a client has been written to stable storage. It is not partly a consequence of the "stateless dogma" but inherently a consequence of the "stateless design" of NFS. The assumption that servers flush data to stable storage before returning NFS_OK to the client has nothing do with client crashes () but has everything to do with the implications of server crashes. By requiring the server to write its data to stable storage, the client need not concern itself with the current server state. On receiving an NFS_OK from the server, the client is free reuse data buffers which held the data just written. If the server crashes and returns (reboots), the client will (in classic "hard" mounted situations) wait for the server to return and continue where it left of. The server crash has not affected the operation of clients. This is some of the behaviour usually implied when people say "NFS is stateless". ["stateless" is a relative term--we're obviously talking about state on a client in the form of buffers held for 30 seconds. This is normal "UNIX" buffering behaviour. There are other "stateless" design implications, the other well-known one being the simple cache coherency strategy used by NFS which results in checking the attributes of a file to validate whether locally cached data in the client is still valid--that is, in agreement with server data. Another is the READDIR cookie, another is the encapsulation of file location information in the file handle. The approach is to keep the server simple, and burdening the client with responsibility of keeping critical state. This critical state is not shared with other clients. Servers are also not without "state"-- servers typically employ a read-ahead strategy to improve performance-- however the key here is that such server state is not critical to proper operation of NFS.] The semantics of an NFS write are to preserve data in event of a server crash (by requiring it ot be on stable storage--static RAM or disk). Suggestions on just allowing servers to return NFS_OK without flushing to stable storage [as have been made in preceding e-mails] are in some sense dangerous. Because all existing clients are implemented under the assumption that NFS servers only reply okay if the data is "safe". {Assuming you didn't just lose the server disk you wrote to during the server crash.} It is a client data reliability issue that it flushes modified buffers every 30 seconds (or so) in exactly the same way it is for flushing buffers to local disk--preserve data in event of a client crash. In this way, NFS is no different from local writing. In addition, 30 second flushes preserve shreds of UNIX data consistency amongst clients (as you mention above). [A useful side effect]. > If NFS clients behaved like local-disk Unix systems (only write dirty > blocks every 30 seconds), then it wouldn't matter as much if the server > acknowledged them immediately, or waited until the data was safely on > the disk. (As has been pointed out, it would be a trivial change to > allow the client to distinguish "precious" blocks from others, just > as the local-disk Unix file system has always done.) But, since server > disk write latency is so nakedly exposed to client applications, anything > that speeds that latency (such as a "stable-storage" cache, or faster > disks) helps a lot. NFS clients do behave like local-disk UNIX systems... What do you mean above--this is where I remain confused? Server disk latency is not exposed so nakedly to a client application. (See above discussion). To help applications detect error in writes, asynchronous write errors (to the execution of the write system call by the application), are returned at close() time. This is why it is so critical for an application to check the results of a "close" operation to detect such errors. I repeat again: NFS writes are not (in general) synchronous from the client application viewpoint, only from the NFS client viewpoint. ---------------------------- -------------------- In effect, a file close() results in an fsync() of the file to ensure that any asycnhronous errors are seen by the application. The current protocol has no provision for later acknowledgement of data being on stable storage (asynchronous writing), allowing the client to implement a "precious block" policy. Such a change would require a protocol revision. What do you consider "precious"? The NFS design considers user data precious, and ensures approximately the same guarantee of reliability to an application that is provided by the local UNIX file system. The semantics of "close" returning any asynchronous write errors (in effect returning following the flush of data to stable storage on the server) provide further guarantees to the application. The attempt is to eliminate inisidious silent errors. Stable storage caching (static RAM techniques) on the server accellerate client applications OVERALL because latency on NFS write requests are reduced (as read-ahead techniques reduce latency by eliminating synchronous disk access, so writing to Static RAM reduces latency by eliminating synchronous disk write activity). The key point here is that no one particular application's write performance is improved, but an OVERAL NFS client's performance is improved (thereby improving all applications). Future Directions ----------------- > Of course, NFS isn't the last word in file systems. Anyone interested > in a better design can read the papers on Sprite (e.g., Michael N. Nelson, > Brent B. Welch, and John K. Ousterhout, "Caching in the Sprite Network File > System", Trans. Computer Systems 6:1, pages 134-154, Feb. 1988) and Spritely > NFS (V. Srinivasan and Jeffrey C. Mogul, "Spritely NFS: Experiments with > Cache-Consistency Protocols", Proc. 12th SOSP, pages 45-57, Dec. 1988). > > But for many of us (including me!), NFS is what we use, so solutions that > don't require protocol changes (such as server stable-storage boards) might > still be a win. NFS is what we use because it is a solution available commercially today, while the papers you reference above describe research in distributed file systems. I take possible exception to your term "better" design--the NFS design met its goals, provides a good solution, and works. I believe that Sprite, AFS and Spritely NFS have shown a lot of promise. In one form or another, they address the issues of (1) cache coherency, and (2) data consistency. Compare AFS 3.0 and AFS 4.0 and you may arrive at the dichotomy on coherency that helps me understand the differences twixt the two. (Or maybe not.) AFS 4.0 definitely provides stronger data consistency semantics (through the Token Manager) than AFS 3.0 (which had well-defined, but possibly moot cache consistency since the guarantees for data consistency amongst cooperating clients was--is--weak. See the paper by Kazar and crew in Summer Usenix proceedings). AFS, Sprite and Spritely NFS provide direction for us [the NFS community] on ways to improve performance and data consistency guarantees in future distributed file systems. NFS improvements in data consistency (the view of data as seen by multiple clients) are not addressed by stable-storage boards. Stable storage boards provide a performance boost within the framework of the current NFS protocol while preserving correctness (the implicit agreement made bewteen an NFS client and server on write semantics). I think it is time to consider alternative cache consistency models for NFS, and research in the area provides several directions. HOWEVER, I also believe that the simplicity of the current design of NFS, particularly in regards to data reliability, are not things we should toss aside lightly. NFS has been made available on the wide variety of platforms because it has been both easy to port and fairly easy to implement from the specification. Simplicity is not a bad word. Simple error recovery semantics in a distributed application is not a bad design. Complex error recovery techniques may accompany complex cache coherency schemes. There is a body of knowledge now on many of the issues. Perhaps it is time to exploit this knowledge seriously in NFS. 'Lest we lapse into a mode where we believe data and cache consistency are the only issues, one should look around at others: operation over unreliable networks (WAN's), administration, support for shared file name spaces, etc. Feedback on issues are solicited, > -Jeff Brian Pawlowski
thurlow@convex.com (Robert Thurlow) (10/21/90)
In <1990Oct19.222754.17622@dg-rtp.dg.com> stukenborg@mavplus9.rtp.dg.com (Stephen Stukenborg) writes: >After reading the postings back and forth on the issue, I still haven't >seen anyone really hit the nail on the head on why nfs write operations >are synchronous. The primary reason is to make the client tolerant of >server crashes. If I'm an NFS client, I don't want the fact that the >server has crashed to impact my "view" of the world. My system is still up >and running. Why should I lose any data? >As has already been described by Rob Thurlow, if I mark my client buffer cache >block as "clean" when the server acks my write, then I'm counting on that >data being on stable storage. If the server is merely going to ack the >receipt of my write request, then I have to hold on to that buffer until >close (or the janitor daemon) verifies that everything is on the server's >disk. As Jeff Mogul pointed out, the primary difference between traditional >unix file system behavior and NFS is that the NFS close operation writes >all of a files dirty buffers to disk. It is this "sync-on-close" behavior >that really dictates the synchronous write policy. (Any also provides a wimpy >consistancy-control policy.) Here's your problem. There is no "open" operation, nor "close" operation, in the NFS protocol. If you want to open a file, your client does an NFS getattr operation to ensure there is indeed a file by that name. If you want to close, you simply stop using that filehandle. All close does is force over-the-wire writes on all VM pages of the associated file. In fact, the only way in the current protocol to send data is with the write operation, and the only way to send metadata is with the set attributes operation. There is no way to have a second ack when I/O is complete. Now, maybe a future protocol will be changed; that writecache from NeFS looks like it has a lot of potential. But right now, if the client doesn't at least feel free to destroy the data when the write ack is received, it never will get any information that will make it feel better about it in the future. >Do users really want a MIPS-like export option that says "don't do sync >writes"? (Note that these async writes are different that those mentioned >above. Now I'm talking about the possibility that data will be lost >on a server crash.) The only reason I can think of having this feature >is for truly wondrous benchmark results that you can wave in a >customer's face. Remember how hard a diskless node hits it's NFS-mounted swap device - I've read numbers akin to five writes to every read from /export/swap. If I'm the sort of user who doesn't run long-lived batch jobs from my workstation, I might enjoy the performance edge I gain with async I/O without minding the cost of rebooting most or all the time when my boot server crashes. #include <iso/std/disclaimer> Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (10/21/90)
In article <143972@sun.Eng.Sun.COM>, beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) writes: > ... > More critical to the on-going discussion is the reasons for an NFS > client requiring servers to write data (including "meta-data" like > file size) to stable storage before replying back to the client. > It is an inherent assumption in the current design of NFS that servers > will not respond NFS_OK (that is, "Write Successful") until data from > a client has been written to stable storage. It is not partly a > consequence of the "stateless dogma" but inherently a consequence of > the "stateless design" of NFS. Confounding statelessness, to the limited degree it is an attribute of NFS, with server caching policies is bad. Consider that "state" is the purpose of a file system. NFS is not now and never was stateless. It is relatively stateless, in that the server is not notified of open()'s, unlike several other remote file systems developed then and since. AT&T made a big deal that NFS was "stateless and so bad," while Sun responded that NFS was "stateless and so good." It was blarney in the battle between what AT&T called the "emerging network file system standard" (RFS) and NFS. The battle was not just the public one, but the internal one between Sun engineering and whatever you call the AT&T New Jersey UNIX department. (I worked on the first SVR3 NFS port in '85 in Mtn.View and saw some of the smoke of the cannons.) "XID cache" is vital for making NFS come even as close as it does to real UNIX file system symantics, and is by itself a sufficient counter to the old claim that "NFS is stateless." > ... > The assumption that servers flush data to stable storage before returning > NFS_OK to the client has nothing do with client crashes () but has > everything to do with the implications of server crashes. By requiring > the server to write its data to stable storage, the client need not > concern itself with the current server state. On receiving an NFS_OK from > the server, the client is free reuse data buffers which held the data just > written. If the server crashes and returns (reboots), the client > will (in classic "hard" mounted situations) wait for the server to > return and continue where it left of. The server crash has not affected > the operation of clients. This is some of the behaviour usually implied > when people say "NFS is stateless". No, the phrase "NFS is stateless" has been almost devoid of meaning for years, because it is confounded with the general notion of state, as in your paragraph above. > ["stateless" is a relative term--we're obviously talking about state > on a client in the form of buffers held for 30 seconds. This is normal > "UNIX" buffering behaviour. There are other "stateless" design implications, > the other well-known one being the simple cache coherency strategy used > by NFS which results in checking the attributes of a file to validate > whether locally cached data in the client is still valid--that is, > in agreement with server data. What has this to do with "statelessness"? Please say what this "stateless" has to do with the differences between the NFS cache coherence mechanism and the coherency mechanisms in the distributed cache systems for files, RAM, host names, and toaster tempuratures. > ... Servers are also not without "state"-- > servers typically employ a read-ahead strategy to improve performance-- > however the key here is that such server state is not critical to proper > operation of NFS.] Wrong. Without a proper XID cache, an NFS filesystem is an unacceptibly poor imitation of a UNIX filesystem. Remember the problems at the Connectathon before last. Please understand that I like NFS very much and stuff many megabytes thru NFS filesystems everyday. I think the trade-offs of Bob Lyon &co. were great and continue to be close to optimimal. Honesty conflicts with claims that NFS==UFS. There are many common UNIX behaviors where NFS is a poor imitation of a Real Filesystem(tm). > The semantics of an NFS write are to preserve data in event of a server > crash (by requiring it ot be on stable storage--static RAM or disk). > ... > Suggestions on just allowing servers to return NFS_OK without flushing > to stable storage [as have been made in preceding e-mails] > are in some sense dangerous. Because all existing > clients are implemented under the assumption that NFS servers only > reply okay if the data is "safe". {Assuming you didn't just lose > the server disk you wrote to during the server crash.} Exactly. Life is "dangerous" and filled with disk crashes. > ... > The semantics of "close" returning any asynchronous write errors > (in effect returning following the flush of data to stable storage > on the server) provide further guarantees to the application. > ... > The attempt is to eliminate inisidious silent errors. I understand guarantees as absolute, except where explicitly limited. The Federal Government and the State of Calif agree with me. If something is guaranteed to not lose data, then it better not. The NFS server dogma does not provide a valid guanrantee of preserving data, or of no silent errors. It only improves the likelihoods. This is because there is no such thing as absolutely stable storage. (As I write this, I'm restoring a crashed disk.) In most UNIX systems, the server cache in DRAM is lost during a crash, disk sectors are usually not lost, and there is no third medium. There are other possibilities. In the 1960's I worked with "mainframes" (Kronos on 6000's) where you could push the reset button ("level 3 restart"), and not only have all active jobs resume, but where the contents of the RAM disk caches would be recovered. Amdhal, Unisys, CDC, and IBM probably still have such features. There are also systems where there are more than 2 layers of storage. Where would the NFS server dogma require that a system with "permanent" optical storage (whether modern WORM or anchient microfiche), behind slow disks, behind fast drums, behind bulk RAM, behind fast DRAM, behind SRAM cache preserve client data? On the most stable, even if it takes minutes to write? > Stable storage caching (static RAM techniques) on the server accellerate > client applications OVERALL because latency on NFS write requests > are reduced (as read-ahead techniques reduce latency by eliminating > synchronous disk access, so writing to Static RAM reduces latency > by eliminating synchronous disk write activity). The key point > here is that no one particular application's write performance > is improved, but an OVERAL NFS client's performance is improved > (thereby improving all applications). > ... This is a strange statement. We found years ago that violating the NFS cache dogma improved the numbers on many NFS benchmarks, from the Sun test suite to many other benchmarks by 50%. (Yes, fellow Connectionathon attendees, that is one of our secrets, now disclosed in an /etc/exports option.) It would be less dogmatic to say that when a server returns NFS_OK, it is saying that the MTBF of the place containing the client's data is greater than XXX, where the MTBF includes all possibilities of failure from power to earthquake to kernel bug. The NFS protocol should dictate the external characteristics of the server file system, not its internal implementation. Whether the server flushes to disk is an internal implementation issue. Rational ustomers buy solutions to problems. They don't care about violations of dogma. They only want an appropriate engineering solution to preserving their data. They don't care whether server buffers are flushed to disk. They care only that data are sufficently rarely lost. I was not present when the NFS cache dogma was graven in stone, but I wonder if it was not mostly a statement about the lack of reliability of NFS servers of the time (i.e. 68010 UNIX systems in 1984). The NFS cache dogma does solve problems, but those problems are of people selling things, not of people building or buying things. Vernon Schryver Silicon Graphics vjs@sgi.com
beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) (10/21/90)
[On several recommendations, I'll try to keep the verbage down. Whoops- total failure.] In article <72781@sgi.sgi.com>, vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes: > > It is not partly a > > consequence of the "stateless dogma" but inherently a consequence of > > the "stateless design" of NFS. > > Confounding statelessness, to the limited degree it is an attribute of NFS, > with server caching policies is bad. Consider that "state" is the purpose > of a file system. > > NFS is not now and never was stateless. It is relatively stateless, in > that the server is not notified of open()'s, unlike several other remote > file systems developed then and since. AT&T made a big deal that NFS was > "stateless and so bad," while Sun responded that NFS was "stateless and so > good." It was blarney in the battle between what AT&T called the "emerging > network file system standard" (RFS) and NFS. The battle was not just the > public one, but the internal one between Sun engineering and whatever you > call the AT&T New Jersey UNIX department. (I worked on the first SVR3 NFS > port in '85 in Mtn.View and saw some of the smoke of the cannons.) I would not disagree with this. I was simplifying the discussion. Sorry. No, NFS is not stateless, it is relative (which I attempted to point at below). The "stateless" wars are pointless; however the fact that the "relatively stateless" design of NFS has simplified implementations should not be ignored... "stateless" is not simply a notification of "open()" though--shared knowledge on the part of clients and servers (particularly in the knowledge of cache consistency) is more critical (and difficult) state to track. > "XID cache" is vital for making NFS come even as close as it does to real > UNIX file system symantics, and is by itself a sufficient counter to the > old claim that "NFS is stateless." I like your term relatively stateless, particularly for this reason. However, at the level of the discussion the e-mails I responded to were at, I felt comfortable pointing out that some of the fundamental design considerations in NFS have pretty basic implications to what one can and cannot do in an implementation. The need for an "XID cache" addresses a "bug" in the protocol. Suggestions to the effect of eliminating syncing data to stable storage on a server before returning NFS_OK on a write undermines basic assumptions made by clients. > > ... > > The assumption that servers flush data to stable storage before returning > > NFS_OK to the client has nothing do with client crashes () but has > > everything to do with the implications of server crashes. By requiring > > the server to write its data to stable storage, the client need not > > concern itself with the current server state. On receiving an NFS_OK from > > the server, the client is free reuse data buffers which held the data just > > written. If the server crashes and returns (reboots), the client > > will (in classic "hard" mounted situations) wait for the server to > > return and continue where it left of. The server crash has not affected > > the operation of clients. This is some of the behaviour usually implied > > when people say "NFS is stateless". > > No, the phrase "NFS is stateless" has been almost devoid of meaning for > years, because it is confounded with the general notion of state, as in > your paragraph above. Yes, perhaps I should have moved up the lower paragraph. I understand and accept the relativity of the term "stateless". > > ["stateless" is a relative term--we're obviously talking about state > > on a client in the form of buffers held for 30 seconds. This is normal > > "UNIX" buffering behaviour. There are other "stateless" design implications, > > the other well-known one being the simple cache coherency strategy used > > by NFS which results in checking the attributes of a file to validate > > whether locally cached data in the client is still valid--that is, > > in agreement with server data. > > What has this to do with "statelessness"? Please say what this "stateless" > has to do with the differences between the NFS cache coherence mechanism > and the coherency mechanisms in the distributed cache systems for files, > RAM, host names, and toaster tempuratures. Ummmm... This was my small way of saying that a bald statement of "NFS is stateless" is untrue, putting me in violent agreement with your "relatively stateless" statement. There are a lot of interesting "state" thingies agreed to by the clients and servers. File handles are agreed to "persist" over a crash. (Is this in the specification?) The state describing a file handle in UNIX is information on disk. > > ... Servers are also not without "state"-- > > servers typically employ a read-ahead strategy to improve performance-- > > however the key here is that such server state is not critical to proper > > operation of NFS.] > > Wrong. Without a proper XID cache, an NFS filesystem is an unacceptibly > poor imitation of a UNIX filesystem. Remember the problems at the > Connectathon before last. Yes, the XID Reply Cache is "highly recommended state":-) for a working NFS server. Would you allow me to separate out the XID cache solution from the things like read-ahead which are not required for "proper operation." [Also, if you have implemented an NFS server, but don't know what the XID cache is, look at the Usenix Winter 88 paper on "Improving Correctness and Performance in an NFS File Server"--I believe that's the title.] > Please understand that I like NFS very much and stuff many megabytes thru > NFS filesystems everyday. I think the trade-offs of Bob Lyon &co. were > great and continue to be close to optimimal. Honesty conflicts with claims > that NFS==UFS. There are many common UNIX behaviors where NFS is a poor > imitation of a Real Filesystem(tm). Yes. Lyon & Co. made trade-offs. And yes, NFS is not a UNIX file system. It comes close where it counts for most situations. And when it doesn't satisfy your requirements (strict read/write consistency without locking example, append mode writes, ???) then you have a problem. [Vernon: Do you feel like posting an enumerated prioritized list of missing features in NFS--with some measure of how important that feature is? That should start an interesting discussion--I'd like to see it.] > > The semantics of an NFS write are to preserve data in event of a server > > crash (by requiring it ot be on stable storage--static RAM or disk). > > ... > > > Suggestions on just allowing servers to return NFS_OK without flushing > > to stable storage [as have been made in preceding e-mails] > > are in some sense dangerous. Because all existing > > clients are implemented under the assumption that NFS servers only > > reply okay if the data is "safe". {Assuming you didn't just lose > > the server disk you wrote to during the server crash.} > > Exactly. Life is "dangerous" and filled with disk crashes. Yes, life is dangerous, but because your "server crash" might mean you lost your disk doesn't lead to the conclusion that you should implement your NFS server such that it doesn't synchronize NFS writes to disk because you might lose your disk!!! I have this paranoia that some people are making that leap in some of the discussions I've seen and heard. I would postulate that most server crashes don't result in lost disks, and that clients can continue once the machine comes back on-line (if the server "dogmatically" flushed the data to "stable" storage).. > > ... > > The semantics of "close" returning any asynchronous write errors > > (in effect returning following the flush of data to stable storage > > on the server) provide further guarantees to the application. > > ... > > The attempt is to eliminate inisidious silent errors. > > I understand guarantees as absolute, except where explicitly limited. The > Federal Government and the State of Calif agree with me. If something is > guaranteed to not lose data, then it better not. The NFS server dogma does > not provide a valid guanrantee of preserving data, or of no silent errors. > It only improves the likelihoods. This is because there is no such thing > as absolutely stable storage. (As I write this, I'm restoring a crashed > disk.) I agree. I have eliminated "guaranteed" from my vocabulary. Guaranteed. > In most UNIX systems, the server cache in DRAM is lost during a crash, disk > sectors are usually not lost, and there is no third medium. There are > other possibilities. In the 1960's I worked with "mainframes" (Kronos on > 6000's) where you could push the reset button ("level 3 restart"), and not > only have all active jobs resume, but where the contents of the RAM disk > caches would be recovered. Amdhal, Unisys, CDC, and IBM probably still > have such features. There are also systems where there are more than 2 > layers of storage. Where would the NFS server dogma require that a system > with "permanent" optical storage (whether modern WORM or anchient > microfiche), behind slow disks, behind fast drums, behind bulk RAM, behind > fast DRAM, behind SRAM cache preserve client data? On the most stable, > even if it takes minutes to write? Have you or anyone ever seen NFS servers with "intelligent" caching disk controllers create a "loss of data" problem? At this point I'm wondering if you are advocating throwing away the "requirement" for a server to flush to "stable" storage? Are you? > > Stable storage caching (static RAM techniques) on the server accellerate > > client applications OVERALL because latency on NFS write requests > > are reduced (as read-ahead techniques reduce latency by eliminating > > synchronous disk access, so writing to Static RAM reduces latency > > by eliminating synchronous disk write activity). The key point > > here is that no one particular application's write performance > > is improved, but an OVERAL NFS client's performance is improved > > (thereby improving all applications). > > ... > > This is a strange statement. We found years ago that violating the NFS > cache dogma improved the numbers on many NFS benchmarks, from the Sun test > suite to many other benchmarks by 50%. (Yes, fellow Connectionathon > attendees, that is one of our secrets, now disclosed in an /etc/exports > option.) Maybe I'm being misunderstood. Try again. Using static RAM as "fast" stable storage as a buffer to disk enables an NFS server to speed up writes while providing the same level of "assurance" to the NFS client on the subject of data persistence over a server crash. Violating the "NFS cache dogma" would increase server write performance in the same, but with an increase in probability of lost data if a server crashed. The critical question is "how much is the increase in failure possibilities--lost data"? Which then leaves you with the decision of: "How lucky do I feel, given these probabilities?" Do you mean that SGI was doing this silently (not requiring syncs to disk) and have now made it an external option? What's the default? Can you send me the man page describing this option? You firmly believe this "flush to stable storage" requirement is in the realm of dogma? > It would be less dogmatic to say that when a server returns NFS_OK, it is > saying that the MTBF of the place containing the client's data is greater > than XXX, where the MTBF includes all possibilities of failure from power > to earthquake to kernel bug. > > The NFS protocol should dictate the external characteristics of the server > file system, not its internal implementation. Whether the server flushes > to disk is an internal implementation issue. Actually, since we're being honest, we both know the NFS protocol specification is none too clear on these issues. For instance, the XID Reply cache is not specified, whereas you imply that it is a necessary component of an NFS server implementation (and I would not disagree). The protocol specification dictates pretty straightforward external characteristics. Perhaps I should add for the interested reader that most of what we're discussing (XID cache, consistency semantics, and other "implementation" details) are not called out in the protocol specification, but are merely aspects of particular implementations. The real world intrudes here. A lot of practical knowledge is exchanged at Connectathon every year on how to improve implementations. [I'm of the school that no specification eliminates the need for interoperability testing. I think Connectathon is one of the very good things done in the NFS community.] > Rational ustomers buy solutions to problems. They don't care about > violations of dogma. They only want an appropriate engineering solution to > preserving their data. They don't care whether server buffers are flushed > to disk. They care only that data are sufficently rarely lost. Agreed. How does your company ensure (Ah! he artfully avoids the contentious word "guarantee") "that data are sufficently rarely lost."? What was the "appropriate engineering solution to preserving their data" added when the requirement for synchronous writes was dropped? > I was not present when the NFS cache dogma was graven in stone, but I > wonder if it was not mostly a statement about the lack of reliability of > NFS servers of the time (i.e. 68010 UNIX systems in 1984). Is your basis simply then that today servers are more reliable, and that in practice this is not a problem? Is server reliability the critical factor or are external factors like power outages, errant flipping of power switches, etc. significant? I would assume that disk MTBF's were much greater than server MTBF's, and synchronous writes exploit this. > The NFS cache dogma does solve problems, but those problems are of people > selling things, not of people building or buying things. Wow. Wow again. I'm thinking about what everyone is selling (including you). Forget absolute failure probabilities... Do you have a relative probability of lost data between flushing to disk and not flushing to disk on a server before responding to client? Or any failure data? Because this has obviously been (and seems to be a growing) contentious point between "strict" (you would say "dogmatic") NFS implementations and "loose" (would you say "enlightened":-) implementations. Feedback on how little (or non-existent) a problem this is of great interest to me. And others, as this seems to be an increasingly polarizing issue. > Vernon Schryver > Silicon Graphics > vjs@sgi.com Brian Pawlowski Sun Microsystems beepy@eng.sun.com
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (10/21/90)
In article beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) writes: > ... We agree: -"NFS is stateless" is not a technical statement, but describes the general philosophy used by the NFS designers. -NFS!=UFS. -the "NFS server cache dogma" increases the reliability of the system, and is consistent with the stateless philosophy. > ... The need for an "XID cache" > addresses a "bug" in the protocol. I may disagree. The external behavior produced by an XID cache should have been specified in the beginning. It is required by real world networks. > Suggestions to the effect of > eliminating syncing data to stable storage on a server before returning > NFS_OK on a write undermines basic assumptions made by clients. No, the synchronous server data cache is an implementation of "safe" or high MBTF server data storage. What if I build a server with non-volatile RAM (e.g. a 180-day UPS on a Sun), put cache blocks in a reserved part of RAM, have the bootstrap code for UNIX and the diagnostics discover and preserve all valid cache blocks, and operate in the evil async mode? (Similarities to Prestoserve are unintended and inevitible.) If Prestoserve is OK, then so is this, or any other with the same MTBF. > There are a lot of interesting "state" thingies agreed to by the > clients and servers. File handles are agreed to "persist" over a > crash. Good point. > [Vernon: Do you feel like posting an enumerated prioritized list of > missing features in NFS.... The UNIX file system is not holy. The NFS lacks are irrelevant, except where they are needed by users. NFS needs about 6 things, including some kind of cache operation like that discussed recently, some open-unlink support, and a few others that our local NFS Master chants under his breath. The new protocol had almost all of them 2 years ago. It's too bad it went non-linear. > ... I would postulate that most server crashes don't result > in lost disks, and that ...[sync writes work]... Yes, synchronous operation is a good BruteForceAndIgnorance implementation of what should be the protocol requirement. (I like BF&I--on the first cut.) > Have you or anyone ever seen NFS servers with "intelligent" caching > disk controllers create a "loss of data" problem? Good point. I've heard rumors, but not seen anything. (We limit controller caches--sync. writes wait.) > At this point I'm wondering if you are advocating throwing away > the "requirement" for a server to flush to "stable" storage? Are you? Yes, the stated requirement is bogus. Pick whatever MTBF or equivalent you wish, but please stop using an implementation to describe a protocol. (Yes, sometimes an implemenation is the best spec, but only if you can and do say which characteristics you really care about.) > ... > You firmly believe this "flush to stable storage" requirement is in > the realm of dogma? Yes. It is a taboo or folk medicine like quinine water or willow bark. It is ok, if we have rationally chosen it instead of its equivalents, or if we don't understand it. Since we all understand it, let's replace the taboo with engineering requirements. Actually, "flush to stable storage" would be ok, if people would not keep reading it as "call bwrite()," and if it were quantitative. > ... > The protocol specification dictates pretty straightforward > external characteristics. The protocol dictates many external characteristics in terms of an implementation. The only complete protocol spec comes on the Sun reference tapes. Still, I much prefer the NFS protocol spec-tape to the ANSI/IEEE paper swill I've been fighting lately. > > I wonder if it was not mostly a statement about the lack of reliability of > > NFS servers of the time (i.e. 68010 UNIX systems in 1984). > > Is your basis simply then that today servers are more reliable, and that > in practice this is not a problem? Is server reliability the critical > factor or are external factors like power outages, errant flipping > of power switches, etc. significant? I would assume that disk MTBF's > were much greater than server MTBF's, and synchronous writes exploit > this. Yes, careful operators, solid hardware, a UPS, and sufficently bug free software are more important and effective than synchronous writes ever were. People do more damage to files with keyboards than with power switches. The standard lightening drill has always been to hit the switch to keep power off to protect disks from flickers and surges. The servers I use have reasonably balanced MTBF's. The big NFS servers I know about (all source for everything from forever on racks of GB drives) stay up for months, and suffer disk problems as often as all others. > > The NFS cache dogma does solve problems, but those problems are of people > > selling things, not of people building or buying things. > > Wow. Wow again. I'm thinking about what everyone is selling (including > you). I was referring to "selling" as in "the marketing department," not selling as in verbally counting coup. I'm not selling because I won't get any $ if I convince you--I might get less because your boxes would be better. > ... this seems to be an increasingly polarizing issue. In my personal experience it has been very controversial since 1985. Until Prestoserv broke the ice, 15% of the NFS vendors have been hiding in the closet. It's just that now we're "coming out." Vernon Schryver, vjs@sgi.com
thurlow@convex.com (Robert Thurlow) (10/22/90)
In <143983@sun.Eng.Sun.COM> beepy@ennoyab.Eng.Sun.COM (Brian Pawlowski) writes: >[Vernon: Do you feel like posting an enumerated prioritized list of >missing features in NFS--with some measure of how important that feature >is? That should start an interesting discussion--I'd like to see it.] I'll bite. I see these as problems that needed to be solved a year ago in a minor protocol revision almost orthogonal to NeFS: - There is nothing in the mount protocol to carry any filesystem info beyond the filehandle. For example, this prevents me from warning my client user that the filesystem was exported read-only at mount time or that the filesystem needs to be accessed via the Secure protocol. AUTH_KERB will make this worse. - access(2) badly needs to go over-the-wire. "Guessing" permissions from attributes is completely inadequate when we have root access issues, UID/GID mapping issues with Secure NFS, access control lists, and other gremlins. We have a perfectly good file server; why not ask the owner of the resource. - It should be possible to do exclusive file creates over the wire. With even a bit flag, the server would be able to do the work which the client now has to make vague stabs at. What hurts is that I know exactly how to implement each of these, but can't do so until Sun blesses a protocol revision. I found NeFS very interesting, but would really like a way to fix some things *now*. #include <osi/std/disclaimer> Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (10/22/90)
[one characteristic of news is that the long winded cannot see the audience muttering "shut up, already!" Sorry about this.] Some have written of paranoia about silently losing data to a server that crashes after returning NFS_OK and before flushing data to disk. That worry must be kept in perspective. We must quantify the probabilities of many failures, and act rationally. -until recently, many vendors, including almost all of those who now run servers synchronously, ran without UDP checksums. I know of mutual customers outraged by that, because they traced silent errors to missing UDP checksums. -many of us use VME or other busses, which do not have parity, and so will occassionally have undetected data corruption. -many systems have only byte-parity on RAM, so two cosmic rays in the same byte will cause silent corruption. The rest have ECC, which does not detect all soft errors. -using 1500 byte ethernet blocks, even with UDP checksums, increases the likelihood of undetected errors compared to using 64 byte blocks by > 30 times. Recall that one of the determinantes of the maximum FDDI block size was the probability of an undetected error given 500 stations and the limits on LER. We could subtantially improve the calculated reliability of NFS transmissions by using small blocks. Why doesn't the protocol require 64 byte blocks? Why is everyone using 4KB FDDI blocks, with the same old 32-bit FCS? -all systems have zillions of circuits that are known to have "metastable" or "resolver" failures--that is, we know hardware will sometimes decide your bit was 1 or 0 when it was really a 0 or 1. We all try to choose things so the MTBF of such a failure is low compared to any other. -in the average crash, the update deamon would have you lose changes of the last 15 seconds. The more modern bdflush comes closer to losing no work, since it tries to keep the disk continually busy flushing blocks. -the early 1980's decision to run without UDP checksums, but to run servers synchronously says volumes about the relative probabilites of server crashes and network corruption then. My recollections are that a server that stayed up for days was a wonder. The market requires and many of us deliver a different order of server reliability today. That Sun now runs with UDP checksums turned on says volumes about the low relative probability of Sun server failure today. -is it the sticky bit that Sun has used for the last 3 years to tell the server that a file should be written asynchronously? I remember hearing about it at Connectathon-before-last. Vernon Schryver, vjs@sgi.com
geoff@bodleian.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (10/22/90)
Quoth thurlow@convex.com (Robert Thurlow) (in <thurlow.656467376@convex.convex.com>): #>Do users really want a MIPS-like export option that says "don't do sync #>writes"? (Note that these async writes are different that those mentioned #>above. Now I'm talking about the possibility that data will be lost #>on a server crash.) The only reason I can think of having this feature #>is for truly wondrous benchmark results that you can wave in a #>customer's face. # #Remember how hard a diskless node hits it's NFS-mounted swap device - #I've read numbers akin to five writes to every read from /export/swap. #If I'm the sort of user who doesn't run long-lived batch jobs from my #workstation, I might enjoy the performance edge I gain with async I/O #without minding the cost of rebooting most or all the time when my #boot server crashes. Simple-minded question: if you want to introduce this best-effort behaviour, why not do so on the client? I know it makes the imple- mentation simpler to shove the responsibility to the server, but it's really a client issue: only the client knows whether or not data is "precious" (to borrow an earlier attribute). It's pretty convoluted to force the server to create and export a filesystem with your whizzy "don't sync" attribute to solve this. The other reason not to fix this "problem" on the server is that it's so damned unilateral! Suppose some zealous system administrator decides to earn himself a bonus by "improving" the network performance by enabling your attribute on all file systems. Most users may say "sure, it's probably a good tradeoff, so go for it." The upshot is that it becomes impossible to reliably fsync a file, even when you want to! If you argue that this simply points up the need for a protocol revision, I won't disagree on the desirability of revising and fixing the protocol, but I would point out that this particular problem arises from solving the asynchrony issue on the server rather than the client, and is therefore pretty bogus... Bottom line: given the choice between changing the protocol and its well-known semantics or changing an implementation of that protocol, I'd rather change the implementation. The _right_ implementation, of course! -- Geoff Arnold, PC-NFS architect, Sun Microsystems. (geoff@East.Sun.COM) -- *** "Now is no time to speculate or hypothecate, but rather a time *** *** for action, or at least not a time to rule it out, though not *** *** necessarily a time to rule it in, either." - George Bush ***
dricejb@drilex.UUCP (Craig Jackson drilex1) (10/23/90)
In the discussion of how async writes on the server are not correct NFS, and how sync writes to a controller with battery-backed RAM cache are OK NFS, isn't this another example of relativity? What if the battery goes dead before the power is restored? On the other hand, what if the server is on an UPS? Would it then be OK to do async writes? -- Craig Jackson dricejb@drilex.dri.mgh.com {bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}
thurlow@convex.com (Robert Thurlow) (10/24/90)
In <3012@jaytee.East.Sun.COM> geoff@bodleian.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) writes: >Simple-minded question: if you want to introduce this best-effort >behaviour, why not do so on the client? Well, to a degree, we already have some control of this. I can set up to use synchronous writes through the buffer cache if I have a process that wants to see write(2) fail unless the server commits, or control timeouts on my attribute cache to force more consistency checks. But it's true that there's no per-filesystem control on the client. I'm not sure this is really important. >it's really a client issue: only the client knows whether or not data is >"precious" (to borrow an earlier attribute). I don't agree at all. The server is providing the storage, and the sysadmin of the server machine is the person best equipped to judge it's overall value; after all, he has to do the backups! I think individual client processes have a need to secure data, and that the server is the machine that really needs per-filesystem control over flushing policy. More client control in the form of a mount option would be welcome if it was easy to implement, though. >Bottom line: given the choice between changing the protocol and >its well-known semantics or changing an implementation of that >protocol, I'd rather change the implementation. I agree with the caveat that we need to tweak a few specific things ASAP without getting sidetracked by postscript-based protcols. #include <osi/std/disclaimer> Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
ddp+@andrew.cmu.edu (Drew Daniel Perkins) (10/25/90)
> Excerpts from netnews.comp.protocols.nfs: 20-Oct-90 Re: NFS writes and > fsync(). Brian Pawlowski@ennoyab. (11773) > > One of the problems with NFS is that it manages to tangle up several > > different issues, which makes it hard to solve one without breaking > > something else. > Perhaps because the issues are related. Sprite cache strategies > consider data consistency issues, as does AFS 4.0, because > aggressive cache coherency strategies must make concessions > to data consistency (cooperation amongst clients). One area it definitely tangled up was transport vs. session protocol layering. I'm particularly referring to mixing "I acknowledge receipt of your data" (a transport function) with "here are the results of your request" (arguably a session function). Retransmissions shouldn't occur because a request simply hasn't finished yet. If these issues weren't tangled, then one function of the Chet cache (tossing out requests which are already in progress) wouldn't have been necessary. > In the normal case, I/O is handled asynchronously subject to > the normal update syncs. I would not consider this synchronous to the > application! There are cases in NFS when I/O is synchronous (to the > application), however: > - Any time an fsync is done by the application > - For the remaining life of the file descriptor once a lock > has been applied to it (even if all locks are then cleared) You do mention it later, but I believe that "Any time a close is done by the application" is an important enough case of synchronous application behaviour that it should be elevated in importance and mentioned here. Drew