peter@prefect.Berkeley.EDU (Peter Moore) (12/15/90)
I have some questions on cacheing and synchronization that I hope some of you NFS implementors can answer. I realize that the answers to these questions can be very implementation specific, but I am interested in almost any systems answer to these questions (but in particular Ultrix, SunOS and AIX3.1) The Unix implementations I have seen use the daemons biod and nfsd to do the actual client and server (resp.) NFS calls. (I understand that at least biod is optional, but lets assume that daemons are used at both ends). The question (actually worry) is how much buffering is done in these daemons and whether that buffering can be controlled. In particular: biod: I have seen biod described as read-ahead and write-behind. Which implies it both caches writes (in the sense that it returns before the write is actually done) and it actually reads more blocks than requested, in anticipation of the additional blocks being used in a future calls. So: a) Does it return before the actual NFS-write is complete? b) Again, if a) is true, is there any way for the user to find that the write failed? c) If so, is there any way a user process can assure that a particular block or all of its writes in have been written yet? In particular does fsync work or is it (as I have heard) a no-op? d) Does biod actually read-ahead? e) If so, how does it decide when to flush the cached data and actually re-read the data? f) Is there any way a user process can affect that cacheing? nfsd: a) Does the nfsd the write back directly do disk, or maintain a personal cache? (My understanding is that modulo WRITECACHE, it definitely does not, in fact it even flushes the OS cache). b) If (heaven forbid and presto-serve not installed) it does cache writes, can this be flushed under user control? c) Does it do any read-ahead/read-cacheing (I would certainly hope it wouldn't) d) If (again, heaven forbid) it does do read-cacheing, can that be flushed under user control? My guess is that nfsd doesn't do any cacheing (except for the implicit cacheing of the OS buffer pool), biod does write-behind and read ahead, but there is no way to control any of it at the user process level. But I hope this is not true, since it make NFS mounted file systems pure poison for any one doing distributed database work. Most, if not all, of these problems can be eliminated by directly connecting to the nfsd, and do the RPC calls directly, but that is fairly drastic. Anyway, thanks for whatever help you can give me, Peter Moore
brent@terra.Eng.Sun.COM (Brent Callaghan) (12/18/90)
In article <1990Dec15.071319.16674@objy.com>, peter@prefect.Berkeley.EDU (Peter Moore) writes: > > I have some questions on cacheing and synchronization that I hope some > of you NFS implementors can answer. > > biod: > I have seen biod described as read-ahead and write-behind. > Which implies it both caches writes (in the sense that it > returns before the write is actually done) and it actually > reads more blocks than requested, in anticipation of the > additional blocks being used in a future calls. So: I can speak for the SunOS implementation: > a) Does it return before the actual NFS-write is complete? You don't need biod's for this to be true. Writes go into mapped file pages and are cached there. The cached data gets flushed only if a write crosses a page boundary or if the file is closed, or if the page daemon flushes it. At flush time the page is scheduled for a biod. If no biod's are available (perhaps they're all busy) then the flush is done in the process context (becomes synchronous on the client). In general, yes thanks to client caching writes will return before the data is written to the server. > b) Again, if a) is true, is there any way for the user to find > that the write failed? Yes, failed writes will be recorded with the file's rnode. The error can be tested for on subsequent writes or the file close. > c) If so, is there any way a user process can assure that a > particular block or all of its writes in have been > written yet? In particular does fsync work or is it (as I > have heard) a no-op? The biod's are invoked only for asynchronous IO. An fsync implies a mandatory synchronous IO and indeed that's how it's implemented in SunOS. An fsync will not return until the file changes are flushed to the servers disk. > d) Does biod actually read-ahead? Yes. > e) If so, how does it decide when to flush the cached data and > actually re-read the data? When cached attributes timeout they are refreshed on demand. If the new attributes indicate that the file has changed then the file pages are invalidated. > f) Is there any way a user process can affect that cacheing? Not really - except to use such heavy handed techniques as mount with the "noac" option, or use fsync to force updates to the server. > nfsd: > a) Does the nfsd the write back directly do disk, or maintain > a personal cache? (My understanding is that modulo > WRITECACHE, it definitely does not, in fact it even flushes > the OS cache). Yup, the nfsd is required to do synchronous writes to stable storage (disk). The nfsd doesn't maintain a write cache. > b) If (heaven forbid and presto-serve not installed) it does cache > writes, can this be flushed under user control? NA. > c) Does it do any read-ahead/read-cacheing (I would certainly > hope it wouldn't) Not explicitly. Only if the underlying VFS does it. > d) If (again, heaven forbid) it does do read-cacheing, can that > be flushed under user control? No. BTW: read caching on the server is fine so long as it's write-through to stable storage. > My guess is that nfsd doesn't do any cacheing (except for the implicit > cacheing of the OS buffer pool), Right. >biod does write-behind and read ahead, > but there is no way to control any of it at the user process level. > But I hope this is not true, since it make NFS mounted file systems > pure poison for any one doing distributed database work. You've got it in a nutshell. -- Made in New Zealand --> Brent Callaghan @ Sun Microsystems Email: brent@Eng.Sun.COM phone: (415) 336 1051
liam@cs.qmw.ac.uk (William Roberts;) (12/21/90)
In <1990Dec15.071319.16674@objy.com> peter@prefect.Berkeley.EDU (Peter Moore) writes: >biod: > I have seen biod described as read-ahead and write-behind. > Which implies it both caches writes (in the sense that it > returns before the write is actually done) and it actually > reads more blocks than requested, in anticipation of the > additional blocks being used in a future calls. So: No - biods are about extra, kernel initiated, I/O for which isn't directly associated with an ordinary process. Specifically, if the kernel decides that you are reading a file sequentially and that it should attempt to read the next block of the file in advance of your process actually asking for it, then it would normally just add that block address to the list of things for the disk device driver to do. For NFS, the biod is used instead so that the kernel can keep track of the request it made to the server and what to do when the answer comes back. The existing kernel mechanisms need process slots entries to wait for I/O, so the biods provide such slots. Similarly, the biods can be used to handle cache flushes to remote files. > > a) Does it return before the actual NFS-write is complete? No - the server should only reply when the write has actually reached the disk service. > b) Again, if a) is true, is there any way for the user to find > that the write failed? If the biod did the write, then the user process finds out at the next operation on that file: remember to check the value returned by close()! > c) If so, is there any way a user process can assure that a > particular block or all of its writes in have been > written yet? In particular does fsync work or is it (as I > have heard) a no-op? The fsync() call works as it should do, namely it flushes all locally cached writes to the disk surface (local OR remote) and returns only when every block has been completely written out. > d) Does biod actually read-ahead? Yes. The kernel decides that a read-ahead is required, then a free biod is chosen to ask the server for that block. > e) If so, how does it decide when to flush the cached data and > actually re-read the data? Flushing cached data is about write-behind, not read-ahead. The only way to get at data in a file on the NFS server is via a file handle (i.e. it isn't block level access). All NFS servers provide a "last modified date" on their files, so the clients can do a loose form of cache checking by recording (in the vnode) the last modify time of the file. Every operation on the file returns the new modify date. If our local record of the modify date is older than 3 seconds (typically, see the actimeo option in later SunOS systems) then we stat the remote file to see if anyone else has modified it. If the modify time is unchanged then our cached information is valid. If someone else has modified the file, we retaliate by flushing *our* changes.... basically this is going to be even worse than two processes writing to the same file in a local filesystem. > f) Is there any way a user process can affect that cacheing? SunOS 4.x provides a mount option called "actimeo" which allows you to set the time. You can force a cache flush using the standard "sync" command or the sync() system call. >nfsd: > a) Does the nfsd the write back directly do disk, or maintain > a personal cache? (My understanding is that modulo > WRITECACHE, it definitely does not, in fact it even flushes > the OS cache). nfsds write to the file SYNCHRONOUSLY otherwise a server crash could lose information. The client is assumed to forget what it told the server once the NFS write has completed, so if the server didn't store that data on its disk then a server crash loses the whole lot. Writecache is an abomination. > b) If (heaven forbid and presto-serve not installed) it does cache > writes, can this be flushed under user control? Only by doing sync() system calls on the server. > c) Does it do any read-ahead/read-cacheing (I would certainly > hope it wouldn't) nfsds probably don't do the readahead, because they are stateless and don;t remember what they were asked for before. They do operate through the server disk cache as normal (thay have to, to get easy access to the code which understands the file system structure) so the data they read is placed into the cache while they sleep waiting for it. They also benefit from cache hits etc. > d) If (again, heaven forbid) it does do read-cacheing, can that > be flushed under user control? There is no way, to my knowledge, that *anyone* can invalidate the whole cache short of unmounting the disk. The nfsds live in the *server* remember, so they stand to gain from the standard caching of reads & writes to their local filesystem. >... But I hope this is not true, since it make NFS mounted file systems >pure poison for any one doing distributed database work. Yep, 100% not-appropriate poison. NFS is very good for shared read-only things such as binaries, libraries etc, and fairly good for read/write from a single client. It is therefore very useful for personal filestores, since there is only one little me and I'mnot often modifying files from two different clients at the same time. But don't try to use it for distributed databases: that isn't what is was designed for and it won't work. >Most, if not all, of these problems can be eliminated by directly >connecting to the nfsd, and do the RPC calls directly, but that is >fairly drastic. Wrong again. If you are prepared to do RPC, then write your own database server that does what you actually want to do: NFS contains no user-servicable parts! > Anyway, thanks for whatever help you can give me, You're welcome. You should look up the original Sandberg paper on NFS, which is %A R. Sandberg %T The Sun Network File System: Design, Implementation and Experience %D Summer 1985 %J Sun Technical Report %I Sun Microsystems Inc. It was probably presented at USENIX at around that sort of time, but I don't have a better reference for it. PS. Does comp.protocols.nfs have a "Frequently asked questions"? I'd imagine that "What does a biod do?" and "What does an nfsd do?" would be near the top of the list :-) -- William Roberts ARPA: liam@cs.qmw.ac.uk Queen Mary & Westfield College UUCP: liam@qmw-cs.UUCP Mile End Road AppleLink: UK0087 LONDON, E1 4NS, UK Tel: 071-975 5250 (Fax: 081-980 6533)