[comp.protocols.nfs] Buffering in biod and nfsd.

peter@prefect.Berkeley.EDU (Peter Moore) (12/15/90)

I have some questions on cacheing and synchronization that I hope some
of you NFS implementors can answer.  I realize that the answers to
these questions can be very implementation specific, but I am
interested in almost any systems answer to these questions (but in
particular Ultrix, SunOS and AIX3.1)

The Unix implementations I have seen use the daemons biod and nfsd to
do the actual client and server (resp.) NFS calls.  (I understand that at
least biod is optional, but lets assume that daemons are used at both ends).
The question (actually worry) is how much buffering is done in these
daemons and whether that buffering can be controlled.  In particular:

biod:
	I have seen biod described as read-ahead and write-behind.
	Which implies it both caches writes (in the sense that it
	returns before the write is actually done) and it actually
	reads more blocks than requested, in anticipation of the
	additional blocks being used in a future calls.  So:
	
	a) Does it return before the actual NFS-write is complete?
	b) Again, if a) is true, is there any way for the user to find
	   that the write failed?
	c) If so, is there any way a user process can assure that a
	    particular block or all of its writes in have been
	    written yet?  In particular does fsync work or is it (as I
	    have heard) a no-op?
	d) Does biod actually read-ahead?
	e) If so, how does it decide when to flush the cached data and
	   actually re-read the data?
	f) Is there any way a user process can affect that cacheing?


nfsd:
	a) Does the nfsd the write back directly do disk, or maintain
	   a personal cache?  (My understanding is that modulo
	   WRITECACHE, it definitely does not, in fact it even flushes
	   the OS cache).
	b) If (heaven forbid and presto-serve not installed) it does cache
	   writes, can this be flushed under user control?
	c) Does it do any read-ahead/read-cacheing (I would certainly
	   hope it wouldn't)
	d) If (again, heaven forbid) it does do read-cacheing, can that
	   be flushed under user control?


My guess is that nfsd doesn't do any cacheing (except for the implicit
cacheing of the OS buffer pool), biod does write-behind and read ahead,
but there is no way to control any of it at the user process level.
But I hope this is not true, since it make NFS mounted file systems
pure poison for any one doing distributed database work.

Most, if not all, of these problems can be eliminated by directly
connecting to the nfsd, and do the RPC calls directly, but that is
fairly drastic.


      Anyway, thanks for whatever help you can give me,

	      Peter Moore

brent@terra.Eng.Sun.COM (Brent Callaghan) (12/18/90)

In article <1990Dec15.071319.16674@objy.com>, peter@prefect.Berkeley.EDU (Peter Moore) writes:
> 
> I have some questions on cacheing and synchronization that I hope some
> of you NFS implementors can answer.
> 
> biod:
> 	I have seen biod described as read-ahead and write-behind.
> 	Which implies it both caches writes (in the sense that it
> 	returns before the write is actually done) and it actually
> 	reads more blocks than requested, in anticipation of the
> 	additional blocks being used in a future calls.  So:

I can speak for the SunOS implementation:

> 	a) Does it return before the actual NFS-write is complete?

You don't need biod's for this to be true.  Writes go into mapped file
pages and are cached there.  The cached data gets flushed only if a 
write crosses a page boundary or if the file is closed, or if the
page daemon flushes it.  At flush time the page is scheduled for
a biod.  If no biod's are available (perhaps they're all busy) then the
flush is done in the process context (becomes synchronous on the client).
In general, yes thanks to client caching writes will return before the
data is written to the server.

> 	b) Again, if a) is true, is there any way for the user to find
> 	   that the write failed?

Yes, failed writes will be recorded with the file's rnode.  The error
can be tested for on subsequent writes or the file close.

> 	c) If so, is there any way a user process can assure that a
> 	    particular block or all of its writes in have been
> 	    written yet?  In particular does fsync work or is it (as I
> 	    have heard) a no-op?

The biod's are invoked only for asynchronous IO.  An fsync implies
a mandatory synchronous IO and indeed that's how it's implemented
in SunOS.  An fsync will not return until the file changes are
flushed to the servers disk.

> 	d) Does biod actually read-ahead?

Yes.

> 	e) If so, how does it decide when to flush the cached data and
> 	   actually re-read the data?

When cached attributes timeout they are refreshed on demand.  If the 
new attributes indicate that the file has changed then the file pages
are invalidated.

> 	f) Is there any way a user process can affect that cacheing?

Not really - except to use such heavy handed techniques as mount with
the "noac" option, or use fsync to force updates to the server.

> nfsd:
> 	a) Does the nfsd the write back directly do disk, or maintain
> 	   a personal cache?  (My understanding is that modulo
> 	   WRITECACHE, it definitely does not, in fact it even flushes
> 	   the OS cache).

Yup, the nfsd is required to do synchronous writes to stable storage
(disk).  The nfsd doesn't maintain a write cache.

> 	b) If (heaven forbid and presto-serve not installed) it does cache
> 	   writes, can this be flushed under user control?

NA.

> 	c) Does it do any read-ahead/read-cacheing (I would certainly
> 	   hope it wouldn't)

Not explicitly.  Only if the underlying VFS does it.

> 	d) If (again, heaven forbid) it does do read-cacheing, can that
> 	   be flushed under user control?

No. BTW: read caching on the server is fine so long as it's write-through
to stable storage.

> My guess is that nfsd doesn't do any cacheing (except for the implicit
> cacheing of the OS buffer pool),

Right.

>biod does write-behind and read ahead,
> but there is no way to control any of it at the user process level.
> But I hope this is not true, since it make NFS mounted file systems
> pure poison for any one doing distributed database work.

You've got it in a nutshell.
--

Made in New Zealand -->  Brent Callaghan  @ Sun Microsystems
			 Email: brent@Eng.Sun.COM
			 phone: (415) 336 1051

liam@cs.qmw.ac.uk (William Roberts;) (12/21/90)

In <1990Dec15.071319.16674@objy.com> peter@prefect.Berkeley.EDU (Peter Moore) 
writes:



>biod:
>	I have seen biod described as read-ahead and write-behind.
>	Which implies it both caches writes (in the sense that it
>	returns before the write is actually done) and it actually
>	reads more blocks than requested, in anticipation of the
>	additional blocks being used in a future calls.  So:


No - biods are about extra, kernel initiated, I/O for which isn't directly 
associated with an ordinary process. Specifically, if the kernel decides that 
you are reading a file sequentially and that it should attempt to read the 
next block of the file in advance of your process actually asking for it, then 
it would normally just add that block address to the list of things for the 
disk device driver to do. For NFS, the biod is used instead so that the kernel 
can keep track of the request it made to the server and what to do when the 
answer comes back. The existing kernel mechanisms need process slots entries 
to wait for I/O, so the biods provide such slots.

Similarly, the biods can be used to handle cache flushes to remote files.
>	
>	a) Does it return before the actual NFS-write is complete?
No - the server should only reply when the write has actually reached the disk 
service.

>	b) Again, if a) is true, is there any way for the user to find
>	   that the write failed?
If the biod did the write, then the user process finds out at the next 
operation on that file: remember to check the value returned by close()!

>	c) If so, is there any way a user process can assure that a
>	    particular block or all of its writes in have been
>	    written yet?  In particular does fsync work or is it (as I
>	    have heard) a no-op?
The fsync() call works as it should do, namely it flushes all locally cached 
writes to the disk surface (local OR remote) and returns only when every block 
has been completely written out.

>	d) Does biod actually read-ahead?
Yes. The kernel decides that a read-ahead is required, then a free biod is 
chosen to ask the server for that block.

>	e) If so, how does it decide when to flush the cached data and
>	   actually re-read the data?
Flushing cached data is about write-behind, not read-ahead. The only way to 
get at data in a file on the NFS server is via a file handle (i.e. it isn't 
block level access). All NFS servers provide a "last modified date" on their 
files, so the clients can do a loose form of cache checking by recording (in 
the vnode) the last modify time of the file. Every operation on the file 
returns the new modify date. If our local record of the modify date is older 
than 3 seconds (typically, see the actimeo option in later SunOS systems) then 
we stat the remote file to see if anyone else has modified it. If the modify 
time is unchanged then our cached information is valid. If someone else has 
modified the file, we retaliate by flushing *our* changes.... basically this 
is going to be even worse than two processes writing to the same file in a 
local filesystem.

>	f) Is there any way a user process can affect that cacheing?
SunOS 4.x provides a mount option called "actimeo" which allows you to set the 
time. You can force a cache flush using the standard "sync" command or the 
sync() system call.


>nfsd:
>	a) Does the nfsd the write back directly do disk, or maintain
>	   a personal cache?  (My understanding is that modulo
>	   WRITECACHE, it definitely does not, in fact it even flushes
>	   the OS cache).
nfsds write to the file SYNCHRONOUSLY otherwise a server crash could lose 
information. The client is assumed to forget what it told the server once the 
NFS write has completed, so if the server didn't store that data on its disk 
then a server crash loses the whole lot. Writecache is an abomination.

>	b) If (heaven forbid and presto-serve not installed) it does cache
>	   writes, can this be flushed under user control?

Only by doing sync() system calls on the server.

>	c) Does it do any read-ahead/read-cacheing (I would certainly
>	   hope it wouldn't)
nfsds probably don't do the readahead, because they are stateless and don;t 
remember what they were asked for before. They do operate through the server 
disk cache as normal (thay have to, to get easy access to the code which 
understands the file system structure) so the data they read is placed into 
the cache while they sleep waiting for it. They also benefit from cache hits 
etc.

>	d) If (again, heaven forbid) it does do read-cacheing, can that
>	   be flushed under user control?
There is no way, to my knowledge, that *anyone* can invalidate the whole cache 
short of unmounting the disk. The nfsds live in the *server* remember, so they 
stand to gain from the standard caching of reads & writes to their local 
filesystem.

>... But I hope this is not true, since it make NFS mounted file systems
>pure poison for any one doing distributed database work.
Yep, 100% not-appropriate poison. NFS is very good for shared read-only things 
such as binaries, libraries etc, and fairly good for read/write from a single 
client. It is therefore very useful for personal filestores, since there is 
only one little me and I'mnot often modifying files from two different clients 
at the same time. But don't try to use it for distributed databases: that 
isn't what is was designed for and it won't work.

>Most, if not all, of these problems can be eliminated by directly
>connecting to the nfsd, and do the RPC calls directly, but that is
>fairly drastic.
Wrong again. If you are prepared to do RPC, then write your own database 
server that does what you actually want to do: NFS contains no user-servicable 
parts!

>      Anyway, thanks for whatever help you can give me,
You're welcome. You should look up the original Sandberg paper on NFS, which is

   %A R. Sandberg
   %T The Sun Network File System: Design, Implementation and Experience
   %D Summer 1985
   %J Sun Technical Report
   %I Sun Microsystems Inc.

It was probably presented at USENIX at around that sort of time, but I don't 
have a better reference for it.

PS. Does comp.protocols.nfs have a "Frequently asked questions"? I'd imagine 
that "What does a biod do?" and "What does an nfsd do?" would be near the top 
of the list :-)
--

William Roberts                 ARPA: liam@cs.qmw.ac.uk
Queen Mary & Westfield College  UUCP: liam@qmw-cs.UUCP
Mile End Road                   AppleLink: UK0087
LONDON, E1 4NS, UK              Tel:  071-975 5250 (Fax: 081-980 6533)