[comp.protocols.nfs] NFS not idempotent

jtkohl@athena.mit.edu (John T Kohl) (09/06/89)

In article <1199@sequent.cs.qmc.ac.uk> liam@cs.qmc.ac.uk (William Roberts) writes:

   This is a difference between user-level RCP and kernel-level RPC.
   The kernel level *knows* that its NFS RPC requests are
   idempotent 

Unfortunately, some of them ARE NOT idempotent, and that has caused
great troubles to us at MIT.

Consider:
	create directory  (we've had mkdir return 'already exists' when
				it has actually just been created)
	set attributes with length = 0 (truncate--with packet reordering
				and multiple biod's/nfsd's, this can lead to
				truncation occuring after a successive
				write()--poof, your file contents are gone)
	rename
	move

John Kohl <jtkohl@ATHENA.MIT.EDU> or <jtkohl@Kolvir.Brookline.MA.US>
Digital Equipment Corporation/Project Athena
(The above opinions are MINE.  Don't put my words in somebody else's mouth!)

liam@cs.qmc.ac.uk (William Roberts) (09/06/89)

In article <14068@bloom-beacon.MIT.EDU> jtkohl@athena.mit.edu (John T Kohl) writes:
>In article <1199@sequent.cs.qmc.ac.uk> liam@cs.qmc.ac.uk (William Roberts) writes:
>
>   This is a difference between user-level RCP and kernel-level RPC.
>   The kernel level *knows* that its NFS RPC requests are
>   idempotent
>
>Unfortunately, some of them ARE NOT idempotent, and that has caused
>great troubles to us at MIT.

The server implementations around keep a cache of recent
requests so that they can recite the previous reply if this
happens to be a retransmission. The previous reply will be
repeated IFF the cached xid matches the xid of the request,
which implies that the request is actually a retransmission.
This extra stuff is actually making the underlying RPC calls
idempotent, rather than the protocol requests themselves.

The requests for which this caching seems to be happening on my
systems (NFS 3.0) are:

    rfs_create, rfs_remove, rfs_link, rfs_mkdir, rfs_rmdir

>Consider:
>       create directory  (we've had mkdir return 'already exists' when
>                               it has actually just been created)
>       set attributes with length = 0 (truncate--with packet reordering
>                               and multiple biod's/nfsd's, this can lead to
>                               truncation occuring after a successive
>                               write()--poof, your file contents are gone)
>       rename
>       move

Move and rename are certainly a potential problem since their
answers aren't cached. I though file truncation was done using
rfs_create rather than rfs_setattr, but the multiple biods are
a problem since they can interleave the requests and indeed
give rise to a write being actioned before the truncation!
I can't see how mkdir can have done that to you (unless it was
that System V nonsense about "create ., create ..").

Question: What's the best way to fix the reordering problem?

My personal suggestion is to make "significant" operations such
as create act synchronously, so that the creating process
cannot issue a subsequent write request before the create has
definitely occurred.

-- 

William Roberts         ARPA: liam@cs.qmc.ac.uk
Queen Mary College      UUCP: liam@qmc-cs.UUCP    AppleLink: UK0087
190 Mile End Road       Tel:  01-975 5250
LONDON, E1 4NS, UK      Fax:  01-980 6533

clancy@mfci.UUCP (Pat Clancy) (09/07/89)

In article <14068@bloom-beacon.MIT.EDU> jtkohl@athena.mit.edu (John T Kohl) writes:
>Unfortunately, some of them ARE NOT idempotent, and that has caused
>great troubles to us at MIT.
>
>Consider:
>	create directory  (we've had mkdir return 'already exists' when
>				it has actually just been created)
>	set attributes with length = 0 (truncate--with packet reordering
>				and multiple biod's/nfsd's, this can lead to
>				truncation occuring after a successive
>				write()--poof, your file contents are gone)

There's a paper on "Improving Performance/Correctness of an NFS
Server" in the Winter 89 Usenix Proceedings, that has a good
discussion of problems in this area; some of this behavior can
be caused by race conditions which are exposed when the server
is overloaded.  This can apparently be partially fixed with
some simple changes, e.g. examining the cache (server side) of
recently completed request records for every new incoming request
and discarding duplicates.

Pat Clancy
Multiflow Computer

news@bbn.COM (News system owner ID) (09/14/89)

liam@cs.qmc.ac.uk (William Roberts) writes:
< jtkohl@athena.mit.edu (John T Kohl) writes:
< > liam@cs.qmc.ac.uk (William Roberts) writes:
< >
< >   This is a difference between user-level RCP and kernel-level RPC.
< >   The kernel level *knows* that its NFS RPC requests are
< >   idempotent
< >
< >Unfortunately, some of them ARE NOT idempotent, and that has caused
< >great troubles to us at MIT.

As near as I can tell, claiming that NFS ops are idempotent is nothing
more than "marketing blurf" from Sun marketing.  The problem is that
Sun didn't think quite hard enough about what a remote file system has
to do before going off and writing one; and now we are stuck with it.

< The server implementations around keep a cache of recent
< requests so that they can recite the previous reply if this
< happens to be a retransmission.

This is what is known generally as "a hack" (a workaround, if you
prefer).

The trouble is, as W.R. points out, it doesn't really work all the
time.  It make the situation better, but it doesn't really cure it.

The really funny thing is that Sun tried using TCP (rather than UDP)
for NFS at first, but it was too slow, so they switched to UDP.  Now,
as part of cleaning up all this mess, they are adding practically all
of the capabilities of TCP to RPC/UDP.  Meanwhile, Jacobson
demonstrated that TCP isn't slow by nature, just by implimentation.

< Question: What's the best way to fix the reordering problem?

Rewrite NFS from scatch, including (_especially_) the entire protocol.
(Only about 1/4 :-) -- it _really_ needs to be done).

< My personal suggestion is to make "significant" operations such
< as create act synchronously, so that the creating process
< cannot issue a subsequent write request before the create has
< definitely occurred.

This is about the same thing that many more simple communications
protocols to (like, say, Kermit): open, close, create, etc. are
synchronous, even if the actual data transmition is async (or
windowed).

Actually, it's kinda odd that given a synchronous RPC system, these
operations were made async anyway.  Of course performance is slower
when creates are synchronous, but which do you want: fast performance,
or a reliable system.

What we really need is for some capable group to go off and write a
networked file sharing system that combines the best features of NFS
(error and failure recovery), (AT&T's) RFS (_full_ Unix semantics,
incl. remote /dev, when talking to another Unix machine), and Andrew
(caching remote mounted files to a more local machine), and make it as
availiable as Sun has with NFS.  Being able to run a pair of servers
as a redundant, reliable, read/write file system would be a nice
bonus.

		-- Paul Placeway <pplaceway@bbn.com>
		   (speaking for myself)

markb@unix386.Convergent.COM (Mark Beyer) (09/15/89)

I've just started reading this thread, so my apologies if this has
already been discussed:

In article <45595@bbn.COM>, news@bbn.COM (News system owner ID) writes:
> liam@cs.qmc.ac.uk (William Roberts) writes:
> < jtkohl@athena.mit.edu (John T Kohl) writes:
> < > liam@cs.qmc.ac.uk (William Roberts) writes:

[stuff about problems with NFS operations not being idempotent]

There's a paper in the 1989 Winter USENIX conference proceedings
by Chet Juszczak at DEC that describes a modification to the
NFS server that may be germane.

As I understand, the problem is that a sequence of non-idempotent
client requests like "create, write, create" may get jumbled at the
server because of server load and the stateless nature of the NFS
protocol.

This paper describes an extension of the server cache mechanism.
The cache is expanded to track all request types and the NFS layer
makes greater use of the cache as well.

Actually, the major reason for doing this was to improve
performance by sorting out duplicate requests, but it looks like
they thought it would solve these jumbled request problems.

Anyone from MIT or DEC care to comment on this ?  How well has it worked ?
Has anyone else implemented this idea in their NFS server ?

Many thanks,
Mark Beyer
{pyramid,pacbell,decwrl,sri-unix}!ctnews!markb

news@bbn.COM (News system owner ID) (09/19/89)

In article <45595@bbn.COM> pplaceway@izar.bbn.com (Paul W. Placeway)
*I* wrote:
< As near as I can tell, claiming that NFS ops are idempotent is nothing
< more than "marketing blurf" from Sun marketing.  The problem is that
< Sun didn't think quite hard enough about what a remote file system has
< to do before going off and writing one; and now we are stuck with it.

I didn't really intend this to be flamable, but it came out that way.
To those who are offended, I apologize.

I don't mean to drag Sun through the mud, just point out that most of
the world is using NFS v.2, and so here are these problems, some of
which could have been caught in the original design and testing
phases.

Several things that NFS should (or could) do (NFS v.3 does at least
some of these):

The client should try to predict the servers mean delay time, a la
Jacobson and Karels TCP timeout predictor.  I believe NFS 3 does
something like this.

The client and server should both do something about reducing the read
and write sizes in the face of a lot of failures, so the sys admin
doesn't have to guess about the right size to set them in /etc/fstab.

NFS has this split personality problem of being both a UNIX file
system sharing system, and a heterogenous system file sharing system.
It would be nice if more things could be done between like systems (so
that I could mount another UNIX server's /dev just like RFS, but still
mount normal files from VMS, VM, etc.).  Some idea of levels and types
of conformance would probably bridge the gap here.

It would be _real_ nice to be able to set up two servers as a single,
reliable, mirrored NFS file server.  There are allready experimental
systems to do this, but it would be nice if Sun developed one for
general consumption.

Whatever.

	More random ramblings from:
		-- Paul Placeway
		   <pplaceway@bbn.com>, <paul@cis.ohio-state.edu>

guy@auspex.auspex.com (Guy Harris) (09/21/89)

 >I don't mean to drag Sun through the mud, just point out that most of
 >the world is using NFS v.2, and so here are these problems, some of
 >which could have been caught in the original design and testing
 >phases.
 >
 >Several things that NFS should (or could) do (NFS v.3 does at least
 >some of these):

I presume by "V.2" and "V.3" you're referring to versions of the
protocol, not the implementation; some of the items you list are
implementation issues, not protocol issues - e.g., the

 >The client should try to predict the servers mean delay time, a la
 >Jacobson and Karels TCP timeout predictor.  I believe NFS 3 does
 >something like this.

item is an implementation issue, and I think Sun's working on a client
implementation for protocol version 2 that does that.  You may be able
to make changes to the protocol that make it work better, but that
doesn't mean it's impossible in version 2.