[comp.unix.internals] On the silliness of close

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/19/90)

In article <1990Oct18.200939.17427@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes:
> In article <19547:Oct1818:25:2690@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
    [ NFS should deal with quotas correctly ]
> Hogwash.  If close() is specified as returning an error value, then it is
> reasonable for it to sometimes return an error value, and it is also
> reasonable for the programmer to be expected to check and deal with its return
> value.

That's a truism. Yes, any system call might return -1/EINTR. Yes,
close() should return -1/EBADF if the descriptor isn't open. These two
errors are documented and reasonable for close() to return, and it's
reasonable to expect the programmer to deal with them.

No other close() errors make sense.

You say that close() should be able to return -1/EDQUOT. That's hogwash.
EDQUOT can and should be detected immediately upon the write() that
triggers it. There's no reason that the system call for saying ``Okay,
forget about this descriptor, I'm done with it'' should produce an errno
saying ``But wait! I neglected to mention this to you when you actually
wrote the data, but you're out of space! Don't you dare forget about
this descriptor when you're out of space! Hold the presses!''

Perhaps you don't yet see how silly that is. Has it occurred to you that
the application may have erased the data that it wrote to disk? Are you
going to insist that every write() be backed up by temporary buffers
that accumulate a copy of all data written until the program dies? Well?

>   Just as it is reasonable for write() to return an error when the disk is
> full, it is reasonable for close() to do so, and the man page should be
> updated to reflect that.

I fail to see your logic. Can I substitute any two system calls there?
``Just as it is reasonable for unlink() to return an error when the file
doesn't exist, it is reasonable for setuid() to do so, and the man page
should be updated to reflect that.''

Now what is your argument?

>   You seem to be saying that filesystems should never be allowed to postpone
> writing to disk until close(),

No. I am saying that there is no excuse for a filesystem to let a
program write() more than the user's quota without giving an error.

``But then it has to send a request immediately over the network and
wait to find out how much space is left!'' you say. Not so. Does TCP
force each side to stay synchronized on every packet? Of course not. The
file server can just pass out ``allocations'' of disk space. Things only
slow down when you're very close to the hard limit.

---Dan

thurlow@convex.com (Robert Thurlow) (10/19/90)

In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>["I don't mind close returning -1/EINTR and -1/EBADF."]
>No other close() errors make sense.

So how do you pick up errors on asynchronous writes?  You are never
going to know about them if close can't return an error here.  It makes
sense to do an implicit fsync on close, and I think the error code from
that operation has to be propagated.  You may not be able to do a lot
of constructive rebuilding in this case, but your program can at least
let the user know that, "hey, something really evil happened."  I'd
shoot a program (like Berkeley's "cp") that didn't at least let me know
via an exit status.  You may choose to let your program retry the close
until it works, too.

>EDQUOT can and should be detected immediately upon the write() that
>triggers it.

This probably can be arranged, though it takes some care to not get
hosed by multiple processes.

>Perhaps you don't yet see how silly that is. Has it occurred to you that
>the application may have erased the data that it wrote to disk? Are you
>going to insist that every write() be backed up by temporary buffers
>that accumulate a copy of all data written until the program dies? Well?

This is ridiculous.  If a program wants to _know_ the data is secured,
it can call fsync before it frees the buffers or overwrites the data.
If you don't care enough about the data to do this, go another step
and cast the return value of close to "void".

>``But then it has to send a request immediately over the network and
>wait to find out how much space is left!'' you say. Not so. Does TCP
>force each side to stay synchronized on every packet? Of course not. The
>file server can just pass out ``allocations'' of disk space.

"Allocations"?  I won't lightly put *any* state into my NFS server, never
mind state to take care of frivolities like close returning EDQUOT.

Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

jik@athena.mit.edu (Jonathan I. Kamens) (10/19/90)

  Dan, I think this is one of those issues where we are arguing because you
are convinced that the way you think Unix should be is the One True Way and
that everyone who thinks differently from you is wrong.  Or, at least, that's
the tone you project in your postings on this topic (you projected exactly the
same tone when we discussed syslog).

In article <24048:Oct1822:23:2090@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
|> In article <1990Oct18.200939.17427@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes:
|> > In article <19547:Oct1818:25:2690@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
|>     [ NFS should deal with quotas correctly ]

  Here's the first example of what I mean.  You summarize what you posted as
"NFS should deal with quotas correctly."  Dan, if we all agreed that the way
you want NFS to deal with quotas is the "correct" way, then we wouldn't be
arguing.  But no, everyone who disagrees with you must be wrong, since you
know the One True Way that Unix should work, so you have the right to call
your ideas "correct" and everyone else's "wrong".

|> That's a truism. Yes, any system call might return -1/EINTR. Yes,
|> close() should return -1/EBADF if the descriptor isn't open. These two
|> errors are documented and reasonable for close() to return, and it's
|> reasonable to expect the programmer to deal with them.

  Actually, my close(2) says nothing about EINTR.  Does that mean my program
doesn't have to be prepared to deal with an interrupted close()?  No, it means
that the man page should be updated.  Which is what I said about close(2) and
EDQUOT in my last posting.  Point for me, I think.

|> No other close() errors make sense.

  It is your *opinion* that no other close() errors make sense.  There are a
whole lot of people, including the designers of NFS and the designers of AFS,
who may not agree with you.

  Actually, I take that back.  I have heard that the designers of AFS are
working on a fix for a later release which will report cause quota problems to
be reported to users in a more timely manner, although I don't know what
mechanism they are using to do so.

  However, that doesn't change my main point, which is that I do not think it
is reasonable to place a restriction on all future filesystems developed for
Unix that they all must detect quota problems immediately.  You may be able to
foresee what kinds of filesystems we'll be developing in ten or fifteen years
(when, of course, Unix will have taken over the world :-), but I don't think I
can, and I think it's completely reasonable to expect that there will probably
be filesystem developers who will want to be able to defer quota errors until
close(), and it may even be a justifiable thing to do on the filesystems they
are developing.

  The only justification you have given for your claim that no other close()
errors "make sense" is that you don't think filesystems should work that way. 
I'm afraid that's just not good enough.

  And your comment, "No other close() errors make sense," is yet another
example of your One True Way attitude.  Lighten up.

|> You say that close() should be able to return -1/EDQUOT. That's hogwash.
|> EDQUOT can and should be detected immediately upon the write() that
|> triggers it. There's no reason that the system call for saying ``Okay,
|> forget about this descriptor, I'm done with it'' should produce an errno
|> saying ``But wait! I neglected to mention this to you when you actually
|> wrote the data, but you're out of space! Don't you dare forget about
|> this descriptor when you're out of space! Hold the presses!''

  You are still back in the days when all the world was UFS.  Close() doesn't
mean what you say above anymore (actually, it's possible to argue that it
never did).  What it means is, "I'm done with this file descriptor, so please
do any finishing touches that you need to on it and then close it up, and let
me know if you have any problems with that."

  Your argument above boils down to, "Close() didn't used to be able to return
EDQUOT, so it shouldn't be able to do so now."  Sorry, but that just doesn't
cut it.  Try again.

|> Perhaps you don't yet see how silly that is. Has it occurred to you that
|> the application may have erased the data that it wrote to disk? Are you
|> going to insist that every write() be backed up by temporary buffers
|> that accumulate a copy of all data written until the program dies? Well?

  First of all, the problem isn't as large as you are trying to make it out to
be, because most of the programs I've worked/used don't even deal well with a
write() failing because of a quota problem; they just say, "Oops, I can't
write, I'm going to give up completely," and forget about whatever they were
doing; they make hardly any effort to preserve data.  Therefore, an error on
close() wouldn't be any worse than an error on write() for most programs.

  For the few programs that are planning on doing serious error recovery if
they can't write to the disk and that are planning on trying to preserve all
data even if a disk write fails, then yes, I expect them to do some sort of
buffering.  If you want a robust program, then you put more work into it.

|> I fail to see your logic. Can I substitute any two system calls there?
|> ``Just as it is reasonable for unlink() to return an error when the file
|> doesn't exist, it is reasonable for setuid() to do so, and the man page
|> should be updated to reflect that.''
|> 
|> Now what is your argument?

  Although not everyone may agree that the behavior we are discussing is
reasonable for remote filesystems, I think even the most thick-headed reader
of this group can recognize that is more reasonable for a system call that
operates on the filesystem to encounter quota-related errors than it is for a
system call that changes your UID in the kernel to encounter quota-related
errors.  Your introduction of fallacious reductio ad absurdum arguments into
the discussion just makes it harder to discuss the subject matter with you
intelligently.

  My comparison of write() to close() was reasonable because both deal with
writing files.  Your comparison of unlink() to setuid() is ludicrous.

  And your comment, "Now what is your argument?" is another One True Way
comment.  Once again, lighten up.

|> No. I am saying that there is no excuse for a filesystem to let a
|> program write() more than the user's quota without giving an error.

  This is your opinion, and you are entitled to it.  You are also entitled to
write code that does not check close() for EDQUOT, in which case your code
will lose data when writing to remote filesystems and your customers will
complain to you, and you can tell them that you can't fix it because close()
*shouldn't* return EDQUOT.

  When you've written a filesystem as successful as NFS, which works as well
as AFS, which doesn't have the close() problem we're discussing, I'll gladly
use it, and I'll gladly try to get other people to use it, and I'll think you
a thousand times.  But until then, I'll stick with the people who *have* done
it.  And I'll admit that i may not be able to see forever into the future and
tell that we will never ever be able to justify a filesystem that doesn't
detect quota problems on write().

|> ``But then it has to send a request immediately over the network and
|> wait to find out how much space is left!'' you say. Not so. Does TCP
|> force each side to stay synchronized on every packet? Of course not. The
|> file server can just pass out ``allocations'' of disk space. Things only
|> slow down when you're very close to the hard limit.

  Please explain what you're proposing here -- I don't quite get it.

-- 
Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8085			      Home: 617-782-0710

gt0178a@prism.gatech.EDU (Jim Burns) (10/19/90)

It seems to me the close() problem is no different than the fclose()
problem. fclose() is known to do final flushing of stdio, and if you run
out of space YSOOL.

As far as quotas go, I think it is possible to check that, tho' absolute
reliablity may incur performance penalties. If quota(1) can tell you what
you are using, then either the write does the equivalent of quota(1) to
determine free=quota-used EVERY time, or it does it once at the beginning
of the program and tries to keep an accurate estimation of what you are
writing in your program. (Yuech, but it can be done.) (Then again, who
ever said addons like quotas ever seemlessly interface w/the standard
system. :-)

The much more serious problem is out of filesystem space. I've lost more
files in vi(1) than I care to remember because it didn't tell me there was
an error. (I don't use ZZ anymore - i do :w, then switch to another window
and check that it got there.) If you are going to buffer (explicitly in
stdio, or implicitly in NFS) you are going to have to expect (f)close errors.
-- 
BURNS,JIM
Georgia Institute of Technology, Box 30178, Atlanta Georgia, 30332
uucp:	  ...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!prism!gt0178a
Internet: gt0178a@prism.gatech.edu

bzs@world.std.com (Barry Shein) (10/20/90)

From: gt0178a@prism.gatech.EDU (Jim Burns)
>As far as quotas go, I think it is possible to check that, tho' absolute
>reliablity may incur performance penalties. If quota(1) can tell you what
>you are using, then either the write does the equivalent of quota(1) to
>determine free=quota-used EVERY time, or it does it once at the beginning
>of the program and tries to keep an accurate estimation of what you are
>writing in your program. (Yuech, but it can be done.) (Then again, who
>ever said addons like quotas ever seemlessly interface w/the standard
>system. :-)

What is with all the hand-wringing here? To implement quota you have
to keep one (perhaps two, soft/hard) integers up to date. BFD. On a
single system it has to be kept user-global, in the user struct
basically, one per major/minor device. A pointer to the correct ints
can be thrown in the per-file struct so it all devolves to something
like:

	if((*file[fd].quota - nwrite) < 0)
		bitch;
	else
		*file[fd].quota -= nwrite;

(with appropriate adjustments for blocks and soft/hard) on each write.

For a distributed file system it's harder because of the possibility
of multiple writing processes. But writes in NFS are synchronous, so
that's not hard, the server knows and will return the error and the
write() doesn't return until it's committed. If you shut that off for
performance gains you're on your own.

Am I missing something hard here?
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

bzs@world.std.com (Barry Shein) (10/20/90)

>>["I don't mind close returning -1/EINTR and -1/EBADF."]
>>No other close() errors make sense.
>
>So how do you pick up errors on asynchronous writes?  You are never
>going to know about them if close can't return an error here.

Well, someone has to say it...

OS/360/370 interrupted (signal) when there was an I/O error
(SYNAD=handler) of any sort and you could pick apart what happened in
the handler() routine. SYNAD stood for Syndrome Address if I remember
correctly.

The only hard part was knowing where your program was when the error
struck. A typical technique was to use a global integer and just set
it some value to indicate what you were about to do (often called
"phase", as in "phase = SAVEBUFS", you could hide that sort of thing
pretty well in macros.)

They also interrupted on EOF.

This is where all the ON CONDITION stuff in PL/I makes sense, it
basically models the OS/370 internals.

I am tempted at this point to say "if you want MVS, you know where to
find it", but I won't :-)

I've certainly encountered programs which would have been much easier
to debug if I just could have enabled some sort of "interrupt on any
syscall error", set a bunch of handlers in main and let it fly in a
debugger until one struck, and then look back up the stack to see who
did it. These were typically programs managing to pass garbage args to
system calls but not checking for errors, or writing to bad fd's etc.
Unfortunately many programs can go on for quite a while after such an
event so it's hard to figure out where the chaos started.

It would be pretty easy to implement at user-level by just having a
replacement library for the libc syscalls which checked for an error
return and just did a kill(0,SIGUSER1) or similar if something was
wrong.
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/20/90)

In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:
> In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> >["I don't mind close returning -1/EINTR and -1/EBADF."]
> >No other close() errors make sense.
> So how do you pick up errors on asynchronous writes?

That is an excellent question. I suspect that if UNIX had truly
asynchronous I/O, my objections would disappear, as the whole system
would work much more cleanly. Somehow I'm not sure the latest mound of
P1003.4 is the right way to approach asynchronous I/O, but anyway...
``What asynchronous writes?''

  [ I object that programs can't afford to keep data around in case of ]
  [ possible problems; errors must be returned in a timely fashion ]
> This is ridiculous.  If a program wants to _know_ the data is secured,
> it can call fsync before it frees the buffers or overwrites the data.

I sympathize with what you're trying to say, but have you noticed that
fsync() isn't required to flush data over NFS, any more than write() is
required to return EDQUOT correctly? If write()'s errors aren't
accurate, I don't know how you expect fsync() to work.

> "Allocations"?  I won't lightly put *any* state into my NFS server, never
> mind state to take care of frivolities like close returning EDQUOT.

No good remote file system is stateless. I think every complaint I've
heard about NFS is caused by the ``purity'' of its stateless
implementation.

---Dan

gt0178a@prism.gatech.EDU (Jim Burns) (10/20/90)

in article <BZS.90Oct19233255@world.std.com>, bzs@world.std.com (Barry Shein) says:

> OS/360/370 interrupted (signal) when there was an I/O error
> (SYNAD=handler) of any sort and you could pick apart what happened in
> the handler() routine.

> The only hard part was knowing where your program was when the error
> struck.

And what do you do if the I/O is done as the result of a close? And worse,
if the close is a result of exit processing, and the process doesn't exist
anymore to get the interrupt? Without timely notification, it's hard to
recover.

-- 
BURNS,JIM
Georgia Institute of Technology, Box 30178, Atlanta Georgia, 30332
uucp:	  ...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!prism!gt0178a
Internet: gt0178a@prism.gatech.edu

bzs@world.std.com (Barry Shein) (10/20/90)

From: gt0178a@prism.gatech.EDU (Jim Burns)
>in article <BZS.90Oct19233255@world.std.com>, bzs@world.std.com (Barry Shein) says:
>
>> OS/360/370 interrupted (signal) when there was an I/O error
>> (SYNAD=handler) of any sort and you could pick apart what happened in
>> the handler() routine.
>
>> The only hard part was knowing where your program was when the error
>> struck.
>
>And what do you do if the I/O is done as the result of a close? And worse,
>if the close is a result of exit processing, and the process doesn't exist
>anymore to get the interrupt? Without timely notification, it's hard to
>recover.

Uh, we're going around in circles here. The assumptions in that thread
was that none of what you mention was the case. Obviously if you care
about errors you better do your own close() and not leave it to the
process rundown (but there's no reason that can't interrupt also.)
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

bzs@world.std.com (Barry Shein) (10/21/90)

From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
>I sympathize with what you're trying to say, but have you noticed that
>fsync() isn't required to flush data over NFS, any more than write() is
>required to return EDQUOT correctly? If write()'s errors aren't
>accurate, I don't know how you expect fsync() to work.

I assume by NFS you mean the NFS from Sun. Writes are always
synchronous in NFS or must appear to be (or are non-compliant and
you're on your own.) So fsync() for writes is a no-op and irrelevant
in that case.
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

thurlow@convex.com (Robert Thurlow) (10/21/90)

In <BZS.90Oct19231046@world.std.com> bzs@world.std.com (Barry Shein) writes:
>For a distributed file system it's harder because of the possibility
>of multiple writing processes. But writes in NFS are synchronous, so
>that's not hard, the server knows and will return the error and the
>write() doesn't return until it's committed. If you shut that off for
>performance gains you're on your own.

write(2) is not synchronous to the process; the NFS write operation is,
but they may be done later by block I/O daemon processes on your behalf
at a later time.  We allow synchronous writes, but that isn't what a
naive process gets.

#include <osi/std/disclaimer>
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

thurlow@convex.com (Robert Thurlow) (10/21/90)

In <9681:Oct2004:06:3090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:
>> In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>> >["I don't mind close returning -1/EINTR and -1/EBADF."]
>> >No other close() errors make sense.
>> So how do you pick up errors on asynchronous writes?

>That is an excellent question. I suspect that if UNIX had truly
>asynchronous I/O, my objections would disappear, as the whole system
>would work much more cleanly. Somehow I'm not sure the latest mound of
>P1003.4 is the right way to approach asynchronous I/O, but anyway...
>``What asynchronous writes?''

I have them by my definition - "my can process have control before my
data has been committed to permanent storage".  What definition are you
using, and why don't you feel my write(2) calls are asynchronous?

>  [ I object that programs can't afford to keep data around in case of ]
>  [ possible problems; errors must be returned in a timely fashion ]
>> This is ridiculous.  If a program wants to _know_ the data is secured,
>> it can call fsync before it frees the buffers or overwrites the data.

>I sympathize with what you're trying to say, but have you noticed that
>fsync() isn't required to flush data over NFS, any more than write() is
>required to return EDQUOT correctly? If write()'s errors aren't
>accurate, I don't know how you expect fsync() to work.

Our fsync, like Suns, ensures there are no pages in the VM system
marked as "dirty", and it does this by forcing and waiting for I/O
on each such page.  The I/O involves an NFS write, and any I/O errors
are detected.  I consider any system that doesn't do this for me when
I call fsync to be broken.  If you can think of a way that this can
fail to return an error indication to me, please send me a test case.

>> "Allocations"?  I won't lightly put *any* state into my NFS server, never
>> mind state to take care of frivolities like close returning EDQUOT.

>No good remote file system is stateless. I think every complaint I've
>heard about NFS is caused by the ``purity'' of its stateless
>implementation.

No doubt, but I appreciate the advantages of the simplicity this allows.
When it is clear what state we need to introduce to make a more robust
implementation, it'll probably happen.

#include <osi/std/disclaimer>
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

thurlow@convex.com (Robert Thurlow) (10/21/90)

In <BZS.90Oct20172142@world.std.com> bzs@world.std.com (Barry Shein) writes:
>I assume by NFS you mean the NFS from Sun. Writes are always
>synchronous in NFS or must appear to be (or are non-compliant and
>you're on your own.) So fsync() for writes is a no-op and irrelevant
>in that case.

This is exactly wrong;  Sun ships a biod(8) daemon to support read-ahead
and write-behind async I/O over NFS, and fsync() is certainly needed.

#include <osi/std/disclaimer>
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

jik@athena.mit.edu (Jonathan I. Kamens) (10/22/90)

In article <BZS.90Oct19231046@world.std.com>, bzs@world.std.com (Barry Shein) writes:
|> Am I missing something hard here?

  I log into machine A and start a process that's going to write a big file
into AFS.  Then, I log into machine B and do the same thing, but another file.
Either of the files, when finished, will fit into my quota, but not both of
them.

  The process on A and the process on B both query the AFS server to get the
starting quota, and then go ahead and start writing.

  Both of them finish.  The first one to finish will have no problems with the
close().  The second will get EDQUOT.

  As for NFS, as I believe someone else has already pointed out, the NFS
*protocol* is synchronous, but NFS client kernels do not guarantee that the
*client* will get synchronous behavior.

  Assuming that a process/kernel on one machine can accurately keep track of
quotas on a remote filesystem by only making one quota call before doing any
operations on the device is just asking for trouble.  And don't forget that
you're going to have to update the quota status in the kernel after *any* file
operation on that device -- when a file is removed, for example, the local
quota record will have to be updated.  As far as I know, NFS client kernels
don't have to ask how big a file is before they remove it, but if we do what
you suggest, they'd have to do that, because they'd have to be able to
subtract the size of the file from the local quota record.

  All this to avoid EDQUOT on close(), something which I still think is quite
reasonable.

-- 
Jonathan Kamens			              USnail:
MIT Project Athena				11 Ashford Terrace
jik@Athena.MIT.EDU				Allston, MA  02134
Office: 617-253-8085			      Home: 617-782-0710

richard@aiai.ed.ac.uk (Richard Tobin) (10/23/90)

In article <BZS.90Oct20172142@world.std.com> bzs@world.std.com (Barry Shein) writes:
>I assume by NFS you mean the NFS from Sun. Writes are always
>synchronous in NFS or must appear to be (or are non-compliant and
>you're on your own.) So fsync() for writes is a no-op and irrelevant
>in that case.

The writes performed by the client kernel to the remote server must be
synchronous, so that a server crash is transparent.  Writes by the
application don't need to be synchronous - the client kernel may buffer
them - since it is not required that client crashes be transparent (:-).

-- Richard
-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

jeff@ingres.com (Jeff Anton) (10/23/90)

|>  When you've written a filesystem as successful as NFS, which works as well
|>as AFS, which doesn't have the close() problem we're discussing, I'll gladly
|>use it, and I'll gladly try to get other people to use it, and I'll think you
|>a thousand times.  But until then, I'll stick with the people who *have* done
|>it.  And I'll admit that i may not be able to see forever into the future and
|>tell that we will never ever be able to justify a filesystem that doesn't
|>detect quota problems on write().
|>
|>-- 
|>Jonathan Kamens			              USnail:
|>MIT Project Athena				11 Ashford Terrace
|>jik@Athena.MIT.EDU				Allston, MA  02134
|>Office: 617-253-8085			      Home: 617-782-0710

NFS, though successful, is hardly a system to present as a good example
of robust out of space error reporting.  ENOSPC handleing is less than
optimal even in a buffered situation.  NFS 'remembers' if a write caused
an out of space condition and refuses to query the server about further
writes until the file is closed and reopened.  (And the close() returns
ENOSPC as well).  Two problems with this optimization, first you can't
seek backwards in the file to write an indication that you ran out of space
because you can't overwrite existing allocated blocks - the no further
writes rule dissallows this so you can't recover from ENOSPC even if you
checked for the case, and second if two processes have the file open they
have to communicate and close the file together to clear the no further
writes condition.  I think this behavior is to prevent the stupid program
which ignores errors from write from killing the NFS server & network
performance.  A simple fix would be to have lseek clear the no futher
writes condition on the grounds that after seeking a write might succeed.

Also, what do you do if close() reports an error like ENOSPC?  Did close
release the file descriptor or not?  You would have to fstat it to
dertermine if it is closed.  And if it was not closed how would you close
it?  We need documentation!

This might be a simple bug, but no vendor has admitted it.  It's just
another performance vs. reliability trade off.  (O_SYNC doesn't help).
(Actually, I've not tested for this bug in the last year or so, it might
be fixed in some places.)
					Jeff Anton

greywolf@unisoft.UUCP (The Grey Wolf) (10/25/90)

[ Heeeere, newsposter.  Heeeeeere boy.  C'mon, lap it up.... ]

My thanks to everyone who answered my question, as silly as it must have
seemed, and my profuse and profound apologies for opening up a can of
worms.  I must confess that I did not expect that such a simple question as
I asked would cause such multiple demon instantiation (pandemonium) as it
did.

Thanks again.
-- 
"This is *not* going to work!"
				"Well, why didn't you say so before?"
"I *did* say so before!"
...!{ucbvax,acad,uunet,amdahl,pyramid}!unisoft!greywolf

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/26/90)

Skip to ``function'' if you want to see a couple of nasty challenges
rather than all the details.

In article <1990Oct19.055913.7103@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes:
>   Dan, I think this is one of those issues where we are arguing because you
> are convinced that the way you think Unix should be is the One True Way and
> that everyone who thinks differently from you is wrong.

One of the problems with your writing style is that you put ``I think''
before each of your opinions. One of the problems with your reading
style is that you assume everyone else shares your stilted view of
presentation.

I don't pretend that what I say is gospel. *Obviously* it's only what I
believe. Stop making me out to be some sort of religious fanatic, and
start presenting rational arguments that might convince me of something.

> Or, at least, that's
> the tone you project in your postings on this topic (you projected exactly the
> same tone when we discussed syslog).

Oh? Here's the model for your syslog argument: ``1. Dan says a logging
system shouldn't be *forced* under central control. 2. Dan says that
central control is a reasonable *default*. 3. syslog has central control
as its *only* option. 4. Therefore syslog is sufficient for Dan.''

That's simply ridiculous. I've italicized above the words that you kept
ignoring in that discussion. I can configure stderr, even though central
control is a good *default*. I can use stderr securely, even though
insecure /dev/log may be a good *default*. I can use stderr reliably,
even though unreliable syslog may be a good *default*.

Your final article went through each of the above points and many
others. You kept repeating ``well, you say that central control and
/dev/log are enough, so why don't you like syslog?''

Now *that's* a ``One True Way'' attitude.

Someday I hope you will realize how restrictive syslog is. I was never
really angry at it until I tested a new (slightly buggy, of course)
version of telnetd. Was I supposed to disable all the syslog()s while
testing? Eventually I had to do just that, so that I wouldn't flood the
real system logs with messages that could not be distinguished from
important messages from the running telnetd. I hope you hit a similar
wall sometime.

In the meantime, don't accuse *me* of ``projecting'' a ``tone.'' It was
your refusal to respond rationally that told me you weren't paying
attention.

>   Actually, my close(2) says nothing about EINTR.  Does that mean my program
> doesn't have to be prepared to deal with an interrupted close()?

I talk about this in another article.

> |> No other close() errors make sense.
>   It is your *opinion* that no other close() errors make sense.  There are a
> whole lot of people, including the designers of NFS and the designers of AFS,
> who may not agree with you.

This is exactly what I hate the most about your argument style. I
express opinion X. You can't figure out a rational response to X. So you
say (substitute appropriately): ``It is your *opinion* that X. You are
extremely opinionated. There are lots of people who disagree with you. I
disagree with you.''

I say that EINTR and EBADF make sense, given the documented function of
close(). You say they don't. I challenge you to explain why EDQUOT is a
sensible error for the ``delete a descriptor'' function to return.

Skip to ``Suppose'' for a second challenge.

>   Actually, I take that back.  I have heard that the designers of AFS are
> working on a fix for a later release which will report cause quota problems to
> be reported to users in a more timely manner, although I don't know what
> mechanism they are using to do so.

Excellent. It's good to hear that filesystem designers are joining the
real world.

>   However, that doesn't change my main point, which is that I do not think it
> is reasonable to place a restriction on all future filesystems developed for
> Unix that they all must detect quota problems immediately.

Did I ever claim such a restriction? No. In fact, I'm strongly against
any rules that restrict future extensions. However, it is most logical
for EDQUOT to be detected upon the offending write()---and current
filesystems *can do so* without undue implementation difficulty. So it's
silly for close() to return EDQUOT.

>   The only justification you have given for your claim that no other close()
> errors "make sense" is that you don't think filesystems should work that way. 
> I'm afraid that's just not good enough.

Here goes your argument style again. Jon, you are the one arguing for a
change from historical implementations, so it's incumbent upon you to
prove your point. You say that EDQUOT ``makes sense'' for close(), but
the only justification you've given is that NFS does it that way. Surely
if you feel so strongly about this issue then you have some rational
reason for your belief? How about sharing this reason with the rest of
us?

> |> You say that close() should be able to return -1/EDQUOT. That's hogwash.
> |> EDQUOT can and should be detected immediately upon the write() that
> |> triggers it. There's no reason that the system call for saying ``Okay,
> |> forget about this descriptor, I'm done with it'' should produce an errno
> |> saying ``But wait! I neglected to mention this to you when you actually
> |> wrote the data, but you're out of space! Don't you dare forget about
> |> this descriptor when you're out of space! Hold the presses!''
>   You are still back in the days when all the world was UFS.  Close() doesn't
> mean what you say above anymore (actually, it's possible to argue that it
> never did).  What it means is, "I'm done with this file descriptor, so please
> do any finishing touches that you need to on it and then close it up, and let
> me know if you have any problems with that."

You're fantasizing. You say close() means something entirely different
from what Bach, and my man pages, and lots of other references say.

> |> Perhaps you don't yet see how silly that is. Has it occurred to you that
> |> the application may have erased the data that it wrote to disk? Are you
> |> going to insist that every write() be backed up by temporary buffers
> |> that accumulate a copy of all data written until the program dies? Well?
>   First of all, the problem isn't as large as you are trying to make it out to
> be, because most of the programs I've worked/used don't even deal well with a
> write() failing because of a quota problem; they just say, "Oops, I can't
> write, I'm going to give up completely,"

I agree with your pragmatic point, but we're talking about reliability
here.

>   For the few programs that are planning on doing serious error recovery if
> they can't write to the disk and that are planning on trying to preserve all
> data even if a disk write fails, then yes, I expect them to do some sort of
> buffering.  If you want a robust program, then you put more work into it.

Sorry, Jon, this fails. Has it occurred to you that several processes
may write to a file before one of them does the close() that writes it
out to disk? How the hell do you expect data to be buffered in the
meantime? Your ``robust program'' is another fantasy.

An I/O error makes reliable disk operations very difficult. At some
level there has to be enough replication to make failures invisible.
Unless the OS does this, each program has to write data to enough places
that an I/O error can be handled. If it does so, EIO is not a disaster.

A quota error, however, *must* be reported as soon as possible. If
EDQUOT is saved up until some system buffer is flushed, a program may
have all its work destroyed, even if it checks your precious close()
error. After all, if some other program happens to be accessing the same
file, each write() and close() could succeed. Do you understand the
problem here?

     [ Jon makes an entirely unsupported claim, and I satirize his ]
     [ lack of justification: ]
> |> I fail to see your logic. Can I substitute any two system calls there?
> |> ``Just as it is reasonable for unlink() to return an error when the file
> |> doesn't exist, it is reasonable for setuid() to do so, and the man page
> |> should be updated to reflect that.''
> |> Now what is your argument?
  [ Jon criticizes my analogy, failing to realize that he's implicitly ]
  [ condemning his own lack of logic ]
> Your introduction of fallacious reductio ad absurdum arguments into
> the discussion just makes it harder to discuss the subject matter with you
> intelligently.

Read carefully and think about what you just wrote.

> |> No. I am saying that there is no excuse for a filesystem to let a
> |> program write() more than the user's quota without giving an error.
>   This is your opinion, and you are entitled to it.  You are also entitled to
> write code that does not check close() for EDQUOT, in which case your code
> will lose data when writing to remote filesystems and your customers will
> complain to you, and you can tell them that you can't fix it because close()
> *shouldn't* return EDQUOT.

Okay, wiseass. My pty program writes output on the pseudo-tty to its own
output, the normal stdout passed as usual from pty's invoker.

Now let's say I'm trying to obey your ``reliability'' constraints.
Rather than letting the system close the descriptor upon exit(), I take
care to close() it myself. I check the return value and error code.

Suppose it now returns an error. Uh-oh, EDQUOT. How do I handle
it? I'm following your prescriptions, and I've saved everything I've
written in an internal buffer. But I have no idea what other programs
might have written to the same descriptor. They won't have found out
about the EDQUOT, even though they checked close()---because the system
buffer hasn't yet been flushed. How am I supposed to handle this error?

Jon, I don't think you've thought out the consequences of this problem.
I don't think you've even considered the above situation. I challenge
you to describe a robust EDQUOT handler.

Feel free to apologize through e-mail rather than in public.

> And I'll admit that i may not be able to see forever into the future and
> tell that we will never ever be able to justify a filesystem that doesn't
> detect quota problems on write().

I agree with your general point; this is one of the many arguments I've
put forth in comp.std.unix for why files shouldn't be forced into the
filesystem abstraction. However, I'm not talking about future
restrictions. I'm talking about what we have *now*. The only example of
real-world EDQUOT close() behavior (that won't be fixed soon) is NFS,
which everyone agrees is a botch. You may be right, and timely EDQUOT
reporting may be difficult on some future file type. But it isn't now.

> |> ``But then it has to send a request immediately over the network and
> |> wait to find out how much space is left!'' you say. Not so. Does TCP
> |> force each side to stay synchronized on every packet? Of course not. The
> |> file server can just pass out ``allocations'' of disk space. Things only
> |> slow down when you're very close to the hard limit.
>   Please explain what you're proposing here -- I don't quite get it.

Windows. Just like TCP windows. This idea, like mostly everything else
in computer ``science,'' has been around since at least the sixties. You
make the most common operations faster by keeping the information they
need readily available.

---Dan

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/26/90)

In article <thurlow.656468483@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:
> In <9681:Oct2004:06:3090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> >In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:
> >> In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> >> >["I don't mind close returning -1/EINTR and -1/EBADF."]
> >> >No other close() errors make sense.
> >> So how do you pick up errors on asynchronous writes?
> >That is an excellent question. I suspect that if UNIX had truly
> >asynchronous I/O, my objections would disappear, as the whole system
> >would work much more cleanly. Somehow I'm not sure the latest mound of
> >P1003.4 is the right way to approach asynchronous I/O, but anyway...
> >``What asynchronous writes?''
> I have them by my definition - "my can process have control before my
> data has been committed to permanent storage".  What definition are you
> using, and why don't you feel my write(2) calls are asynchronous?

I was asking the same question several months ago, before Larry McVoy
and Jim Giles explained to me what I was missing. Frankly, I didn't find
their explanations too elucidating at first, so here's the scoop.

``Asynchronous I/O'' can mean two different things. One is buffered
I/O: output or input that is cached on its way to or from the disk.
This is what UNIX has always had. It doesn't imply multiprocessing, or
a change in programming philosophy, or a different interface from that
of the synchronous, unbuffered I/O in a primitive OS. For all you know,
your data might never leave the buffers---or buffers might not be used
at all. The advantage of buffers is that they reduce turnaround time:
you don't block waiting for data to go to disk or to another process,
as long as it fits into a buffer.

Another is truly asynchronous I/O: reads or writes that happen
asynchronously, concurrently with your program. The right way to see the
difference between synchronous and asynchronous I/O is to look at the
lowest level of I/O programming. A synchronous write has the CPU taking
time off from executing your program. It copies bytes of data, one by
one, into an output port. An asynchronous write has a separate I/O
processor doing the work. Your CPU takes only a moment to let the I/O
processor know about a block of data; then it returns to computation.
The CPU wouldn't run any faster if there were no I/O to do.

UNIX has never let truly asynchronous I/O show through fully to the
programmer. Although any real computer does have some sort of I/O
processor doing asynchronous reads and writes at the lowest levels, UNIX
sticks at least one level of buffering between programs and this
asynchronicity. Disk-to-buffer I/O synchronicity would have no more
impact on programs than any other scheduling problem.

Truly asynchronous I/O---without buffering---involves a change in
programming style. Since the data is not copied by the CPU, your process
has to know when it's safe to access that area of memory. This implies
that processes have to see a signal when the I/O really finishes. In
other words, truly asynchronous I/O is much closer to the level of the
machine, where scheduling I/O and waiting for a signal is the norm.

I hope this clears up what I mean when I say that UNIX doesn't have
asynchronous I/O. (Btw, I'm finishing up a signal-schedule, aka
non-preemptive threads, library. Anyone want to see particular features
in it? It won't give you asynchronous I/O without kernel support, but
it'll provide a framework for easing async syscalls into your code.)

> >  [ I object that programs can't afford to keep data around in case of ]
> >  [ possible problems; errors must be returned in a timely fashion ]
> >> This is ridiculous.  If a program wants to _know_ the data is secured,
> >> it can call fsync before it frees the buffers or overwrites the data.
> >I sympathize with what you're trying to say, but have you noticed that
> >fsync() isn't required to flush data over NFS, any more than write() is
> >required to return EDQUOT correctly? If write()'s errors aren't
> >accurate, I don't know how you expect fsync() to work.
> Our fsync, like Suns, ensures there are no pages in the VM system
> marked as "dirty", and it does this by forcing and waiting for I/O
> on each such page.  The I/O involves an NFS write, and any I/O errors
> are detected.

Are you sure? Suppose the remote side is a symbolic link to yet another
NFS-mounted directory. Is the fsync() really propagated?

This begs the real question: Why should I have to waste all that traffic
on periodic fsync()s, when the traffic for timely EDQUOT detection would
be a mere fraction of the amount? I can't afford to buffer everything
and do just an fsync() before the final close().

> >> "Allocations"?  I won't lightly put *any* state into my NFS server, never
> >> mind state to take care of frivolities like close returning EDQUOT.
> >No good remote file system is stateless. I think every complaint I've
> >heard about NFS is caused by the ``purity'' of its stateless
> >implementation.
> No doubt, but I appreciate the advantages of the simplicity this allows.

Minor advantages at best.

> When it is clear what state we need to introduce to make a more robust
> implementation, it'll probably happen.

I hope so.

---Dan

michael@fts1.uucp (Michael Richardson) (10/26/90)

In article <thurlow.656468286@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:
>In <BZS.90Oct19231046@world.std.com> bzs@world.std.com (Barry Shein) writes:
>>of multiple writing processes. But writes in NFS are synchronous, so
>>that's not hard, the server knows and will return the error and the
>>write() doesn't return until it's committed. If you shut that off for
>>performance gains you're on your own.

>write(2) is not synchronous to the process; the NFS write operation is,
>but they may be done later by block I/O daemon processes on your behalf
>at a later time.  We allow synchronous writes, but that isn't what a
>naive process gets.

  Seems to me that the above two statements solve the problem:
  If NFS write()s are synchronous, then the server can certainly
take care of EDQUOT during the request. The write will fail. A simple
modification to the NFS write operation would return the quota left
on that device on each request (Disclaimer: it has been over a year since
I did anything serious with SUNs or equivalent NFS' machined, two years 
since I read the NFS and XDR stuff)
  Now, the biod processes are the ones that get you the asynchronous
write(2) operation on remote file systems, so it seems that THEY
should be the one that worries about whether there is enough disk
quota left. At some point where the number of blocks left is
the maximum number that could be queued, the write() operation
becomes synchronous. 
  (biod's are kernel processes, so it is the kernel that is 
worrying about the quota. Same difference.)

  Checking for EDQUOT on close() might be a good idea, 
(like checking return codes for ANY system of library call)
but what you'd do after getting it (and taking the data from
a pipe, or a tty --- a user) is beyond me. 
  Someone's .sig said something like "Only check for errors
you know how to deal with." --- I like the spirit of this.

-- 
   :!mcr!:            |    The postmaster never          |  So much mail,
   Michael Richardson |            resolves twice.       |  so few cycles.
 Play: mcr@julie.UUCP Work: michael@fts1.UUCP Fido: 1:163/109.10 1:163/138
    Amiga----^     - Pay attention only to _MY_ opinions. -   ^--Amiga--^

thurlow@convex.com (Robert Thurlow) (10/28/90)

In <12045:Oct2604:56:3290@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>In article <thurlow.656468483@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:
>> In <9681:Oct2004:06:3090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>> >In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:

>> >> So how do you pick up errors on asynchronous writes?

>> >``What asynchronous writes?''

>> [The ones I get when I don't use O_SYNC.]

> [A discussion of async I/O that sounds like the VMS model.]

I think that the model you described would be a nicer way to do some
things; you'd be able to avoid a buffer copy, and a program would be
better able to tell exactly what was going on.  But I don't agree
that it impossible to get multiprocessing with the Unix behaviour.

>Another is truly asynchronous I/O: reads or writes that happen
>asynchronously, concurrently with your program. The right way to see the
>difference between synchronous and asynchronous I/O is to look at the
>lowest level of I/O programming. A synchronous write has the CPU taking
>time off from executing your program. It copies bytes of data, one by
>one, into an output port. An asynchronous write has a separate I/O
>processor doing the work. Your CPU takes only a moment to let the I/O
>processor know about a block of data; then it returns to computation.
>The CPU wouldn't run any faster if there were no I/O to do.

With a buffer cache, your CPU has to do a copy from user space into the
buffer cache, but it can then just submit an I/O request and get on
with other stuff.  If the I/O processor has access to the buffer cache
and the buffers are properly semaphored, the CPU can do what it wants
until it maps another I/O request onto the same block or it gets an
'I/O complete' interrupt.  In this case, a 'synchronous write' just
means your proc will remain asleep until after the I/O has been done.
The buffer copy kinda sucks, and I agree that the programmer can't use
your alternate I/O programming style, but I don't see that this makes
this "less asynchronous".

>> Our fsync, like Suns, ensures there are no pages in the VM system
>> marked as "dirty", and it does this by forcing and waiting for I/O
>> on each such page.  The I/O involves an NFS write, and any I/O errors
>> are detected.

>Are you sure? Suppose the remote side is a symbolic link to yet another
>NFS-mounted directory. Is the fsync() really propagated?

It doesn't go over the wire, which is another reason why we have to rely
on the actual NFS write operation being synchronous.  And the server won't
hand out filehandles that resolve to symlinks or accept them when given
as arguments to a write operation.  What kind of failure mode was on your
mind?

>This begs the real question: Why should I have to waste all that traffic
>on periodic fsync()s, when the traffic for timely EDQUOT detection would
>be a mere fraction of the amount? I can't afford to buffer everything
>and do just an fsync() before the final close().

I don't object to network traffic for EDQUOT detection a tenth as much
as adding state to my servers and junk to my protocol to allow a server
to ~dole out" a disk allocation to a client and be able to get it back
when it runs low.  If you want to build a bulletproof system that is as
immune to server and client failure as the current implementation of
NFS, I'd consider its merits, but if you ask me, we need a bug-free lock
manager/status monitor a lot more than we need something that just lets
you ignore error returns from close.

Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."

thurlow@convex.com (Robert Thurlow) (10/29/90)

In <1990Oct26.050448.26816@fts1.uucp> michael@fts1.uucp (Michael Richardson) writes:

>  Now, the biod processes are the ones that get you the asynchronous
>write(2) operation on remote file systems, so it seems that THEY
>should be the one that worries about whether there is enough disk
>quota left. At some point where the number of blocks left is
>the maximum number that could be queued, the write() operation
>becomes synchronous. 

This might be workable; the only problem is the NFS protocol change
required to carry the quota information, because we're still waiting
for a couple of more serious things to be fixed.

>  Checking for EDQUOT on close() might be a good idea, 
>(like checking return codes for ANY system of library call)
>but what you'd do after getting it (and taking the data from
>a pipe, or a tty --- a user) is beyond me. 

While I like to handle errors, I _demand_ to know about them so that
I can warn my users.  Even here, a workaround might be to have the
process retry the close so the kernel will retry the NFS writes, after
telling the user he is over quota so that he can try to delete some
files on the server.  If your process exited, _close() could just go
ahead and burn the blocks out of the cache.

>  Someone's .sig said something like "Only check for errors
>you know how to deal with." --- I like the spirit of this.

Hey, warning a user is still better than ignoring an error condition
your program can't handle.

#include <osi/std/disclaimer>
Rob T
--
Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu
----------------------------------------------------------------------
"This opinion was the only one available; I got here kind of late."