brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/19/90)
In article <1990Oct18.200939.17427@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes: > In article <19547:Oct1818:25:2690@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: [ NFS should deal with quotas correctly ] > Hogwash. If close() is specified as returning an error value, then it is > reasonable for it to sometimes return an error value, and it is also > reasonable for the programmer to be expected to check and deal with its return > value. That's a truism. Yes, any system call might return -1/EINTR. Yes, close() should return -1/EBADF if the descriptor isn't open. These two errors are documented and reasonable for close() to return, and it's reasonable to expect the programmer to deal with them. No other close() errors make sense. You say that close() should be able to return -1/EDQUOT. That's hogwash. EDQUOT can and should be detected immediately upon the write() that triggers it. There's no reason that the system call for saying ``Okay, forget about this descriptor, I'm done with it'' should produce an errno saying ``But wait! I neglected to mention this to you when you actually wrote the data, but you're out of space! Don't you dare forget about this descriptor when you're out of space! Hold the presses!'' Perhaps you don't yet see how silly that is. Has it occurred to you that the application may have erased the data that it wrote to disk? Are you going to insist that every write() be backed up by temporary buffers that accumulate a copy of all data written until the program dies? Well? > Just as it is reasonable for write() to return an error when the disk is > full, it is reasonable for close() to do so, and the man page should be > updated to reflect that. I fail to see your logic. Can I substitute any two system calls there? ``Just as it is reasonable for unlink() to return an error when the file doesn't exist, it is reasonable for setuid() to do so, and the man page should be updated to reflect that.'' Now what is your argument? > You seem to be saying that filesystems should never be allowed to postpone > writing to disk until close(), No. I am saying that there is no excuse for a filesystem to let a program write() more than the user's quota without giving an error. ``But then it has to send a request immediately over the network and wait to find out how much space is left!'' you say. Not so. Does TCP force each side to stay synchronized on every packet? Of course not. The file server can just pass out ``allocations'' of disk space. Things only slow down when you're very close to the hard limit. ---Dan
thurlow@convex.com (Robert Thurlow) (10/19/90)
In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >["I don't mind close returning -1/EINTR and -1/EBADF."] >No other close() errors make sense. So how do you pick up errors on asynchronous writes? You are never going to know about them if close can't return an error here. It makes sense to do an implicit fsync on close, and I think the error code from that operation has to be propagated. You may not be able to do a lot of constructive rebuilding in this case, but your program can at least let the user know that, "hey, something really evil happened." I'd shoot a program (like Berkeley's "cp") that didn't at least let me know via an exit status. You may choose to let your program retry the close until it works, too. >EDQUOT can and should be detected immediately upon the write() that >triggers it. This probably can be arranged, though it takes some care to not get hosed by multiple processes. >Perhaps you don't yet see how silly that is. Has it occurred to you that >the application may have erased the data that it wrote to disk? Are you >going to insist that every write() be backed up by temporary buffers >that accumulate a copy of all data written until the program dies? Well? This is ridiculous. If a program wants to _know_ the data is secured, it can call fsync before it frees the buffers or overwrites the data. If you don't care enough about the data to do this, go another step and cast the return value of close to "void". >``But then it has to send a request immediately over the network and >wait to find out how much space is left!'' you say. Not so. Does TCP >force each side to stay synchronized on every packet? Of course not. The >file server can just pass out ``allocations'' of disk space. "Allocations"? I won't lightly put *any* state into my NFS server, never mind state to take care of frivolities like close returning EDQUOT. Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
jik@athena.mit.edu (Jonathan I. Kamens) (10/19/90)
Dan, I think this is one of those issues where we are arguing because you are convinced that the way you think Unix should be is the One True Way and that everyone who thinks differently from you is wrong. Or, at least, that's the tone you project in your postings on this topic (you projected exactly the same tone when we discussed syslog). In article <24048:Oct1822:23:2090@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: |> In article <1990Oct18.200939.17427@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes: |> > In article <19547:Oct1818:25:2690@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: |> [ NFS should deal with quotas correctly ] Here's the first example of what I mean. You summarize what you posted as "NFS should deal with quotas correctly." Dan, if we all agreed that the way you want NFS to deal with quotas is the "correct" way, then we wouldn't be arguing. But no, everyone who disagrees with you must be wrong, since you know the One True Way that Unix should work, so you have the right to call your ideas "correct" and everyone else's "wrong". |> That's a truism. Yes, any system call might return -1/EINTR. Yes, |> close() should return -1/EBADF if the descriptor isn't open. These two |> errors are documented and reasonable for close() to return, and it's |> reasonable to expect the programmer to deal with them. Actually, my close(2) says nothing about EINTR. Does that mean my program doesn't have to be prepared to deal with an interrupted close()? No, it means that the man page should be updated. Which is what I said about close(2) and EDQUOT in my last posting. Point for me, I think. |> No other close() errors make sense. It is your *opinion* that no other close() errors make sense. There are a whole lot of people, including the designers of NFS and the designers of AFS, who may not agree with you. Actually, I take that back. I have heard that the designers of AFS are working on a fix for a later release which will report cause quota problems to be reported to users in a more timely manner, although I don't know what mechanism they are using to do so. However, that doesn't change my main point, which is that I do not think it is reasonable to place a restriction on all future filesystems developed for Unix that they all must detect quota problems immediately. You may be able to foresee what kinds of filesystems we'll be developing in ten or fifteen years (when, of course, Unix will have taken over the world :-), but I don't think I can, and I think it's completely reasonable to expect that there will probably be filesystem developers who will want to be able to defer quota errors until close(), and it may even be a justifiable thing to do on the filesystems they are developing. The only justification you have given for your claim that no other close() errors "make sense" is that you don't think filesystems should work that way. I'm afraid that's just not good enough. And your comment, "No other close() errors make sense," is yet another example of your One True Way attitude. Lighten up. |> You say that close() should be able to return -1/EDQUOT. That's hogwash. |> EDQUOT can and should be detected immediately upon the write() that |> triggers it. There's no reason that the system call for saying ``Okay, |> forget about this descriptor, I'm done with it'' should produce an errno |> saying ``But wait! I neglected to mention this to you when you actually |> wrote the data, but you're out of space! Don't you dare forget about |> this descriptor when you're out of space! Hold the presses!'' You are still back in the days when all the world was UFS. Close() doesn't mean what you say above anymore (actually, it's possible to argue that it never did). What it means is, "I'm done with this file descriptor, so please do any finishing touches that you need to on it and then close it up, and let me know if you have any problems with that." Your argument above boils down to, "Close() didn't used to be able to return EDQUOT, so it shouldn't be able to do so now." Sorry, but that just doesn't cut it. Try again. |> Perhaps you don't yet see how silly that is. Has it occurred to you that |> the application may have erased the data that it wrote to disk? Are you |> going to insist that every write() be backed up by temporary buffers |> that accumulate a copy of all data written until the program dies? Well? First of all, the problem isn't as large as you are trying to make it out to be, because most of the programs I've worked/used don't even deal well with a write() failing because of a quota problem; they just say, "Oops, I can't write, I'm going to give up completely," and forget about whatever they were doing; they make hardly any effort to preserve data. Therefore, an error on close() wouldn't be any worse than an error on write() for most programs. For the few programs that are planning on doing serious error recovery if they can't write to the disk and that are planning on trying to preserve all data even if a disk write fails, then yes, I expect them to do some sort of buffering. If you want a robust program, then you put more work into it. |> I fail to see your logic. Can I substitute any two system calls there? |> ``Just as it is reasonable for unlink() to return an error when the file |> doesn't exist, it is reasonable for setuid() to do so, and the man page |> should be updated to reflect that.'' |> |> Now what is your argument? Although not everyone may agree that the behavior we are discussing is reasonable for remote filesystems, I think even the most thick-headed reader of this group can recognize that is more reasonable for a system call that operates on the filesystem to encounter quota-related errors than it is for a system call that changes your UID in the kernel to encounter quota-related errors. Your introduction of fallacious reductio ad absurdum arguments into the discussion just makes it harder to discuss the subject matter with you intelligently. My comparison of write() to close() was reasonable because both deal with writing files. Your comparison of unlink() to setuid() is ludicrous. And your comment, "Now what is your argument?" is another One True Way comment. Once again, lighten up. |> No. I am saying that there is no excuse for a filesystem to let a |> program write() more than the user's quota without giving an error. This is your opinion, and you are entitled to it. You are also entitled to write code that does not check close() for EDQUOT, in which case your code will lose data when writing to remote filesystems and your customers will complain to you, and you can tell them that you can't fix it because close() *shouldn't* return EDQUOT. When you've written a filesystem as successful as NFS, which works as well as AFS, which doesn't have the close() problem we're discussing, I'll gladly use it, and I'll gladly try to get other people to use it, and I'll think you a thousand times. But until then, I'll stick with the people who *have* done it. And I'll admit that i may not be able to see forever into the future and tell that we will never ever be able to justify a filesystem that doesn't detect quota problems on write(). |> ``But then it has to send a request immediately over the network and |> wait to find out how much space is left!'' you say. Not so. Does TCP |> force each side to stay synchronized on every packet? Of course not. The |> file server can just pass out ``allocations'' of disk space. Things only |> slow down when you're very close to the hard limit. Please explain what you're proposing here -- I don't quite get it. -- Jonathan Kamens USnail: MIT Project Athena 11 Ashford Terrace jik@Athena.MIT.EDU Allston, MA 02134 Office: 617-253-8085 Home: 617-782-0710
gt0178a@prism.gatech.EDU (Jim Burns) (10/19/90)
It seems to me the close() problem is no different than the fclose() problem. fclose() is known to do final flushing of stdio, and if you run out of space YSOOL. As far as quotas go, I think it is possible to check that, tho' absolute reliablity may incur performance penalties. If quota(1) can tell you what you are using, then either the write does the equivalent of quota(1) to determine free=quota-used EVERY time, or it does it once at the beginning of the program and tries to keep an accurate estimation of what you are writing in your program. (Yuech, but it can be done.) (Then again, who ever said addons like quotas ever seemlessly interface w/the standard system. :-) The much more serious problem is out of filesystem space. I've lost more files in vi(1) than I care to remember because it didn't tell me there was an error. (I don't use ZZ anymore - i do :w, then switch to another window and check that it got there.) If you are going to buffer (explicitly in stdio, or implicitly in NFS) you are going to have to expect (f)close errors. -- BURNS,JIM Georgia Institute of Technology, Box 30178, Atlanta Georgia, 30332 uucp: ...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!prism!gt0178a Internet: gt0178a@prism.gatech.edu
bzs@world.std.com (Barry Shein) (10/20/90)
From: gt0178a@prism.gatech.EDU (Jim Burns) >As far as quotas go, I think it is possible to check that, tho' absolute >reliablity may incur performance penalties. If quota(1) can tell you what >you are using, then either the write does the equivalent of quota(1) to >determine free=quota-used EVERY time, or it does it once at the beginning >of the program and tries to keep an accurate estimation of what you are >writing in your program. (Yuech, but it can be done.) (Then again, who >ever said addons like quotas ever seemlessly interface w/the standard >system. :-) What is with all the hand-wringing here? To implement quota you have to keep one (perhaps two, soft/hard) integers up to date. BFD. On a single system it has to be kept user-global, in the user struct basically, one per major/minor device. A pointer to the correct ints can be thrown in the per-file struct so it all devolves to something like: if((*file[fd].quota - nwrite) < 0) bitch; else *file[fd].quota -= nwrite; (with appropriate adjustments for blocks and soft/hard) on each write. For a distributed file system it's harder because of the possibility of multiple writing processes. But writes in NFS are synchronous, so that's not hard, the server knows and will return the error and the write() doesn't return until it's committed. If you shut that off for performance gains you're on your own. Am I missing something hard here? -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
bzs@world.std.com (Barry Shein) (10/20/90)
>>["I don't mind close returning -1/EINTR and -1/EBADF."] >>No other close() errors make sense. > >So how do you pick up errors on asynchronous writes? You are never >going to know about them if close can't return an error here. Well, someone has to say it... OS/360/370 interrupted (signal) when there was an I/O error (SYNAD=handler) of any sort and you could pick apart what happened in the handler() routine. SYNAD stood for Syndrome Address if I remember correctly. The only hard part was knowing where your program was when the error struck. A typical technique was to use a global integer and just set it some value to indicate what you were about to do (often called "phase", as in "phase = SAVEBUFS", you could hide that sort of thing pretty well in macros.) They also interrupted on EOF. This is where all the ON CONDITION stuff in PL/I makes sense, it basically models the OS/370 internals. I am tempted at this point to say "if you want MVS, you know where to find it", but I won't :-) I've certainly encountered programs which would have been much easier to debug if I just could have enabled some sort of "interrupt on any syscall error", set a bunch of handlers in main and let it fly in a debugger until one struck, and then look back up the stack to see who did it. These were typically programs managing to pass garbage args to system calls but not checking for errors, or writing to bad fd's etc. Unfortunately many programs can go on for quite a while after such an event so it's hard to figure out where the chaos started. It would be pretty easy to implement at user-level by just having a replacement library for the libc syscalls which checked for an error return and just did a kill(0,SIGUSER1) or similar if something was wrong. -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/20/90)
In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes: > In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > >["I don't mind close returning -1/EINTR and -1/EBADF."] > >No other close() errors make sense. > So how do you pick up errors on asynchronous writes? That is an excellent question. I suspect that if UNIX had truly asynchronous I/O, my objections would disappear, as the whole system would work much more cleanly. Somehow I'm not sure the latest mound of P1003.4 is the right way to approach asynchronous I/O, but anyway... ``What asynchronous writes?'' [ I object that programs can't afford to keep data around in case of ] [ possible problems; errors must be returned in a timely fashion ] > This is ridiculous. If a program wants to _know_ the data is secured, > it can call fsync before it frees the buffers or overwrites the data. I sympathize with what you're trying to say, but have you noticed that fsync() isn't required to flush data over NFS, any more than write() is required to return EDQUOT correctly? If write()'s errors aren't accurate, I don't know how you expect fsync() to work. > "Allocations"? I won't lightly put *any* state into my NFS server, never > mind state to take care of frivolities like close returning EDQUOT. No good remote file system is stateless. I think every complaint I've heard about NFS is caused by the ``purity'' of its stateless implementation. ---Dan
gt0178a@prism.gatech.EDU (Jim Burns) (10/20/90)
in article <BZS.90Oct19233255@world.std.com>, bzs@world.std.com (Barry Shein) says: > OS/360/370 interrupted (signal) when there was an I/O error > (SYNAD=handler) of any sort and you could pick apart what happened in > the handler() routine. > The only hard part was knowing where your program was when the error > struck. And what do you do if the I/O is done as the result of a close? And worse, if the close is a result of exit processing, and the process doesn't exist anymore to get the interrupt? Without timely notification, it's hard to recover. -- BURNS,JIM Georgia Institute of Technology, Box 30178, Atlanta Georgia, 30332 uucp: ...!{decvax,hplabs,ncar,purdue,rutgers}!gatech!prism!gt0178a Internet: gt0178a@prism.gatech.edu
bzs@world.std.com (Barry Shein) (10/20/90)
From: gt0178a@prism.gatech.EDU (Jim Burns) >in article <BZS.90Oct19233255@world.std.com>, bzs@world.std.com (Barry Shein) says: > >> OS/360/370 interrupted (signal) when there was an I/O error >> (SYNAD=handler) of any sort and you could pick apart what happened in >> the handler() routine. > >> The only hard part was knowing where your program was when the error >> struck. > >And what do you do if the I/O is done as the result of a close? And worse, >if the close is a result of exit processing, and the process doesn't exist >anymore to get the interrupt? Without timely notification, it's hard to >recover. Uh, we're going around in circles here. The assumptions in that thread was that none of what you mention was the case. Obviously if you care about errors you better do your own close() and not leave it to the process rundown (but there's no reason that can't interrupt also.) -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
bzs@world.std.com (Barry Shein) (10/21/90)
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) >I sympathize with what you're trying to say, but have you noticed that >fsync() isn't required to flush data over NFS, any more than write() is >required to return EDQUOT correctly? If write()'s errors aren't >accurate, I don't know how you expect fsync() to work. I assume by NFS you mean the NFS from Sun. Writes are always synchronous in NFS or must appear to be (or are non-compliant and you're on your own.) So fsync() for writes is a no-op and irrelevant in that case. -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
thurlow@convex.com (Robert Thurlow) (10/21/90)
In <BZS.90Oct19231046@world.std.com> bzs@world.std.com (Barry Shein) writes: >For a distributed file system it's harder because of the possibility >of multiple writing processes. But writes in NFS are synchronous, so >that's not hard, the server knows and will return the error and the >write() doesn't return until it's committed. If you shut that off for >performance gains you're on your own. write(2) is not synchronous to the process; the NFS write operation is, but they may be done later by block I/O daemon processes on your behalf at a later time. We allow synchronous writes, but that isn't what a naive process gets. #include <osi/std/disclaimer> Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
thurlow@convex.com (Robert Thurlow) (10/21/90)
In <9681:Oct2004:06:3090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes: >> In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >> >["I don't mind close returning -1/EINTR and -1/EBADF."] >> >No other close() errors make sense. >> So how do you pick up errors on asynchronous writes? >That is an excellent question. I suspect that if UNIX had truly >asynchronous I/O, my objections would disappear, as the whole system >would work much more cleanly. Somehow I'm not sure the latest mound of >P1003.4 is the right way to approach asynchronous I/O, but anyway... >``What asynchronous writes?'' I have them by my definition - "my can process have control before my data has been committed to permanent storage". What definition are you using, and why don't you feel my write(2) calls are asynchronous? > [ I object that programs can't afford to keep data around in case of ] > [ possible problems; errors must be returned in a timely fashion ] >> This is ridiculous. If a program wants to _know_ the data is secured, >> it can call fsync before it frees the buffers or overwrites the data. >I sympathize with what you're trying to say, but have you noticed that >fsync() isn't required to flush data over NFS, any more than write() is >required to return EDQUOT correctly? If write()'s errors aren't >accurate, I don't know how you expect fsync() to work. Our fsync, like Suns, ensures there are no pages in the VM system marked as "dirty", and it does this by forcing and waiting for I/O on each such page. The I/O involves an NFS write, and any I/O errors are detected. I consider any system that doesn't do this for me when I call fsync to be broken. If you can think of a way that this can fail to return an error indication to me, please send me a test case. >> "Allocations"? I won't lightly put *any* state into my NFS server, never >> mind state to take care of frivolities like close returning EDQUOT. >No good remote file system is stateless. I think every complaint I've >heard about NFS is caused by the ``purity'' of its stateless >implementation. No doubt, but I appreciate the advantages of the simplicity this allows. When it is clear what state we need to introduce to make a more robust implementation, it'll probably happen. #include <osi/std/disclaimer> Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
thurlow@convex.com (Robert Thurlow) (10/21/90)
In <BZS.90Oct20172142@world.std.com> bzs@world.std.com (Barry Shein) writes: >I assume by NFS you mean the NFS from Sun. Writes are always >synchronous in NFS or must appear to be (or are non-compliant and >you're on your own.) So fsync() for writes is a no-op and irrelevant >in that case. This is exactly wrong; Sun ships a biod(8) daemon to support read-ahead and write-behind async I/O over NFS, and fsync() is certainly needed. #include <osi/std/disclaimer> Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
jik@athena.mit.edu (Jonathan I. Kamens) (10/22/90)
In article <BZS.90Oct19231046@world.std.com>, bzs@world.std.com (Barry Shein) writes: |> Am I missing something hard here? I log into machine A and start a process that's going to write a big file into AFS. Then, I log into machine B and do the same thing, but another file. Either of the files, when finished, will fit into my quota, but not both of them. The process on A and the process on B both query the AFS server to get the starting quota, and then go ahead and start writing. Both of them finish. The first one to finish will have no problems with the close(). The second will get EDQUOT. As for NFS, as I believe someone else has already pointed out, the NFS *protocol* is synchronous, but NFS client kernels do not guarantee that the *client* will get synchronous behavior. Assuming that a process/kernel on one machine can accurately keep track of quotas on a remote filesystem by only making one quota call before doing any operations on the device is just asking for trouble. And don't forget that you're going to have to update the quota status in the kernel after *any* file operation on that device -- when a file is removed, for example, the local quota record will have to be updated. As far as I know, NFS client kernels don't have to ask how big a file is before they remove it, but if we do what you suggest, they'd have to do that, because they'd have to be able to subtract the size of the file from the local quota record. All this to avoid EDQUOT on close(), something which I still think is quite reasonable. -- Jonathan Kamens USnail: MIT Project Athena 11 Ashford Terrace jik@Athena.MIT.EDU Allston, MA 02134 Office: 617-253-8085 Home: 617-782-0710
richard@aiai.ed.ac.uk (Richard Tobin) (10/23/90)
In article <BZS.90Oct20172142@world.std.com> bzs@world.std.com (Barry Shein) writes: >I assume by NFS you mean the NFS from Sun. Writes are always >synchronous in NFS or must appear to be (or are non-compliant and >you're on your own.) So fsync() for writes is a no-op and irrelevant >in that case. The writes performed by the client kernel to the remote server must be synchronous, so that a server crash is transparent. Writes by the application don't need to be synchronous - the client kernel may buffer them - since it is not required that client crashes be transparent (:-). -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
jeff@ingres.com (Jeff Anton) (10/23/90)
|> When you've written a filesystem as successful as NFS, which works as well |>as AFS, which doesn't have the close() problem we're discussing, I'll gladly |>use it, and I'll gladly try to get other people to use it, and I'll think you |>a thousand times. But until then, I'll stick with the people who *have* done |>it. And I'll admit that i may not be able to see forever into the future and |>tell that we will never ever be able to justify a filesystem that doesn't |>detect quota problems on write(). |> |>-- |>Jonathan Kamens USnail: |>MIT Project Athena 11 Ashford Terrace |>jik@Athena.MIT.EDU Allston, MA 02134 |>Office: 617-253-8085 Home: 617-782-0710 NFS, though successful, is hardly a system to present as a good example of robust out of space error reporting. ENOSPC handleing is less than optimal even in a buffered situation. NFS 'remembers' if a write caused an out of space condition and refuses to query the server about further writes until the file is closed and reopened. (And the close() returns ENOSPC as well). Two problems with this optimization, first you can't seek backwards in the file to write an indication that you ran out of space because you can't overwrite existing allocated blocks - the no further writes rule dissallows this so you can't recover from ENOSPC even if you checked for the case, and second if two processes have the file open they have to communicate and close the file together to clear the no further writes condition. I think this behavior is to prevent the stupid program which ignores errors from write from killing the NFS server & network performance. A simple fix would be to have lseek clear the no futher writes condition on the grounds that after seeking a write might succeed. Also, what do you do if close() reports an error like ENOSPC? Did close release the file descriptor or not? You would have to fstat it to dertermine if it is closed. And if it was not closed how would you close it? We need documentation! This might be a simple bug, but no vendor has admitted it. It's just another performance vs. reliability trade off. (O_SYNC doesn't help). (Actually, I've not tested for this bug in the last year or so, it might be fixed in some places.) Jeff Anton
greywolf@unisoft.UUCP (The Grey Wolf) (10/25/90)
[ Heeeere, newsposter. Heeeeeere boy. C'mon, lap it up.... ] My thanks to everyone who answered my question, as silly as it must have seemed, and my profuse and profound apologies for opening up a can of worms. I must confess that I did not expect that such a simple question as I asked would cause such multiple demon instantiation (pandemonium) as it did. Thanks again. -- "This is *not* going to work!" "Well, why didn't you say so before?" "I *did* say so before!" ...!{ucbvax,acad,uunet,amdahl,pyramid}!unisoft!greywolf
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/26/90)
Skip to ``function'' if you want to see a couple of nasty challenges rather than all the details. In article <1990Oct19.055913.7103@athena.mit.edu> jik@athena.mit.edu (Jonathan I. Kamens) writes: > Dan, I think this is one of those issues where we are arguing because you > are convinced that the way you think Unix should be is the One True Way and > that everyone who thinks differently from you is wrong. One of the problems with your writing style is that you put ``I think'' before each of your opinions. One of the problems with your reading style is that you assume everyone else shares your stilted view of presentation. I don't pretend that what I say is gospel. *Obviously* it's only what I believe. Stop making me out to be some sort of religious fanatic, and start presenting rational arguments that might convince me of something. > Or, at least, that's > the tone you project in your postings on this topic (you projected exactly the > same tone when we discussed syslog). Oh? Here's the model for your syslog argument: ``1. Dan says a logging system shouldn't be *forced* under central control. 2. Dan says that central control is a reasonable *default*. 3. syslog has central control as its *only* option. 4. Therefore syslog is sufficient for Dan.'' That's simply ridiculous. I've italicized above the words that you kept ignoring in that discussion. I can configure stderr, even though central control is a good *default*. I can use stderr securely, even though insecure /dev/log may be a good *default*. I can use stderr reliably, even though unreliable syslog may be a good *default*. Your final article went through each of the above points and many others. You kept repeating ``well, you say that central control and /dev/log are enough, so why don't you like syslog?'' Now *that's* a ``One True Way'' attitude. Someday I hope you will realize how restrictive syslog is. I was never really angry at it until I tested a new (slightly buggy, of course) version of telnetd. Was I supposed to disable all the syslog()s while testing? Eventually I had to do just that, so that I wouldn't flood the real system logs with messages that could not be distinguished from important messages from the running telnetd. I hope you hit a similar wall sometime. In the meantime, don't accuse *me* of ``projecting'' a ``tone.'' It was your refusal to respond rationally that told me you weren't paying attention. > Actually, my close(2) says nothing about EINTR. Does that mean my program > doesn't have to be prepared to deal with an interrupted close()? I talk about this in another article. > |> No other close() errors make sense. > It is your *opinion* that no other close() errors make sense. There are a > whole lot of people, including the designers of NFS and the designers of AFS, > who may not agree with you. This is exactly what I hate the most about your argument style. I express opinion X. You can't figure out a rational response to X. So you say (substitute appropriately): ``It is your *opinion* that X. You are extremely opinionated. There are lots of people who disagree with you. I disagree with you.'' I say that EINTR and EBADF make sense, given the documented function of close(). You say they don't. I challenge you to explain why EDQUOT is a sensible error for the ``delete a descriptor'' function to return. Skip to ``Suppose'' for a second challenge. > Actually, I take that back. I have heard that the designers of AFS are > working on a fix for a later release which will report cause quota problems to > be reported to users in a more timely manner, although I don't know what > mechanism they are using to do so. Excellent. It's good to hear that filesystem designers are joining the real world. > However, that doesn't change my main point, which is that I do not think it > is reasonable to place a restriction on all future filesystems developed for > Unix that they all must detect quota problems immediately. Did I ever claim such a restriction? No. In fact, I'm strongly against any rules that restrict future extensions. However, it is most logical for EDQUOT to be detected upon the offending write()---and current filesystems *can do so* without undue implementation difficulty. So it's silly for close() to return EDQUOT. > The only justification you have given for your claim that no other close() > errors "make sense" is that you don't think filesystems should work that way. > I'm afraid that's just not good enough. Here goes your argument style again. Jon, you are the one arguing for a change from historical implementations, so it's incumbent upon you to prove your point. You say that EDQUOT ``makes sense'' for close(), but the only justification you've given is that NFS does it that way. Surely if you feel so strongly about this issue then you have some rational reason for your belief? How about sharing this reason with the rest of us? > |> You say that close() should be able to return -1/EDQUOT. That's hogwash. > |> EDQUOT can and should be detected immediately upon the write() that > |> triggers it. There's no reason that the system call for saying ``Okay, > |> forget about this descriptor, I'm done with it'' should produce an errno > |> saying ``But wait! I neglected to mention this to you when you actually > |> wrote the data, but you're out of space! Don't you dare forget about > |> this descriptor when you're out of space! Hold the presses!'' > You are still back in the days when all the world was UFS. Close() doesn't > mean what you say above anymore (actually, it's possible to argue that it > never did). What it means is, "I'm done with this file descriptor, so please > do any finishing touches that you need to on it and then close it up, and let > me know if you have any problems with that." You're fantasizing. You say close() means something entirely different from what Bach, and my man pages, and lots of other references say. > |> Perhaps you don't yet see how silly that is. Has it occurred to you that > |> the application may have erased the data that it wrote to disk? Are you > |> going to insist that every write() be backed up by temporary buffers > |> that accumulate a copy of all data written until the program dies? Well? > First of all, the problem isn't as large as you are trying to make it out to > be, because most of the programs I've worked/used don't even deal well with a > write() failing because of a quota problem; they just say, "Oops, I can't > write, I'm going to give up completely," I agree with your pragmatic point, but we're talking about reliability here. > For the few programs that are planning on doing serious error recovery if > they can't write to the disk and that are planning on trying to preserve all > data even if a disk write fails, then yes, I expect them to do some sort of > buffering. If you want a robust program, then you put more work into it. Sorry, Jon, this fails. Has it occurred to you that several processes may write to a file before one of them does the close() that writes it out to disk? How the hell do you expect data to be buffered in the meantime? Your ``robust program'' is another fantasy. An I/O error makes reliable disk operations very difficult. At some level there has to be enough replication to make failures invisible. Unless the OS does this, each program has to write data to enough places that an I/O error can be handled. If it does so, EIO is not a disaster. A quota error, however, *must* be reported as soon as possible. If EDQUOT is saved up until some system buffer is flushed, a program may have all its work destroyed, even if it checks your precious close() error. After all, if some other program happens to be accessing the same file, each write() and close() could succeed. Do you understand the problem here? [ Jon makes an entirely unsupported claim, and I satirize his ] [ lack of justification: ] > |> I fail to see your logic. Can I substitute any two system calls there? > |> ``Just as it is reasonable for unlink() to return an error when the file > |> doesn't exist, it is reasonable for setuid() to do so, and the man page > |> should be updated to reflect that.'' > |> Now what is your argument? [ Jon criticizes my analogy, failing to realize that he's implicitly ] [ condemning his own lack of logic ] > Your introduction of fallacious reductio ad absurdum arguments into > the discussion just makes it harder to discuss the subject matter with you > intelligently. Read carefully and think about what you just wrote. > |> No. I am saying that there is no excuse for a filesystem to let a > |> program write() more than the user's quota without giving an error. > This is your opinion, and you are entitled to it. You are also entitled to > write code that does not check close() for EDQUOT, in which case your code > will lose data when writing to remote filesystems and your customers will > complain to you, and you can tell them that you can't fix it because close() > *shouldn't* return EDQUOT. Okay, wiseass. My pty program writes output on the pseudo-tty to its own output, the normal stdout passed as usual from pty's invoker. Now let's say I'm trying to obey your ``reliability'' constraints. Rather than letting the system close the descriptor upon exit(), I take care to close() it myself. I check the return value and error code. Suppose it now returns an error. Uh-oh, EDQUOT. How do I handle it? I'm following your prescriptions, and I've saved everything I've written in an internal buffer. But I have no idea what other programs might have written to the same descriptor. They won't have found out about the EDQUOT, even though they checked close()---because the system buffer hasn't yet been flushed. How am I supposed to handle this error? Jon, I don't think you've thought out the consequences of this problem. I don't think you've even considered the above situation. I challenge you to describe a robust EDQUOT handler. Feel free to apologize through e-mail rather than in public. > And I'll admit that i may not be able to see forever into the future and > tell that we will never ever be able to justify a filesystem that doesn't > detect quota problems on write(). I agree with your general point; this is one of the many arguments I've put forth in comp.std.unix for why files shouldn't be forced into the filesystem abstraction. However, I'm not talking about future restrictions. I'm talking about what we have *now*. The only example of real-world EDQUOT close() behavior (that won't be fixed soon) is NFS, which everyone agrees is a botch. You may be right, and timely EDQUOT reporting may be difficult on some future file type. But it isn't now. > |> ``But then it has to send a request immediately over the network and > |> wait to find out how much space is left!'' you say. Not so. Does TCP > |> force each side to stay synchronized on every packet? Of course not. The > |> file server can just pass out ``allocations'' of disk space. Things only > |> slow down when you're very close to the hard limit. > Please explain what you're proposing here -- I don't quite get it. Windows. Just like TCP windows. This idea, like mostly everything else in computer ``science,'' has been around since at least the sixties. You make the most common operations faster by keeping the information they need readily available. ---Dan
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (10/26/90)
In article <thurlow.656468483@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes: > In <9681:Oct2004:06:3090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > >In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes: > >> In <24048:Oct1822:23:2090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > >> >["I don't mind close returning -1/EINTR and -1/EBADF."] > >> >No other close() errors make sense. > >> So how do you pick up errors on asynchronous writes? > >That is an excellent question. I suspect that if UNIX had truly > >asynchronous I/O, my objections would disappear, as the whole system > >would work much more cleanly. Somehow I'm not sure the latest mound of > >P1003.4 is the right way to approach asynchronous I/O, but anyway... > >``What asynchronous writes?'' > I have them by my definition - "my can process have control before my > data has been committed to permanent storage". What definition are you > using, and why don't you feel my write(2) calls are asynchronous? I was asking the same question several months ago, before Larry McVoy and Jim Giles explained to me what I was missing. Frankly, I didn't find their explanations too elucidating at first, so here's the scoop. ``Asynchronous I/O'' can mean two different things. One is buffered I/O: output or input that is cached on its way to or from the disk. This is what UNIX has always had. It doesn't imply multiprocessing, or a change in programming philosophy, or a different interface from that of the synchronous, unbuffered I/O in a primitive OS. For all you know, your data might never leave the buffers---or buffers might not be used at all. The advantage of buffers is that they reduce turnaround time: you don't block waiting for data to go to disk or to another process, as long as it fits into a buffer. Another is truly asynchronous I/O: reads or writes that happen asynchronously, concurrently with your program. The right way to see the difference between synchronous and asynchronous I/O is to look at the lowest level of I/O programming. A synchronous write has the CPU taking time off from executing your program. It copies bytes of data, one by one, into an output port. An asynchronous write has a separate I/O processor doing the work. Your CPU takes only a moment to let the I/O processor know about a block of data; then it returns to computation. The CPU wouldn't run any faster if there were no I/O to do. UNIX has never let truly asynchronous I/O show through fully to the programmer. Although any real computer does have some sort of I/O processor doing asynchronous reads and writes at the lowest levels, UNIX sticks at least one level of buffering between programs and this asynchronicity. Disk-to-buffer I/O synchronicity would have no more impact on programs than any other scheduling problem. Truly asynchronous I/O---without buffering---involves a change in programming style. Since the data is not copied by the CPU, your process has to know when it's safe to access that area of memory. This implies that processes have to see a signal when the I/O really finishes. In other words, truly asynchronous I/O is much closer to the level of the machine, where scheduling I/O and waiting for a signal is the norm. I hope this clears up what I mean when I say that UNIX doesn't have asynchronous I/O. (Btw, I'm finishing up a signal-schedule, aka non-preemptive threads, library. Anyone want to see particular features in it? It won't give you asynchronous I/O without kernel support, but it'll provide a framework for easing async syscalls into your code.) > > [ I object that programs can't afford to keep data around in case of ] > > [ possible problems; errors must be returned in a timely fashion ] > >> This is ridiculous. If a program wants to _know_ the data is secured, > >> it can call fsync before it frees the buffers or overwrites the data. > >I sympathize with what you're trying to say, but have you noticed that > >fsync() isn't required to flush data over NFS, any more than write() is > >required to return EDQUOT correctly? If write()'s errors aren't > >accurate, I don't know how you expect fsync() to work. > Our fsync, like Suns, ensures there are no pages in the VM system > marked as "dirty", and it does this by forcing and waiting for I/O > on each such page. The I/O involves an NFS write, and any I/O errors > are detected. Are you sure? Suppose the remote side is a symbolic link to yet another NFS-mounted directory. Is the fsync() really propagated? This begs the real question: Why should I have to waste all that traffic on periodic fsync()s, when the traffic for timely EDQUOT detection would be a mere fraction of the amount? I can't afford to buffer everything and do just an fsync() before the final close(). > >> "Allocations"? I won't lightly put *any* state into my NFS server, never > >> mind state to take care of frivolities like close returning EDQUOT. > >No good remote file system is stateless. I think every complaint I've > >heard about NFS is caused by the ``purity'' of its stateless > >implementation. > No doubt, but I appreciate the advantages of the simplicity this allows. Minor advantages at best. > When it is clear what state we need to introduce to make a more robust > implementation, it'll probably happen. I hope so. ---Dan
michael@fts1.uucp (Michael Richardson) (10/26/90)
In article <thurlow.656468286@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes: >In <BZS.90Oct19231046@world.std.com> bzs@world.std.com (Barry Shein) writes: >>of multiple writing processes. But writes in NFS are synchronous, so >>that's not hard, the server knows and will return the error and the >>write() doesn't return until it's committed. If you shut that off for >>performance gains you're on your own. >write(2) is not synchronous to the process; the NFS write operation is, >but they may be done later by block I/O daemon processes on your behalf >at a later time. We allow synchronous writes, but that isn't what a >naive process gets. Seems to me that the above two statements solve the problem: If NFS write()s are synchronous, then the server can certainly take care of EDQUOT during the request. The write will fail. A simple modification to the NFS write operation would return the quota left on that device on each request (Disclaimer: it has been over a year since I did anything serious with SUNs or equivalent NFS' machined, two years since I read the NFS and XDR stuff) Now, the biod processes are the ones that get you the asynchronous write(2) operation on remote file systems, so it seems that THEY should be the one that worries about whether there is enough disk quota left. At some point where the number of blocks left is the maximum number that could be queued, the write() operation becomes synchronous. (biod's are kernel processes, so it is the kernel that is worrying about the quota. Same difference.) Checking for EDQUOT on close() might be a good idea, (like checking return codes for ANY system of library call) but what you'd do after getting it (and taking the data from a pipe, or a tty --- a user) is beyond me. Someone's .sig said something like "Only check for errors you know how to deal with." --- I like the spirit of this. -- :!mcr!: | The postmaster never | So much mail, Michael Richardson | resolves twice. | so few cycles. Play: mcr@julie.UUCP Work: michael@fts1.UUCP Fido: 1:163/109.10 1:163/138 Amiga----^ - Pay attention only to _MY_ opinions. - ^--Amiga--^
thurlow@convex.com (Robert Thurlow) (10/28/90)
In <12045:Oct2604:56:3290@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >In article <thurlow.656468483@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes: >> In <9681:Oct2004:06:3090@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >> >In article <thurlow.656303314@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes: >> >> So how do you pick up errors on asynchronous writes? >> >``What asynchronous writes?'' >> [The ones I get when I don't use O_SYNC.] > [A discussion of async I/O that sounds like the VMS model.] I think that the model you described would be a nicer way to do some things; you'd be able to avoid a buffer copy, and a program would be better able to tell exactly what was going on. But I don't agree that it impossible to get multiprocessing with the Unix behaviour. >Another is truly asynchronous I/O: reads or writes that happen >asynchronously, concurrently with your program. The right way to see the >difference between synchronous and asynchronous I/O is to look at the >lowest level of I/O programming. A synchronous write has the CPU taking >time off from executing your program. It copies bytes of data, one by >one, into an output port. An asynchronous write has a separate I/O >processor doing the work. Your CPU takes only a moment to let the I/O >processor know about a block of data; then it returns to computation. >The CPU wouldn't run any faster if there were no I/O to do. With a buffer cache, your CPU has to do a copy from user space into the buffer cache, but it can then just submit an I/O request and get on with other stuff. If the I/O processor has access to the buffer cache and the buffers are properly semaphored, the CPU can do what it wants until it maps another I/O request onto the same block or it gets an 'I/O complete' interrupt. In this case, a 'synchronous write' just means your proc will remain asleep until after the I/O has been done. The buffer copy kinda sucks, and I agree that the programmer can't use your alternate I/O programming style, but I don't see that this makes this "less asynchronous". >> Our fsync, like Suns, ensures there are no pages in the VM system >> marked as "dirty", and it does this by forcing and waiting for I/O >> on each such page. The I/O involves an NFS write, and any I/O errors >> are detected. >Are you sure? Suppose the remote side is a symbolic link to yet another >NFS-mounted directory. Is the fsync() really propagated? It doesn't go over the wire, which is another reason why we have to rely on the actual NFS write operation being synchronous. And the server won't hand out filehandles that resolve to symlinks or accept them when given as arguments to a write operation. What kind of failure mode was on your mind? >This begs the real question: Why should I have to waste all that traffic >on periodic fsync()s, when the traffic for timely EDQUOT detection would >be a mere fraction of the amount? I can't afford to buffer everything >and do just an fsync() before the final close(). I don't object to network traffic for EDQUOT detection a tenth as much as adding state to my servers and junk to my protocol to allow a server to ~dole out" a disk allocation to a client and be able to get it back when it runs low. If you want to build a bulletproof system that is as immune to server and client failure as the current implementation of NFS, I'd consider its merits, but if you ask me, we need a bug-free lock manager/status monitor a lot more than we need something that just lets you ignore error returns from close. Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."
thurlow@convex.com (Robert Thurlow) (10/29/90)
In <1990Oct26.050448.26816@fts1.uucp> michael@fts1.uucp (Michael Richardson) writes: > Now, the biod processes are the ones that get you the asynchronous >write(2) operation on remote file systems, so it seems that THEY >should be the one that worries about whether there is enough disk >quota left. At some point where the number of blocks left is >the maximum number that could be queued, the write() operation >becomes synchronous. This might be workable; the only problem is the NFS protocol change required to carry the quota information, because we're still waiting for a couple of more serious things to be fixed. > Checking for EDQUOT on close() might be a good idea, >(like checking return codes for ANY system of library call) >but what you'd do after getting it (and taking the data from >a pipe, or a tty --- a user) is beyond me. While I like to handle errors, I _demand_ to know about them so that I can warn my users. Even here, a workaround might be to have the process retry the close so the kernel will retry the NFS writes, after telling the user he is over quota so that he can try to delete some files on the server. If your process exited, _close() could just go ahead and burn the blocks out of the cache. > Someone's .sig said something like "Only check for errors >you know how to deal with." --- I like the spirit of this. Hey, warning a user is still better than ignoring an error condition your program can't handle. #include <osi/std/disclaimer> Rob T -- Rob Thurlow, thurlow@convex.com or thurlow%convex.com@uxc.cso.uiuc.edu ---------------------------------------------------------------------- "This opinion was the only one available; I got here kind of late."