cline@suntan.ece.clarkson.edu (Marshall Cline) (07/01/89)
Craig Jackson (drilex!dricej@bbn.com, drilex!dricejb@bbn.com) and I (Marshall Cline, cline@sun.soe.clarkson.edu) have been having an email discussion, some of the results of which I'd like to post. The gist is a set of new system calls which allow "sync()" to occur on a file-by-file basis. Right now "update()" sync's the entire filesystem occasionally. But _some_ file (and only the user knows which) might need to be kept up-to-date more often to avoid disaster in the event of a system crash.... In article <CLINE.89Jun22135208@suntan.ece.clarkson.edu> M.Cline writes: > ...One of the philosophical issues upon which Un*x >is built is the separation of policy and mechanism. The kernel takes >care of mechanism, with as little policy as possible. (No one would >claim Un*x does a perfect at this separation, but it tries).... > Avoiding delayed-write would, in general, bring performance down >pretty badly. Given an OS that _does_ have delayed-write, the only >question is: how often should we "sync" the system? Personally, I >think this _policy_ decision should be left _OUT_ of the kernel. dricejb@drilex.UUCP (Craig Jackson) replies: >Many Unixes do not offer me my policy preferences: force all directory writes, >and *certain* file writes, directly to the disk. When forcing a file write, >tell me when all previously-written data has actually reached the disk, >so I can proceed with other writes. I thought that sounded like a "neat enough" idea to post it and get other ideas (and undoubtedly a few flames :-). I propose the following _mechanism_ (system calls) could be added to facilitate the _policy_. The default situation (on file descriptors which haven't been changed via one of these system calls) could be equivalent to the current delayed-write-and- update-occasionally method. __________________________ NEW SYSTEM CALLS ___________________________ int num_dirty_blocks(int fd); /* "fd" is the file descriptor of an open file. * Returns the number of dirty blocks associated with the file descriptor. * If "update_policy()" has been set to IMMEDIATE_WRITE, always returns 0. * Returns -1 on error (and sets errno, blah, blah, blah). */ int sync_fd(int fd, int wait_until_done); /* Causes precisely one file to be "sync()"ed (see sync(3)). * This gives users the fine-grained control over which files they * consider critical in the event of a system crash. * Thus the policy is in the hands of the users, not the kernel. * If "wait_until_done" is non-zero, and if the "fd" has any dirty * blocks (see "num_dirty_blocks()"), the calling process is blocked * until all dirty blocks of the "fd" are successfully written to disk. * Returns 0 on success, -1 on error (and sets errno, blah, blah, blah). */ typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t; sync_t update_policy(int fd, sync_t mechanism); /* "fd" is the file descriptor of an open file. * This directs the kernel regarding the desired update policy for "fd". * If "mechanism" is DELAYED_WRITE, doesn't write dirty data blocks until * a "sync()" or a "sync_fd()" call (which might happen due to "update(8)"). * If "mechanism" is IMMEDIATE_WRITE, any changed blocks are immediately * written to disk. This option should be used only on files where the * file's data is critical with respect to a system crash. * Returns the update_policy that formerly was associated with the file. * If "fd" is not an open file descriptor, or if "mechanism" isn't one of * the above values, no action is performed, and BAD_POLICY is returned. */ The end result would of course be that you could, on a file-by-file basis, specify what needs to be kept consistent. Those files which ABSOLUTELY MUST be CONTINUALLY preserved intact could have their update policy set to IMMEDIATE_WRITE. Those files which OCCASIONALLY need to have their contents "flushed" to disk could use "sync_fd()", similarly to the way one fflushes stdout so incomplete lines appear on the screen. Anyway, the point is this: Keep the policy out of the kernel. Any thoughts? Marshall -- ________________________________________________________________ Marshall P. Cline ARPA: cline@sun.soe.clarkson.edu ECE Department UseNet: uunet!sun.soe.clarkson.edu!cline Clarkson University BitNet: BH0W@CLUTX Potsdam, NY 13676 AT&T: 315-268-6591
chris@mimsy.UUCP (Chris Torek) (07/06/89)
In article <CLINE.89Jun30171140@suntan.ece.clarkson.edu> cline@suntan.ece.clarkson.edu (Marshall Cline) posts one copy, each separate, to comp.unix.questions and comp.unix.wizards, of the following. I have redirected followups to comp.unix.questions only. >... I propose the following _mechanism_ (system calls) could be added >to facilitate the _policy_. ... >int num_dirty_blocks(int fd); >/* "fd" is the file descriptor of an open file. > * Returns the number of dirty blocks associated with the file descriptor. There is now no system call that deals with `blocks' rather than `bytes', and I believe this should be maintained. (Actually, on 4.3BSD and similar, `stat' fills in an st_blocks variable, but this is in addition to the st_size field.) In any case, I am not sure why one would care how many bytes or blocks were not yet written: the only interesting question seems to be `any' versus `none', with the object of changing `any' into `none' if necessary. >int sync_fd(int fd, int wait_until_done); >/* Causes precisely one file to be "sync()"ed (see sync(3)). [there is no sync(3); perhaps you mean sync(2)] > * This gives users the fine-grained control over which files they > * consider critical in the event of a system crash. 4BSD already has `fsync', which ... causes all modified data and attributes of /fd/ to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk. /Fsync/ should be used by programs that require a file to be in a known state, for example, in building a simple transaction facility. (/x/ represents italics). The only difference here is that you are forced to wait (fsync uses bwrite, not bawrite). >typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t; > >sync_t update_policy(int fd, sync_t mechanism); SysV has an open() flag that causes writes to be immediate. Presumably this can also be set and cleared with fcntl(F_SETFL). 4BSD might acquire this someday as well (it is not hard to implement). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris