cline@suntan.ece.clarkson.edu (Marshall Cline) (07/01/89)
Craig Jackson (drilex!dricej@bbn.com, drilex!dricejb@bbn.com) and I (Marshall Cline, cline@sun.soe.clarkson.edu) have been having an email discussion, some of the results of which I'd like to post. The gist is a set of new system calls which allow "sync()" to occur on a file-by-file basis. Right now "update()" sync's the entire filesystem occasionally. But _some_ file (and only the user knows which) might need to be kept up-to-date more often to avoid disaster in the event of a system crash.... In article <CLINE.89Jun22135208@suntan.ece.clarkson.edu> M.Cline writes: > ...One of the philosophical issues upon which Un*x >is built is the separation of policy and mechanism. The kernel takes >care of mechanism, with as little policy as possible. (No one would >claim Un*x does a perfect at this separation, but it tries).... > Avoiding delayed-write would, in general, bring performance down >pretty badly. Given an OS that _does_ have delayed-write, the only >question is: how often should we "sync" the system? Personally, I >think this _policy_ decision should be left _OUT_ of the kernel. dricejb@drilex.UUCP (Craig Jackson) replies: >Many Unixes do not offer me my policy preferences: force all directory writes, >and *certain* file writes, directly to the disk. When forcing a file write, >tell me when all previously-written data has actually reached the disk, >so I can proceed with other writes. I thought that sounded like a "neat enough" idea to post it and get other ideas (and undoubtedly a few flames :-). I propose the following _mechanism_ (system calls) could be added to facilitate the _policy_. The default situation (on file descriptors which haven't been changed via one of these system calls) could be equivalent to the current delayed-write-and- update-occasionally method. __________________________ NEW SYSTEM CALLS ___________________________ int num_dirty_blocks(int fd); /* "fd" is the file descriptor of an open file. * Returns the number of dirty blocks associated with the file descriptor. * If "update_policy()" has been set to IMMEDIATE_WRITE, always returns 0. * Returns -1 on error (and sets errno, blah, blah, blah). */ int sync_fd(int fd, int wait_until_done); /* Causes precisely one file to be "sync()"ed (see sync(3)). * This gives users the fine-grained control over which files they * consider critical in the event of a system crash. * Thus the policy is in the hands of the users, not the kernel. * If "wait_until_done" is non-zero, and if the "fd" has any dirty * blocks (see "num_dirty_blocks()"), the calling process is blocked * until all dirty blocks of the "fd" are successfully written to disk. * Returns 0 on success, -1 on error (and sets errno, blah, blah, blah). */ typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t; sync_t update_policy(int fd, sync_t mechanism); /* "fd" is the file descriptor of an open file. * This directs the kernel regarding the desired update policy for "fd". * If "mechanism" is DELAYED_WRITE, doesn't write dirty data blocks until * a "sync()" or a "sync_fd()" call (which might happen due to "update(8)"). * If "mechanism" is IMMEDIATE_WRITE, any changed blocks are immediately * written to disk. This option should be used only on files where the * file's data is critical with respect to a system crash. * Returns the update_policy that formerly was associated with the file. * If "fd" is not an open file descriptor, or if "mechanism" isn't one of * the above values, no action is performed, and BAD_POLICY is returned. */ The end result would of course be that you could, on a file-by-file basis, specify what needs to be kept consistent. Those files which ABSOLUTELY MUST be CONTINUALLY preserved intact could have their update policy set to IMMEDIATE_WRITE. Those files which OCCASIONALLY need to have their contents "flushed" to disk could use "sync_fd()", similarly to the way one fflushes stdout so incomplete lines appear on the screen. Anyway, the point is this: Keep the policy out of the kernel. Any thoughts? Marshall -- ________________________________________________________________ Marshall P. Cline ARPA: cline@sun.soe.clarkson.edu ECE Department UseNet: uunet!sun.soe.clarkson.edu!cline Clarkson University BitNet: BH0W@CLUTX Potsdam, NY 13676 AT&T: 315-268-6591
chris@mimsy.UUCP (Chris Torek) (07/06/89)
In article <CLINE.89Jun30171140@suntan.ece.clarkson.edu> cline@suntan.ece.clarkson.edu (Marshall Cline) posts one copy, each separate, to comp.unix.questions and comp.unix.wizards, of the following. I have redirected followups to comp.unix.questions only. >... I propose the following _mechanism_ (system calls) could be added >to facilitate the _policy_. ... >int num_dirty_blocks(int fd); >/* "fd" is the file descriptor of an open file. > * Returns the number of dirty blocks associated with the file descriptor. There is now no system call that deals with `blocks' rather than `bytes', and I believe this should be maintained. (Actually, on 4.3BSD and similar, `stat' fills in an st_blocks variable, but this is in addition to the st_size field.) In any case, I am not sure why one would care how many bytes or blocks were not yet written: the only interesting question seems to be `any' versus `none', with the object of changing `any' into `none' if necessary. >int sync_fd(int fd, int wait_until_done); >/* Causes precisely one file to be "sync()"ed (see sync(3)). [there is no sync(3); perhaps you mean sync(2)] > * This gives users the fine-grained control over which files they > * consider critical in the event of a system crash. 4BSD already has `fsync', which ... causes all modified data and attributes of /fd/ to be moved to a permanent storage device. This normally results in all in-core modified copies of buffers for the associated file to be written to a disk. /Fsync/ should be used by programs that require a file to be in a known state, for example, in building a simple transaction facility. (/x/ represents italics). The only difference here is that you are forced to wait (fsync uses bwrite, not bawrite). >typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t; > >sync_t update_policy(int fd, sync_t mechanism); SysV has an open() flag that causes writes to be immediate. Presumably this can also be set and cleared with fcntl(F_SETFL). 4BSD might acquire this someday as well (it is not hard to implement). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
tjo@Fulcrum.BT.CO.UK (Tim Oldham) (07/07/89)
In article <18410@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >SysV has an open() flag that causes writes to be immediate. Presumably >this can also be set and cleared with fcntl(F_SETFL). [yes - Tim] Not a correction, but an addition and word of warning-ish. O_SYNC, the flag in question, is only defined in the SVID addendum (V.3). The word of warning, at the risk of teaching grandma to suck eggs: many companies say "hey, we're SVID conformant". But check the SVID conformance level. Sometimes it's without the addendum features. Even then, check the kernel extensions conformance, otherwise you might be in for a shock when you come to use things like ptrace(2). This is particularly true of systems that aren't derived from AT&T source, and just look like SysV *IX. As a quick question, why does the SVID say that the sticky bit is `reserved'? Usually this means that "no, you're not allowed to use that - we're going to say what it's for later". They seem to mean "do what you want to with this bit", which is usually described as `undefined'. Comments? Tim. -- Tim Oldham tjo@fulcrum.bt.co.uk or ...!mcvax!ukc!axion!fulcrum!tjo #include <stdisclaim> Why have coffee, when caffeine tastes this good?