[comp.unix.questions] Fine grained sync policy control

cline@suntan.ece.clarkson.edu (Marshall Cline) (07/01/89)

Craig Jackson (drilex!dricej@bbn.com, drilex!dricejb@bbn.com) and
I (Marshall Cline, cline@sun.soe.clarkson.edu) have been having an
email discussion, some of the results of which I'd like to post.

The gist is a set of new system calls which allow "sync()" to occur on a
file-by-file basis.  Right now "update()" sync's the entire filesystem
occasionally.  But _some_ file (and only the user knows which) might need
to be kept up-to-date more often to avoid disaster in the event of a system
crash....

In article <CLINE.89Jun22135208@suntan.ece.clarkson.edu> M.Cline writes:
>   ...One of the philosophical issues upon which Un*x
>is built is the separation of policy and mechanism.  The kernel takes
>care of mechanism, with as little policy as possible.  (No one would
>claim Un*x does a perfect at this separation, but it tries)....
>   Avoiding delayed-write would, in general, bring performance down
>pretty badly.  Given an OS that _does_ have delayed-write, the only
>question is: how often should we "sync" the system?  Personally, I
>think this _policy_ decision should be left _OUT_ of the kernel.

dricejb@drilex.UUCP (Craig Jackson) replies:
>Many Unixes do not offer me my policy preferences: force all directory writes,
>and *certain* file writes, directly to the disk.  When forcing a file write,
>tell me when all previously-written data has actually reached the disk,
>so I can proceed with other writes.

I thought that sounded like a "neat enough" idea to post it and get other
ideas (and undoubtedly a few flames :-).  I propose the following _mechanism_
(system calls) could be added to facilitate the _policy_.  The default
situation (on file descriptors which haven't been changed via one of these
system calls) could be equivalent to the current delayed-write-and-
update-occasionally method.

__________________________ NEW SYSTEM CALLS ___________________________

int num_dirty_blocks(int fd);
/* "fd" is the file descriptor of an open file.
 * Returns the number of dirty blocks associated with the file descriptor.
 * If "update_policy()" has been set to IMMEDIATE_WRITE, always returns 0.
 * Returns -1 on error (and sets errno, blah, blah, blah).
 */

int sync_fd(int fd, int wait_until_done);
/* Causes precisely one file to be "sync()"ed (see sync(3)).
 * This gives users the fine-grained control over which files they
 * consider critical in the event of a system crash.
 * Thus the policy is in the hands of the users, not the kernel.
 * If "wait_until_done" is non-zero, and if the "fd" has any dirty
 * blocks (see "num_dirty_blocks()"), the calling process is blocked
 * until all dirty blocks of the "fd" are successfully written to disk.
 * Returns 0 on success, -1 on error (and sets errno, blah, blah, blah).
 */

typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t;

sync_t update_policy(int fd, sync_t mechanism);
/* "fd" is the file descriptor of an open file.
 * This directs the kernel regarding the desired update policy for "fd".
 * If "mechanism" is DELAYED_WRITE, doesn't write dirty data blocks until
 * a "sync()" or a "sync_fd()" call (which might happen due to "update(8)").
 * If "mechanism" is IMMEDIATE_WRITE, any changed blocks are immediately
 * written to disk.  This option should be used only on files where the
 * file's data is critical with respect to a system crash.
 * Returns the update_policy that formerly was associated with the file.
 * If "fd" is not an open file descriptor, or if "mechanism" isn't one of
 * the above values, no action is performed, and BAD_POLICY is returned.
 */


The end result would of course be that you could, on a file-by-file basis,
specify what needs to be kept consistent.  Those files which ABSOLUTELY
MUST be CONTINUALLY preserved intact could have their update policy set to
IMMEDIATE_WRITE.  Those files which OCCASIONALLY need to have their contents
"flushed" to disk could use "sync_fd()", similarly to the way one fflushes
stdout so incomplete lines appear on the screen.

Anyway, the point is this: Keep the policy out of the kernel.
Any thoughts?

Marshall
--
	________________________________________________________________
	Marshall P. Cline	ARPA:	cline@sun.soe.clarkson.edu
	ECE Department		UseNet:	uunet!sun.soe.clarkson.edu!cline
	Clarkson University	BitNet:	BH0W@CLUTX
	Potsdam, NY  13676	AT&T:	315-268-6591

chris@mimsy.UUCP (Chris Torek) (07/06/89)

In article <CLINE.89Jun30171140@suntan.ece.clarkson.edu>
cline@suntan.ece.clarkson.edu (Marshall Cline) posts one copy, each
separate, to comp.unix.questions and comp.unix.wizards, of the
following.  I have redirected followups to comp.unix.questions only.

>... I propose the following _mechanism_ (system calls) could be added
>to facilitate the _policy_. ...

>int num_dirty_blocks(int fd);
>/* "fd" is the file descriptor of an open file.
> * Returns the number of dirty blocks associated with the file descriptor.

There is now no system call that deals with `blocks' rather than
`bytes', and I believe this should be maintained.  (Actually, on 4.3BSD
and similar, `stat' fills in an st_blocks variable, but this is in
addition to the st_size field.)  In any case, I am not sure why one
would care how many bytes or blocks were not yet written: the only
interesting question seems to be `any' versus `none', with the object
of changing `any' into `none' if necessary.

>int sync_fd(int fd, int wait_until_done);
>/* Causes precisely one file to be "sync()"ed (see sync(3)).
[there is no sync(3); perhaps you mean sync(2)]
> * This gives users the fine-grained control over which files they
> * consider critical in the event of a system crash.

4BSD already has `fsync', which

	... causes all modified data and attributes of /fd/ to be
	moved to a permanent storage device.  This normally results
	in all in-core modified copies of buffers for the associated
	file to be written to a disk.

	/Fsync/ should be used by programs that require a file to be
	in a known state, for example, in building a simple transaction
	facility.

(/x/ represents italics).  The only difference here is that you are
forced to wait (fsync uses bwrite, not bawrite).

>typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t;
>
>sync_t update_policy(int fd, sync_t mechanism);

SysV has an open() flag that causes writes to be immediate.  Presumably
this can also be set and cleared with fcntl(F_SETFL).  4BSD might
acquire this someday as well (it is not hard to implement).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

tjo@Fulcrum.BT.CO.UK (Tim Oldham) (07/07/89)

In article <18410@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>SysV has an open() flag that causes writes to be immediate.  Presumably
>this can also be set and cleared with fcntl(F_SETFL). [yes - Tim]

Not  a  correction,  but an addition and word of warning-ish. O_SYNC,  the
flag in question, is only defined in the SVID addendum (V.3).

The  word  of warning, at the risk of teaching grandma to suck eggs:  many
companies say "hey, we're SVID conformant". But check the SVID conformance
level. Sometimes it's without the addendum features. Even then, check  the
kernel extensions conformance, otherwise you might be in for a shock  when
you  come  to  use  things  like  ptrace(2). This is particularly true  of
systems that aren't derived from AT&T source, and just look like SysV *IX.

As  a  quick  question,  why  does  the  SVID  say that the sticky bit  is
`reserved'? Usually this means that "no, you're not allowed to use that  -
we're  going  to say what it's for later". They seem to mean "do what  you
want  to  with  this  bit",  which  is  usually described as  `undefined'.
Comments?

	Tim.
-- 
Tim Oldham      tjo@fulcrum.bt.co.uk  or  ...!mcvax!ukc!axion!fulcrum!tjo
#include	<stdisclaim>
Why have coffee, when caffeine tastes this good?