[comp.unix.wizards] Fine grained sync policy control

cline@suntan.ece.clarkson.edu (Marshall Cline) (07/01/89)

Craig Jackson (drilex!dricej@bbn.com, drilex!dricejb@bbn.com) and
I (Marshall Cline, cline@sun.soe.clarkson.edu) have been having an
email discussion, some of the results of which I'd like to post.

The gist is a set of new system calls which allow "sync()" to occur on a
file-by-file basis.  Right now "update()" sync's the entire filesystem
occasionally.  But _some_ file (and only the user knows which) might need
to be kept up-to-date more often to avoid disaster in the event of a system
crash....

In article <CLINE.89Jun22135208@suntan.ece.clarkson.edu> M.Cline writes:
>   ...One of the philosophical issues upon which Un*x
>is built is the separation of policy and mechanism.  The kernel takes
>care of mechanism, with as little policy as possible.  (No one would
>claim Un*x does a perfect at this separation, but it tries)....
>   Avoiding delayed-write would, in general, bring performance down
>pretty badly.  Given an OS that _does_ have delayed-write, the only
>question is: how often should we "sync" the system?  Personally, I
>think this _policy_ decision should be left _OUT_ of the kernel.

dricejb@drilex.UUCP (Craig Jackson) replies:
>Many Unixes do not offer me my policy preferences: force all directory writes,
>and *certain* file writes, directly to the disk.  When forcing a file write,
>tell me when all previously-written data has actually reached the disk,
>so I can proceed with other writes.

I thought that sounded like a "neat enough" idea to post it and get other
ideas (and undoubtedly a few flames :-).  I propose the following _mechanism_
(system calls) could be added to facilitate the _policy_.  The default
situation (on file descriptors which haven't been changed via one of these
system calls) could be equivalent to the current delayed-write-and-
update-occasionally method.

__________________________ NEW SYSTEM CALLS ___________________________

int num_dirty_blocks(int fd);
/* "fd" is the file descriptor of an open file.
 * Returns the number of dirty blocks associated with the file descriptor.
 * If "update_policy()" has been set to IMMEDIATE_WRITE, always returns 0.
 * Returns -1 on error (and sets errno, blah, blah, blah).
 */

int sync_fd(int fd, int wait_until_done);
/* Causes precisely one file to be "sync()"ed (see sync(3)).
 * This gives users the fine-grained control over which files they
 * consider critical in the event of a system crash.
 * Thus the policy is in the hands of the users, not the kernel.
 * If "wait_until_done" is non-zero, and if the "fd" has any dirty
 * blocks (see "num_dirty_blocks()"), the calling process is blocked
 * until all dirty blocks of the "fd" are successfully written to disk.
 * Returns 0 on success, -1 on error (and sets errno, blah, blah, blah).
 */

typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t;

sync_t update_policy(int fd, sync_t mechanism);
/* "fd" is the file descriptor of an open file.
 * This directs the kernel regarding the desired update policy for "fd".
 * If "mechanism" is DELAYED_WRITE, doesn't write dirty data blocks until
 * a "sync()" or a "sync_fd()" call (which might happen due to "update(8)").
 * If "mechanism" is IMMEDIATE_WRITE, any changed blocks are immediately
 * written to disk.  This option should be used only on files where the
 * file's data is critical with respect to a system crash.
 * Returns the update_policy that formerly was associated with the file.
 * If "fd" is not an open file descriptor, or if "mechanism" isn't one of
 * the above values, no action is performed, and BAD_POLICY is returned.
 */


The end result would of course be that you could, on a file-by-file basis,
specify what needs to be kept consistent.  Those files which ABSOLUTELY
MUST be CONTINUALLY preserved intact could have their update policy set to
IMMEDIATE_WRITE.  Those files which OCCASIONALLY need to have their contents
"flushed" to disk could use "sync_fd()", similarly to the way one fflushes
stdout so incomplete lines appear on the screen.

Anyway, the point is this: Keep the policy out of the kernel.
Any thoughts?

Marshall
--
	________________________________________________________________
	Marshall P. Cline	ARPA:	cline@sun.soe.clarkson.edu
	ECE Department		UseNet:	uunet!sun.soe.clarkson.edu!cline
	Clarkson University	BitNet:	BH0W@CLUTX
	Potsdam, NY  13676	AT&T:	315-268-6591

chris@mimsy.UUCP (Chris Torek) (07/06/89)

In article <CLINE.89Jun30171140@suntan.ece.clarkson.edu>
cline@suntan.ece.clarkson.edu (Marshall Cline) posts one copy, each
separate, to comp.unix.questions and comp.unix.wizards, of the
following.  I have redirected followups to comp.unix.questions only.

>... I propose the following _mechanism_ (system calls) could be added
>to facilitate the _policy_. ...

>int num_dirty_blocks(int fd);
>/* "fd" is the file descriptor of an open file.
> * Returns the number of dirty blocks associated with the file descriptor.

There is now no system call that deals with `blocks' rather than
`bytes', and I believe this should be maintained.  (Actually, on 4.3BSD
and similar, `stat' fills in an st_blocks variable, but this is in
addition to the st_size field.)  In any case, I am not sure why one
would care how many bytes or blocks were not yet written: the only
interesting question seems to be `any' versus `none', with the object
of changing `any' into `none' if necessary.

>int sync_fd(int fd, int wait_until_done);
>/* Causes precisely one file to be "sync()"ed (see sync(3)).
[there is no sync(3); perhaps you mean sync(2)]
> * This gives users the fine-grained control over which files they
> * consider critical in the event of a system crash.

4BSD already has `fsync', which

	... causes all modified data and attributes of /fd/ to be
	moved to a permanent storage device.  This normally results
	in all in-core modified copies of buffers for the associated
	file to be written to a disk.

	/Fsync/ should be used by programs that require a file to be
	in a known state, for example, in building a simple transaction
	facility.

(/x/ represents italics).  The only difference here is that you are
forced to wait (fsync uses bwrite, not bawrite).

>typedef enum {DELAYED_WRITE, IMMEDIATE_WRITE, BAD_POLICY} sync_t;
>
>sync_t update_policy(int fd, sync_t mechanism);

SysV has an open() flag that causes writes to be immediate.  Presumably
this can also be set and cleared with fcntl(F_SETFL).  4BSD might
acquire this someday as well (it is not hard to implement).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris