[comp.databases] ensuring output has reached the disc

ok@quintus.uucp (Richard A. O'Keefe) (07/21/88)

In article <302@infmx.UUCP> aland@infmx.UUCP (Alan S. Denney @ Informix)
writes:
>The funniest thing is that they choose the Suns to run their benchmarks.
>Sun 3.X machines DO NOT SUPPORT synchronous writes anyway (no O_SYNC flag
>here, folks), so any claim that their benchmarks are hurt by "integrity
>issues" on these machines is BOGUS.  The only way to force i/o is to use
>raw devices; Oracle decries raw devices as being "complex" in their current
>ad.   (If my understanding of Sun 3.X synchronicity is wrong, I will post
>a followup and apology.  I confirmed this with a friend at Sun about a month 
>ago).  The current ads do *not* indicate that this integrity claim
>applies only to certain (e.g SVR2+) ports, that I recall.

I'm using SunOS 3.2.  A quick "grep" through /usr/include/*/*.h confirmed
the absence of an O_SYNC flag.  But "man -k sync" turned up
	fsync (2) - synchronize a file's in-core state with that on disk
and "man 2 fsync" says that fsync(fd)
     fsync moves all modified data and attributes of fd to a per-
     manent  storage  device:  all  in-core  modified  copies  of
     buffers for the associated file have been written to a  disk
     when  the  call  returns.   Note that this is different than
     sync(2) which schedules disk I/O for all files (as though an
     fsync had been done on all files) but returns before the I/O
     completes.

     fsync should be used by programs which require a file to  be
     in  a  known  state; for example, a program which contains a
     simple transaction facility might use it to ensure that  all
     modifications  to  a  file  or files caused by a transaction
     were recorded on disk.
    [Sun Release 3.2 Last Change: 16 July 1986]
				  ^^^^^^^^^^^^
This appears to claim that the changed information has actually been
written on the disc, not merely scheduled for writing.  What more do
you want, exactly?

eric@pyrps5 (Eric Bergan) (07/21/88)

In article <179@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>In article <302@infmx.UUCP> aland@infmx.UUCP (Alan S. Denney @ Informix)
>writes:
>>Sun 3.X machines DO NOT SUPPORT synchronous writes anyway (no O_SYNC flag
>>here, folks),

>I'm using SunOS 3.2.  A quick "grep" through /usr/include/*/*.h confirmed
>the absence of an O_SYNC flag.  But "man -k sync" turned up
>	fsync (2) - synchronize a file's in-core state with that on disk
>and "man 2 fsync" says that fsync(fd)

>This appears to claim that the changed information has actually been
>written on the disc, not merely scheduled for writing.  What more do
>you want, exactly?

	The problem is that fsync is very inefficient, with respect to
O_SYNC, which itself is slower than writing directly to a raw disk.
The reason for the performance difference is that fsync is supposed
to flush all dirty blocks for the file. To do that, it has to look
for all the dirty blocks for the file descriptor. O_SYNC guarentees
each individual write, and avoids the overhead of either scanning, or
maintaining a per file descriptor list of dirty blocks. Raw disk just
skips the file system code entirely, which provides you guarenteed
contiguous disk space, and also no worries about indirect blocks. But
the DBMS code has to do the disk management (what blocks are used, which
are free, etc) itself.

	Since the need for assurance of a write is for the log, and since
transactions can't finish their commit and release their locks until
the write of the log entry to disk has happened, this becomes a big
bottleneck. I believe INGRES used to rely on fsync, but switched to
O_SYNC on systems that support it.

john@anasaz.UUCP (John Moore) (07/23/88)

In article <32133@pyramid.pyramid.com% eric@pyrps5.UUCP (Eric Bergan) writes:
%In article <179@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
%>In article <302@infmx.UUCP> aland@infmx.UUCP (Alan S. Denney @ Informix)
%>writes:
%maintaining a per file descriptor list of dirty blocks. Raw disk just
%skips the file system code entirely, which provides you guarenteed
%contiguous disk space, and also no worries about indirect blocks. But
%the DBMS code has to do the disk management (what blocks are used, which

On a couple of machines, we have observed that raw device I/O does
not seem to have seek optimization and has extreme unfairness between
processes (one process gets lots of I/O's, another gets none, when
both are same priority and doing the same thing). Apparently the
raw devicer driver sends only one request at a time to the strategy
routine, and waits for it to finish before sending another, thus
defeating the seek optimization. Does anyone know how common this
is and which systems do provide optimization on raw devices?
-- 
John Moore (NJ7E)           {decvax, ncar, ihnp4}!noao!mcdsun!nud!anasaz!john
(602) 861-7607 (day or eve) {gatech, ames, rutgers}!ncar!...
The opinions expressed here are obviously not mine, so they must be
someone else's.