[comp.unix.wizards] Write-Behind

pope@vatican (John Pope) (10/23/88)

In article <931@riddle.UUCP>, domo@riddle (Dominic Dunlop) writes:
>In article <107@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
>>
>>[...] According to several
>>manuals, the main difference between /dev/dsk* and /dev/rdsk* is that
>>there is no buffering for the latter.  Reads always delay for physical
>>I/O, and writes always go immediately to disk (though with DMA, the
>>write may not be complete when write() returns). 
>
> Worse, the disk controller hardware may be ``intelligent'', buffering write
> data in its private memory for an indeterminate time before actually
> writing it onto the physical disk medium.

I agree, this is worse. Since the controller doesn't discriminate
between raw and filesystem I/O, it will do "write-behind" in the
filesystem case as well. This could lead to the controller saying the
I/O is done when it really isn't, which could in turn lead to
filesystem integrity problems if the system should crash.  Although
such integrity problems already exist, doing write-behind on the
controller exacerbates the situation, as you now have a race between
writing updated filesystem control information and actually writing
the block the controller already told you it wrote. 

It's unclear that such a scheme buys you much anyway, since the kernel
already does "write-behind" for filesystem I/O (unless, as you point
out, the O_SYNC flag is set), while if you're working on the raw disk,
your application will probably want to control the data buffering itself.

> UNIX System V, release 3 and later (but not the POSIX standard, IEEE
> 1003.1, or, as an example of a BSD-derived system, SunOS) has a
> synchronous write facility (enabled with the O_SYNC flag...

FYI, the latest version of SunOS (4.0) does have O_SYNC.

> This is where I get to admit my ignorance: maybe five years ago,
> intelligent controllers which defeated the ``write straight to disk''
> characteristic of raw devices were dismayingly commonplace.  

Which ones? The UDA-50 didn't do this, did it #:-o ??

> One thing that makes me think that the problem may still be around is
> this: there has been recent discussion elsewhere about the ability to do
> single byte reads and writes to raw disks on certain computers.  Hmmm.
> Sounds as though the controllers have their own buffers, doesn't it? 

I haven't seen the discussion, but it sounds like they are targeted
towards database or other systems where doing a quick single byte
read-modify-write cycle on the raw device is common enough to make a
special case for it. I don't think this necessarily involves
write-behind, though...

>-- 
> Dominic Dunlop
> domo@sphinx.co.uk  domo@riddle.uucp
John Pope
	Sun Microsystems, Inc. 
		pope@sun.COM

chris@mimsy.UUCP (Chris Torek) (10/23/88)

In article <74161@sun.uucp> pope@vatican [indeed!] (John Pope) writes:
>[such a controller] will do "write-behind" in the filesystem case
>as well. This could lead to ...  filesystem integrity problems if
>the system should crash.

For whatever reasons, some hardware manufacturers either never think
of this, or else do it and keep quiet about it anyway.

>It's unclear that such a scheme buys you much anyway, since the kernel
>already does "write-behind" for filesystem I/O (unless, as you point
>out, the O_SYNC flag is set), while if you're working on the raw disk,
>your application will probably want to control the data buffering itself.

Not all kernels have buffering.  VMS, for instance, does not (at the
ODS-II level: RMS does its own buffering and no doubt uses ASTs and
such).  Why do you think DEC thought the 16 kB in the UDA50 was such a
wonderful buffer?  (Maybe it was 32k, but anyway, small enough not to
matter---it does not even buffer one track of an RA81.) (It is always
amusing to watch the reaction of salespersons when you tell them that
your machine already uses over a megabyte of buffering, and their 64k
is not interesting.)

>>... maybe five years ago, intelligent controllers which defeated the
>>``write straight to disk'' characteristic of raw devices were
>>dismayingly commonplace.  

>Which ones? The UDA-50 didn't do this, did it #:-o ??

No, it does not report the transfer as done until it has in fact been
written.  It *may* reorder writes, however (it is hard to tell for sure).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

pope@vatican (John Pope) (10/25/88)

In article <14122@mimsy.UUCP>, chris@mimsy (Chris Torek) writes:

>Why do you think DEC thought the 16 kB in the UDA50 was such a
>wonderful buffer? 

I guess because it was 16k more than they had before *:o) . My UDA-50
documentation has virtually nothing in it about controller
optimization - maybe it's in one of the parts with "this section
deliberately omitted" on it.

>[the UDA-50] does not report the transfer as done until it has in fact been
>written.  It *may* reorder writes, however (it is hard to tell for sure).

Most of the "smart" SMD and IPI controllers seem to do this now. I
believe most of them treat reads and writes the same within the
elevator (or whatever) algorithm, but I've seen at least one
implementation of disksort() in the operating system that optimized
reads ahead of writes on the theory that processes don't wait for
writes.  Such algorithms may also have made their way on board a disk
controller or two...
John Pope
	Sun Microsystems, Inc. 
		pope@sun.COM

hedrick@geneva.rutgers.edu (Charles Hedrick) (10/25/88)

The only smart controller I know about is the Ciprico 32xx.  (This is
a VME controller with 512K of cache.  We use it on our Suns.)  It
distinguishes 3 kinds of transaction: raw and two kinds of normal I/O
(the two kinds are intended to be used for small and large transfers).
The driver tells the controller board which kind of transfer is being
done.  The controller board has separate parameters for each kind, so
you can decide what sort of caching and read-ahead is to be done for
each.  Here is the help message from our script, which descibes the
options.  As far as I know, write is never delayed, for reasons of
filesystem integrity.

  echo "rfset raw fullblocks shortblocks"
  echo "  where arguments are in hex, with bits"
  echo "  200 - readahead will cross cylinders"
  echo "  100 - readahead will cross tracks"
  echo "   80 - disable zero latency"
  echo "   20 - reorder writes"
  echo "   10 - reorder reads"
  echo "    8 - cache write data"
  echo "    4 - force completion of readaheads before doing other I-O"
  echo "    2 - cache read data"
  echo "    1 - use cache when reading"
  echo "  arguments are:"
  echo "   raw: control transactions involving raw disk or swapping"
  echo "   fullblocks: normal I/O, multiples of 4K"
  echo "   shortblocks: normal I/O, other sizes"

henry@utzoo.uucp (Henry Spencer) (10/28/88)

In article <74266@sun.uucp> pope@vatican (John Pope) writes:
>... I've seen at least one
>implementation of disksort() in the operating system that optimized
>reads ahead of writes on the theory that processes don't wait for
>writes.  Such algorithms may also have made their way on board a disk
>controller or two...

I hope not, since they are known to exhibit pathological behavior under
stress.  If your system is disk-limited, as many are, it can generate
disk i/o requests faster than the controller can service them.  Most of
those requests are reads.  The result is that writes can wait a very
long time.  Worse, they occupy buffers while waiting, meaning that your
buffer pool slowly fills up with pending writes, and the effectiveness
of buffer caching drops dramatically.

And the amusing part is that this all results from a bug!  The read-
before-write algorithm was in V7.  But it wasn't supposed to be:  the
code was supposed to put writes before reads!  (I asked Dennis.)  The
"reads are synchronous, writes are not" explanation appears to have been
invented after the fact by someone trying to figure out the motive for
the (incorrect!) code.
-- 
The dream *IS* alive...         |    Henry Spencer at U of Toronto Zoology
but not at NASA.                |uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mangler@cit-vax.Caltech.Edu (Don Speck) (10/29/88)

In article <14122@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes:
>	  Why do you think DEC thought the 16 kB in the UDA50 was such a
> wonderful buffer?  (Maybe it was 32k, but anyway, small enough not to
> matter---it does not even buffer one track of an RA81.) (It is always
> amusing to watch the reaction of salespersons when you tell them that
> your machine already uses over a megabyte of buffering, and their 64k
> is not interesting.)

It's been a couple of years since I used a UDA-50, but I recall that
repeated reads of the same sector proceeded no faster than the
rotation rate of the disk.  Thus, the UDA-50 does not use its buffer
for caching.

The buffer is there for speed-matching.  The DMA rate is 800 KB/sec
at best.  An RA80, which transfers 1 MB/sec, is not too badly
mismatched, so the early UDA-50's with 4K of buffering could
transfer a whole track of an RA80 without falling too far behind.
But the RA81 transfers twice as fast as the DMA rate.  The small
4K buffer would be full after reading only 1/3 of a track; reading
would have to stop while it spent 1/3 of a revolution draining the
buffer, and then 2/3 of a revolution would be wasted waiting for the
next sector to come back around to continue the read.  The later
models of the UDA-50 have 16K bytes of buffering, which is more
than the DMA can move in one revolution, so DMA can run continuously
once started, and additional buffering would gain nothing.

The buffer in the Xylogics 451 is similarly used for speed-matching,
but 8K is too small.  Word transfers on a Sun VMEbus or Multibus can
move 23K in 1/60 of a second.

The DEC Massbus had only 512 bytes of buffering, but it didn't need
any more than that, because the DMA rate was fast enough.

Don Speck   mangler@csvax.caltech.edu	{amdahl,ames!elroy}!cit-vax!mangler