pope@vatican (John Pope) (10/23/88)
In article <931@riddle.UUCP>, domo@riddle (Dominic Dunlop) writes: >In article <107@minya.UUCP> jc@minya.UUCP (John Chambers) writes: >> >>[...] According to several >>manuals, the main difference between /dev/dsk* and /dev/rdsk* is that >>there is no buffering for the latter. Reads always delay for physical >>I/O, and writes always go immediately to disk (though with DMA, the >>write may not be complete when write() returns). > > Worse, the disk controller hardware may be ``intelligent'', buffering write > data in its private memory for an indeterminate time before actually > writing it onto the physical disk medium. I agree, this is worse. Since the controller doesn't discriminate between raw and filesystem I/O, it will do "write-behind" in the filesystem case as well. This could lead to the controller saying the I/O is done when it really isn't, which could in turn lead to filesystem integrity problems if the system should crash. Although such integrity problems already exist, doing write-behind on the controller exacerbates the situation, as you now have a race between writing updated filesystem control information and actually writing the block the controller already told you it wrote. It's unclear that such a scheme buys you much anyway, since the kernel already does "write-behind" for filesystem I/O (unless, as you point out, the O_SYNC flag is set), while if you're working on the raw disk, your application will probably want to control the data buffering itself. > UNIX System V, release 3 and later (but not the POSIX standard, IEEE > 1003.1, or, as an example of a BSD-derived system, SunOS) has a > synchronous write facility (enabled with the O_SYNC flag... FYI, the latest version of SunOS (4.0) does have O_SYNC. > This is where I get to admit my ignorance: maybe five years ago, > intelligent controllers which defeated the ``write straight to disk'' > characteristic of raw devices were dismayingly commonplace. Which ones? The UDA-50 didn't do this, did it #:-o ?? > One thing that makes me think that the problem may still be around is > this: there has been recent discussion elsewhere about the ability to do > single byte reads and writes to raw disks on certain computers. Hmmm. > Sounds as though the controllers have their own buffers, doesn't it? I haven't seen the discussion, but it sounds like they are targeted towards database or other systems where doing a quick single byte read-modify-write cycle on the raw device is common enough to make a special case for it. I don't think this necessarily involves write-behind, though... >-- > Dominic Dunlop > domo@sphinx.co.uk domo@riddle.uucp John Pope Sun Microsystems, Inc. pope@sun.COM
chris@mimsy.UUCP (Chris Torek) (10/23/88)
In article <74161@sun.uucp> pope@vatican [indeed!] (John Pope) writes: >[such a controller] will do "write-behind" in the filesystem case >as well. This could lead to ... filesystem integrity problems if >the system should crash. For whatever reasons, some hardware manufacturers either never think of this, or else do it and keep quiet about it anyway. >It's unclear that such a scheme buys you much anyway, since the kernel >already does "write-behind" for filesystem I/O (unless, as you point >out, the O_SYNC flag is set), while if you're working on the raw disk, >your application will probably want to control the data buffering itself. Not all kernels have buffering. VMS, for instance, does not (at the ODS-II level: RMS does its own buffering and no doubt uses ASTs and such). Why do you think DEC thought the 16 kB in the UDA50 was such a wonderful buffer? (Maybe it was 32k, but anyway, small enough not to matter---it does not even buffer one track of an RA81.) (It is always amusing to watch the reaction of salespersons when you tell them that your machine already uses over a megabyte of buffering, and their 64k is not interesting.) >>... maybe five years ago, intelligent controllers which defeated the >>``write straight to disk'' characteristic of raw devices were >>dismayingly commonplace. >Which ones? The UDA-50 didn't do this, did it #:-o ?? No, it does not report the transfer as done until it has in fact been written. It *may* reorder writes, however (it is hard to tell for sure). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
pope@vatican (John Pope) (10/25/88)
In article <14122@mimsy.UUCP>, chris@mimsy (Chris Torek) writes: >Why do you think DEC thought the 16 kB in the UDA50 was such a >wonderful buffer? I guess because it was 16k more than they had before *:o) . My UDA-50 documentation has virtually nothing in it about controller optimization - maybe it's in one of the parts with "this section deliberately omitted" on it. >[the UDA-50] does not report the transfer as done until it has in fact been >written. It *may* reorder writes, however (it is hard to tell for sure). Most of the "smart" SMD and IPI controllers seem to do this now. I believe most of them treat reads and writes the same within the elevator (or whatever) algorithm, but I've seen at least one implementation of disksort() in the operating system that optimized reads ahead of writes on the theory that processes don't wait for writes. Such algorithms may also have made their way on board a disk controller or two... John Pope Sun Microsystems, Inc. pope@sun.COM
hedrick@geneva.rutgers.edu (Charles Hedrick) (10/25/88)
The only smart controller I know about is the Ciprico 32xx. (This is a VME controller with 512K of cache. We use it on our Suns.) It distinguishes 3 kinds of transaction: raw and two kinds of normal I/O (the two kinds are intended to be used for small and large transfers). The driver tells the controller board which kind of transfer is being done. The controller board has separate parameters for each kind, so you can decide what sort of caching and read-ahead is to be done for each. Here is the help message from our script, which descibes the options. As far as I know, write is never delayed, for reasons of filesystem integrity. echo "rfset raw fullblocks shortblocks" echo " where arguments are in hex, with bits" echo " 200 - readahead will cross cylinders" echo " 100 - readahead will cross tracks" echo " 80 - disable zero latency" echo " 20 - reorder writes" echo " 10 - reorder reads" echo " 8 - cache write data" echo " 4 - force completion of readaheads before doing other I-O" echo " 2 - cache read data" echo " 1 - use cache when reading" echo " arguments are:" echo " raw: control transactions involving raw disk or swapping" echo " fullblocks: normal I/O, multiples of 4K" echo " shortblocks: normal I/O, other sizes"
henry@utzoo.uucp (Henry Spencer) (10/28/88)
In article <74266@sun.uucp> pope@vatican (John Pope) writes: >... I've seen at least one >implementation of disksort() in the operating system that optimized >reads ahead of writes on the theory that processes don't wait for >writes. Such algorithms may also have made their way on board a disk >controller or two... I hope not, since they are known to exhibit pathological behavior under stress. If your system is disk-limited, as many are, it can generate disk i/o requests faster than the controller can service them. Most of those requests are reads. The result is that writes can wait a very long time. Worse, they occupy buffers while waiting, meaning that your buffer pool slowly fills up with pending writes, and the effectiveness of buffer caching drops dramatically. And the amusing part is that this all results from a bug! The read- before-write algorithm was in V7. But it wasn't supposed to be: the code was supposed to put writes before reads! (I asked Dennis.) The "reads are synchronous, writes are not" explanation appears to have been invented after the fact by someone trying to figure out the motive for the (incorrect!) code. -- The dream *IS* alive... | Henry Spencer at U of Toronto Zoology but not at NASA. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
mangler@cit-vax.Caltech.Edu (Don Speck) (10/29/88)
In article <14122@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes: > Why do you think DEC thought the 16 kB in the UDA50 was such a > wonderful buffer? (Maybe it was 32k, but anyway, small enough not to > matter---it does not even buffer one track of an RA81.) (It is always > amusing to watch the reaction of salespersons when you tell them that > your machine already uses over a megabyte of buffering, and their 64k > is not interesting.) It's been a couple of years since I used a UDA-50, but I recall that repeated reads of the same sector proceeded no faster than the rotation rate of the disk. Thus, the UDA-50 does not use its buffer for caching. The buffer is there for speed-matching. The DMA rate is 800 KB/sec at best. An RA80, which transfers 1 MB/sec, is not too badly mismatched, so the early UDA-50's with 4K of buffering could transfer a whole track of an RA80 without falling too far behind. But the RA81 transfers twice as fast as the DMA rate. The small 4K buffer would be full after reading only 1/3 of a track; reading would have to stop while it spent 1/3 of a revolution draining the buffer, and then 2/3 of a revolution would be wasted waiting for the next sector to come back around to continue the read. The later models of the UDA-50 have 16K bytes of buffering, which is more than the DMA can move in one revolution, so DMA can run continuously once started, and additional buffering would gain nothing. The buffer in the Xylogics 451 is similarly used for speed-matching, but 8K is too small. Word transfers on a Sun VMEbus or Multibus can move 23K in 1/60 of a second. The DEC Massbus had only 512 bytes of buffering, but it didn't need any more than that, because the DMA rate was fast enough. Don Speck mangler@csvax.caltech.edu {amdahl,ames!elroy}!cit-vax!mangler