chris@mimsy.UUCP (Chris Torek) (06/01/87)
In article <2716@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (System Mangler) writes: >... contention from 4.3 BSD dump's multiple processes. In a >disk-limited situation, the context switch rate will soar to 300 >per second, as processes fight over one raw I/O buffer header. >An easy change to hpread(), allowing any of its raw I/O buffers to >be used with any of its disks, improves dump throughput by 25%. According to Don, only about 10% is due to reduced context switching, the rest coming from better I/O overlap. With a lot of help from Don (why did not *I* think of borrowing swap I/O headers?) we have come up with a scheme for eliminating the contention, and at the same time simplifying all the disk drivers, chipping a few bytes off that 4.3BSD-sized kernel. Being in a loquacious mood (and finding myself wide awake when I should sleep), and having just written the code but being unable to test it (I am at home), I thought to take some time writing first about the existing physio, and then about our scheme. So: A number of programs use what is called `raw I/O', including fsck, ncheck, icheck, dcheck, and (drum roll) dump. Most of these work in a very sequential manner, reading or writing one block at a time, then examining the result. This, too, was true of dump in all BSD releases up through 4.2BSD. The revised dump on the 4.3BSD release, written by Don Speck, has multiple processes passing tokens through a ring of pipes so as to allow parallel access to disk blocks while maintaining serial tape writes. This is probably the first major use of the raw I/O system by more than one process per device, and it points out some problems in the existing implementation. The `block I/O' system, which arranges for read-ahead, write-behind, and cacheing, mediates all `normal' Unix file system access. It moves data between user space and devices by copying it into dedicated kernel memory (`buffers'). This makes it extremely simple to use; one need not worry about sectors or blocking factors. It also adds a fair bit of overhead. Raw I/O bypasses all of this, transferring data directly to or from user space. Almost all of the device driver code is identical; indeed, on Vaxen, *all* of the code is identical, and only the Unibus and massbuss meta-driver code distinguishes between raw and block transfers. The kernel summarises each transfer in a `buffer header'. The information in this header includes the internal device number, the disk or tape block number, flags such as `read' (B_READ) or `write', and a pointer to the data buffer to be read or written. The flags also tell whether the transfer is from user space (B_PHYS) or kernel space. (On the Vax, there are three virtual `user' spaces, B_PAGET, B_UAREA, and `regular raw', but this is largely an irrelevant complication.) The kernel hands the buffer header to the device strategy routine, which queues the transfer and touches the appropriate device registers to get things moving. The block I/O system finds the strategy routine by looking up the `major part' of the device number in the `bdevsw' (Block DEVice SWitch) table, bdevsw[major(dev)].d_strategy. The raw I/O system finds it in a very different way. The heart of the raw I/O system is the routine `physio()', found near the end of /sys/sys/vm_swp.c. The kernel provides only indirect access to physio(). Each disk and tape driver has, in addition to its row in the block device table, a row in the character device switch table `cdevsw'. This table does not include a strategy pointer, but does provide both `read' and `write' routines. These routines are virtually identical through all the disk and tape drivers: xxread(dev, uio) dev_t dev; struct uio *uio; { return (physio(xxstrategy, &rxxbuf[xxunit(dev)], dev, B_READ, minphys, uio)); } xxwrite(dev, uio) dev_t dev; struct uio *uio; { return (physio(xxstrategy, &rxxbuf[xxunit(dev)], dev, B_WRITE, minphys, uio)); } (Many, if not all, of the drivers also test the disk or tape unit number `xxunit(dev)' to be sure it represents a proper device. Since there is no way for someone to open an invalid device---the driver checks the number before allowing the open---this test is pointless paranoia.) Here the `xx' driver implements both reading and writing by calling physio, telling it to use xxstrategy, and giving it a pointer to the buffer header `&rxxbuf[xxunit(dev)]' that the xx driver reserves for raw I/O on this device. It also gives physio the transfer direction (B_READ or B_WRITE) and the information about the transfer (dev and uio), and passes `minphys' as the routine to restrict transfer sizes. `minphys' breaks any transfer of more than 63*1024 bytes into chunks of 63K, plus any remainder. At least one driver (ht.c) checks later to see if a particular transfer buffer header is the raw buffer for that tape drive, &rhtbuf[htunit(dev)], but no disk driver makes such checks. (Indeed, for that matter, the ht driver has a bug in this very section of code, for it will not retry some correctable errors when they occur on the block device.) It seems rather odd that drivers should reserve one special buffer header per disk or tape drive, when most of the time these headers go unused, and, since each can describe only one transfer, when they *are* needed, they force transfers to proceed serially! But until recently few people have tried to do anything but purely serial transfers. This, then, is the `easy change to hpread': it can look through all its reserved raw transfer headers, find one that is not busy, and give that to physio. If all are busy, any one will do. hpread(dev, uio) dev_t dev; struct uio *uio; { register int i; for (i = 0; i < NHP-1; i++) if ((rhpbuf[i].b_flags & B_BUSY) == 0) break; return (physio(hpstrategy, &rhpbuf[i], dev, B_READ, minphys, uio)); } But this must be done to every driver, and it makes each one bigger. Surely there is a better solution. And there is. hpread(dev, uio) dev_t dev; struct uio *uio; { return (physio(hpstrategy, (struct buf *)NULL, dev, B_READ, minphys, uio)); } Give physio a NULL pointer (of the proper type!) and have it choose a buffer header from a pool it will maintain. All that is left is finding a good pool of headers, and indeed, there is one ready and waiting in the very same source file that holds the physio routine: the swap I/O headers. But lo! Now we have many read and write routines that are nearly identical. `rkread' looks just like `hpread', except that it passes `rkstrategy' rather than `hpstrategy'. Why not find the stragegy routine in the block device switch table? Well, because we cannot: the major device number in `dev' is for the character switch table, and there is no necessary correlation between character and block switch entries. Well, no problem: just add the strategy pointer to the character device table as well. And now we have just one read routine, and one write routine: rawread(dev, uio) dev_t dev; struct uio *uio; { return (physio(cdevsw[major(dev)].d_strategy, (struct buf *)NULL, dev, B_READ, minphys, uio)); } rawwrite(dev, uio) dev_t dev; struct uio *uio; { return (physio(cdevsw[major(dev)].d_strategy, (struct buf *)NULL, dev, B_WRITE, minphys, uio)); } At a cost of four bytes per `cdevsw' entry, we can get rid of many more bytes of code and data space per disk driver, and perhaps per tape driver as well, by eliminating the raw I/O buffer headers and the read and write routines. Once everything is written and tested, someone will post diffs. (Perhaps I can convince Keith to send them to 4bsd-fixes, in his copious spare time :-) no doubt). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) Domain: chris@mimsy.umd.edu Path: seismo!mimsy!chris
gnu@hoptoad.UUCP (06/04/87)
Bravo, Chris! There has always been way too much mumbo jumbo in the disk and tape drivers. Seventeen routines that do the same thing with the "ab" changed to "cd" indeed! That's almost as bad as the six copies of "tgetent" used for gettytab, remote, printcap, ... :-) Not only is it cleaner and simpler, it runs faster, and in less space! -- Copyright 1987 John Gilmore; you may redistribute only if your recipients may. (This is an effort to bend Stargate to work with Usenet, not against it.) {sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu gnu@ingres.berkeley.edu