[comp.unix.wizards] getting more

chris@mimsy.UUCP (Chris Torek) (06/01/87)

In article <2716@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu
(System Mangler) writes:
>... contention from 4.3 BSD dump's multiple processes.  In a
>disk-limited situation, the context switch rate will soar to 300
>per second, as processes fight over one raw I/O buffer header.
>An easy change to hpread(), allowing any of its raw I/O buffers to
>be used with any of its disks, improves dump throughput by 25%.

According to Don, only about 10% is due to reduced context switching,
the rest coming from better I/O overlap.  With a lot of help from
Don (why did not *I* think of borrowing swap I/O headers?) we have
come up with a scheme for eliminating the contention, and at the
same time simplifying all the disk drivers, chipping a few bytes
off that 4.3BSD-sized kernel.

Being in a loquacious mood (and finding myself wide awake when I
should sleep), and having just written the code but being unable
to test it (I am at home), I thought to take some time writing
first about the existing physio, and then about our scheme.  So:

A number of programs use what is called `raw I/O', including fsck,
ncheck, icheck, dcheck, and (drum roll) dump.  Most of these work
in a very sequential manner, reading or writing one block at a
time, then examining the result.  This, too, was true of dump in
all BSD releases up through 4.2BSD.  The revised dump on the 4.3BSD
release, written by Don Speck, has multiple processes passing
tokens through a ring of pipes so as to allow parallel access
to disk blocks while maintaining serial tape writes.  This is
probably the first major use of the raw I/O system by more than
one process per device, and it points out some problems in the
existing implementation.

The `block I/O' system, which arranges for read-ahead, write-behind,
and cacheing, mediates all `normal' Unix file system access.  It
moves data between user space and devices by copying it into
dedicated kernel memory (`buffers').  This makes it extremely simple
to use; one need not worry about sectors or blocking factors.  It
also adds a fair bit of overhead.  Raw I/O bypasses all of this,
transferring data directly to or from user space.  Almost all of
the device driver code is identical; indeed, on Vaxen, *all* of
the code is identical, and only the Unibus and massbuss meta-driver
code distinguishes between raw and block transfers.

The kernel summarises each transfer in a `buffer header'.  The
information in this header includes the internal device number,
the disk or tape block number, flags such as `read' (B_READ) or
`write', and a pointer to the data buffer to be read or written.
The flags also tell whether the transfer is from user space (B_PHYS)
or kernel space.  (On the Vax, there are three virtual `user'
spaces, B_PAGET, B_UAREA, and `regular raw', but this is largely
an irrelevant complication.)  The kernel hands the buffer header
to the device strategy routine, which queues the transfer and
touches the appropriate device registers to get things moving.  The
block I/O system finds the strategy routine by looking up the `major
part' of the device number in the `bdevsw' (Block DEVice SWitch)
table, bdevsw[major(dev)].d_strategy.  The raw I/O system finds it
in a very different way.

The heart of the raw I/O system is the routine `physio()', found
near the end of /sys/sys/vm_swp.c.  The kernel provides only indirect
access to physio().  Each disk and tape driver has, in addition to
its row in the block device table, a row in the character device
switch table `cdevsw'.  This table does not include a strategy
pointer, but does provide both `read' and `write' routines.  These
routines are virtually identical through all the disk and tape
drivers:

	xxread(dev, uio)
		dev_t dev;
		struct uio *uio;
	{

		return (physio(xxstrategy, &rxxbuf[xxunit(dev)],
			dev, B_READ, minphys, uio));
	}

	xxwrite(dev, uio)
		dev_t dev;
		struct uio *uio;
	{

		return (physio(xxstrategy, &rxxbuf[xxunit(dev)],
			dev, B_WRITE, minphys, uio));
	}

(Many, if not all, of the drivers also test the disk or tape unit
number `xxunit(dev)' to be sure it represents a proper device.
Since there is no way for someone to open an invalid device---the
driver checks the number before allowing the open---this test is
pointless paranoia.)

Here the `xx' driver implements both reading and writing by calling
physio, telling it to use xxstrategy, and giving it a pointer to
the buffer header `&rxxbuf[xxunit(dev)]' that the xx driver reserves
for raw I/O on this device.  It also gives physio the transfer
direction (B_READ or B_WRITE) and the information about the transfer
(dev and uio), and passes `minphys' as the routine to restrict
transfer sizes.  `minphys' breaks any transfer of more than 63*1024
bytes into chunks of 63K, plus any remainder.

At least one driver (ht.c) checks later to see if a particular
transfer buffer header is the raw buffer for that tape drive,
&rhtbuf[htunit(dev)], but no disk driver makes such checks.  (Indeed,
for that matter, the ht driver has a bug in this very section of
code, for it will not retry some correctable errors when they occur
on the block device.)  It seems rather odd that drivers should
reserve one special buffer header per disk or tape drive, when most
of the time these headers go unused, and, since each can describe
only one transfer, when they *are* needed, they force transfers to
proceed serially!  But until recently few people have tried to
do anything but purely serial transfers.

This, then, is the `easy change to hpread': it can look through all
its reserved raw transfer headers, find one that is not busy, and
give that to physio.  If all are busy, any one will do.

	hpread(dev, uio)
		dev_t dev;
		struct uio *uio;
	{
		register int i;

		for (i = 0; i < NHP-1; i++)
			if ((rhpbuf[i].b_flags & B_BUSY) == 0)
				break;
		return (physio(hpstrategy, &rhpbuf[i], dev, B_READ,
			minphys, uio));
	}

But this must be done to every driver, and it makes each one bigger.
Surely there is a better solution.

And there is.

	hpread(dev, uio)
		dev_t dev;
		struct uio *uio;
	{

		return (physio(hpstrategy, (struct buf *)NULL, dev, B_READ,
			minphys, uio));
	}

Give physio a NULL pointer (of the proper type!) and have it choose
a buffer header from a pool it will maintain.  All that is left is
finding a good pool of headers, and indeed, there is one ready and
waiting in the very same source file that holds the physio routine:
the swap I/O headers.

But lo!  Now we have many read and write routines that are nearly
identical.  `rkread' looks just like `hpread', except that it passes
`rkstrategy' rather than `hpstrategy'.  Why not find the stragegy
routine in the block device switch table?  Well, because we cannot:
the major device number in `dev' is for the character switch table,
and there is no necessary correlation between character and block
switch entries.  Well, no problem: just add the strategy pointer
to the character device table as well.  And now we have just one
read routine, and one write routine:

	rawread(dev, uio)
		dev_t dev;
		struct uio *uio;
	{

		return (physio(cdevsw[major(dev)].d_strategy,
			(struct buf *)NULL, dev, B_READ, minphys, uio));
	}

	rawwrite(dev, uio)
		dev_t dev;
		struct uio *uio;
	{

		return (physio(cdevsw[major(dev)].d_strategy,
			(struct buf *)NULL, dev, B_WRITE, minphys, uio));
	}

At a cost of four bytes per `cdevsw' entry, we can get rid of many
more bytes of code and data space per disk driver, and perhaps per
tape driver as well, by eliminating the raw I/O buffer headers and
the read and write routines.

Once everything is written and tested, someone will post diffs.
(Perhaps I can convince Keith to send them to 4bsd-fixes, in his
copious spare time :-) no doubt).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
Domain:	chris@mimsy.umd.edu	Path:	seismo!mimsy!chris

gnu@hoptoad.UUCP (06/04/87)

Bravo, Chris!  There has always been way too much mumbo jumbo in
the disk and tape drivers.  Seventeen routines that do the same thing
with the "ab" changed to "cd" indeed!  That's almost as bad as the six
copies of "tgetent" used for gettytab, remote, printcap, ...	:-)

Not only is it cleaner and simpler, it runs faster, and in less space!
-- 
Copyright 1987 John Gilmore; you may redistribute only if your recipients may.
(This is an effort to bend Stargate to work with Usenet, not against it.)
{sun,ptsfa,lll-crg,ihnp4,ucbvax}!hoptoad!gnu	       gnu@ingres.berkeley.edu