[comp.unix.wizards] Disk striping?

arnold@emory.uucp (Arnold D. Robbins {EUCC}) (12/03/87)

Something that I've been wondering about for a while, and which
promises to expose my ignorance, is as follows. In standard 4.3 BSD, a
disk partition is a partition is a partition. You cannot put two 7 Meg
'A' partitions together and use them like one 14 Meg partition. (Some
vendors do allow this, e.g. Convex, but mostly it's few and far between.)

However, there is a limited form of disk striping already in the BSD
kernel: /dev/swap! One can swap on more than one physical disk at a time.

My question is, how general or non-general are the mechanisms used in
putting together the swap device? Is it unreasonable to adapt it for
more general purposes? It seems to me that this would be really useful,
as I'll bet lots of people have lots of unused 'A' and 'B' partitions.

Is there any chance disk striping will be in 4.4 BSD?

As they used to say, "Thanks in Advance".
-- 
Arnold Robbins
ARPA, CSNET:	arnold@emory.ARPA	BITNET: arnold@emory
UUCP: { decvax, gatech, }!emory!arnold	DOMAIN: arnold@emory.edu (soon)
	``csh: just say NO!''

ron@topaz.rutgers.edu (Ron Natalie) (12/04/87)

First, striping is not defined precisely as you indicate.

There are several ways one might use multiple drives on a UNIX
system.

1.  First method (currently in use).  Chop your filesystem into
    part and place them on separate drives.  Provided you have
    a controller that allows you to do overlapped seeks you can
    arrange for better access times by care placement of filesystems.
    Note that / and /tmp are the heaviest usage filesystems on most
    systems.  If you have multiple controllers, then you can actually
    do two transfers at a time, providing for better througput.

2.  Second method, span file system accross multiple physical devices.
    This was done in some Unixes, especially on disk drives like RS04's
    whose size is prohibitedly small.  Of course, this isn't that hot
    of an idea as it guarantees that only one drive will be active at
    a time.

3.  Stripe the drives.  Interleave the filesystems between two drives
    so that you can transfer data requests from both drives at once.
    This only beats out method #1 if you have the following:
	1.  A controller capabale of transferring to/from each drive
	    simoultaneously (or doing the same with multiple controllers).
	2.  A controller with "zero latency" feature.  Most controllers
	    even when the I/O is a full track (or a sustantial fraction
	    of it) wait until the beginning of the track comes under the
	    head before starting the transfer.  Using these controllers
	    causes you to head towards the maximum latency as you must
	    wait for the maximum latency on each drives which will be
	    statistically higher than the single drive feature.  Newer
	    controllers will start transferring as soon as they realize
	    that there is valid data under the head that is part of the
	    request.

-Ron

mangler@cit-vax.Caltech.Edu (Don Speck) (12/10/87)

In article <2369@emory.uucp>, arnold@emory.uucp (Arnold D. Robbins {EUCC}) writes:
> However, there is a limited form of disk striping already in the BSD
> kernel: /dev/swap! One can swap on more than one physical disk at a time.
>
> My question is, how general or non-general are the mechanisms used in
> putting together the swap device?

During a time when it looked like my boss would only buy small disks
(Eagles), I started writing a pseudo-device to stripe several partitions
together.

Chris Torek pointed out that it's easy to do for the raw device:
supply physio() with an appropriate (*minphys)() routine that will
chop up the I/O and bump the blkno and buffer address.

But for the block device it's messy.  You can't just fudge the blkno
and buffer address, because those are the property of the buffer cache,
and may be inspected by getblk() at ANY TIME.  So you have to construct
one or more new buf structures, and use B_CALL to merge the relevant info
back to the original buf.  (Which means you can't use the stripe driver
for a swap device.  What's that you say?  Oh, the real swap device gets
around this by only providing a character device).

The really ugly part is allocating those new buf structures.  You can
rip them off from either the buffer cache or the swap headers, but
either way, there may not be any available when you ask, so you might
have to sleep().  The problem is that strategy routines aren't supposed
to sleep(), because they might be called at interrupt time.  I think ND
may be the only thing that does this, but I'm not sure.

I never got around to testing the code, partly due to lack of empty
partitions.

So no, the mechanism used in the swap device is not very general.

Don Speck   speck@vlsi.caltech.edu  {amdahl,scgvaxd}!cit-vax!speck

mangler@cit-vax.Caltech.Edu (Don Speck) (02/15/88)

On December 10 I wrote that disk striping has a couple of rather
serious restrictions.  At the beginning of February I finally had
a pressing need for disk striping (to piece together two small
partitions into a usable-size filesystem after losing a disk),
so I finally debugged the striping pseudo-device driver that I'd
written, and found that neither restriction was necessary.

The basic method is for the strategy routine to copy the buf, fudge
the dev/blkno fields in the copy, and set B_CALL in the copy (NOT in
the original).	At iodone time, a routine is called, which copies
back b_resid, b_error, and (only) the B_ERROR bit of b_flags, and
does an iodone() on that.  The temporary buf is then freed.

To avoid the possibility of having to sleep on buf allocation,
requests that cannot immediately allocate a buf are linked into a
list.  By having a private pool of bufs, we're assured that a buf
will soon be freed up by an interrupt, and when that happens the
list of waiting requests is examined.  Ripping off swap buffers
doesn't work, since the swapper may hog them and the only way it
has to tell you when one becomes free is via sleep/wakeup, which
strategy routines are NOT supposed to use.

With those changes, it should be safe to use with Sun ND, etc.
(I don't see why it couldn't be used recursively if you wanted).

I tried various interleave factors, and found that with a single
disk controller, it's best to interleave by cylinders.	Trying to
interleave by filesystem blocks messes up the rotdelay optimization.
Reading large files does not go any faster than with a single disk,
you only gain throughput if you have several independent readers.

cit-vax has been using this to hold netnews since February 7.
The code can be obtained by anonymous ftp from csvax.caltech.edu
(10.1.0.54), file pub/stripe.tar.  Feedback is welcome, this is
still pretty experimental.

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck