arnold@emory.uucp (Arnold D. Robbins {EUCC}) (12/03/87)
Something that I've been wondering about for a while, and which promises to expose my ignorance, is as follows. In standard 4.3 BSD, a disk partition is a partition is a partition. You cannot put two 7 Meg 'A' partitions together and use them like one 14 Meg partition. (Some vendors do allow this, e.g. Convex, but mostly it's few and far between.) However, there is a limited form of disk striping already in the BSD kernel: /dev/swap! One can swap on more than one physical disk at a time. My question is, how general or non-general are the mechanisms used in putting together the swap device? Is it unreasonable to adapt it for more general purposes? It seems to me that this would be really useful, as I'll bet lots of people have lots of unused 'A' and 'B' partitions. Is there any chance disk striping will be in 4.4 BSD? As they used to say, "Thanks in Advance". -- Arnold Robbins ARPA, CSNET: arnold@emory.ARPA BITNET: arnold@emory UUCP: { decvax, gatech, }!emory!arnold DOMAIN: arnold@emory.edu (soon) ``csh: just say NO!''
ron@topaz.rutgers.edu (Ron Natalie) (12/04/87)
First, striping is not defined precisely as you indicate. There are several ways one might use multiple drives on a UNIX system. 1. First method (currently in use). Chop your filesystem into part and place them on separate drives. Provided you have a controller that allows you to do overlapped seeks you can arrange for better access times by care placement of filesystems. Note that / and /tmp are the heaviest usage filesystems on most systems. If you have multiple controllers, then you can actually do two transfers at a time, providing for better througput. 2. Second method, span file system accross multiple physical devices. This was done in some Unixes, especially on disk drives like RS04's whose size is prohibitedly small. Of course, this isn't that hot of an idea as it guarantees that only one drive will be active at a time. 3. Stripe the drives. Interleave the filesystems between two drives so that you can transfer data requests from both drives at once. This only beats out method #1 if you have the following: 1. A controller capabale of transferring to/from each drive simoultaneously (or doing the same with multiple controllers). 2. A controller with "zero latency" feature. Most controllers even when the I/O is a full track (or a sustantial fraction of it) wait until the beginning of the track comes under the head before starting the transfer. Using these controllers causes you to head towards the maximum latency as you must wait for the maximum latency on each drives which will be statistically higher than the single drive feature. Newer controllers will start transferring as soon as they realize that there is valid data under the head that is part of the request. -Ron
mangler@cit-vax.Caltech.Edu (Don Speck) (12/10/87)
In article <2369@emory.uucp>, arnold@emory.uucp (Arnold D. Robbins {EUCC}) writes: > However, there is a limited form of disk striping already in the BSD > kernel: /dev/swap! One can swap on more than one physical disk at a time. > > My question is, how general or non-general are the mechanisms used in > putting together the swap device? During a time when it looked like my boss would only buy small disks (Eagles), I started writing a pseudo-device to stripe several partitions together. Chris Torek pointed out that it's easy to do for the raw device: supply physio() with an appropriate (*minphys)() routine that will chop up the I/O and bump the blkno and buffer address. But for the block device it's messy. You can't just fudge the blkno and buffer address, because those are the property of the buffer cache, and may be inspected by getblk() at ANY TIME. So you have to construct one or more new buf structures, and use B_CALL to merge the relevant info back to the original buf. (Which means you can't use the stripe driver for a swap device. What's that you say? Oh, the real swap device gets around this by only providing a character device). The really ugly part is allocating those new buf structures. You can rip them off from either the buffer cache or the swap headers, but either way, there may not be any available when you ask, so you might have to sleep(). The problem is that strategy routines aren't supposed to sleep(), because they might be called at interrupt time. I think ND may be the only thing that does this, but I'm not sure. I never got around to testing the code, partly due to lack of empty partitions. So no, the mechanism used in the swap device is not very general. Don Speck speck@vlsi.caltech.edu {amdahl,scgvaxd}!cit-vax!speck
mangler@cit-vax.Caltech.Edu (Don Speck) (02/15/88)
On December 10 I wrote that disk striping has a couple of rather serious restrictions. At the beginning of February I finally had a pressing need for disk striping (to piece together two small partitions into a usable-size filesystem after losing a disk), so I finally debugged the striping pseudo-device driver that I'd written, and found that neither restriction was necessary. The basic method is for the strategy routine to copy the buf, fudge the dev/blkno fields in the copy, and set B_CALL in the copy (NOT in the original). At iodone time, a routine is called, which copies back b_resid, b_error, and (only) the B_ERROR bit of b_flags, and does an iodone() on that. The temporary buf is then freed. To avoid the possibility of having to sleep on buf allocation, requests that cannot immediately allocate a buf are linked into a list. By having a private pool of bufs, we're assured that a buf will soon be freed up by an interrupt, and when that happens the list of waiting requests is examined. Ripping off swap buffers doesn't work, since the swapper may hog them and the only way it has to tell you when one becomes free is via sleep/wakeup, which strategy routines are NOT supposed to use. With those changes, it should be safe to use with Sun ND, etc. (I don't see why it couldn't be used recursively if you wanted). I tried various interleave factors, and found that with a single disk controller, it's best to interleave by cylinders. Trying to interleave by filesystem blocks messes up the rotdelay optimization. Reading large files does not go any faster than with a single disk, you only gain throughput if you have several independent readers. cit-vax has been using this to hold netnews since February 7. The code can be obtained by anonymous ftp from csvax.caltech.edu (10.1.0.54), file pub/stripe.tar. Feedback is welcome, this is still pretty experimental. Don Speck speck@vlsi.caltech.edu {amdahl,ames!elroy}!cit-vax!speck