john@compugen. (John Beaudin) (11/29/89)
Something along the lines of Raxco's Rabbit, or DiskKeeper, or Squeezpak. How about it? There's money to be made here.
alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) (11/30/89)
In article <2095@compugen.>, john@compugen. (John Beaudin) writes: > Something along the lines of Raxco's Rabbit, or DiskKeeper, or Squeezpak. ULTRIX(+) uses the Berkeley Fast File System [1] which is in a way is very resistant to fragmentation. Of course "very resistant" to fragmentation doesn't mean that it can't happen. A problem I have considered from time to time is how to determine whether a file system is badly enough fragmented to warrant trying to clean it up. It is not only a matter of having a file with blocks scattered all over, also one of having FILES scattered all over and the access patterns of the I/O to those files. It has been shown that frequently accessed file systems should live towards the middle of a disk to get the best access times. I would guess that if you could determine which files are the most frequently accessed that you should arrange for them to live closest to the center. A defragmenter for the FFS wouldn't isn't a "simple" matter of compacting all the blocks of a file into one nice little area, but it is an interesting problem if it exists. [1] - A Fast File System for UNIX*, revised July 27 1983 Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler, Robert S. Fabry. (*) UNIX is a trademark of AT&T. (+) ULTRIX is a trademark of Digital Equipment Corporation. -- Alan Rollow alan@nabeth.enet.dec.com
chris@mimsy.umd.edu (Chris Torek) (11/30/89)
In article <2095@compugen.> john@compugen. (John Beaudin) writes: >Something along the lines of Raxco's Rabbit, or DiskKeeper, or Squeezpak. >How about it? There's money to be made here. No, there is no money to make, except perhaps from suckers. The 4.2BSD file system is effectively self-defragmenting. Files are allocated badly only when the disk is over-full. (This is one of two reasons for the 10% free space minimum.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
krs0@GTE.COM (Rod Stephens) (12/01/89)
I was at a DECUS seminar where someone asked how disk fragmentation effected performance on Unix systems. The answer was that the file system works best if things are NOT contiguous. (This started a series of jokes about disk fragmenters ;) Unfortunately that's all I know. Does anyone know what this guy meant? Rod Stephens GTE Laboratories, Inc (617)466-4182 -- Rod Stephens GTE Laboratories, Inc (617)466-4182
alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) (12/02/89)
In article <7862@bunny.GTE.COM>, krs0@GTE.COM (Rod Stephens) writes: > I was at a DECUS seminar where someone asked how disk fragmentation > effected performance on Unix systems. The answer was that the file > system works best if things are NOT contiguous. (This started a series > of jokes about disk fragmenters ;) Unfortunately that's all I know. > Does anyone know what this guy meant? First you should refer to the article "A Fast File System for UNIX*" by Marshall Kirk McKusick, William N. Joy, Samuel J. Leffler and Robert S. Fabry. A copy is in the System Manager volume of the ULTRIX+ Supplementary Documents. ULTRIX uses the Berkeley Fast File System (FFS) for local disk accesses. Many other UNIX systems also use FFS, but some probably still use the old UNIX file system. Among the features of the FFS are: o Optimal storage utilization (blocks and fragments). o File system parameterization (rotation layout). o A different block layout policy. There are others, but these are the ones that reduce the affects of fragmentation in the file system. The first allows a large allocation block size for files (4KB and 8KB in the ULTRIX implementatin). Left to itself though this would waste large amounts of space when many small files are created. To reduce the amount of wasted space small files are allowed to allocate fragments of blocks. Fragment sizes of 1/8th, 1/4th, 1/2 and the block size are allowed. Both blocks and fragments are allocated contiguously, so for files whose sizes are less than or equal to the block size fragmentation isn't a problem. The 2nd feature attempts to layout file system blocks at rotationally optimal locations for the capability of the disk. A number of file system parameters are provided for tuning the file system to work best for the disk it is on(1). The optimal location for a block is calculated based on the rotational speed of the disk and the time it takes the system to dispatch an I/O request to the disk. A simple example of this is might a file that consists of two 8KB blocks. Even if the file system is doing read-ahead it will take two reads to read the files (one for each 8KB block). If the blocks are allocated contigously it is possible the 2nd block will have rotated past the disk head before the request gets to the disk and so you'll have to wait for the block to come back around. If a gap is placed between the blocks that is long enough to allow the 2nd request to show up, the request can be satisfied more quickly. If the disk/controller hardware allows it, it is possible to specify long strings of blocks that can be read/written contiguously(1). The affect of rotational optimization on fragmentation is that files are already fragmented in such a way to allow for optimal sequential access at the file system block size. Depending on the disk speed, controller speed and latency and CPU speed the best layout to have these rotational gaps or it may be best to layout the blocks as contiguously as possible. These is some- thing that you may have to determine by experimentation for your hardware and file system access. The third feature is an attempt to keep directories of files close together and spread the free space equally across the disk. The disk is divided into groups of cylinders, where each group is treated like a small file system. It has a copy of the superblock, a section of inodes and free space. When a new directory is created it is placed in the cylinder group with the most free space. Files created in that directory are allocated to the same cylinder group. In order to try and keep a reasonable amount of free space in the cylinder group large files are limited to the amount of space they can use out of one cylinder group and are allocated to other groups. If a cylinder group is full a quadratic hash is used to find space and if that fails an exhaustive search is performanced. Performance studies of the FFS at Berkeley showed that the performance was fairly constant until the file system reach around 90% full. This is the reason that the file systems attempts to keep 10% free. This threshold can be adjusted with tunefs(8), but if it is going to be long term situation you should attempt to find more free space somewhere. One potential disadvantage of the block layout is that files get scattered all over the disk. The files in a given directory may be close together, but two different directory (two users for example) may be far part. To help get around this Berkeley added request sorting to the disk drivers so that when the disk queues became full the requests would be served in such a way to get the best global through put from the disk. The Digital Storage Architecture (DSA) controllers also do request sorting. In ULTRIX-32 V1.0 the DSA (UDA50) driver still had the Berkeley sort routine in it. It was removed in V1.1 in the belief that there was no need to sort the requests twice. I believe that most of what I have written is accurate, but I haven't the FFS article recently so my memory may be faulty. Any corrections would be appreciated. *UNIX is a trademark of AT&T. +ULTRIX is a trademark of Digital Equipment Corporation. 1. Refer to the tunefs(8) manual page. > > Rod Stephens -- Alan Rollow alan@nabeth.enet.dec.com
chris@mimsy.umd.edu (Chris Torek) (12/02/89)
In article <7862@bunny.GTE.COM> krs0@GTE.COM (Rod Stephens) writes: >I was at a DECUS seminar where someone asked how disk fragmentation >effected performance on Unix systems. The answer was that the file >system works best if things are NOT contiguous. (This started a series >of jokes about disk fragmenters ;) Unfortunately that's all I know. >Does anyone know what this guy meant? He was probably talking about matching rotational and processing latency. The average Unix program reads files using stdio. Stdio reads files one block (underlying file system block size, typically 4K, 8K, 16K, 32K, or 64K, although sizes greater than 8K do not occur on current VAX Unix systems) at a time. At the same time as the current block is read, the next block (of up to 12 total, since only direct block entries participate in this) is brought in to the buffer cache via breada(). The two read requests are passed to the disk device driver as two separate requests, and most disk devices can handle only one request at a time [assumption #1]. Thus, the first request (for the desired block) is sent to the disk, and the second is placed on a queue. When the first one finishes transferring into main memory, the device interrupts the CPU, which has to figure out what has happened, notice the next read request, and pass that on to the device. During that time, the disk may have passed the point where it can read the next sector immediately [assumption #2]. If the second block of the file is contiguous with the first, the disk head will be over the middle of that block, and the CPU will have to wait for one complete revolution of the disk (typically 1/60th of a second) for the second block [assumption (hidden this time) #3], by which time the application reading the file is likely to already have requested the second block. In other words, the application will have to wait. If the application has to wait, the next read request will proceed much as the previous one, and the application will have to wait for each disk block, despite the read-ahead. If the blocks are separated by a gap, however, the analysis changes as follows. The first block works more or less as before, but this time the second block is not already under the disk head, so it comes in as soon as the head reaches that block. The CPU has only to wait for the current block to pass under the head, and for the next block to transfer into memory. On a `typical' Fujitsu Eagle (48 sectors/track) this is essentially 2/3 (32/48) the time for a full rotation (for 8K byte blocks). On a larger disk (with more sectors per track) the ratio is better. Now the application need not wait as long, or---if it is sufficiently slow---perhaps not even at all. If it does not wait at all, the next read request (issued at the same time the read-ahead block is returned to the application) might have to wait for nearly a full rotation again (if we have bad luck and the application uses the first read-ahead block just as the disk head is over the third file block) or might not. (If the application is REALLY slow, random block placement works just as well as anything else, since the read-ahead always has plenty of time to move the disk head to the next block, so we can ignore this case.) Now, the disk sometimes also has to switch heads and/or cylinders. This can take time. Typical head switch delays are 0 (disks with servo tracks) or ~.2ms (embedded sync) or ridiculous (some DEC RA series, embedded sync plus stupidity); typical track-to-track seek times are ~1ms. On head or cylinder switch, one also wants the next block not to be contiguous, no matter what the application's speed may be, since the system can try to match the rotational delay with the switch delay. The BSD FFS does not do a good job of this, so I am not going to say much more here. Here are the assumptions marked above: #1: disk does only one thing at a time. True for VAX except DEC RA disks on DEC MSCP controllers. Many of these controllers are so horribly slow (UDA50 => 8085 w/ 16 or 32 K of cache) that this does not matter anyway. Others (HSC => LSI-11, presumably with decent amounts of RAM) can do a better job (which is not to say that they necessarily do so). #2: CPU+device is slow enough to let sector headers pass by during interrupt handling. True on many machines, becomes false as the machines get faster or the drivers get better. Once the machine is fast enough, contiguous becomes good. (VAXen are not getting faster very fast.) #3: disk+controller does not do read-ahead. True on most (all?) VAX systems, false on modern SCSI systems. A good read-ahead cache on the disk changes everything. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
chris@mimsy.umd.edu (Chris Torek) (12/03/89)
In article <504@shodha.dec.com> alan@shodha.dec.com ( Alan's Home for Wayward Notes File.) writes: [a bunch of reasonable stuff deleted, although he consistently misspells `effects' (this from the American boy who spells -ize words with -ise :-) )] >One potential disadvantage of the block layout is that files >get scattered all over the disk. The files in a given directory >may be close together, but two different directory (two users >for example) may be far part. To help get around this Berkeley >added request sorting to the disk drivers so that when the disk >queues became full the requests would be served in such a way >to get the best global through put from the disk. Unix has had elevator sorting in the disk drivers since at least V7. The BSD disksort code is not quite the same as the original, but it was not `added', merely `preserved'. >... In ULTRIX-32 V1.0 the DSA (UDA50) driver still had the Berkeley >sort routine in it. Did it? I thought the 4.2BSD uda.c (the Ultrix 1.x driver was a variant thereof) never called disksort(). >It was removed in V1.1 in the belief that there was no need to sort the >requests twice. Indeed. It is, however, worth pointing out that even on many multi-user machines, the disk queue is rarely deeper than one entry. I put some statistics in our kernels when I was working on the KDB driver for the 8250 for the BSD Tahoe release, and found that 85% of all I/O was through the buffer cache, and that about 97% of all I/O consisted of at most two requests (two is not surprising, as breada() generates two). One of them gets handed to the device immediately, and the second gets queued (on hp/up/rk/... disks) or is also immediately handed to the device (uda/kdb disks), but in either case one is being done when the second is generated, so the queue depth is 1. The only time the queue gets deep on a typical multi-user VAX is when it starts paging heavily. The pageout daemon can generate upwards of 20 write requests at once. . . . It would be interesting to add disk queue statistics to large NFS file servers. A professor here is doing disk studies, including this sort of thing, but I am not sure what results he has for queue depth. (He has other results that show that the disk stays on-cylinder 1/3 of the time, which is pretty good. Seek distance on the 2/3 misses is not as good, however.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris