alan@mn-at1.k.mn.org (Alan Klietz) (06/18/88)
In article <6963@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes: <In article <23326@bu-cs.BU.EDU>, bzs@bu-cs.BU.EDU (Barry Shein) writes: [why UNIX I/O is so slow compared to big mainframe OS] A useful model is to partition the time spent by every I/O request into fixed and variable length portions. tf is the fixed overhead to reset the interface hardware, queue the I/O request, wait for the data to rotate under the head (for networks, the time to process all of the headers), etc. td is the marginal cost transferring a unit of data (byte, block, whatever). The total I/O utilization of a channel in this case is characterized by n td D = ------------ tf + n td for n units of data. The lim D = 1.0. n->inf td is typically very small (microsecs), tf is typically orders of magnitude higher (millisecs). The curve usually has a knee; UNIX I/O is often on the left side of the knee while most mainframe OS's are on the right side. <For all the optimizations that these I/O processors are supposed to do, <Unix rarely gives them the chance. Unless there's more than two requests <outstanding at once, once they finish one, there's only one request to <choose from. Unix has minimal readahead, so that's as many requests as <a single process can generate. Raw I/O is even worse. Yep, Unix needs to do larger I/O transfers. Case in point: the Cray-2 has a 16 Gbyte/sec I/O throughput capability with incredibly expensive 80+ Mbit/s parallel-head disks (often stripped). And yet, typing cp bigfile bigfile2 measures a transfer performance of only 18 Mbit/s, because BUFSIZ is 4K. <Asynchronous reads would be the obvious way to get enough requests in <the queue to optimize, but that seems unlikely to happen. Rather, <explicit read commands are giving way to memory-mapped files (in Mach <and SunOS 4.0) where readahead becomes synonymous with prepaging. It <remains to be seen whether much attention is put into this. There have been comments that SunOs 4.0 I/O overhead is 2 or 3 times greater than under 3.0. Demand paged I/O introduces all of the Turing divination problems of trying to predict which pages (I/O blocks) the program will use next. IMHO, this is a step backward. <Barry credits the asynchronous nature of I/O on mainframe OS's to the <access methods, like RMS on VMS. People avoid those when they want <speed (imagine using dbm to do sequential reads). For instance, the <VMS "copy" command bypasses RMS when copying disk-to-disk, with the <curious result that it's faster to copy to a disk than to the null <device, because the null device is record-oriented, requiring RMS. RMS systems developed through evolution ("survival of the fastest?") to their current state of being I/O marvels. Hence MVS preallocation requirements, VMS, asynch channel I/O, etc. <As DMR demonstrates, parallel-transfer disks are great for big files. <They're horrendously expensive though, and it's hard enough to find <controllers that keep up with even 3 MB/s, much less 10 MB/s. Disk prices are dropping fast. 8" 1 Gb dual-head disks (6 MB/s) will be common in about a year for $5000-$9000 qty 1. The ANSI X3T9 IPI (Intelligent Peripheral Interface) is now a full standard. It starts at 10 Mb/s and goes up to 25 Mb/s in the current configurations. N.B. the vendors pushing this standard are: IBM, CDC, Unisys, Fujitsu, NEC, Hitachi, (big mainframe manufacturers). Unix in its current incarnation is unable to take advantage of this new disk technology. <they can be simulated with ordinary disks by striping across multiple <controllers, *if* the disks rotate as one. Does anyone know of a cost- <effective disk that can phase-lock its spindle motor to that of a second <disk, or perhaps with the AC line? With direct-drive electronically- <controlled motors becoming common, this should be possible. The Eagle <has such a motor, but no provision for external sync. I recall stories <of Cray's using phase-locked disks to advantage. The thesis in my paper "Turbo NFS" (*) shows how you can get good I/O performance without phase-locked disks by reorganizing the file system contiguously. Cylinders of data are prefetched from selected disks at a rate commensurate with the rate of which the data is consumed by the program. Extents are allocated contiguously by powers of 2. The organization is called a "fractal file system". Phillip Koch did the original work in this area (**). <Of course, to get the most from high transfer rates, you need large <blocksizes; DMR's example looked like about one revolution. Hence <the extent-based file allocation of mainframe OS's, etc. Perhaps <it's time to pester Berkeley to double MAXBSIZE to 16384 bytes? Berkeley should start over. The whole business with "cylinder groups" tries to keep sets of blocks relatively near each other. With the new disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL DELAY. You don't want to keep blocks "near" each other, instead you want to make each extent as large as possible. Sorry, but cylinder groups are archaic. <The one point that nobody mentioned is that you don't want the CPU <copying the data around between kernel and user address spaces when <there's a lot! (Maybe it was just too obvious). Here is an area where paged I/O has an advantage. The first UNIX vendor to do contiguous file systems + paged I/O + prefetching will win big in the disk I/O race. <Don Speck speck@vlsi.caltech.edu {amdahl,ames!elroy}!cit-vax!speck (*) "Turbo NFS: Fast Shared Access for Cray Disk Storage", A. Klietz (MN Supercomputer Center) Proceedings of the Cray User Group, Spring 1988. (**) "Disk File Allocation Based on the Buddy System", P. D. L. Koch (Dartmouth) ACM TOCS, Vol 5, No 3, November 1987. -- Alan Klietz Minnesota Supercomputer Center (*) 1200 Washington Avenue South Minneapolis, MN 55415 UUCP: alan@mn-at1.k.mn.org Ph: +1 612 626 1836 ARPA: alan@uc.msc.umn.edu (was umn-rei-uc.arpa) (*) An affiliate of the University of Minnesota
gwyn@brl-smoke.ARPA (Doug Gwyn ) (06/18/88)
In article <441@mn-at1.k.mn.org> alan@mn-at1.UUCP (0000-Alan Klietz) writes:
-Berkeley should start over. The whole business with "cylinder groups"
-tries to keep sets of blocks relatively near each other. With the new
-disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL
-DELAY. You don't want to keep blocks "near" each other, instead you want
-to make each extent as large as possible. Sorry, but cylinder groups are
-archaic.
Such considerations should lead to the conclusion that each type of
filesystem may need its own access algorithms (perhaps in an I/O
processor). This is easy to arrange via the File System Switch.
alan@mn-at1.k.mn.org (Alan Klietz) (06/18/88)
In article <441@mn-at1.k.mn.org> alan@mn-at1.UUCP (0000-Alan Klietz) writes:
< The ANSI X3T9 IPI
<(Intelligent Peripheral Interface) is now a full standard and starts
<at 10 Mb/s and goes up to 25 Mb/s in the current configurations.
^^^^ ^^^^^^^
That's "megabytes" per second, not "megabits" per second.
(80-200 Mbits/s == 10-25 Mbytes/s). Sorry for the confusion.
--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN 55415 UUCP: alan@mn-at1.k.mn.org
Ph: +1 612 626 1836 ARPA: alan@uc.msc.umn.edu (was umn-rei-uc.arpa)
(*) An affiliate of the University of Minnesota
mike@turing.unm.edu (Michael I. Bushnell) (06/19/88)
In article <441@mn-at1.k.mn.org> alan@mn-at1.UUCP (0000-Alan Klietz) writes: >Berkeley should start over. The whole business with "cylinder groups" >tries to keep sets of blocks relatively near each other. With the new >disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL >DELAY. You don't want to keep blocks "near" each other, instead you want >to make each extent as large as possible. Sorry, but cylinder groups are >archaic. Yet Another Fast File System Misunderstanding (YAFFSM). FFS doesn't try to put blocks "relatively near eachother" in the sense you suggest. It minimizes seek time by cylinder grouping, AND it minimizes rotational delay by the allocation strategies inside cylinder groups. Also, FFS does try to do big transfers. 8K is the norm for a block. Compare that to AT&T 512 byte! -- N u m q u a m G l o r i a D e o Michael I. Bushnell HASA - "A" division mike@turing.unm.edu {ucbvax,gatech}!unmvax!turing.unm.edu!mike
lm@arizona.edu (Larry McVoy) (06/29/88)
In article <8124@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes: >Such considerations should lead to the conclusion that each type of >filesystem may need its own access algorithms (perhaps in an I/O >processor). This is easy to arrange via the File System Switch. Do the wizards have a preference (based on logic, not religion, one presumes) between the file system switch and the vnode method of virtualizing file systems? Anyone looked into both? -- Larry McVoy laidbak!lm@sun.com 1-800-LAI-UNIX x286