[comp.unix.wizards] Why UNIX I/O is so slow

alan@mn-at1.k.mn.org (Alan Klietz) (06/18/88)

In article <6963@cit-vax.Caltech.Edu> mangler@cit-vax.Caltech.Edu (Don Speck) writes:
<In article <23326@bu-cs.BU.EDU>, bzs@bu-cs.BU.EDU (Barry Shein) writes:

	[why UNIX I/O is so slow compared to big mainframe OS]

A useful model is to partition the time spent by every I/O request
into fixed and variable length portions.   tf is the fixed overhead to
reset the interface hardware, queue the I/O request, wait for the
data to rotate under the head (for networks, the time to process all
of the headers), etc.  td is the marginal cost transferring a unit of
data (byte, block, whatever).  The total I/O utilization of a channel
in this case is characterized by

	        n td
	D = ------------
	     tf + n td

	for n units of data.  The lim  D = 1.0. 
				  n->inf

td is typically very small (microsecs), tf is typically orders
of magnitude higher (millisecs).  The curve usually has a knee;
UNIX I/O is often on the left side of the knee while most mainframe
OS's are on the right side.

<For all the optimizations that these I/O processors are supposed to do,
<Unix rarely gives them the chance.  Unless there's more than two requests
<outstanding at once, once they finish one, there's only one request to
<choose from.  Unix has minimal readahead, so that's as many requests as
<a single process can generate.	Raw I/O is even worse.

Yep, Unix needs to do larger I/O transfers.  Case in point: the Cray-2
has a 16 Gbyte/sec I/O throughput capability with incredibly expensive
80+ Mbit/s parallel-head disks (often stripped).   And yet, typing

	cp bigfile bigfile2

measures a transfer performance of only 18 Mbit/s, because BUFSIZ is 4K.

<Asynchronous reads would be the obvious way to get enough requests in
<the queue to optimize, but that seems unlikely to happen.  Rather,
<explicit read commands are giving way to memory-mapped files (in Mach
<and SunOS 4.0) where readahead becomes synonymous with prepaging.  It
<remains to be seen whether much attention is put into this.

There have been comments that SunOs 4.0 I/O overhead is 2 or 3 times
greater than under 3.0.  Demand paged I/O introduces all of the Turing
divination problems of trying to predict which pages (I/O blocks) the
program will use next.  IMHO, this is a step backward.

<Barry credits the asynchronous nature of I/O on mainframe OS's to the
<access methods, like RMS on VMS.  People avoid those when they want
<speed (imagine using dbm to do sequential reads).  For instance, the
<VMS "copy" command bypasses RMS when copying disk-to-disk, with the
<curious result that it's faster to copy to a disk than to the null
<device, because the null device is record-oriented, requiring RMS.
 
RMS systems developed through evolution ("survival of the fastest?")
to their current state of being I/O marvels.  Hence MVS preallocation
requirements, VMS, asynch channel I/O, etc.

<As DMR demonstrates, parallel-transfer disks are great for big files.
<They're horrendously expensive though, and it's hard enough to find
<controllers that keep up with even 3 MB/s, much less 10 MB/s.

Disk prices are dropping fast.  8" 1 Gb dual-head disks (6 MB/s) will be
common in about a year for $5000-$9000 qty 1.  The ANSI X3T9 IPI
(Intelligent Peripheral Interface) is now a full standard.  It starts
at 10 Mb/s and goes up to 25 Mb/s in the current configurations.
N.B. the vendors pushing this standard are: IBM, CDC, Unisys, Fujitsu,
NEC, Hitachi, (big mainframe manufacturers).   Unix in its current
incarnation is unable to take advantage of this new disk technology.

<they can be simulated with ordinary disks by striping across multiple
<controllers, *if* the disks rotate as one.  Does anyone know of a cost-
<effective disk that can phase-lock its spindle motor to that of a second
<disk, or perhaps with the AC line?  With direct-drive electronically-
<controlled motors becoming common, this should be possible.  The Eagle
<has such a motor, but no provision for external sync.  I recall stories
<of Cray's using phase-locked disks to advantage.
 
The thesis in my paper "Turbo NFS" (*) shows how you can get good
I/O performance without phase-locked disks by reorganizing the
file system contiguously.  Cylinders of data are prefetched from
selected disks at a rate commensurate with the rate of which the
data is consumed by the program.   Extents are allocated contiguously
by powers of 2.  The organization is called a "fractal file system".
Phillip Koch did the original work in this area (**).

<Of course, to get the most from high transfer rates, you need large
<blocksizes; DMR's example looked like about one revolution.  Hence
<the extent-based file allocation of mainframe OS's, etc.  Perhaps
<it's time to pester Berkeley to double MAXBSIZE to 16384 bytes?

Berkeley should start over.  The whole business with "cylinder groups"
tries to keep sets of blocks relatively near each other.  With the new
disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL
DELAY.  You don't want to keep blocks "near" each other, instead you want
to make each extent as large as possible.  Sorry, but cylinder groups are
archaic.

<The one point that nobody mentioned is that you don't want the CPU
<copying the data around between kernel and user address spaces when
<there's a lot!	(Maybe it was just too obvious).

Here is an area where paged I/O has an advantage.  The first UNIX vendor
to do contiguous file systems + paged I/O + prefetching will win big in
the disk I/O race.
 
<Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

(*) "Turbo NFS: Fast Shared Access for Cray Disk Storage", A. Klietz
(MN Supercomputer Center)  Proceedings of the Cray User Group, Spring 1988.

(**) "Disk File Allocation Based on the Buddy System", P. D. L. Koch
(Dartmouth)  ACM TOCS, Vol 5, No 3, November 1987.

--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN  55415    UUCP:  alan@mn-at1.k.mn.org
Ph: +1 612 626 1836       ARPA:  alan@uc.msc.umn.edu  (was umn-rei-uc.arpa)

(*) An affiliate of the University of Minnesota

gwyn@brl-smoke.ARPA (Doug Gwyn ) (06/18/88)

In article <441@mn-at1.k.mn.org> alan@mn-at1.UUCP (0000-Alan Klietz) writes:
-Berkeley should start over.  The whole business with "cylinder groups"
-tries to keep sets of blocks relatively near each other.  With the new
-disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL
-DELAY.  You don't want to keep blocks "near" each other, instead you want
-to make each extent as large as possible.  Sorry, but cylinder groups are
-archaic.

Such considerations should lead to the conclusion that each type of
filesystem may need its own access algorithms (perhaps in an I/O
processor).  This is easy to arrange via the File System Switch.

alan@mn-at1.k.mn.org (Alan Klietz) (06/18/88)

In article <441@mn-at1.k.mn.org> alan@mn-at1.UUCP (0000-Alan Klietz) writes:
< The ANSI X3T9 IPI
<(Intelligent Peripheral Interface) is now a full standard and starts
<at 10 Mb/s and goes up to 25 Mb/s in the current configurations.
       ^^^^                ^^^^^^^
That's "megabytes" per second, not "megabits" per second.
(80-200 Mbits/s == 10-25 Mbytes/s).   Sorry for the confusion.

--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN  55415    UUCP:  alan@mn-at1.k.mn.org
Ph: +1 612 626 1836       ARPA:  alan@uc.msc.umn.edu  (was umn-rei-uc.arpa)

(*) An affiliate of the University of Minnesota

mike@turing.unm.edu (Michael I. Bushnell) (06/19/88)

In article <441@mn-at1.k.mn.org> alan@mn-at1.UUCP (0000-Alan Klietz) writes:
>Berkeley should start over.  The whole business with "cylinder groups"
>tries to keep sets of blocks relatively near each other.  With the new
>disks today, the average SEEK TIME IS OFTEN FASTER THAN THE ROTATIONAL
>DELAY.  You don't want to keep blocks "near" each other, instead you want
>to make each extent as large as possible.  Sorry, but cylinder groups are
>archaic.

Yet Another Fast File System Misunderstanding (YAFFSM).

FFS doesn't try to put blocks "relatively near eachother" in the sense
you suggest.  It minimizes seek time by cylinder grouping, AND it minimizes
rotational delay by the allocation strategies inside cylinder groups.

Also, FFS does try to do big transfers.  8K is the norm for a block.  
Compare that to AT&T 512 byte!

-- 
                N u m q u a m   G l o r i a   D e o 

			Michael I. Bushnell
			HASA - "A" division
			mike@turing.unm.edu
	    {ucbvax,gatech}!unmvax!turing.unm.edu!mike

lm@arizona.edu (Larry McVoy) (06/29/88)

In article <8124@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>Such considerations should lead to the conclusion that each type of
>filesystem may need its own access algorithms (perhaps in an I/O
>processor).  This is easy to arrange via the File System Switch.

Do the wizards have a preference (based on logic, not religion, one presumes)
between the file system switch and the vnode method of virtualizing file
systems?  Anyone looked into both?
-- 
Larry McVoy	laidbak!lm@sun.com	1-800-LAI-UNIX x286