[comp.lang.misc] UNIX does *not* fully support asynchronous I/O

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/25/90)

In article <1990Aug21.223350.7595@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
> In article <60345@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
> >From article <126800008@.Prime.COM>, by EAF@.Prime.COM:
> >> If your language I/O library is intelligent and you are reading sequential
> >> data, the language library will call on the OS to read the next disk 
> >> block into memory, often before it is required.
> >Not on UNIX it won't.  There is no system call for the library to use ...
  [ John talks about simple caching schemes ]

I'm afraid Jim is right, though he drastically overestimates the effect
of this failure on small machines. Let me explain.

Say a program computes some numbers. Computes them optimally, in fact,
leaving them in an array. Now it wants to write the array to disk.

If the operating system weren't in the way, the program would simply
call upon the disk device to copy the data---through DMA, of course---to
the disk.

Under UNIX, there's at least one big extra step. write(fd,buf,n) first
*copies* the data to a buffer inside the kernel's space. This takes CPU
time. Do you see now what Jim is complaining about?

Of course, on most machines disk transfer is much slower than CPU
transfer, so once you've gotten rid of the disk seek by caching, any
further asynchronicity is silly. But Jim works with very fast disks, and
a lot of them at once.

mmap() is a partial solution: it does its job well and gets rid of the
extra step, but doesn't fit into the ``UNIX model'' as well as it could.
How do you use mmap() on a pipe, for example? If two programs are
communicating via a pipe, they should be able to write data and read it
with *zero* copies in the middle. Under standard UNIX, there are two
extra copies at least: one for read() and one for write().

I've proposed a solution: make a call analogous to writev() that uses
the iovecs directly. Introduce another call that says whether a
particular iovec has been written or not. Also introduce a way to wait
on this status, similar to select(). Similarly for reading.

---Dan

moss@cs.umass.edu (Eliot Moss) (08/26/90)

Such extra support for multi-buffered asynchronous I/O is supported by DEC
under Ultrix for access to raw disk devices ....		Eliot
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu

jlg@lanl.gov (Jim Giles) (08/28/90)

From article <11576:Aug2503:18:3790@kramden.acf.nyu.edu>, by brnstnd@kramden.acf.nyu.edu (Dan Bernstein):
> [...]
> I'm afraid Jim is right, though he drastically overestimates the effect
> of this failure on small machines. Let me explain.

True.  But many "small" machines aren't very small any more.  This will
increasingly be true.

> [...]
> Under UNIX, there's at least one big extra step. write(fd,buf,n) first
> *copies* the data to a buffer inside the kernel's space. This takes CPU
> time. Do you see now what Jim is complaining about?

Takes more than CPU time.  If the program is consuming and producing
data at (or exceeding) the maximum I/O rate, the buffers will always
be busy when I make a request.  I will, therefore, be forced to wait.
With real asynchronous I/O, I only wait when I need the next I/O to
have _completed_, not when I make the request.  This means that _none_
of the I/O will overlap user calculation in the UNIX style, while
_all_ of the I/O will be overlapped with genuine asynchronous I/O.

J. Giles

P.S. Dan:  I have been on vacation and will get to your email when I
catch up with other things.

seanf@sco.COM (Sean Fagan) (08/28/90)

In article <11576:Aug2503:18:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
>Say a program computes some numbers. Computes them optimally, in fact,
>leaving them in an array. Now it wants to write the array to disk.
>
>If the operating system weren't in the way, the program would simply
>call upon the disk device to copy the data---through DMA, of course---to
>the disk.

Uhm, *where* is it going to put it?

Or does the user program just automagically know which sectors on the disk
are free to write in, and, of course, it's going to up date the inode
information properly, and all directory information.

Having worked on an OS that does to asynchronous I/O quite well (NOS, for
those who don't read comp.arch 8-)), it's a bit different from what brnstnd
says.

Under NOS, you tell the OS you want to write some data to disk (or tape, or
whatnot); you set a bit in the instruction words (each request needs a block
of data describing the file and data) on whether you wish to return
immediately or not.  You then write the "syscall number" into word 0, and
one of the I/O processors will pick it up, and act accordingly.

Note that unix machines generally don't have the I/O processors capable of
this much intelligence.  (Sorry, an 8259 won't cut it.)

-- 
Sean Eric Fagan  | "let's face it, finding yourself dead is one 
seanf@sco.COM    |   of life's more difficult moments."
uunet!sco!seanf  |   -- Mark Leeper, reviewing _Ghost_
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/29/90)

In article <1990Aug27.223445.4474@sco.COM> seanf@sco.COM (Sean Fagan) writes:
> In article <11576:Aug2503:18:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
> >Say a program computes some numbers. Computes them optimally, in fact,
> >leaving them in an array. Now it wants to write the array to disk.
> >If the operating system weren't in the way, the program would simply
> >call upon the disk device to copy the data---through DMA, of course---to
> >the disk.
> Uhm, *where* is it going to put it?
> Or does the user program just automagically know which sectors on the disk
> are free to write in, and, of course, it's going to up date the inode
> information properly, and all directory information.

Who cares? It can use any disk management scheme it likes. This is
absolutely irrelevant to the issue: the time to update, say, the inodes
is at most a fraction of the time it takes to copy a large block of data
from one place in memory to another.

> Having worked on an OS that does to asynchronous I/O quite well (NOS, for
> those who don't read comp.arch 8-)), it's a bit different from what brnstnd
> says.

I was describing how asynchronous I/O works when there's no OS in the
way. You start talking about how a particular OS does asynchronous I/O.
``It's a bit different''---``No shit, Sherlock!'' RTFABYFU.

---Dan

les@chinet.chi.il.us (Leslie Mikesell) (08/29/90)

In article <29639:Aug2903:48:4290@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

>I was describing how asynchronous I/O works when there's no OS in the
>way. You start talking about how a particular OS does asynchronous I/O.

Does this have something to do with unix?  The usual concept of unix I/O
is that something that is written to disk is likely to be read back and
thus should live in the common buffer pool for a while.  

Les Mikesell
  les@chinet.chi.il.us

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/31/90)

In article <1990Aug29.155253.18634@chinet.chi.il.us> les@chinet.chi.il.us (Leslie Mikesell) writes:
> Does this have something to do with unix?

Yes. (As A. Lester Buck pointed out to me in e-mail, the functions I
proposed for truly asynchronous I/O are already in POSIX.)

> The usual concept of unix I/O
> is that something that is written to disk is likely to be read back and
> thus should live in the common buffer pool for a while.  

Ummm, no. The LRU philosophy tells us that something recently *written*
is likely to be *written* in the near future; something recently *read*
is likely to be *read* in the near future. That's the most important
reason that things are kept in buffers: it's stupid to make a one-byte
modification to a block when more bytes are probably on their way. A
second reason is so that writes can be scheduled to avoid disk seeks and
not keep the process waiting while the disk catches up.

Now it is true that pipes, for instance, are likely to be read very soon
after they are written. But pipes aren't inherently disk-based objects,
and if they're implemented purely in memory, asynchronous I/O means that
a pipe read-write takes zero data copies.

---Dan