brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/25/90)
In article <1990Aug21.223350.7595@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes: > In article <60345@lanl.gov> jlg@lanl.gov (Jim Giles) writes: > >From article <126800008@.Prime.COM>, by EAF@.Prime.COM: > >> If your language I/O library is intelligent and you are reading sequential > >> data, the language library will call on the OS to read the next disk > >> block into memory, often before it is required. > >Not on UNIX it won't. There is no system call for the library to use ... [ John talks about simple caching schemes ] I'm afraid Jim is right, though he drastically overestimates the effect of this failure on small machines. Let me explain. Say a program computes some numbers. Computes them optimally, in fact, leaving them in an array. Now it wants to write the array to disk. If the operating system weren't in the way, the program would simply call upon the disk device to copy the data---through DMA, of course---to the disk. Under UNIX, there's at least one big extra step. write(fd,buf,n) first *copies* the data to a buffer inside the kernel's space. This takes CPU time. Do you see now what Jim is complaining about? Of course, on most machines disk transfer is much slower than CPU transfer, so once you've gotten rid of the disk seek by caching, any further asynchronicity is silly. But Jim works with very fast disks, and a lot of them at once. mmap() is a partial solution: it does its job well and gets rid of the extra step, but doesn't fit into the ``UNIX model'' as well as it could. How do you use mmap() on a pipe, for example? If two programs are communicating via a pipe, they should be able to write data and read it with *zero* copies in the middle. Under standard UNIX, there are two extra copies at least: one for read() and one for write(). I've proposed a solution: make a call analogous to writev() that uses the iovecs directly. Introduce another call that says whether a particular iovec has been written or not. Also introduce a way to wait on this status, similar to select(). Similarly for reading. ---Dan
moss@cs.umass.edu (Eliot Moss) (08/26/90)
Such extra support for multi-buffered asynchronous I/O is supported by DEC under Ultrix for access to raw disk devices .... Eliot -- J. Eliot B. Moss, Assistant Professor Department of Computer and Information Science Lederle Graduate Research Center University of Massachusetts Amherst, MA 01003 (413) 545-4206; Moss@cs.umass.edu
jlg@lanl.gov (Jim Giles) (08/28/90)
From article <11576:Aug2503:18:3790@kramden.acf.nyu.edu>, by brnstnd@kramden.acf.nyu.edu (Dan Bernstein): > [...] > I'm afraid Jim is right, though he drastically overestimates the effect > of this failure on small machines. Let me explain. True. But many "small" machines aren't very small any more. This will increasingly be true. > [...] > Under UNIX, there's at least one big extra step. write(fd,buf,n) first > *copies* the data to a buffer inside the kernel's space. This takes CPU > time. Do you see now what Jim is complaining about? Takes more than CPU time. If the program is consuming and producing data at (or exceeding) the maximum I/O rate, the buffers will always be busy when I make a request. I will, therefore, be forced to wait. With real asynchronous I/O, I only wait when I need the next I/O to have _completed_, not when I make the request. This means that _none_ of the I/O will overlap user calculation in the UNIX style, while _all_ of the I/O will be overlapped with genuine asynchronous I/O. J. Giles P.S. Dan: I have been on vacation and will get to your email when I catch up with other things.
seanf@sco.COM (Sean Fagan) (08/28/90)
In article <11576:Aug2503:18:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >Say a program computes some numbers. Computes them optimally, in fact, >leaving them in an array. Now it wants to write the array to disk. > >If the operating system weren't in the way, the program would simply >call upon the disk device to copy the data---through DMA, of course---to >the disk. Uhm, *where* is it going to put it? Or does the user program just automagically know which sectors on the disk are free to write in, and, of course, it's going to up date the inode information properly, and all directory information. Having worked on an OS that does to asynchronous I/O quite well (NOS, for those who don't read comp.arch 8-)), it's a bit different from what brnstnd says. Under NOS, you tell the OS you want to write some data to disk (or tape, or whatnot); you set a bit in the instruction words (each request needs a block of data describing the file and data) on whether you wish to return immediately or not. You then write the "syscall number" into word 0, and one of the I/O processors will pick it up, and act accordingly. Note that unix machines generally don't have the I/O processors capable of this much intelligence. (Sorry, an 8259 won't cut it.) -- Sean Eric Fagan | "let's face it, finding yourself dead is one seanf@sco.COM | of life's more difficult moments." uunet!sco!seanf | -- Mark Leeper, reviewing _Ghost_ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/29/90)
In article <1990Aug27.223445.4474@sco.COM> seanf@sco.COM (Sean Fagan) writes: > In article <11576:Aug2503:18:3790@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: > >Say a program computes some numbers. Computes them optimally, in fact, > >leaving them in an array. Now it wants to write the array to disk. > >If the operating system weren't in the way, the program would simply > >call upon the disk device to copy the data---through DMA, of course---to > >the disk. > Uhm, *where* is it going to put it? > Or does the user program just automagically know which sectors on the disk > are free to write in, and, of course, it's going to up date the inode > information properly, and all directory information. Who cares? It can use any disk management scheme it likes. This is absolutely irrelevant to the issue: the time to update, say, the inodes is at most a fraction of the time it takes to copy a large block of data from one place in memory to another. > Having worked on an OS that does to asynchronous I/O quite well (NOS, for > those who don't read comp.arch 8-)), it's a bit different from what brnstnd > says. I was describing how asynchronous I/O works when there's no OS in the way. You start talking about how a particular OS does asynchronous I/O. ``It's a bit different''---``No shit, Sherlock!'' RTFABYFU. ---Dan
les@chinet.chi.il.us (Leslie Mikesell) (08/29/90)
In article <29639:Aug2903:48:4290@kramden.acf.nyu.edu> brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: >I was describing how asynchronous I/O works when there's no OS in the >way. You start talking about how a particular OS does asynchronous I/O. Does this have something to do with unix? The usual concept of unix I/O is that something that is written to disk is likely to be read back and thus should live in the common buffer pool for a while. Les Mikesell les@chinet.chi.il.us
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/31/90)
In article <1990Aug29.155253.18634@chinet.chi.il.us> les@chinet.chi.il.us (Leslie Mikesell) writes: > Does this have something to do with unix? Yes. (As A. Lester Buck pointed out to me in e-mail, the functions I proposed for truly asynchronous I/O are already in POSIX.) > The usual concept of unix I/O > is that something that is written to disk is likely to be read back and > thus should live in the common buffer pool for a while. Ummm, no. The LRU philosophy tells us that something recently *written* is likely to be *written* in the near future; something recently *read* is likely to be *read* in the near future. That's the most important reason that things are kept in buffers: it's stupid to make a one-byte modification to a block when more bytes are probably on their way. A second reason is so that writes can be scheduled to avoid disk seeks and not keep the process waiting while the disk catches up. Now it is true that pipes, for instance, are likely to be read very soon after they are written. But pipes aren't inherently disk-based objects, and if they're implemented purely in memory, asynchronous I/O means that a pipe read-write takes zero data copies. ---Dan