lewine@dg-rtp.dg.com (Donald Lewine) (08/29/90)
In article <27619@nuchat.UUCP>, steve@nuchat.UUCP (Steve Nuchia) writes: |> (For completeness I will note that to use this scheme intelligently |> you must be able to discover the relevant properties of the memory |> management implementation. This is nothing new for high performance |> programs in a paged environment, but unless its been added recently |> there isn't a standard way to do it. Whether this is properly a |> language or a system interface issue is best left to another debate.) |> -- That last remark defeats your entire suggestion. If I have to "discover the relevant properties of the memory management implementation", all dusty decks will fail. If you blindly map the page(s) containing "buf" out of the users address space you will map out other variables that the user may want. It is not possible for the compiler to know that buf must be on a page by itself. How could you implement your scheme? Also the read() and write() functions return the number of characters read or written. How do you know this before the read() or write() completes? Do you assume that all disks are error free and never fail? That is a poor assumption! I don't think that your idea works at all. For a scheme that does almost exactly this, but with the cooperation of the user program, look at PMAP under DEC's TOPS-20 operating system or ?SPAGE under DG's AOS/VS. -------------------------------------------------------------------- Donald A. Lewine (508) 870-9008 Data General Corporation (508) 366-0750 FAX 4400 Computer Drive. MS D112A Westboro, MA 01580 U.S.A. uucp: uunet!dg!lewine Internet: lewine@cheshirecat.webo.dg.com
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/31/90)
In article <861@dg.dg.com> uunet!dg!lewine writes: > In article <27619@nuchat.UUCP>, steve@nuchat.UUCP (Steve Nuchia) writes: > > (For completeness I will note that to use this scheme intelligently > > you must be able to discover the relevant properties of the memory > > management implementation. > That last remark defeats your entire suggestion. If I have to > "discover the relevant properties of the memory management > implementation", all dusty decks will fail. No. Steve's point was that on paged architectures, he can get a low-cost speedup out of some programs without any change in semantics. This is a worthwhile change. Discovering memory managemment can be as simple as having a system call getpage() that returns a char buffer taking up exactly one page. Any code that understands this can take full advantage of asynchronous I/O. > If you blindly map the page(s) containing "buf" out of the users > address space you will map out other variables that the user may > want. It is not possible for the compiler to know that buf must > be on a page by itself. How could you implement your scheme? So what if there are other variables on the page? The worst that happens is that the page gets mapped out and then back in; on paged hardware, this cost is negligible. The best that happens is that the program uses getpage() and guarantees that it will wait for the I/O to finish on that page. > Also the read() and write() functions return the number of characters > read or written. How do you know this before the read() or write() > completes? Do you assume that all disks are error free and never > fail? That is a poor assumption! So what? read() and write() already return before data gets written out to disk; assuming that you see all I/O errors before a sync is the poor assumption! This is irrelevant to the issue at hand. ---Dan
stripes@eng.umd.edu (Joshua Osborne) (08/31/90)
In article <61535@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >From article <27619@nuchat.UUCP>, by steve@nuchat.UUCP (Steve Nuchia): [...] >> Have read(2) and write(2) calls map the pages containing the buffers >> out of the user address space and return immediately. Once the >> data have been copied (DMAed?) to/from the buffers, map the pages back in. >> [...] > >Yes, this will work. I believe that MACH already does this. >Unfortunately, this idea has two problems: 1) [omited] >2) not all I/O requests are a multiple of the pagesize. >The pagesize problem >just means that you'd have to map out more memory than is actually >involved in the I/O request. This means that the user might get >blocked on memory that is really perfectly safe to access - a minor >source of slowdown. It shouldn't be a source of slowdown in the read case; normally the program would not get control untill after the read was done, the new worst case is exactly the same. However for reads the old best case is far better then the new worst case, and it could happen relitavly offen. The old best case: do a large write, the kernel copys the data lets you run and writes it some time, you re-use the buffer. If you do that under the new system: You do the large write the kernal maps out the pages, gives you control you re-use the buffer, the kernel makes you sleep untill it can do the write. You lose out. Alot of programs do this. Currently stdio does this. Of corse stdio would need a bit of tweeking anyway. (allign page sized buffers on page boundrys) While we are in there we could make writes use 2 page buffers, and flush alterniate ones... and do huge writes directly out of the users space. (just things to think about, I do like the idea...) -- stripes@eng.umd.edu "Security for Unix is like Josh_Osborne@Real_World,The Mutitasking for MS-DOS" "The dyslexic porgramer" - Kevin Lockwood "Isn't that a shell script?" - D. MacKenzie "Yeah, kinda sticks out like a sore thumb in the middle of a kernel" - K. Lidl
peter@ficc.ferranti.com (Peter da Silva) (08/31/90)
In article <861@dg.dg.com> uunet!dg!lewine writes: > That last remark defeats your entire suggestion. If I have to > "discover the relevant properties of the memory management > implementation", all dusty decks will fail. You only have to do this if you want to get some additional performance. It should still work regardless. > Also the read() and write() functions return the number of characters > read or written. How do you know this before the read() or write() > completes? Do you assume that all disks are error free and never > fail? That is a poor assumption! Well, it's an assumption made for write() anyway. For read() you can just treat it like any disk error on any other faulted-in page and blow off the process. Disk errors are a very rare occurrence, and almost always require human intervention anyway. Any other return value for read is known ahead of time. -- Peter da Silva. `-_-' +1 713 274 5180. 'U` peter@ferranti.com
pplacewa@bbn.com (Paul W Placeway) (09/01/90)
peter@ficc.ferranti.com (Peter da Silva) writes:
< ... Disk errors are a very rare occurrence, and almost always require
< human intervention anyway. Any other return value for read is known
< ahead of time.
Unless your "disk" is RFS and the remote machine crashes, or soft
mounted NFS and any one of about a zillion things happens...
-- Paul Placeway
eliot@dg-rtp.dg.com (Topher Eliot) (09/01/90)
In article <12023:Aug3017:24:1590@kramden.acf.nyu.edu>, brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes: |> |> > Also the read() and write() functions return the number of characters |> > read or written. How do you know this before the read() or write() |> > completes? Do you assume that all disks are error free and never |> > fail? That is a poor assumption! |> |> So what? read() and write() already return before data gets written out |> to disk; assuming that you see all I/O errors before a sync is the poor |> assumption! This is irrelevant to the issue at hand. I think this point very well covers the case of writing: if you want to be sure it really got to disk, you need to do an fsync(), and even then I'm not sure you can be sure (doesn't the fsync just *mark* the buffers for writing out?). We could certainly arrange for fsync to block until everything is really on disk. But consider the reading case. Here, we could tell the process how many bytes it had "gotten" (was going to get), even if it was less than the process had requested (presumably the kernel knows how big the file is, without having to read all those bytes off disk). The application then might do something *other than examining the bytes it just "read"* based on this knowledge of a "successful read". If the disk then fails (or the net fails, or whatever), the application would have acted incorrectly. Moreover, instead of the application learning about the failure by getting -1 back from a read call, it will learn about it by receiving a signal or some such. So, can anyone think of an application that behaves in this manner (i.e. acts upon the return value from a read by doing something important, that does not involve examining the read buffer)? I can't. Perhaps more significant is the issue of the application not getting a -1 back from the read call. -- Topher Eliot Data General Corporation eliot@dg-rtp.dg.com 62 T. W. Alexander Drive {backbone}!mcnc!rti!dg-rtp!eliot Research Triangle Park, NC 27709 (919) 248-6371 Obviously, I speak for myself, not for DG.
hunt@dg-rtp.dg.com (Greg Hunt) (09/01/90)
In article <1990Aug31.190751.12522@dg-rtp.dg.com>, eliot@dg-rtp.dg.com (Topher Eliot) writes: > > But consider the reading case. Here, we could tell the process how > many bytes it had "gotten" (was going to get), even if it was less > than the process had requested (presumably the kernel knows how big > the file is, without having to read all those bytes off disk). The > application then might do something *other than examining the bytes > it just "read"* based on this knowledge of a "successful read". If > the disk then fails (or the net fails, or whatever), the application > would have acted incorrectly. Moreover, instead of the application > learning about the failure by getting -1 back from a read call, it > will learn about it by receiving a signal or some such. > > So, can anyone think of an application that behaves in this manner > (i.e. acts upon the return value from a read by doing something > important, that does not involve examining the read buffer)? I can't. > Perhaps more significant is the issue of the application not getting a > -1 back from the read call. > > Topher Eliot > Yes, I can. Under Data General's AOS/VS OS, I wrote a program that read blocks from tape drives and checked the sizes of the blocks read. The program was for verifying that labeled backup tapes were physically readable. The header label on the tape contained buffersize information which the program used to read the data blocks. As each block was read, the size of the read returned by the OS was checked against the buffersize to ensure that full buffer reads were done. It also counted the number of blocks read. It discarded the contents of the read buffer without looking at them. The trailer label on the tape contained block count information that was written by the OS. The OS's block count was compared against the block count seen by the program. All of these checks were only to ensure that the tape could be physically read. Using it eliminated ALL bad backup tapes that I was encountering. Sometimes I found that I could write a tape, but not read it again. Tapes that could not be verified by this program were discarded. The program did nothing to ensure that the tape could be logically read by the load program used to restore files, so it did nothing to guard against bugs in the dump/load programs themselves. I have yet to port the program to DG/UX to verify UNIX backup tapes in a similar manner. I believe the program could be made to serve a similar purpose, but I'd probably have to change the header/trailer handling since AOS/VS uses ANSI standard tape labels and I don't think that UNIX does. Does this example meet the behavior that you were wondering about? It may be a specialized use of the results of a read and not be representative of what applications-level software does. -- Greg Hunt Internet: hunt@dg-rtp.dg.com DG/UX Kernel Development UUCP: {world}!mcnc!rti!dg-rtp!hunt Data General Corporation Research Triangle Park, NC These opinions are mine, not DG's.
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/01/90)
In article <1990Aug31.190751.12522@dg-rtp.dg.com> eliot@dg-rtp.dg.com writes: > I think this point very well covers the case of writing: if you want > to be sure it really got to disk, you need to do an fsync(), and even > then I'm not sure you can be sure (doesn't the fsync just *mark* the > buffers for writing out?). We could certainly arrange for fsync to > block until everything is really on disk. fsync() will certainly do that, independently of this mechanism. (It's sync() that just marks buffers for writing. BSD's fsync() truly writes the data to disk, giving the transaction control you need for reliable databases. I have no idea what you poor System V folks do.) > But consider the reading case. [ what happens upon failure? ] As Peter pointed out, this case is fatal. How many disk errors have you had over the last year? How many did the programs involved recover from? Yeah, thought so. I guess you're right in principle: Steve's proposal is only completely transparent for writing (which is the more important case anyway). ---Dan
bzs@world.std.com (Barry Shein) (09/02/90)
Having somehow missed the fact that comp.unix.wizards has, um, changed status (what happened anyhow? I still don't know) I'll repeat a query I just made there here: Does anyone have any papers or references on performance benchmarking of regular vs asynch I/O under UNIX (not other OS's, it's not applicable)? Failing that, does anyone have any informal results? Anything? -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
buck@siswat.UUCP (A. Lester Buck) (09/02/90)
In article <1990Aug30.222226.20866@cbnewsm.att.com> lfd@cbnewsm.att.com (leland.f.derbenwick) writes: >In essentially any serious database application, a completed >write() to a raw disk is treated as a guarantee that the data >block has been _physically written to the device_. (This is >needed to ensure reliable transaction behavior in the presence >of potential system crashes.) Since your suggestion would void >that guarantee, it is not benign. Close, but not quite. The guarantee is that the _controller_ has accepted the data. If/when the bits actually hit the media is not fully under the control of the OS. Remember SCSI has a READ BUFFERED DATA command for error recovery. SCSI disks are coming with bigger caches all the time, and a power hit can take out a significant amount of data. If the database really must remain consistent, a UPS is probably required. As to Steve's idea, it has a certain elegance to recommend it. But its practical value is low. Sure, it can be made to have full Unix semantics, but at the price of the common case reducing almost exactly to synchronous I/O. Or imagine the case of an I/O server process sharing memory with dozens of clients. Each shared memory segment will have to keep a list of every process that must block on a page fault. The practical effect will be that an _arbitrary_ number of processes will potentially block for every I/O, instead of doing useful work in their own address spaces. This scheme falls into the general class of YANSUAIOM (Yet Another Non-Standard Unix Asynchronous I/O Mechanism), as do the schemes with ioctl's or select'ing on disk. What may be difficult to understand at this point, when Unix has not had a standard asynchronous I/O facility, is that we will program _differently_ when it is widely available. The semantics of I/O must change (broaden). The structure and flow of a program will be significantly different when it uses asynchronous I/O, in the same way that the availablility of real threads leads to new programming paradigms to take advantage of those facilities. We may have to look at schemes used in the realtime Unix versions, VMS (gag) or even MVS (gag!!), which have had asynchronous I/O facilities for up to decades, to adapt to this new mindset. The only reason one designs an asynchronous I/O facility is to efficiently overlap computation with I/O transfers, and that can take some careful thought to achieve maximum speedup. For example, Chris Torek recently traced the path of a raw synchronous I/O, which eventually sleeps in physio() in the context of the calling process. A large transfer will loop through physio, with a wakeup/sleep cycle for every chunk (limited by how much physical memory the OS wants to lock down at once). Each sleep/wakeup cycle is an expensive context switch, involving reloading the virtual memory state of the caller. But a fully asynchronous I/O scheme drags along enough state to start the next I/O chunk all within the driver interrupt routine, with the calling process completely out of context. Of course, it is a bit(!) more complicated if non-resident pages are found in the next chunk that needs to be page-locked... The POSIX.4 asynchronous I/O facilities are moving toward final ballot and present a rich set of asynchronous I/O primitives. These include the obvious aread/awrite, and listio, similar to readv/writev for synchronous transfers, which can fire off a large number of aio's at once and optionally be notified only when they are all complete. Iosuspend is a more advanced version of select that waits for completion of any operations in a list. The process can learn of I/O completion in at least four ways: 1) return codes written into the process' asynchronous I/O control block, 2) receiving a completely asynchronous "fixed" (queued, tagged) signal/event which runs a handler, 3) synchronously suspending for I/O completion (iosuspend), or 4) synchronously suspending or polling for the signal/event posting I/O completion. [Suspending is familiar, but the committee added polling, where a process can sleep until one of a selected signal/event class is posted while taking signal/events not being polled for completely asynchronously.] -- A. Lester Buck buck@siswat.lonestar.org ...!uhnix1!lobster!siswat!buck
steve@nuchat.UUCP (Steve Nuchia) (09/03/90)
In article <555@siswat.UUCP> buck@siswat.UUCP (A. Lester Buck) writes: >As to Steve's idea, it has a certain elegance to recommend it. But its >practical value is low. Sure, it can be made to have full Unix semantics, Hmm. What ever happened to aesthetic value having some measure of parity with "practical" value? To update everyone on my thinking, there have been two serious (in my view) objections raised. The fact that at the extrema of system performance requirements my assumption of a paging environment is false is the most damaging, and probably is sufficient to relegate the whole idea to the scrap-heap. Secondly, the cases in which the return value (especially for read) cannot be predicted productively, though few, are important enough to cast serious doubt on the sufficiency of my scheme. >What may be difficult to understand at this point, when Unix has not had a >standard asynchronous I/O facility, is that we will program _differently_ An excellent point, one we would all do well to keep in mind. I would have added to Lester's list of examples the event-driven style imposed by modern user interface construction. >The POSIX.4 asynchronous I/O facilities are moving toward final ballot and >present a rich set of asynchronous I/O primitives. These include the It is precisely this "rich set" of "primitives" (!!!) that I am striving to avoid. If one may, as you suggest, learn something by examining MVS and VMS and all thos other UPPER CASE operating systems, one of those things should be that a proliferation of system interface mechanisms (calls, whatever) is Not Good. >The process can learn of I/O completion in at least four ways: 1) return >codes written into the process' asynchronous I/O control block, 2) receiving >a completely asynchronous "fixed" (queued, tagged) signal/event which runs a >handler, 3) synchronously suspending for I/O completion (iosuspend), or 4) >synchronously suspending or polling for the signal/event posting I/O >completion. [Suspending is familiar, but the committee added polling, where AAAARRRRRRRGGGGHHHH!!!!! The time has come to wander in the desert for a while, I think. For what its worth, I've solidified my thinking on the completion discovery mechanism I hinted at in my original posting. It is a general purpose VM interface call with the following properties: int ret = vm_avail ( base, len, mode ) char *base; /* user virtual address */ int len; /* extent of block in bytes */ int mode; /* read or write or both */ ret = number of continuous bytes, starting at base, available for the specified access mode(s) without faulting. As a side effect, every page in the block is "scheduled" to become available, as if an access attempt had been made to it. This may involve faulting a page in from disk, making a copy-on-write copy, or whatever. vm_avail will not block, but it may take an arbitrary amount of time (linear in len) eg for zero-fill or block copying. This could be used, if one implemented my original asynch-io scheme, to check for (incremental!) read/write completion. It could also be used to read-ahead a memory mapped file. A real-time program that had some portions of its image pageable could also use it to avoid taking page faults. Aren't I just full of it? :-) -- Steve Nuchia South Coast Computing Services (713) 964-2462 "To learn which questions are unanswerable, and _not_to_answer_them; this skill is most needful in times of stress and darkness." Ursula LeGuin, _The_Left_Hand_of_Darkness_
steve@nuchat.UUCP (Steve Nuchia) (09/03/90)
In article <BZS.90Aug31173255@world.std.com> bzs@world.std.com (Barry Shein) writes: >Do there exist any benchmark or other test results which indicate that >adding asynch i/o to unix actually yields a performance improvement? There is little question that it can dramatically speed up selected applications. Applications, mostly in the backup arena (afio, ddd) get substancially improved throughput even using really stupid mechanisms. In afio's case the mechanism is to do a fork for every write, and it is still a lot faster than running without the feature enabled. One poster mentioned as fact that the point behind asynch I/O was to overlap computation and I/O. At least in the backup arena the point is to overlap some I/O (tape) with other I/O (disk). In this case it should be clear that the overall system throughput is not damaged by the presence of an application using (well-implemented) asynch I/O if the remaining job mix is CPU intensive. Reference: Winter 88 Usenix, "A Faster UNIX Dump Program", Jeff Polk and Rob Kolstad (both then of CONVEX). -- Steve Nuchia South Coast Computing Services (713) 964-2462 "To learn which questions are unanswerable, and _not_to_answer_them; this skill is most needful in times of stress and darkness." Ursula LeGuin, _The_Left_Hand_of_Darkness_
schwartz@groucho.cs.psu.edu (Scott Schwartz) (09/03/90)
In article <27813@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes: | >The POSIX.4 asynchronous I/O facilities are moving toward final ballot and | >present a rich set of asynchronous I/O primitives. These include the | | It is precisely this "rich set" of "primitives" (!!!) that I am | striving to avoid. I was just thinking the same thing. Isn't it the case that lightweight processes (mach style threads, say) with shared memory for communication solve the asynch-io problem? I'd prefer that to a new set of async-io routines, I think.
bzs@world.std.com (Barry Shein) (09/03/90)
From: steve@nuchat.UUCP (Steve Nuchia) [responding to my note] >>Do there exist any benchmark or other test results which indicate that >>adding asynch i/o to unix actually yields a performance improvement? > >There is little question that it can dramatically speed up selected >applications. I guess that's a way of saying "no, I don't know of any results..." >Applications, mostly in the backup arena (afio, ddd) >get substancially improved throughput even using really stupid mechanisms. Backups almost always use raw I/O, how does this affect this observation? Is it worth doing for filesystem (disk) I/O? Why are these mechanisms "stupid"? Are you sure the same speed-up would be seen with async I/O? Are there any other applications besides intensive disk to tape which would benefit from this, or might this be a singular example? -- -Barry Shein Software Tool & Die | {xylogics,uunet}!world!bzs | bzs@world.std.com Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
peter@ficc.ferranti.com (Peter da Silva) (09/05/90)
In article <1990Aug31.190751.12522@dg-rtp.dg.com> eliot@dg-rtp.dg.com writes: > So, can anyone think of an application that behaves in this manner > (i.e. acts upon the return value from a read by doing something > important, that does not involve examining the read buffer)? I can't. Any program that reads a whole file at a time, like GNU Emacs or TCL. -- Peter da Silva. `-_-' +1 713 274 5180. 'U` peter@ferranti.com
peter@ficc.ferranti.com (Peter da Silva) (09/06/90)
In article <27813@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes: > An excellent point, one we would all do well to keep in mind. I would > have added to Lester's list of examples the event-driven style imposed > by modern user interface construction. ^^^^^^ Event loops are basically single loop control systems, such as are found in the simplest of embedded controllers: microwave ovens, for example. For them to have become synonymous with modern user interfaces borders on the obscene. The best way to implement a modern user interface is with multiple loops of control, such as Haitex' spreadsheet "Haicalc" on the commodore Amiga... Or, for a workstation environment, in NeWS. > AAAARRRRRRRGGGGHHHH!!!!! Sympathy. -- Peter da Silva. `-_-' +1 713 274 5180. 'U` peter@ferranti.com
peter@ficc.ferranti.com (Peter da Silva) (09/06/90)
In article <Fyu&?$-1@cs.psu.edu> schwartz@groucho.cs.psu.edu (Scott Schwartz) writes: > I was just thinking the same thing. Isn't it the case that > lightweight processes (mach style threads, say) with shared memory for > communication solve the asynch-io problem? I'd prefer that to a new > set of async-io routines, I think. If you have asynchronous I/O, you can implement threads, and vice versa. A simple asynch I/O mechanism, combined with user-mode threads, is quite adequate. The complexities involved in POSIX 1003.4 are slightly (but only slightly) overkill. -- Peter da Silva. `-_-' +1 713 274 5180. 'U` peter@ferranti.com