[comp.unix.internals] UNIX semantics do permit full support for asynchronous I/O

lewine@dg-rtp.dg.com (Donald Lewine) (08/29/90)

In article <27619@nuchat.UUCP>, steve@nuchat.UUCP (Steve Nuchia) writes:
|> (For completeness I will note that to use this scheme intelligently
|> you must be able to discover the relevant properties of the memory
|> management implementation.  This is nothing new for high performance
|> programs in a paged environment, but unless its been added recently
|> there isn't a standard way to do it.  Whether this is properly a
|> language or a system interface issue is best left to another debate.)
|> -- 

That last remark defeats your entire suggestion.  If I have to
"discover the relevant properties of the memory management 
implementation", all dusty decks will fail.  

If you blindly map the page(s) containing "buf" out of the users
address space you will map out other variables that the user may
want.  It is not possible for the compiler to know that buf must
be on a page by itself.  How could you implement your scheme?

Also the read() and write() functions return the number of characters
read or written.  How do you know this before the read() or write()
completes?  Do you assume that all disks are error free and never
fail?  That is a poor assumption!

I don't think that your idea works at all.  For a scheme that does
almost exactly this, but with the cooperation of the user program,
look at PMAP under DEC's TOPS-20 operating system or ?SPAGE under
DG's AOS/VS.

--------------------------------------------------------------------
Donald A. Lewine                (508) 870-9008
Data General Corporation        (508) 366-0750 FAX
4400 Computer Drive. MS D112A
Westboro, MA 01580  U.S.A.

uucp: uunet!dg!lewine   Internet: lewine@cheshirecat.webo.dg.com

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (08/31/90)

In article <861@dg.dg.com> uunet!dg!lewine writes:
> In article <27619@nuchat.UUCP>, steve@nuchat.UUCP (Steve Nuchia) writes:
> > (For completeness I will note that to use this scheme intelligently
> > you must be able to discover the relevant properties of the memory
> > management implementation.
> That last remark defeats your entire suggestion.  If I have to
> "discover the relevant properties of the memory management 
> implementation", all dusty decks will fail.  

No. Steve's point was that on paged architectures, he can get a low-cost
speedup out of some programs without any change in semantics. This is a
worthwhile change.

Discovering memory managemment can be as simple as having a system call
getpage() that returns a char buffer taking up exactly one page. Any
code that understands this can take full advantage of asynchronous I/O.

> If you blindly map the page(s) containing "buf" out of the users
> address space you will map out other variables that the user may
> want.  It is not possible for the compiler to know that buf must
> be on a page by itself.  How could you implement your scheme?

So what if there are other variables on the page? The worst that happens
is that the page gets mapped out and then back in; on paged hardware,
this cost is negligible. The best that happens is that the program uses
getpage() and guarantees that it will wait for the I/O to finish on that
page.

> Also the read() and write() functions return the number of characters
> read or written.  How do you know this before the read() or write()
> completes?  Do you assume that all disks are error free and never
> fail?  That is a poor assumption!

So what? read() and write() already return before data gets written out
to disk; assuming that you see all I/O errors before a sync is the poor
assumption! This is irrelevant to the issue at hand.

---Dan

stripes@eng.umd.edu (Joshua Osborne) (08/31/90)

In article <61535@lanl.gov> jlg@lanl.gov (Jim Giles) writes:
>From article <27619@nuchat.UUCP>, by steve@nuchat.UUCP (Steve Nuchia):
[...]
>> Have read(2) and write(2) calls map the pages containing the buffers
>> out of the user address space and return immediately.  Once the
>> data have been copied (DMAed?) to/from the buffers, map the pages back in.
>> [...]
>
>Yes, this will work.  I believe that MACH already does this.
>Unfortunately, this idea has two problems: 1) [omited]
>2) not all I/O requests are a multiple of the pagesize.
>The pagesize problem
>just means that you'd have to map out more memory than is actually
>involved in the I/O request.  This means that the user might get
>blocked on memory that is really perfectly safe to access - a minor
>source of slowdown.

It shouldn't be a source of slowdown in the read case; normally the program
would not get control untill after the read was done, the new worst case is
exactly the same.  However for reads the old best case is far better then
the new worst case, and it could happen relitavly offen.

The old best case: do a large write, the kernel copys the data lets you run
and writes it some time, you re-use the buffer.

If you do that under the new system:  You do the large write the kernal maps
out the pages,  gives you control you re-use the buffer, the kernel makes you
sleep untill it can do the write.  You lose out.  Alot of programs do this.
Currently stdio does this.  Of corse stdio would need a bit of tweeking anyway.
(allign page sized buffers on page boundrys)  While we are in there we could
make writes use 2 page buffers, and flush alterniate ones...  and do huge
writes directly out of the users space.

(just things to think about, I do like the idea...)
-- 
           stripes@eng.umd.edu          "Security for Unix is like
      Josh_Osborne@Real_World,The          Mutitasking for MS-DOS"
      "The dyslexic porgramer"                  - Kevin Lockwood
"Isn't that a shell script?"                                    - D. MacKenzie
"Yeah, kinda sticks out like a sore thumb in the middle of a kernel" - K. Lidl

peter@ficc.ferranti.com (Peter da Silva) (08/31/90)

In article <861@dg.dg.com> uunet!dg!lewine writes:
> That last remark defeats your entire suggestion.  If I have to
> "discover the relevant properties of the memory management 
> implementation", all dusty decks will fail.  

You only have to do this if you want to get some additional
performance. It should still work regardless.

> Also the read() and write() functions return the number of characters
> read or written.  How do you know this before the read() or write()
> completes?  Do you assume that all disks are error free and never
> fail?  That is a poor assumption!

Well, it's an assumption made for write() anyway. For read() you can just
treat it like any disk error on any other faulted-in page and blow off the
process. Disk errors are a very rare occurrence, and almost always require
human intervention anyway. Any other return value for read is known
ahead of time.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

pplacewa@bbn.com (Paul W Placeway) (09/01/90)

peter@ficc.ferranti.com (Peter da Silva) writes:

< ... Disk errors are a very rare occurrence, and almost always require
< human intervention anyway. Any other return value for read is known
< ahead of time.

Unless your "disk" is RFS and the remote machine crashes, or soft
mounted NFS and any one of about a zillion things happens...


		-- Paul Placeway

eliot@dg-rtp.dg.com (Topher Eliot) (09/01/90)

In article <12023:Aug3017:24:1590@kramden.acf.nyu.edu>,
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:
|> 
|> > Also the read() and write() functions return the number of
characters
|> > read or written.  How do you know this before the read() or
write()
|> > completes?  Do you assume that all disks are error free and never
|> > fail?  That is a poor assumption!
|> 
|> So what? read() and write() already return before data gets written
out
|> to disk; assuming that you see all I/O errors before a sync is the
poor
|> assumption! This is irrelevant to the issue at hand.

I think this point very well covers the case of writing:  if you want
to be sure it really got to disk, you need to do an fsync(), and even
then I'm not sure you can be sure (doesn't the fsync just *mark* the
buffers for writing out?).  We could certainly arrange for fsync to
block until everything is really on disk.

But consider the reading case.  Here, we could tell the process how
many bytes it had "gotten" (was going to get), even if it was less
than the process had requested (presumably the kernel knows how big
the file is, without having to read all those bytes off disk).  The
application then might do something *other than examining the bytes
it just "read"* based on this knowledge of a "successful read".  If
the disk then fails (or the net fails, or whatever), the application
would have acted incorrectly.  Moreover, instead of the application
learning about the failure by getting -1 back from a read call, it
will learn about it by receiving a signal or some such.

So, can anyone think of an application that behaves in this manner
(i.e.  acts upon the return value from a read by doing something
important, that does not involve examining the read buffer)?  I can't.
Perhaps more significant is the issue of the application not getting a
-1 back from the read call.

--
Topher Eliot
Data General Corporation                eliot@dg-rtp.dg.com
62 T. W. Alexander Drive               
{backbone}!mcnc!rti!dg-rtp!eliot
Research Triangle Park, NC 27709        (919) 248-6371
Obviously, I speak for myself, not for DG.

hunt@dg-rtp.dg.com (Greg Hunt) (09/01/90)

In article <1990Aug31.190751.12522@dg-rtp.dg.com>, eliot@dg-rtp.dg.com
(Topher Eliot) writes:
>
> But consider the reading case.  Here, we could tell the process how
> many bytes it had "gotten" (was going to get), even if it was less
> than the process had requested (presumably the kernel knows how big
> the file is, without having to read all those bytes off disk).  The
> application then might do something *other than examining the bytes
> it just "read"* based on this knowledge of a "successful read".  If
> the disk then fails (or the net fails, or whatever), the application
> would have acted incorrectly.  Moreover, instead of the application
> learning about the failure by getting -1 back from a read call, it
> will learn about it by receiving a signal or some such.
> 
> So, can anyone think of an application that behaves in this manner
> (i.e.  acts upon the return value from a read by doing something
> important, that does not involve examining the read buffer)?  I
can't.
> Perhaps more significant is the issue of the application not getting
a
> -1 back from the read call.
> 
> Topher Eliot
>

Yes, I can.  Under Data General's AOS/VS OS, I wrote a program that
read blocks from tape drives and checked the sizes of the blocks read.
The program was for verifying that labeled backup tapes were
physically readable.

The header label on the tape contained buffersize information which
the program used to read the data blocks.  As each block was read,
the size of the read returned by the OS was checked against the
buffersize to ensure that full buffer reads were done.  It also
counted the number of blocks read.  It discarded the contents of the
read buffer without looking at them.

The trailer label on the tape contained block count information that
was written by the OS.  The OS's block count was compared against the
block count seen by the program.

All of these checks were only to ensure that the tape could be
physically read.  Using it eliminated ALL bad backup tapes that I was
encountering.  Sometimes I found that I could write a tape, but not
read it again.  Tapes that could not be verified by this program were
discarded.  The program did nothing to ensure that the tape could
be logically read by the load program used to restore files, so it
did nothing to guard against bugs in the dump/load programs
themselves.

I have yet to port the program to DG/UX to verify UNIX backup tapes in
a similar manner.  I believe the program could be made to serve a
similar purpose, but I'd probably have to change the header/trailer
handling since AOS/VS uses ANSI standard tape labels and I don't think
that UNIX does.

Does this example meet the behavior that you were wondering about?  It
may be a specialized use of the results of a read and not be
representative of what applications-level software does.

--
Greg Hunt                        Internet: hunt@dg-rtp.dg.com
DG/UX Kernel Development         UUCP:     {world}!mcnc!rti!dg-rtp!hunt
Data General Corporation
Research Triangle Park, NC       These opinions are mine, not DG's.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/01/90)

In article <1990Aug31.190751.12522@dg-rtp.dg.com> eliot@dg-rtp.dg.com writes:
> I think this point very well covers the case of writing:  if you want
> to be sure it really got to disk, you need to do an fsync(), and even
> then I'm not sure you can be sure (doesn't the fsync just *mark* the
> buffers for writing out?).  We could certainly arrange for fsync to
> block until everything is really on disk.

fsync() will certainly do that, independently of this mechanism. (It's
sync() that just marks buffers for writing. BSD's fsync() truly writes
the data to disk, giving the transaction control you need for reliable
databases. I have no idea what you poor System V folks do.)

> But consider the reading case.
  [ what happens upon failure? ]

As Peter pointed out, this case is fatal. How many disk errors have you
had over the last year? How many did the programs involved recover from?
Yeah, thought so.

I guess you're right in principle: Steve's proposal is only completely
transparent for writing (which is the more important case anyway).

---Dan

bzs@world.std.com (Barry Shein) (09/02/90)

Having somehow missed the fact that comp.unix.wizards has, um, changed
status (what happened anyhow? I still don't know) I'll repeat a query
I just made there here:

Does anyone have any papers or references on performance benchmarking
of regular vs asynch I/O under UNIX (not other OS's, it's not
applicable)?

Failing that, does anyone have any informal results? Anything?
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

buck@siswat.UUCP (A. Lester Buck) (09/02/90)

In article <1990Aug30.222226.20866@cbnewsm.att.com> lfd@cbnewsm.att.com (leland.f.derbenwick) writes:
>In essentially any serious database application, a completed
>write() to a raw disk is treated as a guarantee that the data
>block has been _physically written to the device_.  (This is
>needed to ensure reliable transaction behavior in the presence
>of potential system crashes.)  Since your suggestion would void
>that guarantee, it is not benign.

Close, but not quite.  The guarantee is that the _controller_ has accepted
the data.  If/when the bits actually hit the media is not fully under the
control of the OS.  Remember SCSI has a READ BUFFERED DATA command for error
recovery.  SCSI disks are coming with bigger caches all the time, and a
power hit can take out a significant amount of data.  If the database really
must remain consistent, a UPS is probably required.

As to Steve's idea, it has a certain elegance to recommend it.  But its
practical value is low.  Sure, it can be made to have full Unix semantics,
but at the price of the common case reducing almost exactly to synchronous
I/O.  Or imagine the case of an I/O server process sharing memory
with dozens of clients.  Each shared memory segment will have to keep a
list of every process that must block on a page fault. The practical effect
will be that an _arbitrary_ number of processes will potentially block
for every I/O, instead of doing useful work in their own address spaces.
This scheme falls into the general class of YANSUAIOM (Yet Another
Non-Standard Unix Asynchronous I/O Mechanism), as do the schemes with
ioctl's or select'ing on disk.

What may be difficult to understand at this point, when Unix has not had a
standard asynchronous I/O facility, is that we will program _differently_
when it is widely available.  The semantics of I/O must change (broaden).
The structure and flow of a program will be significantly different when it
uses asynchronous I/O, in the same way that the availablility of real
threads leads to new programming paradigms to take advantage of those
facilities.  We may have to look at schemes used in the realtime Unix
versions, VMS (gag) or even MVS (gag!!), which have had asynchronous I/O
facilities for up to decades, to adapt to this new mindset.

The only reason one designs an asynchronous I/O facility is to efficiently
overlap computation with I/O transfers, and that can take some careful
thought to achieve maximum speedup.  For example, Chris Torek recently
traced the path of a raw synchronous I/O, which eventually sleeps in
physio() in the context of the calling process.  A large transfer will loop
through physio, with a wakeup/sleep cycle for every chunk (limited by how
much physical memory the OS wants to lock down at once).  Each sleep/wakeup
cycle is an expensive context switch, involving reloading the virtual memory
state of the caller.  But a fully asynchronous I/O scheme drags along enough
state to start the next I/O chunk all within the driver interrupt routine,
with the calling process completely out of context.  Of course, it is a
bit(!) more complicated if non-resident pages are found in the next chunk
that needs to be page-locked...

The POSIX.4 asynchronous I/O facilities are moving toward final ballot and
present a rich set of asynchronous I/O primitives.  These include the
obvious aread/awrite, and listio, similar to readv/writev for synchronous
transfers, which can fire off a large number of aio's at once and optionally
be notified only when they are all complete.  Iosuspend is a more advanced
version of select that waits for completion of any operations in a list.
The process can learn of I/O completion in at least four ways:  1) return
codes written into the process' asynchronous I/O control block, 2) receiving
a completely asynchronous "fixed" (queued, tagged) signal/event which runs a
handler, 3) synchronously suspending for I/O completion (iosuspend), or 4)
synchronously suspending or polling for the signal/event posting I/O
completion.  [Suspending is familiar, but the committee added polling, where
a process can sleep until one of a selected signal/event class is posted
while taking signal/events not being polled for completely asynchronously.]

-- 
A. Lester Buck    buck@siswat.lonestar.org  ...!uhnix1!lobster!siswat!buck

steve@nuchat.UUCP (Steve Nuchia) (09/03/90)

In article <555@siswat.UUCP> buck@siswat.UUCP (A. Lester Buck) writes:
>As to Steve's idea, it has a certain elegance to recommend it.  But its
>practical value is low.  Sure, it can be made to have full Unix semantics,

Hmm.  What ever happened to aesthetic value having some measure of
parity with "practical" value?

To update everyone on my thinking, there have been two serious
(in my view) objections raised.  The fact that at the extrema
of system performance requirements my assumption of a paging
environment is false is the most damaging, and probably is sufficient
to relegate the whole idea to the scrap-heap.  Secondly, the cases
in which the return value (especially for read) cannot be predicted
productively, though few, are important enough to cast serious
doubt on the sufficiency of my scheme.

>What may be difficult to understand at this point, when Unix has not had a
>standard asynchronous I/O facility, is that we will program _differently_

An excellent point, one we would all do well to keep in mind.  I would
have added to Lester's list of examples the event-driven style imposed
by modern user interface construction.

>The POSIX.4 asynchronous I/O facilities are moving toward final ballot and
>present a rich set of asynchronous I/O primitives.  These include the

It is precisely this "rich set" of "primitives" (!!!) that I am
striving to avoid.  If one may, as you suggest, learn something
by examining MVS and VMS and all thos other UPPER CASE operating
systems, one of those things should be that a proliferation of
system interface mechanisms (calls, whatever) is Not Good.

>The process can learn of I/O completion in at least four ways:  1) return
>codes written into the process' asynchronous I/O control block, 2) receiving
>a completely asynchronous "fixed" (queued, tagged) signal/event which runs a
>handler, 3) synchronously suspending for I/O completion (iosuspend), or 4)
>synchronously suspending or polling for the signal/event posting I/O
>completion.  [Suspending is familiar, but the committee added polling, where

AAAARRRRRRRGGGGHHHH!!!!!

The time has come to wander in the desert for a while, I think.

For what its worth, I've solidified my thinking on the completion
discovery mechanism I hinted at in my original posting.  It is a
general purpose VM interface call with the following properties:

	int ret = vm_avail ( base, len, mode )
		char	*base;	/* user virtual address */
		int	len;	/* extent of block in bytes */
		int	mode;	/* read or write or both */

	ret = number of continuous bytes, starting at base,
		available for the specified access mode(s)
		without faulting.

	As a side effect, every page in the block is "scheduled"
	to become available, as if an access attempt had been
	made to it.  This may involve faulting a page in from
	disk, making a copy-on-write copy, or whatever.

	vm_avail will not block, but it may take an arbitrary
	amount of time (linear in len) eg for zero-fill or block copying.

This could be used, if one implemented my original asynch-io
scheme, to check for (incremental!) read/write completion.  It could
also be used to read-ahead a memory mapped file.  A real-time program
that had some portions of its image pageable could also use it to
avoid taking page faults.

Aren't I just full of it?  :-)
-- 
Steve Nuchia	      South Coast Computing Services      (713) 964-2462
"To learn which questions are unanswerable, and _not_to_answer_them;
this skill is most needful in times of stress and darkness."
		Ursula LeGuin, _The_Left_Hand_of_Darkness_

steve@nuchat.UUCP (Steve Nuchia) (09/03/90)

In article <BZS.90Aug31173255@world.std.com> bzs@world.std.com (Barry Shein) writes:
>Do there exist any benchmark or other test results which indicate that
>adding asynch i/o to unix actually yields a performance improvement?

There is little question that it can dramatically speed up selected
applications.  Applications, mostly in the backup arena (afio, ddd)
get substancially improved throughput even using really stupid mechanisms.
In afio's case the mechanism is to do a fork for every write, and it
is still a lot faster than running without the feature enabled.

One poster mentioned as fact that the point behind asynch I/O was
to overlap computation and I/O.  At least in the backup arena the
point is to overlap some I/O (tape) with other I/O (disk).  In this
case it should be clear that the overall system throughput is not
damaged by the presence of an application using (well-implemented)
asynch I/O if the remaining job mix is CPU intensive.

Reference: Winter 88 Usenix, "A Faster UNIX Dump Program", Jeff Polk
and Rob Kolstad (both then of CONVEX).
-- 
Steve Nuchia	      South Coast Computing Services      (713) 964-2462
"To learn which questions are unanswerable, and _not_to_answer_them;
this skill is most needful in times of stress and darkness."
		Ursula LeGuin, _The_Left_Hand_of_Darkness_

schwartz@groucho.cs.psu.edu (Scott Schwartz) (09/03/90)

In article <27813@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
| >The POSIX.4 asynchronous I/O facilities are moving toward final ballot and
| >present a rich set of asynchronous I/O primitives.  These include the
| 
| It is precisely this "rich set" of "primitives" (!!!) that I am
| striving to avoid. 


I was just thinking the same thing.  Isn't it the case that
lightweight processes (mach style threads, say) with shared memory for
communication solve the asynch-io problem?  I'd prefer that to a new
set of async-io routines, I think.

bzs@world.std.com (Barry Shein) (09/03/90)

From: steve@nuchat.UUCP (Steve Nuchia) [responding to my note]
>>Do there exist any benchmark or other test results which indicate that
>>adding asynch i/o to unix actually yields a performance improvement?
>
>There is little question that it can dramatically speed up selected
>applications.

I guess that's a way of saying "no, I don't know of any results..."

>Applications, mostly in the backup arena (afio, ddd)
>get substancially improved throughput even using really stupid mechanisms.

Backups almost always use raw I/O, how does this affect this
observation?

Is it worth doing for filesystem (disk) I/O?

Why are these mechanisms "stupid"? Are you sure the same speed-up
would be seen with async I/O? Are there any other applications besides
intensive disk to tape which would benefit from this, or might this be
a singular example?
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

peter@ficc.ferranti.com (Peter da Silva) (09/05/90)

In article <1990Aug31.190751.12522@dg-rtp.dg.com> eliot@dg-rtp.dg.com writes:
> So, can anyone think of an application that behaves in this manner
> (i.e.  acts upon the return value from a read by doing something
> important, that does not involve examining the read buffer)?  I can't.

Any program that reads a whole file at a time, like GNU Emacs or TCL.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

peter@ficc.ferranti.com (Peter da Silva) (09/06/90)

In article <27813@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
> An excellent point, one we would all do well to keep in mind.  I would
> have added to Lester's list of examples the event-driven style imposed
> by modern user interface construction.
     ^^^^^^

Event loops are basically single loop control systems, such as are found
in the simplest of embedded controllers: microwave ovens, for example. For
them to have become synonymous with modern user interfaces borders on the
obscene. The best way to implement a modern user interface is with multiple
loops of control, such as Haitex' spreadsheet "Haicalc" on the commodore
Amiga...

Or, for a workstation environment, in NeWS.

> AAAARRRRRRRGGGGHHHH!!!!!

Sympathy.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com

peter@ficc.ferranti.com (Peter da Silva) (09/06/90)

In article <Fyu&?$-1@cs.psu.edu> schwartz@groucho.cs.psu.edu (Scott Schwartz) writes:
> I was just thinking the same thing.  Isn't it the case that
> lightweight processes (mach style threads, say) with shared memory for
> communication solve the asynch-io problem?  I'd prefer that to a new
> set of async-io routines, I think.

If you have asynchronous I/O, you can implement threads, and vice versa.
A simple asynch I/O mechanism, combined with user-mode threads, is quite
adequate. The complexities involved in POSIX 1003.4 are slightly (but
only slightly) overkill.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com