[comp.unix.wizards] UNIX semantics do permit full support for asynchronous I/O

steve@nuchat.UUCP (Steve Nuchia) (08/29/90)

On the subject of asynchronous I/O in Unix:  I've come up with
what I consider a rather slick way of making it fit neatly
into Unix's way of doing things:

Have read(2) and write(2) calls map the pages containing the buffers
out of the user address space and return immediately.  Once the
data have been copied (DMAed?) to/from the buffers, map the pages back in.

A user program that is not aware of the subterfuge will then run
along for some (probably short) time and trap on an attempt to
refill or inspect the buffer.  It will then be blocked until
the request completes.  A savvy program will do something else
for as long as it can, then take a peek at the buffer when it
has run out of busy work.  One would probably also provide
(grudgingly, in my case) an explicit call for discovering the status.

(Note that such a call might be useful to a program that
wished to control its paging behaviour if it were written with
sufficient generality.)

The scheme will only provide asynchronicity in cases where the
return value for the read or write call is known early.  This
will be the case primarily for files, but other cases can be
made to take advantage of it.

The performance characteristics would be similar to using mmap
but would apply to programs written in normal unix style and
to dusty decks.  One must of course take some care in implementing
the scheme, and there are no doubt the usual raft of gotchas that
come up when doing anything involving memory management.  The
case of write(1,buf,read(0,buf,sizeof(buf))) is entertaining to
contemplate for instance.

Good performance with V7 file system call semantics.  Programs
work whether the feature is in the kernel or not and whether they
are written to take advantage of it or not.  I sure wish I
had thought of it a long time ago (like while the standards
were still soft).

I would appreciate any comments that wizards with more kernel
internals experience might have.  If I've rediscovered something
well-known again I think I shall slit my wrists.

(For completeness I will note that to use this scheme intelligently
you must be able to discover the relevant properties of the memory
management implementation.  This is nothing new for high performance
programs in a paged environment, but unless its been added recently
there isn't a standard way to do it.  Whether this is properly a
language or a system interface issue is best left to another debate.)
-- 
Steve Nuchia	      South Coast Computing Services      (713) 964-2462
"To learn which questions are unanswerable, and _not_to_answer_them;
this skill is most needful in times of stress and darkness."
		Ursula LeGuin, _The_Left_Hand_of_Darkness_

rsc@merit.edu (Richard Conto) (08/30/90)

In article <27619@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
>On the subject of asynchronous I/O in Unix:  I've come up with
>what I consider a rather slick way of making it fit neatly
>into Unix's way of doing things:
>
>Have read(2) and write(2) calls map the pages containing the buffers
>out of the user address space and return immediately.  Once the
>data have been copied (DMAed?) to/from the buffers, map the pages back in.
>
>A user program that is not aware of the subterfuge will then run
>along for some (probably short) time and trap on an attempt to
>refill or inspect the buffer.  It will then be blocked until
>the request completes.  A savvy program will do something else
>for as long as it can, then take a peek at the buffer when it
>has run out of busy work.  One would probably also provide
>(grudgingly, in my case) an explicit call for discovering the status.

A buffer is not necessarily aligned on a page boundary. And a page
may contain more than one variable.  The savvy program would have to
design it's data structures (including local variable arrangement, if
a buffer happens to be there) to be aware of whatever peculiar way
the complier lays out variables and whatever peculiar granularity the
OS has for pages.

Make it simpler. Have a routine that requests an I/O operation. Another
routine that can check it's status. A way of specifying a routine to be
called when the I/O operation completes might be yet another option.
I'm afraid that your idea adds unnecessary complexity (and system dependancies).
And using constructs like 'write(fdout,buf,read(fdin,sizeof(buf), buf))' is
asking for trouble when 'read()' returns an error condition.

--- Richard

jlg@lanl.gov (Jim Giles) (08/30/90)

From article <27619@nuchat.UUCP>, by steve@nuchat.UUCP (Steve Nuchia):
> On the subject of asynchronous I/O in Unix:  I've come up with
> what I consider a rather slick way of making it fit neatly
> into Unix's way of doing things:
> 
> Have read(2) and write(2) calls map the pages containing the buffers
> out of the user address space and return immediately.  Once the
> data have been copied (DMAed?) to/from the buffers, map the pages back in.
> [...]

Yes, this will work.  I believe that MACH already does this.
Unfortunately, this idea has two problems: 1) not all machines are
paged/segmented; 2) not all I/O requests are a multiple of the
pagesize.  The first problem is more severe - hardware designers avoid
pages/segments when designing for speed.  The extra hardware overhead
is 10% speed or about that for extra hardware cost.  So they are
avoided (Crays don't have pages or segments).  The pagesize problem
just means that you'd have to map out more memory than is actually
involved in the I/O request.  This means that the user might get
blocked on memory that is really perfectly safe to access - a minor
source of slowdown.

J. Giles

merriman@ccavax.camb.com (08/30/90)

In article <1990Aug29.170931.10853@terminator.cc.umich.edu>, 
	rsc@merit.edu (Richard Conto) writes:

> 
> Make it simpler. Have a routine that requests an I/O operation. Another
> routine that can check it's status. A way of specifying a routine to be
> called when the I/O operation completes might be yet another option.

Sure sounds like VMS QIO calls.

bdsz@cbnewsl.att.com (bruce.d.szablak) (08/30/90)

In article <1990Aug29.170931.10853@terminator.cc.umich.edu>, rsc@merit.edu (Richard Conto) writes:
> In article <27619@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
> >Have read(2) and write(2) calls map the pages containing the buffers
> >out of the user address space and return immediately.
>  
> A buffer is not necessarily aligned on a page boundary. And a page
> may contain more than one variable.

Actually, the OS only has to mark the pages as copy on write. This sort
of thing is often done when a process forks to avoid making a copy of the data
space for the child. Whether its worth it is another matter.

lfd@cbnewsm.att.com (leland.f.derbenwick) (08/31/90)

In article <27619@nuchat.UUCP>, steve@nuchat.UUCP (Steve Nuchia) writes:
> On the subject of asynchronous I/O in Unix:  I've come up with
> what I consider a rather slick way of making it fit neatly
> into Unix's way of doing things:
> 
> Have read(2) and write(2) calls map the pages containing the buffers
> out of the user address space and return immediately.  Once the
> data have been copied (DMAed?) to/from the buffers, map the pages back in.
> 
> A user program that is not aware of the subterfuge will then run
> along for some (probably short) time and trap on an attempt to
> refill or inspect the buffer.  It will then be blocked until
> the request completes.  A savvy program will do something else
> for as long as it can, then take a peek at the buffer when it
> has run out of busy work.  One would probably also provide
> (grudgingly, in my case) an explicit call for discovering the status.

Apart from the implementation problems that others have mentioned,
_this suggestion breaks existing code_.

In essentially any serious database application, a completed
write() to a raw disk is treated as a guarantee that the data
block has been _physically written to the device_.  (This is
needed to ensure reliable transaction behavior in the presence
of potential system crashes.)  Since your suggestion would void
that guarantee, it is not benign.

On the other hand, I like your idea of implementing asynchronous
behavior using the ordinary read() and write() calls.  So how
difficult would it be to add a couple ioctl's to the existing
raw disk driver to support that?

One ioctl would select sync/async reads/writes (the default would
be the present behavior: sync read, sync write).  The other ioctl
would do the status inquiry.  With these, asynchronous behavior
is available on demand, and the OS doesn't need to jump through
any hoops to make it transparent: it's up to the user to use the
facility properly.

This is a lot cleaner than implementing asynchronous I/O in user
mode with shared memory and a background process...

 -- Speaking strictly for myself,
 --   Lee Derbenwick, AT&T Bell Laboratories, Warren, NJ
 --   lfd@cbnewsm.ATT.COM  or  <wherever>!att!cbnewsm!lfd

utoddl@uncecs.edu (Todd M. Lewis) (08/31/90)

In article <31445.26dc0466@ccavax.camb.com> merriman@ccavax.camb.com writes:
>In article <1990Aug29.170931.10853@terminator.cc.umich.edu>, 
>	rsc@merit.edu (Richard Conto) writes:
>
>> 
>> Make it simpler. Have a routine that requests an I/O operation. Another
>> routine that can check it's status. A way of specifying a routine to be
>> called when the I/O operation completes might be yet another option.
>
>Sure sounds like VMS QIO calls.

Sounds like the Amiga's OS to me.  And UNIX doesn't do this?
I'm trying to be a UNIX nut in training, but I keep hearing about
these new tricks that seem to be rather hard to teach the
old dog.  I'd hate to wake up in 5 years and realize that UNIX
had become to workstations what MS-DOS is to PCs now.  Somebody
pinch me.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/01/90)

In article <1990Aug30.222226.20866@cbnewsm.att.com> lfd@cbnewsm.att.com (leland.f.derbenwick) writes:
> In article <27619@nuchat.UUCP>, steve@nuchat.UUCP (Steve Nuchia) writes:
> > Have read(2) and write(2) calls map the pages containing the buffers
> > out of the user address space and return immediately.  Once the
> > data have been copied (DMAed?) to/from the buffers, map the pages back in.
> Apart from the implementation problems that others have mentioned,
> _this suggestion breaks existing code_.

No, it does not. On many paged machines, an implementation of Steve's
suggestion takes virtually [sic] no time. The worst that happens is your
original efficiency. The best that happens is a noticeable speedup,
especially of pipe read-writes. A program that uses, say, a getpage()
call to allocate a page-aligned buffer can guarantee the best case.

> In essentially any serious database application, a completed
> write() to a raw disk is treated as a guarantee that the data
> block has been _physically written to the device_.

No. Any database application that claims to recover after crashes
without fsync()ing its write()s is lying. (This says some interesting
things about certain System V database programs.)

---Dan

meissner@osf.org (Michael Meissner) (09/01/90)

In article <29290:Aug3120:10:5590@kramden.acf.nyu.edu>
brnstnd@kramden.acf.nyu.edu (Dan Bernstein) writes:

| No. Any database application that claims to recover after crashes
| without fsync()ing its write()s is lying. (This says some interesting
| things about certain System V database programs.)

Ah, but System V has O_SYNC was does do the fsync after every write
(or so the man page claims...).
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Do apple growers tell their kids money doesn't grow on bushes?

bzs@world.std.com (Barry Shein) (09/01/90)

Do there exist any benchmark or other test results which indicate that
adding asynch i/o to unix actually yields a performance improvement?

Papers, pointers, etc appreciated. I am not interested in results from
other operating systems, I don't believe they would have any
applicability to the question.

However, very informal results would be appreciated.
-- 
        -Barry Shein

Software Tool & Die    | {xylogics,uunet}!world!bzs | bzs@world.std.com
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

stripes@eng.umd.edu (Joshua Osborne) (09/02/90)

In article <1990Aug30.222226.20866@cbnewsm.att.com> lfd@cbnewsm.att.com (leland.f.derbenwick) writes:
>Apart from the implementation problems that others have mentioned,
>_this suggestion breaks existing code_.
>
>In essentially any serious database application, a completed
>write() to a raw disk is treated as a guarantee that the data
>block has been _physically written to the device_.  (This is
>needed to ensure reliable transaction behavior in the presence
>of potential system crashes.)  Since your suggestion would void
>that guarantee, it is not benign.

Then that program is quite broken.  Unix guarantees no such thing.
If you want it you need to use fsync(filno), or open the file in
sync mode.  Currently Unix copys data to write into it's disk buffers,
returns controll to the user and doesn't write them until it is forced to
(sync, fsync, buffer shortage) or decides that it is a good time to write.

>One ioctl would select sync/async reads/writes (the default would
>be the present behavior: sync read, sync write).  The other ioctl
>would do the status inquiry.  With these, asynchronous behavior
>is available on demand, and the OS doesn't need to jump through
>any hoops to make it transparent: it's up to the user to use the
>facility properly.

The default should be async for both read & write, because the default write
is aready async & the async read would be transparent.  There should be
a way to select sync read/write on a file by file basis 'tho.
-- 
           stripes@eng.umd.edu          "Security for Unix is like
      Josh_Osborne@Real_World,The          Mutitasking for MS-DOS"
      "The dyslexic porgramer"                  - Kevin Lockwood
"Isn't that a shell script?"                                    - D. MacKenzie
"Yeah, kinda sticks out like a sore thumb in the middle of a kernel" - K. Lidl

chris@mimsy.umd.edu (Chris Torek) (09/02/90)

>In article <1990Aug30.222226.20866@cbnewsm.att.com>
lfd@cbnewsm.att.com (leland.f.derbenwick) writes:
>>In essentially any serious database application, a completed
>>write() to a raw disk is treated as a guarantee that the data
>>block has been _physically written to the device_. ...

In article <1990Sep1.185221.8718@eng.umd.edu> stripes@eng.umd.edu
(Joshua Osborne) writes:
>Unix guarantees no such thing.  If you want it you need to use
>fsync(filno), or open the file in sync mode.  Currently Unix copys
>data to write into it's disk buffers, ....

Look again: he said `raw disk'.  Raw I/O calls physio; physio calls
vslock (in 4BSD anyway); vslock pages in and locks in core all the
memory needed for the transfer; physio calls physstrat or the device
strategy routine (depending on the partiuclar variant of 4BSD);
physstrat (if it exists) calls the device strategy routine; the device
routine queues the transfer and, if necessary, starts the device, then
returns; and then physio/physstrat *WAITS*.  Finally, physio calls
vsunlock (which may also mark the pages as modified) and returns.

It would be useful to be able to start raw transfers without waiting.
I once (actually, twice) wrote a driver that did this.  Sort of a hack,
but it worked.  It required changes to vsunlock() (to allow it to be
called at interrupt time) and exit() (to avoid throwing away the
process VM until the device close routines finished up).  It would be
better to do this more directly, though.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

jc@minya.UUCP (John Chambers) (09/03/90)

In article <BZS.90Aug31173255@world.std.com>, bzs@world.std.com (Barry Shein) writes:
> 
> Do there exist any benchmark or other test results which indicate that
> adding asynch i/o to unix actually yields a performance improvement?

Well, I don't have the papers any more, and in any case, they were internal
documents at the compnay I was working for, but I did a study along this
line about 5 years ago.  We used an assortment of Sys/V, Sys/III, and BSD
systems, including a couple (e.g., Masscomp) that implemented contiguous
files.  I wrote a set of test programs that did various patterns of file
access (sequential, random, random-followed-by-sequential, etc.), and also
tested a number of the companies' existing applications.  Files were opened
with/without O_SYNC, and were contiguous/allocated on different tests  The
tests were run both alone and together with other applications, giving a
total of 8 combinations for each test, which were run long enough to give
stastically-significant results.

The results were disappointing for those that wanted these features.  Most
of the tests showed no significant differences among the combinations.  In 
the few cases where there was a difference, the "normal" case (no syncing,
not contiguous) was the winner by a small margin.

It was particularly interesting that we couldn't find a single application
that ran faster with contiguous files than with normal files.  I'm sure
that some exist, but we couldn't construct them.  I was less surprised that
automatic syncing didn't benefit anyone; I had predicted that.  After all,
forcing a block to be written causes it to be marked "clean", so it becomes
a good candidate for re-use if buffer space is low.  As a result, such files
tend to have a somewhat smaller fraction of their data in buffers, so random
reads are somewhat less likely to have a hit.  It's not big, but for random
I/O, it is measurable.

It'd be interesting to hear of cases where these features are worth their
price (in kernel code, programmer time, etc.).

-- 
Zippy-Says: Imagine ... a world without clothing folds, chiaroscuro, or marital difficulties ...
Home: 1-617-484-6393 Work: 1-508-952-3274
Uucp: ...!{harvard.edu,ima.com,eddie.mit.edu,ora.com}!minya!jc (John Chambers)
Uucp-map: minya	adelie(DEAD)

peter@ficc.ferranti.com (Peter da Silva) (09/06/90)

In article <27813@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
> An excellent point, one we would all do well to keep in mind.  I would
> have added to Lester's list of examples the event-driven style imposed
> by modern user interface construction.
     ^^^^^^

Event loops are basically single loop control systems, such as are found
in the simplest of embedded controllers: microwave ovens, for example. For
them to have become synonymous with modern user interfaces borders on the
obscene. The best way to implement a modern user interface is with multiple
loops of control, such as Haitex' spreadsheet "Haicalc" on the commodore
Amiga...

Or, for a workstation environment, in NeWS.

> AAAARRRRRRRGGGGHHHH!!!!!

Sympathy.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
peter@ferranti.com