[comp.arch] mmap

gingell%opus@Sun.COM (Rob Gingell) (02/13/90)

In article <3556@rti.UUCP> trt@rti.UUCP (Thomas Truscott) writes:
>[someone wrote a replacement for "sum", called "fastsum" that uses mmap().]
>
>You are comparing your efficient "fastsum" that happens to use mmap()
>against a sluggardly "sum" that happens to use read().
>(Actually it uses getchar(), which calls _filbuf(),
>maybe _filbuf() uses mmap()?!)

As it happens, no.  This is always a potential change, however we have
not done so because to date we have not found that stdio would benefit
from such a change -- the principal advantage would be to save buffer
copy time and memory loading, however we haven't found a large population
of programs where these factors are dominant.  Perhaps it is because our
stdio is so otherwise inefficient, perhaps it is because the applications
themselves are inherently not I/O buffer copy limited, or perhaps simply
because those programs that were already so limited long ago converted
to direct read()/write() operations.

>The following would be a more appropriate test:
>Change your fastsum routine so that instead of mmap()ing
>a megabyte at a time, it does a read() of a megabyte at a time. 
>Compare the mmap() and read() versions of this program.
>I suspect you will find they take about the same amount of time.

I don't think so.  At the very least, the read() version will be slower
than the mmap() version by the amount of time required to effect the
copies from kernel to program buffers.  And this assumes an "optimum"
situation in which the overhead of buffer management in the kernel does
not become significant -- which it will for a large amount of data.  And
it ignores the system effects of essentially doubling the memory load
on the system for both the original file pages and the pages used to
buffer the copies in the application.

>On a Sparcstation 1, try timing "cp" vs. the following program:
>
>    main()
>    {
>	    char bfr[8192];
>	    register int n;
>
>	    while ((n = read(0, bfr, sizeof(bfr))) > 0)
>		    write(1, bfr, n);
>    }
>
>I did "/bin/time cp /vmunix /tmp/x"
>and "/bin/time a.out < /vmunix /tmp/x" several times.
>The results were essentially identical.
>(I did not experiment with buffer sizes, I suspect 16k would be faster.)

I'd be astonished if the results did not always show that access through
mmap() is faster (and they are for this program running on my 3/160.)  To be a
valid experiment, you should be sure that both /vmunix and /tmp/x are
completely flushed from memory after each test run -- otherwise the system's
buffering of the two files will skew the results.  I've never observed a
proper experiment in which mmap() was not faster, though the difference is not
always large.

>There is no inherent reason that read() should be slower
>than mmap() for sequential I/O, since read() is doing precisely
>what is wanted.  Indeed read() should be faster since
>it is conceptually simpler.

Not true.  read() operates by mmaping the file and copying it.  And, due to
limitations in the address space available inside the kernel, read() must
often perform more, smaller "mmap()-like" chunk operations than a single
application mmap() could use, using even more CPU time in the process.

>Note that read() can be implemented with memory mapping, in some cases:
>it could map the address of "bfr" to a copy-on-modify kernel page.

This is also not true, though it is a common belief and one that arose
repeatedly during development.  read() gives you a copy of the file data
at the time that the call is executed.  That copy is immutable save any
action performed by your program.  If read() were implemented *as* mmap(),
then while it is possible to deal with side effects introduced in *your*
machine, it is not, in general, possible to deal with side effects introduced
in other machines -- such as file modifications performed by DOS PC's living
in your network.  It might be possible to make such an assumption save for
heterogeneous environments.  However, it should be noted that neither
MULTICS nor TENEX/TOPS-20 (the latter being the more direct parent of
mmap(), with MULTICS as a more remote ancestor) attempted such an 
optimization either.

>As others have pointed out, read() and write() are generally useful
>on streams, and mmap() is not.
>(The SunOS "cp" command falls back to read/write if mmap() fails.
>But since read/write is as fast as mmap(),
>why bother with mmap() in the first place?!)
>
>So what is mmap() good for?  Plenty.
>But it is NOT a replacement for read/write.

Nor is it advertised as such.  Though Mr.  Truscott has not done so, those
deprecating mmap() for not being "device independent" or lacking other
attributes of read()/write() miss the point -- which was never that mmap()
replace read() or write() or otherwise represent some "grail" in the search
for computing enlightenment.  Rather it was to provide an abstraction of
operations in which the system was already engaged (namely file buffering and
physical store multiplexing) in a way that was accessible to applications and
which can increase their flexibility.  A good test of the sufficiency of such
an abstraction is that it is capable of becoming a primitive which you can use
to replace older and various implementations with a common framework -- and in
this we believe mmap() to have been a success.  We also believe it to be an
effective abstraction for those requiring its properties.  But neither do we
believe that everyone does, for mmap() is certainly a "lower-level"
abstraction than read()/write(), a primitive out of which the latter can be
constructed on memory objects in the same way device drivers provide a
primitive for transfer operations.

Because mmap() is *more* primitive than read()/write(), it can be (as Dennis
Ritchie points out) more cumbersome to use than the equivalent sequence of
read() or write() -- but so would access to raw devices.  If you're
programming around it, it's probably an indication that operating at this
level of the system isn't suitable for your needs, you should use the higher
abstractions.  The fact that the system supplies an abstraction that isn't
suitable for your use, does not lessen the fact that it is an effective
abstraction for others as well as an effective one for the system to use
in the implementation of abstractions that *are* appropriate for your use.
It's been my experience that most frustrations in the use of memory mapping
techniques in MULTICS, TENEX/TOPS-20, and now with mmap() have come from the
expectation that somehow mmap() was a higher-level operation than it really
is.

henry@utzoo.uucp (Henry Spencer) (02/13/90)

In article <131682@sun.Eng.Sun.COM> gingell@sun.UUCP (Rob Gingell) writes:
>... At the very least, the read() version will be slower
>than the mmap() version by the amount of time required to effect the
>copies from kernel to program buffers...

Assuming your MMU can do copy-on-write, why copy?

>... read() gives you a copy of the file data
>at the time that the call is executed.  That copy is immutable save any
>action performed by your program.  If read() were implemented *as* mmap(),
>then while it is possible to deal with side effects introduced in *your*
>machine, it is not, in general, possible to deal with side effects introduced
>in other machines...

Why are other machines relevant?  Can they reach into your machine and
mess with memory?  Or are you assuming an implementation where other
machines are not told "I'm using this data, and want to be told before
anyone else starts to change it"?

Clearly, implementing read with copy-on-write mapping requires a proper
implementation of copy-on-write, in which *any* attempt to mess with the
data triggers a copy operation.  If some defective file system, call it
NFS (name picked purely at random :-)), is not capable of supporting
copy-on-write, then you can't do this optimization when the file is on
the other end of NFS.  If, on the other hand, you have a real file system,
it's not a problem.  Concurrent access to files is actually quite rare
except for directories and certain system files; a copy-on-write read
would almost never need to copy.
-- 
"The N in NFS stands for Not, |     Henry Spencer at U of Toronto Zoology
or Need, or perhaps Nightmare"| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

mrc@Tomobiki-Cho.CAC.Washington.EDU (Mark Crispin) (02/13/90)

In article <1990Feb13.003010.23356@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <131682@sun.Eng.Sun.COM> gingell@sun.UUCP (Rob Gingell) writes:
>>... At the very least, the read() version will be slower
>>than the mmap() version by the amount of time required to effect the
>>copies from kernel to program buffers...
>
>Assuming your MMU can do copy-on-write, why copy?

I don't see anything that says that read() has to be on a page or
even a word boundary.  I haven't seen a description of mmap(), but
usually file/memory mapping is whole pages on page boundaries, so the
swapper can swap in/out the pages directly from/to the file.

 _____     ____ ---+---   /-\   Mark Crispin           Atheist & Proud
 _|_|_  _|_ ||  ___|__   /  /   6158 Lariat Loop NE    R90/6 pilot
|_|_|_| /|\-++- |=====| /  /    Bainbridge Island, WA  "Gaijin! Gaijin!"
 --|--   | |||| |_____|   / \   USA  98110-2098        "Gaijin ha doko ka?"
  /|\    | |/\| _______  /   \  +1 (206) 842-2385      "Niichan ha gaijin."
 / | \   | |__| /     \ /     \ mrc@CAC.Washington.EDU "Chigau. Gaijin ja nai.
kisha no kisha ga kisha de kisha-shita                  Omae ha gaijin darou."
sumomo mo momo, momo mo momo, momo ni mo iroiro aru    "Iie, boku ha nihonjin."
uraniwa ni wa niwa, niwa ni wa niwa niwatori ga iru    "Souka. Yappari gaijin!"

ccplumb@lion.waterloo.edu (Colin Plumb) (02/13/90)

In article <1990Feb13.003010.23356@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <131682@sun.Eng.Sun.COM> gingell@sun.UUCP (Rob Gingell) writes:
>>... At the very least, the read() version will be slower
>>than the mmap() version by the amount of time required to effect the
>>copies from kernel to program buffers...
>
>Assuming your MMU can do copy-on-write, why copy?

Because the odds that the user's buffers are properly aligned are not very
good.  If read() allocated a buffer and returned a pointer to that, they'd
be pretty much the same call.

One thing Mach is showing is that Unix was *not* designed with paging
in mind and you can do all sorts of neat tricks if you have it.
-- 
	-Colin

henry@utzoo.uucp (Henry Spencer) (02/14/90)

In article <20837@watdragon.waterloo.edu> ccplumb@lion.waterloo.edu (Colin Plumb) writes:
>>[read()] Assuming your MMU can do copy-on-write, why copy?
>
>Because the odds that the user's buffers are properly aligned are not very
>good.  If read() allocated a buffer and returned a pointer to that, they'd
>be pretty much the same call.

The odds are very high that if the user is pulling in large chunks of data,
either he's malloced the buffer or it's a static variable.  Either way, it's
no big trick, given a bit of cooperation from library and compiler, to
ensure that any large buffer is page-aligned.  Sure, once in a while you'll
have to copy; not often.

More generally, there is *NO REASON* why the worst-case code and the
typical-case code have to be one and the same.  Usually it is a massive
performance win to optimize the bejesus out of the typical case and let the
worst case just muddle along as best it can.  Once in a long while you run
into a real application which triggers the worst case, and has to be fixed
(usually in some trivial way) for acceptable performance.  For example,
read() clearly has to be able to handle reads into buffers at any byte
alignment.  Back when Unix ran mostly on the pdp11, very few people ever
realized what a colossal performance hit pdp11 Unix took if the buffer
was at an odd address.  I suspect it's fairly significant even on most
modern machines.  Nobody cares, because it's very rare for a program
to ever do that.  I found out about it because one of the early Boyer-Moore
egreps did odd-aligned reads, and was actually slower than old egrep on
the 11.  20 lines of code fixed it.
-- 
"The N in NFS stands for Not, |     Henry Spencer at U of Toronto Zoology
or Need, or perhaps Nightmare"| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

gingell%opus@Sun.COM (Rob Gingell) (02/14/90)

In article <1990Feb13.003010.23356@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <131682@sun.Eng.Sun.COM> gingell@sun.UUCP (Rob Gingell) writes:
>>... At the very least, the read() version will be slower
>>than the mmap() version by the amount of time required to effect the
>>copies from kernel to program buffers...
>
>Assuming your MMU can do copy-on-write, why copy?

See >> below...

>>... read() gives you a copy of the file data
>>at the time that the call is executed.  That copy is immutable save any
>>action performed by your program.  If read() were implemented *as* mmap(),
>>then while it is possible to deal with side effects introduced in *your*
>>machine, it is not, in general, possible to deal with side effects introduced
>>in other machines...
>
>Why are other machines relevant?  Can they reach into your machine and
>mess with memory?  Or are you assuming an implementation where other
>machines are not told "I'm using this data, and want to be told before
>anyone else starts to change it"?

I'm assuming an environment in which it is not possible to tell such things.
It is, of course, possible to assume other environments.  TOPS-20, for
instance, assumed that all machines that shared mapping to a file were TOPS-20,
and therefore had all the TOPS-20 semantics about sharing, copy-on-write, etc.
TOPS-20 had other convenient properties such as a page-based file system that
also helped simplify the problem and provide a reasonably powerful and "pure"
environment.  However, such assumptions made it impossible for other systems to
share access to the files involved -- which, in my experience anyway, was a
real and important shortcoming.

>Clearly, implementing read with copy-on-write mapping requires a proper
>implementation of copy-on-write, in which *any* attempt to mess with the
>data triggers a copy operation.  

Yes.  It also assumes those few cases in which the data is so conveniently
aligned as to make this optimization possible.  One has to question whether
the complexity is worth cobbling up something so semantically simple as
the read() call.  And, it also excludes every system which is not capable
of cooperating with the somewhat rigorous semantic requirements, including
PC's, most IBM operating systems running on any hardware, in short, the
vast majority of hardware in the world.

>If some defective file system, call it
>NFS (name picked purely at random :-)), is not capable of supporting
>copy-on-write, then you can't do this optimization when the file is on
>the other end of NFS.  

Gee, you still got any ax left after all these years of grinding? :-)

The argument has nothing to do with NFS.  It has to do with heterogeneity.  It
may be that accomodating heterogeneous hardware and software environments is a
complexity that some can live without.  Others have not been so fortunate, and
must instead deal with a melting pot reality.  You can certainly design
everything around simplifying assumptions, but at some point you have to wonder
whether or not the assumption set has excluded most of the real world.  

As it happens, the viewpoint I'm describing predates personal familiarity with
the NFS, and was derived from experiences in trying to deal with mixing the
relative "purity" of the TOPS-20 assumption set with real installations where
TOPS-20 (or any system) exclusivity was not possible.  And, independent of any
particular notion of exclusivity, it's also a recognition that transparent
cache coherence is not always possible given heterogeneous hardware and
interconnects.  In the general environment it may not even be desirable due to
the cost of maintaining the coherence.  While this is most evident today in
systems involving networks such as CI busses, Ethernet, FDDI, etc., it seems
increasingly likely that we will have to build architectures around a variety
of assumptions involving weakly-ordered operations as we deal with more
and more independent caches and buffers.

It is these experiences that has driven much of the semantic definition and
less so the existence of NFS.  That the NFS also dealt with heterogeneity and
the mutual assumption sets mesh well is either the result of serendipity or
congruence.

>If, on the other hand, you have a real file system,
>it's not a problem.  

This of course assumes we have a definition of a "real" filesystem.  I guess
NFS isn't "real" because it isn't "UNIX", and you're assuming that "real" ==
"UNIX".  And NFS isn't a UNIX file system over networks, it's a network file
system of which UNIX can be a client.  Yeah, it looks a lot like UNIX, but
that's it's heritage -- most people going to design something new generally
carry most of their experience into it.  But just as UNIX can be an NFS client
so can MVS, PC-DOS, VMS, and a variety of other non-UNIX systems which -- even
assuming NFS (or whatever) supported the "copy-on-write" semantics required,
such systems would still be incapable of responding to them.

tihor@acf4.NYU.EDU (Stephen Tihor) (02/14/90)

Wooa.  At this point hetrogenaity of NFS supported OS's counts as a 
significant design goal. [At least for reasonable metrics.]  I though 
MS-DOS compaticility (at least) was a day one goal but someone closer can
probably coprrect that.

dworkin@salgado.Solbourne.COM (Dieter Muller) (02/14/90)

In article <131682@sun.Eng.Sun.COM> gingell@sun.UUCP (Rob Gingell) writes:
>... At the very least, the read() version will be slower
>than the mmap() version by the amount of time required to effect the
>copies from kernel to program buffers...

A few days ago, I posted the results of sum versus fastsum (which
used mmap).  Someone rightly pointed out that the former was going
through lots of grotty stdio code.  Well, I just took fastsum and
(effectively) replaced the mmap calls with read calls.

	salgado {47} time fastsum /image/os/4.0C/upgrade/USR.tar
	36.1u 11.8s 0:53 89% 0+640k 2+0io 3614pf+0w
	salgado {48} time ./gerbil /image/os/4.0C/upgrade/USR.tar
	35.2u 14.0s 1:18 63% 0+1136k 3601+0io 3645pf+0w

User time is a little less, but system time is significantly greater.
I suspect most of the difference is in copying the data around in the
kernel.  The data buffer, btw, was declared global and static, so
alignment should be the best you can expect w/o hand-tuning things.

The data file and general conditions were the same as in my previous
posting.

	Dworkin
--
Martha, the platypus is into the rutabagas again!
boulder!stan!dworkin  dworkin%stan@boulder.colorado.edu  dworkin@solbourne.com
Flamer's Hotline: (303) 678-4624 (1000 - 1800 Mountain Time)