[comp.unix.wizards] mmap

rbk@sequent.UUCP (Bob Beck) (06/24/87)

Sequent's version of UNIX (Dynix) implements a version of mmap(), with some
of the "blanks" form the original 4.2bsd system description filled in.  This
supports paged shared memory (nice for parallel programming ;-), and can
support SystemV shared-memory via library code (libc).

If more than one process maps overlapping portions of a file for PROT_RDWR
and MAP_SHARED you get shared memory.  MAP_PRIVATE ==> local mods to the
data.  No address-space alignment restrictions other than page boundaries.
Mapped files are "coherent" with normal reads and writes, so "cp mapped-file
somewhere" will get the latest data.  Mapped address space is dumped into
core files, providing a "snapshot" of the data when/if the program dies.  In
the upcomming release of Dynix (v3.0) "shared text" is finally gone and is
done as a sub-case of mapped files.  Also mmap() can map over any previously
existing address space (or it can populate new address space) -- the last
thing mapped is what's there.  munmap() makes address space go away (ie,
creates a "hole").  mmap() also supports device-drivers providing physical or
paged maps for various "custom" reasons.

If anyone wants more details, let me know.

					Bob Beck
					Sequent Computer Systems

throopw@xyzzy.UUCP (Wayne A. Throop) (06/27/87)

> chris@mimsy.UUCP (Chris Torek)
>> rwhite@nu3b2.UUCP (Robert C. White Jr.)
>>OK, so whare does the file come from??  Is it in a special disk
>>partition, and loaded at system boot time?
> The file comes from the file system, of course.

Indeed.  Quite like the text area of an executable file, it is paged
directly from the filesystem.  Unlike the text area of an executable
file, it may have to have pages written to it from memory that has been
modified.  An interesting question is whether sync() should force this
to happen, or whether some other kernel hint should be used.

Chris's summary is of course excellent:

>>As I said... What's the point?
> There are several:
>   - The new kernel runs faster (for some selected set of benchmarks);
>   - The new kernel code is simpler;
>   - Mapped file semantics are more convenient for some programs;
>   - Mapped files provide shared memory.

... but I thought I'd expand on the "mapped files provide shared memory"
point a little.  In fact, they provide it in a much cleaner and more
convenient form than the shm calls from SysV.  To touch on only one
point, the very first problem one runs into when actually using SysV
shared memory is how to allocate shmids and communicate them to all the
cooperating processes.  Usually, some scheme with a central
administrator process and some fifos or whatnot has to be kludged
together.  On the other hand, with the file system providing the
"schmids", which are just files (or inodes if you will), there doesn't
need to be any complication.  Just create a file where it will do the
most good, in a place in the file system that has the protections you
want, and the overflow checking you want, and all the other nice things
the file system already provides.  It even gives you a shared memory
segement that persists longer than the machine stays up, without having
to utter any extraneious incantations in the /etc/rc or equivalent.

All in all, mmap-like shared memory and file-to-memory mapping is far
superior, far more flexible, and could be far better integrated into the
rest of unix than is SysV shmem.  

--
Adam and Eve had many advantages, but the principal one
was, that they escaped teething.
                --- Pudd'nhead Wilson's Calendar (Mark Twain)
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw

Kemp@dockmaster.ncsc.mil (02/07/90)

Doug Gwyn writes:
 > Larry McVoy writes:
 >> Anyway, it's tough to do this otherwise.  Protection is implemented
 >> via the MMU.  I can't think of any other reasonable (performance)
 >> way to do it.
 >
 > I appreciate that, but you do see the point, I hope.
 > A user-mode application has no reasonable handle on the notion of
 > "page alignment"; . . .
 >
 > . . . lacking a system-provided function like
 >  void *PageAlign( void *base_pointer, size_t extent );
 >   . . . there is no way I can see to use mmap() portably even among
 > systems on which it exists.

I fail to understand why mmap(2) can't be used portably.

 caddr_t mmap( caddr_t addr, int len, int prot, int flags,
             int fd, off_t off );

 " mmap() establishes a mapping between the process's address space
 at an address paddr for len bytes to the memory object represented
 by fd at off for len bytes.  The value of paddr is an implementation
 dependent function of the parameter addr and the values of flags,
 . . .  A successful mmap() returns paddr as its result."

In other words, a user mode application should have *no* handle at all
on the notion of "page alignment".  It should just regard the value
returned by mmap as a pointer to memory that is valid for the particular
hardware on which it is running.  In fact, page alignment alone may not
be a sufficient condition for validity.

What's non-portable about that?

   Dave Kemp <Kemp@dockmaster.ncsc.mil>

gwyn@smoke.BRL.MIL (Doug Gwyn) (02/07/90)

In article <22368@adm.BRL.MIL> Kemp@dockmaster.ncsc.mil writes:
> caddr_t mmap( caddr_t addr, int len, int prot, int flags,
>             int fd, off_t off );
> " mmap() establishes a mapping between the process's address space
> at an address paddr for len bytes to the memory object represented
> by fd at off for len bytes.  The value of paddr is an implementation
> dependent function of the parameter addr and the values of flags,
> . . .  A successful mmap() returns paddr as its result."
>What's non-portable about that?

Gee, suppose I need N bytes mapped.  I cannot just use N for the `len'
argument.  What should I use?  Presumably N+{page_size}-1 would suffice,
assuming of course that my application has arranged for addr to point to
that much valid storage.  However, {page_size} is not generally available
to me.  Some *non-POSIX conformant* systems seem to provide a way to
find out via sysconf(PAGESIZE).  (PAGESIZE should not be defined in
<unistd.h>.)  SVID3 also seems to say that using an addr of 0 will result
in a random location in the middle of my process being used, which would
surely be a horrible design botch; one hopes it means that the necessary
storage will be allocated from the process's heap, in which case that
would be the best way to use this function.

Note also that SVID3 says that -1 is returned on error.  I have no idea
what that means, as -1 is not a caddr_t.  You may recall that we discuss
this sort of botch (as with sbrk()) every so often, and yet here these
idiots go and make the same mistake yet again.  The proper error return
should have been a null pointer.

lwa@osf.org (Larry Allen) (02/08/90)

If you want to map "len" bytes using mmap, you ask to
map "len" bytes.  If mmap is successful, it guarantees
to return a pointer to an address a such that addresses
a through a+len-1 are valid in your address space.  Addresses
beyond this (say, up to the next page boundary) may also
be valid, but you can't depend on that.

In any case, I'm not sure what your point about non-Posix
conformant systems is.  An application using mmap is not
Posix conformant.  If Posix is ever extended to include
mmap, presumably it will also include an interface to
get the page size.

Now, I think there *are* a couple of things about mmap
to which you could have raised valid objections (what happens
on a machine with multiple page sizes, such as the ETA-10?
what are the semantics for memory sharing on a multiprocessor
machine?  how do accesses to a file via mapping and accesses
to a file via the file system interact?)  But I don't see
the page size dependency as being a big problem.
						-Larry Allen
						 Open Software Foundation

gwyn@smoke.BRL.MIL (Doug Gwyn) (02/08/90)

In article <3399@paperboy.OSF.ORG> lwa@osf.org (Larry Allen) writes:
>If you want to map "len" bytes using mmap, you ask to
>map "len" bytes.  If mmap is successful, it guarantees
>to return a pointer to an address a such that addresses
>a through a+len-1 are valid in your address space.

That doesn't mean they're safe to use!  They could overlay program
variables.  That is why I wanted a function to help determine page
alignment, so I could safely set up an array to use.

>In any case, I'm not sure what your point about non-Posix
>conformant systems is.  An application using mmap is not
>Posix conformant.

It is the implementation that is not POSIX conformant, due to
sticking PAGESIZE gratuitously in <unistd.h> (according to SVID3).

ndjc@hobbit.UUCP (Nick Crossley) (02/09/90)

In article <12087@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>Gee, suppose I need N bytes mapped.  I cannot just use N for the `len'
>argument.

SVID Edition 3 is very clear that 'the parameter len need not meet a size
or alignment constraint'.  Any partial page at the end is zero-filled,
and is not written back out to the file if modified.

>SVID3 also seems to say that using an addr of 0 will result
>in a random location in the middle of my process being used, which would
>surely be a horrible design botch; one hopes it means that the necessary
>storage will be allocated from the process's heap, in which case that
>would be the best way to use this function.

Again, SVID 3 is clear: 'When the system selects <an address>, it will
never place a mapping at address zero, nor replace any extant mapping,
nor map into areas considered part of the potential stack or data segments.'
I take this to mean that the default mmap virtual addresses must be far
distant from both stack and sbrk/malloc addresses.  This is not guaranteed,
and is frequently not true, for shmat, which makes it difficult to use
shared memory and malloc extensively in the same program.  I would expect
most systems to use a similar choice of mappings for mmap and shmat, so
in practice, that problem may also disappear.

-- 

<<< standard disclaimers >>>
Nick Crossley, ICL NA, 9801 Muirlands, Irvine, CA 92718-2521, USA 714-458-7282
uunet!ccicpg!ndjc  /  ndjc@ccicpg.UUCP

guy@auspex.auspex.com (Guy Harris) (02/10/90)

>Gee, suppose I need N bytes mapped.  I cannot just use N for the `len'
>argument.

SVID89 says (and so does SunOS 4.x, whence both man page and
implementation were derived):

     When MAP_FIXED is not set, the system uses addr as a hint in
     an  implementation-defined  manner  to arrive at paddr.  The
     paddr so chosen will be an area of the address  space  which
     the  system deems suitable for a mapping of len bytes to the
     specified object.

Said area may be larger than "len", if that's what's necessary to be
suitable for such a mapping, even if "N" isn't a multiple of a page
size.  The documentation should perhaps be clearer on this, but that
*is* how it works....

>SVID3 also seems to say that using an addr of 0 will result
>in a random location in the middle of my process being used, which would
>surely be a horrible design botch; one hopes it means that the necessary
>storage will be allocated from the process's heap, in which case that
>would be the best way to use this function.

It picks a location which is, in most implementations, a large distance
below the bottom of the area reserved for the stack.  (I don't know what
it does on the WE32K, or other machines where the stack grows upward; I
would guess that it would place it a large distance above the top of the
area reserved for the stack.)

The way that area is reserved for the stack is by setting the stack
limit to some value.  The default stack size seems to be 2MB on Sun-3s
and 8MB on Sun-4s; if you need a larger stack, you can set it to a
larger value. 

There are reasons why having the system choose the address is a win; one
reason is that machines exist where alignment constraints other than
page-alignment constraints affect cacheability of shared segments
(Sun-3/2xx and Sun-4/2xx, for example).

>Note also that SVID3 says that -1 is returned on error.  I have no idea
>what that means, as -1 is not a caddr_t.  You may recall that we discuss
>this sort of botch (as with sbrk()) every so often, and yet here these
>idiots go and make the same mistake yet again.  The proper error return
>should have been a null pointer.

I won't defend that decision as being right; it was probably done in
SunOS 4.x for binary compatibility with the limited "mmap" in earlier
releases, but unless source compatibility was important (and I don't
know that source compatibility is provided for "mmap" at all) it could
have been given a different trap number and done differently.

coleman@cam.nist.gov (Sean Sheridan Coleman X5672) (02/05/91)

Does anyone have any examples of mmap for sun's. I am especially
interested in being able to  open one file and copy it to
another  one. I also would like to see some examples that 
utilize the EXEC proto.


I am also looking for examples that use madvise, mcntl, and
mlock.


Thanks very much

Sean Coleman
NIST
coleman@bldrdoc.gov

lm@slovax.Eng.Sun.COM (Larry McVoy) (02/05/91)

In article <6991@alpha.cam.nist.gov> coleman@cam.nist.gov (Sean Sheridan Coleman X5672) writes:
>Does anyone have any examples of mmap for sun's. I am especially
>interested in being able to  open one file and copy it to
>another  one. I also would like to see some examples that 
>utilize the EXEC proto.

Don't have a quickie for EXEC.  Check this out, just some gunk I'm playing
with.


# This is a shell archive.  Remove anything before this line, then
# unpack it by saving it in a file and typing "sh file".  (Files
# unpacked will be owned by you and have default permissions.)
#
# This archive contains:
# mmapcp.c mmaplib.c

echo x - mmapcp.c
cat > "mmapcp.c" << '//E*O*F mmapcp.c//'
/*
 * a version of copy that maps in the data & writes it with mmap.
 *
 * Both sets of data are synced of memory.  This is designed to move data
 * fairly quickly but still disturb the system as little as possible.
 *
 * @(#)mmapcp.c	1.2
 */

#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

#define	SYNCSIZ	(256*1024)

main(ac, av)
	char **av;
{
	int	i, bytes;
	int	in, out;
	char	*ibuf, *obuf, *ip, *op;

	if (ac != 3) {
		printf("usage: %s src dest\n", av[0]);
		exit(0);
	}
	if ((in = open(av[1], 0)) == -1) {
		perror(av[1]);
		exit(1);
	}
	if ((out = open(av[2], O_RDWR|O_CREAT, 0644)) == -1) {
		perror(av[2]);
		exit(1);
	}
	if (mmap_init(in, &ibuf, 0, -1, 0) == -1) {
		perror("mmap_init in");
		exit(1);
	}
	if (mmap_init(out, &obuf, 0, size(in), 1) == -1) {
		perror("mmap_init out");
		exit(1);
	}
	ip = ibuf;
	op = obuf;
	for (i = size(in); i > 0; i -= SYNCSIZ) {
		bytes = SYNCSIZ < i ? SYNCSIZ : i;
		bcopy(ip, op, bytes);
		mmap_flush(ip, bytes, 1);
		mmap_flush(op, bytes, 0);	/* 1 takes longer */
		ip += bytes;
		op += bytes;
	}
	close(in);
	close(out);
	exit(0);
}
//E*O*F mmapcp.c//

echo x - mmaplib.c
cat > "mmaplib.c" << '//E*O*F mmaplib.c//'
/*
 * @(#)mmaplib.c	1.3
 */
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>

/*
 * Set things up to do I/O over the indicated range
 *
 * XXX - up to user to be sure that mmap is an OK thing to do (tapes).
 */
mmap_init(fd, basepp, off, bytes, writeable)
	caddr_t	*basepp;
	off_t	bytes;
{
	caddr_t	base;
	int	protbits;

	protbits = PROT_READ;
	if (writeable) {
		protbits |= PROT_WRITE;
	}
	if (!regfile(fd)) {
		return (-1);
	}
	if (bytes == (off_t)-1) {
		bytes = size(fd);
	}
	if (writeable && ftruncate(fd, off + bytes) == -1) {
		return (-1);
	}
	base = mmap((caddr_t)0, bytes, protbits, MAP_SHARED, fd, off);
	if (base == (caddr_t)-1) {
		return (-1);
	}
	madvise(base, bytes, MADV_SEQUENTIAL);
	*basepp = base;
	return (0);
}

/*
 * flush writes
 */
mmap_flush(addr, len, clean)
	caddr_t	addr;
	off_t	len;
{
	msync(addr, len, MS_ASYNC);
	if (clean)
		madvise(addr, len, MADV_DONTNEED);
}

static	struct stat sb;
static	lastfd = -1;

regfile(fd)
{
	if (lastfd != fd) {
		if (fstat(fd, &sb) == -1) {
			return (-1);
		}
		lastfd = fd;
	}
	return (S_ISREG(sb.st_mode));
}

size(fd)
{
	if (lastfd != fd) {
		if (fstat(fd, &sb) == -1) {
			return (-1);
		}
		lastfd = fd;
	}
	return (sb.st_size);
}
//E*O*F mmaplib.c//

echo Possible errors detected by \'wc\' [hopefully none]:
temp=/tmp/shar$$
trap "rm -f $temp; exit" 0 1 2 3 15
cat > $temp <<\!!!
      57     188    1124 mmapcp.c
      78     198    1239 mmaplib.c
     135     386    2363 total
!!!
wc  mmapcp.c mmaplib.c | sed 's=[^ ]*/==' | diff -b $temp -
exit 0
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

cgy@cs.brown.edu (Curtis Yarvin) (03/26/91)

There is nothing in mmap(2) on Suns to indicate that I cannot mmap a socket.
What will happen if I try?  Will new pages be mapped, sequentially, into
memory as the socket receives data?

Curtis

"I tried living in the real world
 Instead of a shell
 But I was bored before I even began." - The Smiths