[comp.unix.wizards] What new system calls do you want in BSD?

moss@takahe.cs.umass.edu (Eliot &) (01/31/90)

Another system call (that I feel any virtual memory system ought to support)
is one to move a range of pages from one location in the address space to
another. This is sometimes desirable for garbage collectors, etc. Note that
one need not move the *data*, only the page table entries (though it is
certainly more complicated than just a block move within the OS). Note that
the call should work even if the range overlaps with itself. It is the page
level analog of memcpy. The size of the region should probably *not* be
specified in terms of pages, but rather bytes, and the source and destination
addresses as byte addresses, too. The call should fail if the addresses are
not page aligned and the quantity to move is not a multiple of the page size.

If one allows additional arguments for adjusting the protection, and allows
the source and/or destination to be associated with different processes and/or
files, a general move/messaging operator results.
								Eliot
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu

jay@emtek.UUCP (Jay Elston) (02/01/90)

I've heard that there might be interest in a "is this memory location
readable/writable" function ala:

NAME
    probe - probe memory locations
SYNOPSIS
    #include <sys/probe.h>
    bool probe (address, length, mode)
    unknown *address;
    unsigned length;
    TBD mode;
+-
| Jay Elston, EMTEK Health Care System, Inc.  (602) 431-9343
| uunet!emtek!jay
+-

guy@auspex.auspex.com (Guy Harris) (02/02/90)

>>>	vm_offset_t	*addr;		/* Where to map to (page aligned) */
>>Yes, and so do SunOS 4.x and System V Release 4.  What's more, both of
>>them implement "mmap", which bears a startling resemblance to "map_fd". 
>
>For a user-mode function, I strongly dislike the page-alignment constraint.
>Does mmap have a similar requirement?

Yes.  The SunOS 4.x/S5R4 VM subsystem implements "mmap()" as an
interface to the VM system's mechanism for setting up mappings between
pages in your address space and objects such as files; given that said
VM mechanism can't, for example, say "bytes 23 through 47 of this
particular page are backed by bytes from thus-and-such a vnode", it
requires that the address be page-aligned.  (I.e., if you really mean
"map", there's a *lot* of work involved in lifting the page-alignment
restriction.)

However, "mmap()" lets you ask the system to assign an address, which is
almost always what you want, so applications don't need to worry about
page alignment.  (If they need to get a page-aligned address, they can
use "getpagesize()" and work from there.)

dwc@cbnewsh.ATT.COM (Malaclypse the Elder) (02/03/90)

In article <12067@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
> In article <2863@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
> >>	vm_offset_t	*addr;		/* Where to map to (page aligned) */
> >Yes, and so do SunOS 4.x and System V Release 4.  What's more, both of
> >them implement "mmap", which bears a startling resemblance to "map_fd". 
> 
> For a user-mode function, I strongly dislike the page-alignment constraint.
> Does mmap have a similar requirement?

yes mmap has a similar requirement.  this is because the page is
the unit of mapping that the hardware supports.  the alternative
is some HUGE overhead of faulting on any reference to a page that
has an address that is not page aligned and doing any copying necessary
to update that page.

i'm not sure, but doesn't mmap return the address that something was
actually attached to?  the *addr above is just a hint to the system
about where to attach?  so user level code doesn't have to know?

danny chen
att!hocus!dwc

peter@ficc.uu.net (Peter da Silva) (02/03/90)

In article <12071@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
> there is no way I can see to use mmap() portably
> even among systems on which it exists.  This puts a serious crimp
> in its potential usability.

Except that map_fd (and mmap) will accept a NULL address, allocate
the memory, and return a pointer to it.
-- 
 _--_|\  Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \
\_.--._/ Xenix Support -- it's not just a job, it's an adventure!
      v  "Have you hugged your wolf today?" `-_-'

dstewart@fas.ri.cmu.edu (David B Stewart) (02/06/90)

How about implementing good ol' semaphores, with a much simpler interface
than Sys V.  I've written a front end to Sys V semaphores (under SunOS x.x)
which give me the good old classical P() and V() operations.  
I don't need all that other stuff associated with Sys V.  Why not provide a 
simple system call, which works just as described in all those operating system 
textbooks.  The same goes with shared memory; if BSD 4.4 doesn't have 
lightweight processes with shared memory, then provide some sort of system call
to allow processes to establish shared memory segments.

Another useful call would be peek(address) which will return 1 if
address is legal within the processes address space, or 0 (i.e. error)
if a write/read operation to that address would cause a bus error or
segmentation fault.  I know that the SunOS kernel provides that routine,
but there is no way for the user to access it.

~dave
-- 
David B. Stewart, Dept. of Elec. & Comp. Engr., and The Robotics Institute, 
	Carnegie Mellon University,  email: stewart@faraday.ece.cmu.edu 
The following software is now available; ask me for details
        CHIMERA II, A Real-time OS for Sensor-Based Control Applications

brnstnd@stealth.acf.nyu.edu (02/08/90)

In article <7848@pt.cs.cmu.edu> dstewart@fas.ri.cmu.edu (David B Stewart) writes:
> How about implementing good ol' semaphores, with a much simpler interface
> than Sys V.

Actually, I just built a simple threads library on top of my signal
library. Everything is shared. You get threadfork(), threadexit(),
threadup() and threaddown() for semaphores, and a few other calls.

It ain't Mach but it works.

---Dan

lm@snafu.Sun.COM (Larry McVoy) (02/08/90)

In article <2212.21:08:11@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
>In article <7848@pt.cs.cmu.edu> dstewart@fas.ri.cmu.edu (David B Stewart) writes:
>> How about implementing good ol' semaphores, with a much simpler interface
>> than Sys V.
>
>Actually, I just built a simple threads library on top of my signal
>library. Everything is shared. You get threadfork(), threadexit(),
>threadup() and threaddown() for semaphores, and a few other calls.
>
>It ain't Mach but it works.
>
>---Dan

Yeah, I did the same thing once for kicks.  It's cute, but essentially useless.
You don't have threadkill() or threadsignal(), but you do have 
block_all_threads() in the form of read(), write(), select(), and any other
system call that can block().

Bottom line: threads without kernel support are largely useless.
---
What I say is my opinion.  I am not paid to speak for Sun, I'm paid to hack.
    Besides, I frequently read news when I'm drjhgunghc, err, um, drunk.
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

del@thrush.semi.harris-atd.com (Don Lewis) (02/08/90)

In article <23449@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
>In article <1990Jan24.193433.3332@semi.harris-atd.com> del@thrush.semi.harris-atd.com (Don Lewis) writes:
>>   open(file,O_PEEK)
>
>This could be a flag on any open, meaning simply ``update ctime rather
>than atime or mtime.'' Crackers already know about utimes(); perhaps an
>O_PEEK flag would educate inexperienced sysadmins.
>
>---Dan
I don't want it to update the ctime either.  I shouldn't have to dump the
file just because someone read it.  It only needs to keep atime from being
changed if you read the file.  If you write to the file, it is still ok
to change ctime and mtime.   I'm looking to preserve the atime so I can still
find unused files (candidates for deletion) even if the hierarchy containing
them has been tar'ed or cpio'ed.  Besides, it's a performance win because
the kernel doesn't have to go back and update all those inodes, sort of like
treating the filesystem as read-only.
--
Don "Truck" Lewis                      Harris Semiconductor
Internet:  del@semi.harris-atd.com     PO Box 883   MS 62A-028
UUCP:      rutgers!soleil!thrush!del   Melbourne, FL  32901              
Phone:     (407) 729-5205

peralta@pinocchio.Encore.COM (Rick Peralta) (02/09/90)

How about resource weighting.
For example:
	. have the free memory weight be tunable
	  That is to say instead of taking a compile time
	  value (maybe 10% free) for when to start flushing
	  things to swap, accept changes from the sysadmin.
	. memory usage priority (even on the page level)
	  lock this page in memory, this one is largely filler, etc.
	. I/O priority, get my I/O done ASAP or after everyone else
	. heavy CPU priority (more than just nice)
	  this thread is to be done at hardware interrupt level x,
	  after hardware interrupts, as part of the idle loop

Things like sync could be placed in the lowest priority loop and called
more frequently.  Things like servers that tend to be bottlenecks and
don't consume lots of resources could be placed in the upper categories.
Kernel code running in user space (whoops, I for got to mentin that)
could be exectued in satisfactory time for most driver applications.
Things like compiles could get high memory priority and things like
Emacs could get a lower memory priority but higher I/O priority.


 - Rick "Just a fwe stray synapses..."

brnstnd@stealth.acf.nyu.edu (02/09/90)

In article <1990Feb8.080645.4458@semi.harris-atd.com> del@thrush.semi.harris-atd.com (Don Lewis) writes:
> In article <23449@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
> >In article <1990Jan24.193433.3332@semi.harris-atd.com> del@thrush.semi.harris-atd.com (Don Lewis) writes:
> >>   open(file,O_PEEK)
> >This could be a flag on any open, meaning simply ``update ctime rather
> >than atime or mtime.'' Crackers already know about utimes(); perhaps an
> >O_PEEK flag would educate inexperienced sysadmins.
> I don't want it to update the ctime either.

That would be a security violation.

---Dan

peter@ficc.uu.net (Peter da Silva) (02/09/90)

In article <131446@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
> Bottom line: threads without kernel support are largely useless.

Which is one reason I want *clean* asynchronous I/O, in the form of
some equivalent of my aread/awrite/await proposal.
-- 
 _--_|\  Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \
\_.--._/ Xenix Support -- it's not just a job, it's an adventure!
      v  "Have you hugged your wolf today?" `-_-'

brnstnd@stealth.acf.nyu.edu (02/09/90)

In article <131446@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
> In article <2212.21:08:11@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
> >In article <7848@pt.cs.cmu.edu> dstewart@fas.ri.cmu.edu (David B Stewart) writes:
> >> How about implementing good ol' semaphores, with a much simpler interface
> >> than Sys V.
> >Actually, I just built a simple threads library on top of my signal
> >library. Everything is shared. You get threadfork(), threadexit(),
> >threadup() and threaddown() for semaphores, and a few other calls.
> 
> Yeah, I did the same thing once for kicks.  It's cute, but essentially useless.
> You don't have threadkill() or threadsignal(),

As I said, everything is shared---including signals. What's wrong with
this? As long as all the threads set up their signal handlers through my
library, they won't interfere with each other.

> but you do have 
> block_all_threads() in the form of read(), write(), select(), and any other
> system call that can block().

This is based on the faulty assumptions that I demolish in a companion
message. When an I/O system call blocks, it can be interrupted! (Larry's
point is correct for certain system calls that do not perform I/O. For
example, if you open() a pty without anyone to talk to, you'll block in
kernel mode, as ps shows.)

> Bottom line: threads without kernel support are largely useless.

Bottom line: When I'm done with the libraries I'll give them to Rich and
see whether the rest of the world thinks they're so useless.

---Dan

del@thrush.semi.harris-atd.com (Don Lewis) (02/09/90)

In article <5068.16:48:52@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
>In article <1990Feb8.080645.4458@semi.harris-atd.com> del@thrush.semi.harris-atd.com (Don Lewis) writes:
>> In article <23449@stealth.acf.nyu.edu> brnstnd@stealth.acf.nyu.edu (Dan Bernstein) writes:
>> >In article <1990Jan24.193433.3332@semi.harris-atd.com> del@thrush.semi.harris-atd.com (Don Lewis) writes:
>> >>   open(file,O_PEEK)
>> >This could be a flag on any open, meaning simply ``update ctime rather
>> >than atime or mtime.'' Crackers already know about utimes(); perhaps an
>> >O_PEEK flag would educate inexperienced sysadmins.
>> I don't want it to update the ctime either.
>
>That would be a security violation.
In what way?  The only information that I lose is that I can't tell if
someone has been looking at my files.  If I cared then I would make them
something other than rw-r--r--. Even in the present scheme, if I read my
file after the "cracker" has, then I can't tell if it was previously read.

If the filesystem is mounted read-only, the atime doesn't get updated, is
this a security violation?
--
Don "Truck" Lewis                      Harris Semiconductor
Internet:  del@semi.harris-atd.com     PO Box 883   MS 62A-028
UUCP:      rutgers!soleil!thrush!del   Melbourne, FL  32901              
Phone:     (407) 729-5205

dstewart@fas.ri.cmu.edu (David B Stewart) (02/09/90)

Another feature that would be useful as a BSD system call is to
lock down one or more pages in physical memory, and allow other
processors on a common backplane to mmap it.  Of course, this assumes
appropriate hardware architecture.

As an example, suppose one CPU is running BSD UNIX, while all others have
some kind of Real-Time OS (our current situation, except we have SunOS). 
It is possible for the UNIX machine to mmap part of the other CPUs
memory; but the reverse is not possible.  The Real-time CPU cannot
mmap part of the BSD UNIX memory.  Such communication can greatly
increase the speed of communication between the Real-Time and Non-real-time
environments.  On the Sun, this is possible using DVMA (Direct Virtual
Memory Access), but it is rather awkward to use.  The space reserved
for DVMA is available to only kernel routines.  User routines do
not have access to that memory.  Replacing this functionality with
a system call would allow user processes to access the reserved memory
on the UNIX system, while at the same time letting other CPUs on the
backplane also access the memory.  

I really have no clue if the above type of system call is feasable
to implement in BSD, since I am not familiar with the internals of 
BSD.  Any futher insight is welcome.

~dave

-- 
David B. Stewart, Dept. of Elec. & Comp. Engr., and The Robotics Institute, 
	Carnegie Mellon University,  email: stewart@faraday.ece.cmu.edu 
The following software is now available; ask me for details
        CHIMERA II, A Real-time OS for Sensor-Based Control Applications

peter@ficc.uu.net (Peter da Silva) (02/09/90)

> > Bottom line: threads without kernel support are largely useless.

> Bottom line: When I'm done with the libraries I'll give them to Rich and
> see whether the rest of the world thinks they're so useless.

From your other message, it looks like you *have* kernel support.
-- 
 _--_|\  Peter da Silva. +1 713 274 5180. <peter@ficc.uu.net>.
/      \
\_.--._/ Xenix Support -- it's not just a job, it's an adventure!
      v  "Have you hugged your wolf today?" `-_-'

jfh@rpp386.cactus.org (John F. Haugh II) (02/10/90)

In article <1990Feb9.025853.8202@semi.harris-atd.com> del@thrush.semi.harris-atd.com (Don Lewis) writes:
>>> I don't want it to update the ctime either.
>>
>>That would be a security violation.
>In what way?  The only information that I lose is that I can't tell if
>someone has been looking at my files.  If I cared then I would make them
>something other than rw-r--r--. Even in the present scheme, if I read my
>file after the "cracker" has, then I can't tell if it was previously read.

In some sense of the word "secure" information about whether a particular
file has been referenced is "security relevant".

In the typical "insecure" UNIX environment it would be useful to be
able to read a file without updating the atime or ctime.  In that
sense an "O_PEEK" flag would be Real Nice(tm).  File backup utilities
are forced to use utimes() which as a side-effect changes the ctime,
not really a nice thing to do consider file backup utilities tend to
use the ctime for selecting files to dump ...

In a more secure environment it is possible to track references to
individual files with more granularity than "yes/no" and when.  A
feature like "O_PEEK" probably wouldn't matter in this case either
since interesting files are going to be tracked with other mechanisms.

Dan's assertion that O_PEEK is a "security violation" is only true
in the most simplistic sense.  It most certainly is not a "security
violation" in any "official" sense of the word.  This is borne out
by 2.2.2.2 of the TCSEC - the different times in the i-node DO NOT
provide the function required for conformance.  So I do not see how
any possible mis-use could be contrued to be a lack of protection.

Indeed, were such a feature to be provided the only requirement would
be that its use be restricted in some fashion and that use of this
feature be auditable.
-- 
John F. Haugh II                             UUCP: ...!cs.utexas.edu!rpp386!jfh
Ma Bell: (512) 832-8832                           Domain: jfh@rpp386.cactus.org

jfh@rpp386.cactus.org (John F. Haugh II) (02/10/90)

In article <7904@pt.cs.cmu.edu> dstewart@fas.ri.cmu.edu (David B Stewart) writes:
>Another feature that would be useful as a BSD system call is to
>lock down one or more pages in physical memory, and allow other
>processors on a common backplane to mmap it.  Of course, this assumes
>appropriate hardware architecture.

It is actually possible to mmap() files over the wire - including
such transport mechanisms as SL/IP or Morse Code over a spark gap
rig.

>As an example, suppose one CPU is running BSD UNIX, while all others have
>some kind of Real-Time OS (our current situation, except we have SunOS). 
>It is possible for the UNIX machine to mmap part of the other CPUs
>memory; but the reverse is not possible.

Anything is possible.  Just sit down and dream up some way to make it
work.  There is nothing special about "real time", provided the "real
time" constraints are met.
-- 
John F. Haugh II                             UUCP: ...!cs.utexas.edu!rpp386!jfh
Ma Bell: (512) 832-8832                           Domain: jfh@rpp386.cactus.org

steve@nuchat.UUCP (Steve Nuchia) (02/12/90)

In article <1990Feb9.025853.8202@semi.harris-atd.com> del@thrush.semi.harris-atd.com (Don Lewis) writes:
>If the filesystem is mounted read-only, the atime doesn't get updated, is
>this a security violation?

Hmm... maybe we don't need a new sys call, or a new argument/flag for
old ones, to let backup avoid updating the inode.  Maybe we just need
to remove an arbitrary restriction on an old one.  Namely, allow devices
to be mounted more than once.  If the second mount is RO then you
can back up from it and get most of what you want.

While I'm on the subject, a thought on disk/partition/file system
organization:

The current situation is really a mess, with all the partitioning
and defect mapping burried down in *each* disk driver.  What we
need is truly raw drivers for specific hardware and a generic
indirect driver implementing "cooked" features -- mapping,
partitioning, striping, mirroring -- in a *standard* way.

Neither of these suggestions is particularly difficult, unless
I missed something when I last looked at the relevant code.
-- 
Steve Nuchia	      South Coast Computing Services      (713) 964-2462
"If the conjecture `You would rather I had not disturbed you
 by sending you this.' is correct, you may add it to the list of
 uncomfortable truths."   - Edsgar Dijkstra

les@chinet.chi.il.us (Leslie Mikesell) (02/13/90)

In article <19451@nuchat.UUCP> steve@nuchat.UUCP (Steve Nuchia) writes:
>>If the filesystem is mounted read-only, the atime doesn't get updated, is
>>this a security violation?

>Hmm... maybe we don't need a new sys call, or a new argument/flag for
>old ones, to let backup avoid updating the inode.  Maybe we just need
>to remove an arbitrary restriction on an old one.  Namely, allow devices
>to be mounted more than once.  If the second mount is RO then you
>can back up from it and get most of what you want.

Yes!  I'll second that one.  In fact, I'd go even further and let
arbitrary directories be mapped as read-only mount points.  The mechanisms
are probably mostly in place already in RFS and/or NFS.  Just provide
a local-loopback and take away the restriction of only mounting a
resource in one place on a machine (RFS has this, I don't know about
NFS).  I've wanted this in RFS anyway to give "public" read-only access
via one mount point while having "system" read/write access at the same
time through a different mount point.

Les Mikesell
 les@chinet.chi.il.us

webb@bass.tcspa.ibm.com (Bill Webb) (02/15/90)

> I'm asking about calls that don't require lots of code or fundamental
> changes in the system; that provide a useful service unavailable with
> current system calls; that, hopefully, simplify other calls; that don't
> hurt security; that don't hurt anything if they're not used.

How about a system call to indicate either the maximum number of file
descriptors allocated, or better still, a bit-map of allocated 
e.g.
	nfound = getfds(nfds, fds);
	int nfound, nfds;
	fd_set *fds;

The parameters are similar to those of select(2), except that bits are
set for any valid file descriptors in the range 0...(nfds-1).

It might be useful to specify two fd_set parameters, e.g.

	nfound = getfds(nfds, readfds, writefds);
	int nfound, nfds;
	fd_set *readfds, *writefds;

where "readfds" returns bits for file descriptors opened for reading, 
"writefds" returns bits for file descriptors open for writing, and bits are
set in both for file descriptors open for read/writing.

These calls would mostly be useful to programs (such as shells) that have
to manage file descriptors and either don't want to clobber existing 
file descriptors, or want to know what file descriptors to close in certain
circumstances. This is particularly important now that the number of
file descriptors allowed is significantly increased in some Unix
implementations. This call should nicely complement 'getdtablesize' which
tells you how many bits you will need to hold the resulting information.

----------------------------------------------------------------
The above views are my own, not those of my employer.
Bill Webb (IBM AWD Palo Alto), (415) 855-4457.
UUCP: ...!uunet!ibmsupt!webb