[comp.os.misc] Reducing system calls overhead

goshen@ccicpg.UUCP (Shmuel Goshen) (08/25/88)

I have been looking recently at ways to reduce the system call overhead.
One approach would be to move *simple* system calls form the kernel out
to the user. This is quite limited since many kernel data structures
cannot be accessed from User Mode.

The second approach would be to introduce a simple and quick interface
for "light weight" system calls (like getpid(), umask() etc'), which 
perform simple tasks and never sleep.  This quick interface will run 
in kernel mode, thus enabling access to kernel data structures, but 
will not include operations like saving the context (setjmp), signal 
delivery and allowing rescheduling upon return from the call. 
The net effect is almost as having the system call run in user mode. 

I am looking for opinions on these approaches. My main concern is side 
effects caused by eliminating signal delivery and rescheduling until 
a "software forced exception" or a *heavy* system call is executed 
by the process. (Consider, for example, a program running getpid() 
calls in a loop).

-- 

Shmuel Goshen				(714) 951-8053	
Computer Consoles Inc.			(714) 458-7282
Irvine, CA.		  {allegra!hplabs!felix,seismo!rlgvax}!ccicpg!goshen

vixie@decwrl.dec.com (Paul Vixie) (08/26/88)

In article <21606@ccicpg.UUCP> goshen@ccicpg.UUCP (Shmuel Goshen) writes:
# [...] This quick interface will run 
# in kernel mode, thus enabling access to kernel data structures, but 
# will not include operations like saving the context (setjmp), signal 
# delivery and allowing rescheduling upon return from the call. 
# The net effect is almost as having the system call run in user mode. 

It depends on the CPU and MMU.  If you're changing to supervisor mode,
chances are you're switching memory maps / page tables, which on many
CPU/MMU combos means: kiss your cache goodbye (sorry :-)).
-- 
Paul Vixie
Digital Equipment Corporation	Work:  vixie@dec.com	Play:  paul@vixie.UUCP
Western Research Laboratory	 uunet!decwrl!vixie	   uunet!vixie!paul
Palo Alto, California, USA	  +1 415 853 6600	   +1 415 864 7013

jfh@rpp386.UUCP (The Beach Bum) (08/26/88)

In article <21606@ccicpg.UUCP> goshen@ccicpg.UUCP (Shmuel Goshen) writes:

[ thoughts on lightening the system overhead of system calls ]

>I am looking for opinions on these approaches. My main concern is side 
>effects caused by eliminating signal delivery and rescheduling until 
>a "software forced exception" or a *heavy* system call is executed 
>by the process. (Consider, for example, a program running getpid() 
>calls in a loop).

i did optimization work on trap() to lighten the syscall overhead.
the first approximation was to check the number of arguments which
where expected and jump around the code which copied the arguments.

the second approximation checked the number of return arguments,
and so on.  i believe the final solution (which i had nothing to
do with) added several additional fields to the sysent structure
to indicate how comlex the call was.

the net result was a big boost in benchmark performance.  i don't
remember WHICH system call a certain benchmark company used to
measure overhead, but that metric was improved by 100%.
-- 
John F. Haugh II (jfh@rpp386.UUCP)                           HASA, "S" Division

    "If the code and the comments disagree, then both are probably wrong."
                -- Norm Schryer

jack@cwi.nl (Jack Jansen) (08/26/88)

I don't think that the solution you pose for simple calls would
help anything. First, calls like getuid, getpid, etc. are very
rare, and, second, there's a much simpler solution for these calls:
just do the call the first time and cache the result.

For getpid() you would need some help from fork/vfork, but nothing
undoable.
This would get the same results, and even cheaper.

Now, two calls that *would* benefit from optimization are stat() and
ioctl().
From some system call traces I've seen it showed that these calls
are done incredibly often: stat() by programs that first check
modes before they create a file; and ioctl() by all sorts of
shells, etc. that have command line editing.

Unfortunately the quick-entry system doesn't work here because both
these calls can block....

	Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp)
	The shell is my oyster.

henry@utzoo.uucp (Henry Spencer) (08/27/88)

In article <21606@ccicpg.UUCP> goshen@ccicpg.UUCP (Shmuel Goshen) writes:
>I have been looking recently at ways to reduce the system call overhead.
>... introduce a simple and quick interface
>for "light weight" system calls (like getpid(), umask() etc'), which 
>perform simple tasks and never sleep...

Better profile your system load before putting a lot of effort into this.
Or look at the Bach&Gomes paper in the Spring 88 EUUG proceedings, which
observes that (after allowing for some oddities on their system) nearly
all system calls are file-system-related.  The light-weight calls you're
talking about simply don't occur frequently enough to be worth much effort.
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ast@cs.vu.nl (Andy Tanenbaum) (08/29/88)

In article <21606@ccicpg.UUCP> goshen@ccicpg.UUCP (Shmuel Goshen) writes:
>I have been looking recently at ways to reduce the system call overhead.

I would suggest you look closely at MULTICS and OS/2, which uses many ideas
taken straight from MULTICS.  In these systems, the entire operating system is
contained within the address space of each user process, with appropriate
protection mechanisms to keep users from doing naughty things.  As a result,
a system call is just a procedure call to an operating system routine, and the
complete overhead is only slightly more than a normal procedure call.  This
gets the time down from milliseconds to microseconds.  

Andy Tanenbaum (ast@cs.vu.nl)

bga@raspail.UUCP (Bruce Albrecht) (08/31/88)

In article <1316@ast.cs.vu.nl>, ast@cs.vu.nl (Andy Tanenbaum) writes:
> In article <21606@ccicpg.UUCP> goshen@ccicpg.UUCP (Shmuel Goshen) writes:
> >I have been looking recently at ways to reduce the system call overhead.
> 
> I would suggest you look closely at MULTICS and OS/2, which uses many ideas
> taken straight from MULTICS.  In these systems, the entire operating system is
> contained within the address space of each user process, with appropriate
> protection mechanisms to keep users from doing naughty things.  

Control Data's NOS/VE system on their Cyber 180's works the same way.  However,
this really only works when the memory scheme is based on heirarchial segments.
I can't speak for OS/2, but for Multics and NOS/VE, the memory is set up so
that each segment has ring attributes associated with it, and the ring 
attributes determine what accesses are allowed.  Most, if not all system data
is stored in segments which have ring attributes making them inaccessable to
the user's normal ring level, and the system calls are in segments that are
accessible to the less protected rings.  When a user calls a system procedure,
the ring mechanism automatically lowers the working ring (if the system
procedure's ring attributes indicate this is necessary), so that the heretofore
system data is now accessible.  If one is using an architecture that is not
segmented (such as the 68000 or NSC 32000), it's pretty hard, if not impossible
to automatically make system data accessible for read or read/write with
just a subroutine call.

Another thing I would like to point out is that containing the entire OS within
the user address space, usually cuts down on the maximum user address space.
On the Cyber 180, this is not a problem, because the user has access to 2047
segments of size 2**31.  If you are working with a 68000, you've only got
a single segment of size 2**24 (2**31 for 68020/68030).  If you have system
data that you think the user should be able to read, but not write, you could
map some system data into the user's address space, as part of the system
calling mechanisms (run time library), and make those calls that don't modify
system data subroutine calls rather than system calls (traps).

It is possible to simulate a heirarchial segmented memory schema on a 
non-segmented processor.  Unfortunately, it requires that you trap any call
to a ring outside of the current ring, so that you can reset all of the
read/write/execute attributes on all of the segments (pages) known to the
user.  This is likely to be a lot more expensive than system calls, though.

Bruce Albrecht ({backbone}!shamash!raspail!bga)

dlm@cuuxb.ATT.COM (Dennis L. Mumaugh) (09/01/88)

In article <7622@boring.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
        I don't think that the solution you pose for simple calls
        would  help  anything.  First, calls like getuid, getpid,
        etc. are very rare, and, second, there's a  much  simpler
        solution for these calls: just do the call the first time
        and cache the result.

1).  We have a system call trace program that reports on each and
every system call a process makes -- useful for support to figure
out what a program  REALLY  is  doing.  The  most  commonly  used
system call seems to be
	lseek(fd,0L,1) ==> tell me where I am.
I have seen one program issue  50  of  these  in  a  row  without
intervening  I/O  or  other system calls. [Result of the standard
I/O system design.]

Caching the lseek and redesign of the I/O library could be a real
win.  With the release of the system call tracer in the future as
a standard tool, it will be possible for developers  to  look  at
dynamic program behaviour and catch some of these problems.

        Unfortunately the quick-entry system  doesn't  work  here
        because both these calls can block....


2).  Fix the sysent table with a flag  for  blocking/non-blocking
syscalls.  On guarenteed non-blocking calls use a faster process.
On the WE32100 chip with gate tables, it was possible to  specify
different  entry  points  for  each  system call.  The developers
chose to use a single entry point for all system calls,  BUT  the
hardware  does  support all different entry points.  Thus syscall
handling  could  be  re-written  to  allow  light-weight  syscall
handling.

3).  Or, map the user's u-block (and proc table entry) into  some
funny  virtual  area  of the user's space (read-only).  Then many
syscalls could be eliminated in favor of simple code:

Instead of getpid() being a system call it becomes:
	#define getpid() (PROC->p_pid)
and similarly:
	#define getppid() (PROC->p_ppid)
	#define getuid() (UBLOCK->u_uid)
and so forth.

With better design of the users virtual memory  space  one  could
"allow"  a  lot less kernel activity.  The trade-off of course is
the necessity of having  enough  MMU  entries  available,  kernel
interlocking  so  the  extra  pages are there when needed and the
extra setup for a context switch.

-- 
=Dennis L. Mumaugh
 Lisle, IL       ...!{att,lll-crg}!cuuxb!dlm  OR cuuxb!dlm@arpa.att.com

jpd@usl-pc.usl.edu (DugalJP) (09/06/88)

In article <1316@ast.cs.vu.nl> ast@cs.vu.nl (Andy Tanenbaum) writes:
...
>I would suggest you look closely at MULTICS and OS/2, which uses many ideas
>taken straight from MULTICS.  In these systems, the entire operating system is
>contained within the address space of each user process, with appropriate
...

I've been a systems analyst on our Multics system for 13 years now, and
OS/2 does indeed have great possibilities, BUT only when the superior
features of the 386 processor are used.  Doesn't OS/2 use only techniques
possible on the 286?

-- 
-- James Dugal,	N5KNX		USENET: ...!{dalsqnt,killer}!usl!jpd
Associate Director		Internet: jpd@usl.edu
Computing Center		US Mail: PO Box 42770  Lafayette, LA  70504
University of Southwestern LA.	Tel. 318-231-6417	U.S.A.

jerry@olivey.olivetti.com (Jerry Aguirre) (09/07/88)

In article <21606@ccicpg.UUCP> goshen@ccicpg.UUCP (Shmuel Goshen) writes:
>
>The second approach would be to introduce a simple and quick interface
>for "light weight" system calls (like getpid(), umask() etc'), which 
>perform simple tasks and never sleep.  This quick interface will run 
>in kernel mode, thus enabling access to kernel data structures, but 
>will not include operations like saving the context (setjmp), signal 
>delivery and allowing rescheduling upon return from the call. 
>The net effect is almost as having the system call run in user mode. 

I once did this for a proprietary OS (not Unix).  The general system
call and return took slightly more than 300us.  I added a test in the
software interupt routine to check if the request was a simple one.  If
it was then the code would preform the service and return directly to
the caller.  This reduced the time to slightly more than 30us.

This was all done with interupts disabled but I calculated that the
routine did the requested service and returned in less time than the
normal code required to save context and enable interupts.

For that OS this was a big win.  In effect it had system calls doing
system calls so many of the common calls had their overhead multiplied
several times doing trivial functions like getting and returning
parameters.  (This was a result of an attempt to maintain structure
while limited to a 16 bit logical address space.)

There are certainly several Unix system calls that can be considered
trivial (getpid, umask, gettimeofday) but as others have pointed out
they are not normally used very often.  The exact overhead in checking
for trivial services is going to depend on the hardware facilities
available.  In some cases it may be zero.  In others it is going to add
a small amount of overhead to every non-trivial call.  Only measurements
under a "normal" load are going to tell if the result is a net gain.

It would certainly look nice on some of the benchmarks though.
				Jerry Aguirre

james@bigtex.uucp (James Van Artsdalen) (09/08/88)

In article <29@usl-pc.usl.edu>, jpd@usl-pc.UUCP (DugalJP) wrote:

> [..] OS/2 does indeed have great possibilities, BUT only when the superior
> features of the 386 processor are used.  Doesn't OS/2 use only techniques
> possible on the 286?

Indeed OS/2 uses only 286 features.  But this isn't nearly a big a
problem as the file system: OS/2 uses the MS-DOS 3.x file system!  Not
only is this slow, but it limits the capacity of any logical drive to
32meg.  Even MS-DOS 4.0 can do better than this....
-- 
James R. Van Artsdalen    ...!uunet!utastro!bigtex!james     "Live Free or Die"
Home: 512-346-2444 Work: 328-0282; 110 Wild Basin Rd. Ste #230, Austin TX 78746

mouse@mcgill-vision.UUCP (der Mouse) (09/11/88)

In article <28575@oliveb.olivetti.com>, jerry@olivey.olivetti.com (Jerry Aguirre) writes:
> In article <21606@ccicpg.UUCP> goshen@ccicpg.UUCP (Shmuel Goshen) writes:
>> The second approach would be to introduce a simple and quick
>> interface for "light weight" system calls [...].
> I once did this for a proprietary OS (not Unix).  [description of how
> this was a win in this case.]

We have a MicroVAX with an auxiliary CPU in it, something for which
very little software appears to exist.  We were building some, which
involved writing a kernel to run on the auxiliary.  In this particular
system, there was one syscall which was used very heavily, and it was a
very simple one: set a couple of flags and return (except for once in a
long while, in which case it did a good deal of work).  Well, it turned
out that adding a couple of tests to the chmk handler to test for that
particular syscall got us a factor of about 5 improvement.  Quite a
win.  (Ultimately, we got an even bigger win my making those particular
flags reside in a piece of memory which was made writeable by user
mode, so no mode-change was necessary.  Turned a full syscall, with all
its overhead, into a assign, test, plus a syscall in the "hard" case.)

					der Mouse

			old: mcgill-vision!mouse
			new: mouse@larry.mcrcim.mcgill.edu

rich@eday.pilchuck.Data-IO.COM (Rich Wallick) (09/13/88)

In article <29@usl-pc.usl.edu>, jpd@usl-pc.usl.edu (DugalJP) writes:
> OS/2 does indeed have great possibilities, BUT only when the superior
> features of the 386 processor are used.  Doesn't OS/2 use only techniques
> possible on the 286?

OS/2 just uses protect mode of the 286;  it will (i believe) recognize
if it is running on a 386 box and use the 386's ability to switch
between protect and real mode. this increases speed of high memory
(>1meg) memory access when in the real mode box over the 8042/8047
reset and loadall instruction.  i don't see any other 386 capabilities
being used.

			-<<O>>-