[comp.os.research] How light weight can a process get?

jms@central.cis.upenn.edu (Jonathan M. Smith) (11/21/90)

Notes on "lightweight" processes

1. I wanted to see how minimal a process could be created, assuming I was
willing to give up address space protection. For experimental purposes, I
also focused on an ideal scheduling method, i.e., co-routine scheduling.
The method I used was to use the facilities setjmp() and longjmp() which
are provided by the C library. They provide the ability to achieve a non-local
goto, that is, one which crosses routine boundaries. setjmp() saves the current
"state" of the program (i.e., a minimal set of registers, floating registers,
frame pointers, etc) into a jmp_buf structure 
(see /usr/include/setjmp.h) and longjmp(),
given a jmp_buf, restores the specified state, including the program counter.

2. The experiment I chose was to implement a null filter (like cat) using
a reader and writer process communicating using a small shared memory, in this
case a buffer of one byte. The code (mpx.c) follows:

#include <stdio.h>
#include <setjmp.h>

#define NPROC 2
#define READER 0
#define WRITER 1
#define S_READ 0
#define S_WRITE 1
#define S_DONE 2

#define SIZE 1
struct control {
	int status;
	char buf[SIZE];
};

struct control ctl;
jmp_buf proc[NPROC];

/*
 * co-routine design (toy program) to emulate "cat" and give some timing
 * data.
 * Writer() is called first, to initialize its jmp_buf. Then reader() is
 * called, and it co-routine schedules with writer() until they are done.
 * sort of minimal, non-preemptive, event-driven scheduling!
 */

main()
{
	ctl.status = S_READ;
	writer();
	reader();
	exit( 0 );
}

writer()
{
	_setjmp( proc[WRITER] );

	if( ctl.status == S_WRITE )
	{
		putchar( ctl.buf[0] );
		ctl.status = S_READ;
		_longjmp( proc[READER], 0);
	}

}

reader()
{
	int c;

	_setjmp( proc[READER] );

	if( ctl.status == S_READ )
	{
		c = getchar();
		if( c == EOF )
			exit( 0 );
		else
		{
			ctl.buf[0] = c;
			ctl.status = S_WRITE;
			_longjmp( proc[WRITER], 0 );
		}
	}
}

3. Notes:
_setjmp() and _longjmp() are specialized versions of setjmp() and longjmp()
which do not restore the SIGVEC values under 4.3BSD-like systems, and are
thus faster. Note that they must save all machine registers, including 
floating values, which are numerous. Note also that due to the minimal state
save that _setjmp()/_longjmp() do, such things as local variables are not saved
and restored. Type "volatile" could be used
as a cautionary measure in the control structure
declarations, although it shouldn't matter.

4.
In order to give a valid comparison between a "multiprocessing" reader/writer
scheme, and the straightforward method, a la "cat", a simplistic null filter
program ("mycat.c") was written. All of the overhead of character at a time
I/O done through stdio is still here, but no "context-switching", just a 
straightforward while() loop:

#include <stdio.h>

/*
 */

main()
{
	int c;

	while( (c=getchar()) != EOF )
	{
		putchar( c );
	}
	exit( 0 );
}

Certainly, both codes could be considerably optimized by using bigger buffers
and using system calls directly rather than through stdio's buffering. But
that's OK, because the focus of the experiments is minimal cost
multiprocessing, not writing fast null filters. The multiprocessing code could
be made more sophisticated by having a real "scheduler" rather than straight
co-routining. But from the point of minimality, that's excess baggage
(it actually can be done quite trivially here, as the reader and writer could
both pass control to the "scheduler" when they have finished their small
chunk of work - the "scheduler" then decides which to restart next based on
the state variable).

Let's see how they perform:

5. The experimental environment is an IBM RS/6000 running AIX. To keep things
easy, I will use a single big test file, namely "/unix", which is 1261933
bytes long on this system. ./mpx is the compiled mpx.c, and ./mycat is the 
analogous mycat.c. The buffer cache was initialized by running both binaries
and reading the /unix file before the timings were made.

$ time ./mycat </unix >/dev/null

real    0m4.90s
user    0m4.72s
sys     0m0.16s

$ time ./mpx </unix >/dev/null

real    0m13.90s
user    0m13.60s
sys     0m0.21s

As an additional timing experiment, I recompiled both programs using
optimization, which is fairly sophisticated on this machine. The results:

$ time ./mpx </unix >/dev/null

real    0m10.59s
user    0m10.38s
sys     0m0.15s
$ time ./mycat </unix >/dev/null

real    0m1.97s
user    0m1.81s
sys     0m0.16s

The optimization seems to help "mycat" a good deal more than it helps "mpx";
I suspect the reason is that "mpx" has copying it cannot get around. But let's
see what these numbers mean. 

6. If we subtract the user time for "mpx" from the
user time for "mycat" (using the optimized times - of course we'll optimize our
OS!), we get: 10.38-1.81, or 8.57 seconds. In this 8.57 seconds, we have 
processed 1261933 characters, the contents of /unix. For each character
processed, we have:
	1. Restored the reader
	2. Read the character
	3. Stashed it to a buffer (shared memory(!))
	4. Restored the writer
	5. Written out the character

I will call the pair of 1. and 4. a "context switch", since a "process"
has been switched out and another switched in.
So the timing could be interpreted as
the time for 1261933 context switches, meaning that for such minimal 
"processes" we can get about 150,000 context switches per second. Or that a
context switch can be done in about 7 microseconds, if I've done my division
right. Of course, this is a relatively fast machine, and everything fits in
the cache....

So this minimal case is pretty good. I don't see a way of making it
much smaller, unless:
	1. You change the language model so that registers are not used.
	Then you don't need to save them. But I expect the performance hit
	from not using registers is much worse than that from slower
	_setjmp()/_longjmp()!
	2. You provide hardware to give something like a "set-associative"
	register cache and just switch pointers - I think the Sun-4s do
	something like this but I am not sure of the details.

						-Jonathan

pardo@cs.washington.edu (David Keppel) (11/29/90)

jms@central.cis.upenn.edu (Jonathan M. Smith) writes:
>Notes on "lightweight" processes

>[Using setjmp/longjmp]

In case anybody cares, that's not a portable implementation as, e.g.,
a system may unwind the stack on a |longjmp|.

>[7 microseconds per context switch on a RS/6000.]
>[Context switch includes save/restore all machine registers.]

I'm not familiar with the processor architecture.  I assume 32 integer,
32 floating-point registers and a 30MHz (30ns) clock and no stalls,
then in 7 microseconds I can execute 7000/30 = 233 instructions.
Saving and restoring 64 registers should take at least 128 cycles, but
the rest of the context switch could be pretty cheap - a few dozen
instructions.  So you're within a factor of two, but some shaving could
be done.

Another data point: GCC (the GNU C compiler) has the capability to say
``certain registers are clobbered by this |asm()|''.  I have used this
capability to implement a nonpreemptive multiprocessor lightweight
process package on the Sequent i386-based multiprocessor that performs
(multiprocessor locked) thread swaps on a 16MHz 6-register i386
(including the overhead of two procedure calls with a dynamically-bound
target address) in slightly over 30 microseconds.  Using the |asm()|
feature lets me do *no* register saves and restores in the context
switch routine, relying instead on the compiler to insert them
callee-saves in the procedure in which the context switch is performed.

The unfortunate thing is that the 1.x series GCC simply avoids using
certain registers instead of treating the |asm()| as a register `kill',
so the code generated around the context switch is inferior.  That
could be avoided with a superior register allocator (in 2.x maybe?)

>[Or, you can use a register window pointer like the Sun-4s.]

Unfortunately, massaging the window pointer on a conforming SPARC is a
kernel-reserved operation, so even a minimalist implementation has to
pay the kernel trap cost.  The default primitives that are provided
with e.g., SunOS are targeted for procedure calls not threads, and so
you can't really do threads using windows under SunOS on the SPARC;
further, saving the register file on overflow/underflow has been
optimized but the general `flush' appears to be significantly slower.

The APRIL project [Agarwal, Lim, Kranz, and Kubiatowicz; Proceedings of
the 17th Annual International Symposium on Computer Architecture, IEEE
Computer Society Press, pg. 104, 1990] will implement context switching
using a slightly modified SPARC in an estimated 11 cycles for
``in-cache'' context swaps.  Of that, 6 cycles is for the trap handler.
Up to 4 threads may be in-cache at any given time, ``out of cache''
context swaps require saving and restoring 32 integer and 8
floating-point registers, so the context switch time will be like that
of a conventional processor (no kernel access will be required, since
only the current window must be saved and restored).

How light weight can a process get?  Pretty light.

	;-D on  ( Tastes great, less filling, swaps fast )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

lance@uunet.UU.NET (11/30/90)

jms@central.cis.upenn.edu (Jonathan M. Smith) writes:



>Notes on "lightweight" processes

>1. I wanted to see how minimal a process could be created, assuming I was
>willing to give up address space protection. For experimental purposes, I
>also focused on an ideal scheduling method, i.e., co-routine scheduling.
>The method I used was to use the facilities setjmp() and longjmp() which
>are provided by the C library. They provide the ability to achieve a non-local
>goto, that is, one which crosses routine boundaries. setjmp() saves the current
>"state" of the program (i.e., a minimal set of registers, floating registers,
>frame pointers, etc) into a jmp_buf structure 
>(see /usr/include/setjmp.h) and longjmp(),
>given a jmp_buf, restores the specified state, including the program counter.


A "neutrino-weight" process:

process_a() {
	static state = 0;

	switch (state) {
		case 0:		do_something(); state = 1; break;
		case 1:		do_something_else(); state = 0; break;
	}
}

The setjmp/longjmp hack is handy.  The only problem with it is that
you have to allocate separate stack RAM for each thread, and if your
operating system doesn't support a separate signal stack, running out
of stack causes the program to die.

Lance