jms@central.cis.upenn.edu (Jonathan M. Smith) (11/21/90)
Notes on "lightweight" processes 1. I wanted to see how minimal a process could be created, assuming I was willing to give up address space protection. For experimental purposes, I also focused on an ideal scheduling method, i.e., co-routine scheduling. The method I used was to use the facilities setjmp() and longjmp() which are provided by the C library. They provide the ability to achieve a non-local goto, that is, one which crosses routine boundaries. setjmp() saves the current "state" of the program (i.e., a minimal set of registers, floating registers, frame pointers, etc) into a jmp_buf structure (see /usr/include/setjmp.h) and longjmp(), given a jmp_buf, restores the specified state, including the program counter. 2. The experiment I chose was to implement a null filter (like cat) using a reader and writer process communicating using a small shared memory, in this case a buffer of one byte. The code (mpx.c) follows: #include <stdio.h> #include <setjmp.h> #define NPROC 2 #define READER 0 #define WRITER 1 #define S_READ 0 #define S_WRITE 1 #define S_DONE 2 #define SIZE 1 struct control { int status; char buf[SIZE]; }; struct control ctl; jmp_buf proc[NPROC]; /* * co-routine design (toy program) to emulate "cat" and give some timing * data. * Writer() is called first, to initialize its jmp_buf. Then reader() is * called, and it co-routine schedules with writer() until they are done. * sort of minimal, non-preemptive, event-driven scheduling! */ main() { ctl.status = S_READ; writer(); reader(); exit( 0 ); } writer() { _setjmp( proc[WRITER] ); if( ctl.status == S_WRITE ) { putchar( ctl.buf[0] ); ctl.status = S_READ; _longjmp( proc[READER], 0); } } reader() { int c; _setjmp( proc[READER] ); if( ctl.status == S_READ ) { c = getchar(); if( c == EOF ) exit( 0 ); else { ctl.buf[0] = c; ctl.status = S_WRITE; _longjmp( proc[WRITER], 0 ); } } } 3. Notes: _setjmp() and _longjmp() are specialized versions of setjmp() and longjmp() which do not restore the SIGVEC values under 4.3BSD-like systems, and are thus faster. Note that they must save all machine registers, including floating values, which are numerous. Note also that due to the minimal state save that _setjmp()/_longjmp() do, such things as local variables are not saved and restored. Type "volatile" could be used as a cautionary measure in the control structure declarations, although it shouldn't matter. 4. In order to give a valid comparison between a "multiprocessing" reader/writer scheme, and the straightforward method, a la "cat", a simplistic null filter program ("mycat.c") was written. All of the overhead of character at a time I/O done through stdio is still here, but no "context-switching", just a straightforward while() loop: #include <stdio.h> /* */ main() { int c; while( (c=getchar()) != EOF ) { putchar( c ); } exit( 0 ); } Certainly, both codes could be considerably optimized by using bigger buffers and using system calls directly rather than through stdio's buffering. But that's OK, because the focus of the experiments is minimal cost multiprocessing, not writing fast null filters. The multiprocessing code could be made more sophisticated by having a real "scheduler" rather than straight co-routining. But from the point of minimality, that's excess baggage (it actually can be done quite trivially here, as the reader and writer could both pass control to the "scheduler" when they have finished their small chunk of work - the "scheduler" then decides which to restart next based on the state variable). Let's see how they perform: 5. The experimental environment is an IBM RS/6000 running AIX. To keep things easy, I will use a single big test file, namely "/unix", which is 1261933 bytes long on this system. ./mpx is the compiled mpx.c, and ./mycat is the analogous mycat.c. The buffer cache was initialized by running both binaries and reading the /unix file before the timings were made. $ time ./mycat </unix >/dev/null real 0m4.90s user 0m4.72s sys 0m0.16s $ time ./mpx </unix >/dev/null real 0m13.90s user 0m13.60s sys 0m0.21s As an additional timing experiment, I recompiled both programs using optimization, which is fairly sophisticated on this machine. The results: $ time ./mpx </unix >/dev/null real 0m10.59s user 0m10.38s sys 0m0.15s $ time ./mycat </unix >/dev/null real 0m1.97s user 0m1.81s sys 0m0.16s The optimization seems to help "mycat" a good deal more than it helps "mpx"; I suspect the reason is that "mpx" has copying it cannot get around. But let's see what these numbers mean. 6. If we subtract the user time for "mpx" from the user time for "mycat" (using the optimized times - of course we'll optimize our OS!), we get: 10.38-1.81, or 8.57 seconds. In this 8.57 seconds, we have processed 1261933 characters, the contents of /unix. For each character processed, we have: 1. Restored the reader 2. Read the character 3. Stashed it to a buffer (shared memory(!)) 4. Restored the writer 5. Written out the character I will call the pair of 1. and 4. a "context switch", since a "process" has been switched out and another switched in. So the timing could be interpreted as the time for 1261933 context switches, meaning that for such minimal "processes" we can get about 150,000 context switches per second. Or that a context switch can be done in about 7 microseconds, if I've done my division right. Of course, this is a relatively fast machine, and everything fits in the cache.... So this minimal case is pretty good. I don't see a way of making it much smaller, unless: 1. You change the language model so that registers are not used. Then you don't need to save them. But I expect the performance hit from not using registers is much worse than that from slower _setjmp()/_longjmp()! 2. You provide hardware to give something like a "set-associative" register cache and just switch pointers - I think the Sun-4s do something like this but I am not sure of the details. -Jonathan
pardo@cs.washington.edu (David Keppel) (11/29/90)
jms@central.cis.upenn.edu (Jonathan M. Smith) writes: >Notes on "lightweight" processes >[Using setjmp/longjmp] In case anybody cares, that's not a portable implementation as, e.g., a system may unwind the stack on a |longjmp|. >[7 microseconds per context switch on a RS/6000.] >[Context switch includes save/restore all machine registers.] I'm not familiar with the processor architecture. I assume 32 integer, 32 floating-point registers and a 30MHz (30ns) clock and no stalls, then in 7 microseconds I can execute 7000/30 = 233 instructions. Saving and restoring 64 registers should take at least 128 cycles, but the rest of the context switch could be pretty cheap - a few dozen instructions. So you're within a factor of two, but some shaving could be done. Another data point: GCC (the GNU C compiler) has the capability to say ``certain registers are clobbered by this |asm()|''. I have used this capability to implement a nonpreemptive multiprocessor lightweight process package on the Sequent i386-based multiprocessor that performs (multiprocessor locked) thread swaps on a 16MHz 6-register i386 (including the overhead of two procedure calls with a dynamically-bound target address) in slightly over 30 microseconds. Using the |asm()| feature lets me do *no* register saves and restores in the context switch routine, relying instead on the compiler to insert them callee-saves in the procedure in which the context switch is performed. The unfortunate thing is that the 1.x series GCC simply avoids using certain registers instead of treating the |asm()| as a register `kill', so the code generated around the context switch is inferior. That could be avoided with a superior register allocator (in 2.x maybe?) >[Or, you can use a register window pointer like the Sun-4s.] Unfortunately, massaging the window pointer on a conforming SPARC is a kernel-reserved operation, so even a minimalist implementation has to pay the kernel trap cost. The default primitives that are provided with e.g., SunOS are targeted for procedure calls not threads, and so you can't really do threads using windows under SunOS on the SPARC; further, saving the register file on overflow/underflow has been optimized but the general `flush' appears to be significantly slower. The APRIL project [Agarwal, Lim, Kranz, and Kubiatowicz; Proceedings of the 17th Annual International Symposium on Computer Architecture, IEEE Computer Society Press, pg. 104, 1990] will implement context switching using a slightly modified SPARC in an estimated 11 cycles for ``in-cache'' context swaps. Of that, 6 cycles is for the trap handler. Up to 4 threads may be in-cache at any given time, ``out of cache'' context swaps require saving and restoring 32 integer and 8 floating-point registers, so the context switch time will be like that of a conventional processor (no kernel access will be required, since only the current window must be saved and restored). How light weight can a process get? Pretty light. ;-D on ( Tastes great, less filling, swaps fast ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
lance@uunet.UU.NET (11/30/90)
jms@central.cis.upenn.edu (Jonathan M. Smith) writes: >Notes on "lightweight" processes >1. I wanted to see how minimal a process could be created, assuming I was >willing to give up address space protection. For experimental purposes, I >also focused on an ideal scheduling method, i.e., co-routine scheduling. >The method I used was to use the facilities setjmp() and longjmp() which >are provided by the C library. They provide the ability to achieve a non-local >goto, that is, one which crosses routine boundaries. setjmp() saves the current >"state" of the program (i.e., a minimal set of registers, floating registers, >frame pointers, etc) into a jmp_buf structure >(see /usr/include/setjmp.h) and longjmp(), >given a jmp_buf, restores the specified state, including the program counter. A "neutrino-weight" process: process_a() { static state = 0; switch (state) { case 0: do_something(); state = 1; break; case 1: do_something_else(); state = 0; break; } } The setjmp/longjmp hack is handy. The only problem with it is that you have to allocate separate stack RAM for each thread, and if your operating system doesn't support a separate signal stack, running out of stack causes the program to die. Lance