jms@central.cis.upenn.edu (Jonathan M. Smith) (11/09/90)
Some Notes on Experiences programming shared memory 1. Purpose I, and others, have claimed that using shared memory support for programming can reduce the copying required in IPC. Being a "doubting Thomas", I decided to play a little bit and try to apply shared memory to a task for which other mechanisms (highly optimized) exist. Specifically, on a System V UNIX system, there exist both pipes and access to shared memory facilities. In my experiments, I employed a HP9000 Series 370 with 16 megabytes of main memory, running HPUX version 6.5 (deans.cis.upenn.edu). The system runs diskless except for a 70meg swap device, and my home directory is on "central.cis.upenn.edu". 2. Experimental Description The experiment was as follows: 1. Write a two-process program using pipes to communicate between reader and writer. 2. Rewrite the program using shared memory for the same communication. 3. Compare writing effort. 4. Measure the performance of the two implementations. 3. Pipes "piper.c" follows: #include <stdio.h> #ifndef PIPESIZ #define PIPESIZ 8192 #endif main() { int pfd[2], pid, fork(), nread, nwrite; char buf[PIPESIZ]; if( pipe( pfd ) < 0 ) { fprintf( stderr, "Can't open pipe. Exiting.\n" ); exit( 1 ); } switch( (pid = fork()) ) { case -1: fprintf( stderr, "Can't fork. Exiting.\n" ); exit( 2 ); break; case 0: /* child */ close( pfd[1] ); while( (nread = read( pfd[0], buf, PIPESIZ ) ) > 0 ) { nwrite = write( 1, buf, nread ); if( nread != nwrite ) { fprintf( stderr, "IO error. Exiting.\n" ); exit( 3 ); } } break; default: /* parent */ close( pfd[0] ); while( (nread = read( 0, buf, PIPESIZ ) ) > 0 ) { nwrite = write( pfd[1], buf, nread ); if( nread != nwrite ) { fprintf( stderr, "IO error. Exiting.\n" ); exit( 4 ); } } } exit( 0 ); } The program was written as it is displayed, and was only changed to make the buffer size PIPESIZ so that the hidden buffering of pipe data provided by UNIX did not distort measurements. The size had originally been set to BUFSIZ. The program took about 15 minutes to type in and remove errors (mainly lexical). I am an experienced UNIX programmer, and IPC by pipe is the idiom, so this is to be expected. The program (if you're counting) is about 55 lines of source text. 4. Shared Memory I tried to keep the same model, that of a parent process reading from the standard input file, dumping data to a child which writes it to the standard output. Thus, in their IPC relationship, the parent is the "WRITER" and the child is the "READER". The shared data structure was a buffer+counter, and was set up to be the same size as in "piper.c". I realized that I needed mutual exclusion, and had to give it some thought. At first, I envisioned using the memory itself, but one problem seemed to be that C provides no access to an atomic "test-and-set" style instruction, and with preemptive multiprocessing, this is obviously necessary. Having used semaphores while writing code for my thesis, I was attracted to this mechanism for mutual exclusion, and the HPUX kernel provides semaphores. So I had to add code to A. allocate the shared memory for the buffer B. allocate the semaphore C. initialize the semaphore D. perform mutual exclusion when accessing the shared buffer E. explicitly deallocate the semaphore and shared memory on completion I was able to subtract code to A. create the pipe B. read/write from/to the pipe In some initial trials, I found that the shared memory program ("mem") performed far worse than "piper". This surprised me, so I profiled the code to find out where the time was going, and it was being eaten up by the calls to "semop()", which are used to implement P() and V(). But the cost of the calls was not the problem; it was the *number* of calls! For a file (my "mbox") with 184 8192-byte blocks, about 30-40 K calls to semop were made! This didn't sem possible from examining the code; and I must admit to a lot of hacking trying to "fix" the problem, which I was sure was a BUG. The problem was that pipes are well-integrated with process synchronization and scheduling. While executing a "V()" operation freed the semaphore, it did not give up the processor, so that the process which had just executed the "V" could now execute a "P" again immediately. After I saw this, I quickly rewrote the program with two semaphores so that better control over process control (such as in "sleep" and "wakeup") and response to events could be achieved. The correct (and reasonably fast) "mem.c" is given below: #include <stdio.h> #include <sys/types.h> #include <sys/ipc.h> #include <sys/shm.h> #include <sys/sem.h> #ifndef PIPESIZ #define PIPESIZ 8192 #endif struct shared_buf { char buf[PIPESIZ]; int count; }; #define READ 0 #define WRITE 1 #define NSEM 2 int Semid; struct sembuf Sops[1]; get_sema() { if( (Semid = semget( IPC_PRIVATE, NSEM, 0600 )) < 0 ) { fprintf( stderr, "get of sema failed. Exiting.\n" ); exit( 1 ); } } set_sema() { int i; for( i = 0; i < NSEM; i += 1 ) { if( semctl( Semid, i, SETVAL, 1 ) < 0 ) { fprintf( stderr, "set of sema failed. Exiting.\n" ); exit( 1 ); } } return; } P( sema ) int sema; { Sops[0].sem_flg = 0; Sops[0].sem_num = sema; Sops[0].sem_op = -1; semop( Semid, Sops, 1 ); return; } V( sema ) int sema; { Sops[0].sem_flg = 0; Sops[0].sem_num = sema; Sops[0].sem_op = 1; semop( Semid, Sops, 1 ); return; } main() { int shmid, pid, fork(), shmget(); struct shared_buf *sbp; char *shmat(); if( (shmid = shmget(IPC_PRIVATE, sizeof( *sbp ), 0600) ) < 0 ) { fprintf( stderr, "Can't get shared mem. Exiting.\n" ); exit( 1 ); } sbp = (struct shared_buf *) shmat( shmid, (char *) 0, 0 ); if( sbp == (struct shared_buf *) -1 ) { fprintf( stderr, "Can't attach shared mem. Exiting.\n" ); exit( 1 ); } get_sema(); set_sema(); P( READ ); switch( (pid = fork()) ) { case -1: fprintf( stderr, "Can't fork. Exiting.\n" ); exit( 2 ); break; case 0: /* child */ do { P( READ ); write( 1, sbp->buf, sbp->count ); V( WRITE ); } while( sbp->count > 0 ); /* clean-up */ shmdt( sbp ); shmctl( shmid, IPC_RMID, 0 ); semctl( Semid, 0, IPC_RMID, 0 ); break; default: /* parent */ do { P( WRITE ); sbp->count = read( 0, sbp->buf, PIPESIZ ); V( READ ); } while( sbp->count > 0 ); } exit( 0 ); } This program takes about 130 lines of code, although the length of code inside main() is about the same. The main processing loops are much more elegant, but the setup is uglier. In order to force the WRITER (parent) to go first, the child is blocked by calling P( READ ) in the parent, thus blocking the child until the writer calls V( READ ) after it has read data into the buffer. It took me quite a long time (4-7 hours, depending on how you count, spread over several days) to get the program working, although I am now sure I could apply the paradigm effectively in a much shorter period of time, since I understand the details. Note that a program with multiple readers and writers would require a semaphore for mutual exclusion as well as mechanism for scheduling. (I originally included such a semaphore, MUTEX, but discarded it after I concluded it was unnecessary here). 5. Comparative Measurements The programs take the same time! My "mbox" file is a 1.5M text file. The programs were compiled with optimization turned on, and run using the file as input. Output was /dev/null, as is shown below: $ time ./piper <mbox >/dev/null real 0m4.04s user 0m0.00s sys 0m0.78s $ time ./piper <mbox >/dev/null real 0m3.96s user 0m0.02s sys 0m0.70s $ time ./mem <mbox >/dev/null real 0m3.50s user 0m0.08s sys 0m0.62s $ time ./mem <mbox >/dev/null real 0m3.46s user 0m0.00s sys 0m0.66s The "mem" programs seem to be slightly faster, but there is no substantial difference. Once again, I profiled, (using PROFDIR=. in the environment, so that I could examine the profiles from the child and parent separately); here are the top few lines for the parent: $ prof -m2765.mem mem %Time Seconds Cumsecs #Calls msec/call Name 78.8 0.52 0.52 184 2.83 _read 15.2 0.10 0.62 369 0.27 _semop 3.0 0.02 0.64 1 20. _main 3.0 0.02 0.66 1 20. _fork 0.0 0.00 0.66 185 0.00 _P and the child: $ prof -m2766.mem mem %Time Seconds Cumsecs #Calls msec/call Name 72.0 0.16 0.16 369 0.43 _semop 18.0 0.06 0.22 185 0.32 _write 0.0 0.00 0.22 1 0. _main 0.0 0.00 0.22 185 0.00 _P (there was a bogus timing for fork() here, and there are of course many lines of calls taking no time edited out for the reader's sanity....) For piper, the parent's profile was: $ prof -m2779.piper piper %Time Seconds Cumsecs #Calls msec/call Name 58.1 0.50 0.50 184 2.72 _read 37.2 0.32 0.82 184 1.74 _write 2.3 0.02 0.84 1 20. _main 2.3 0.02 0.86 1 20. _pipe and the child's: $ prof -m2780.piper piper %Time Seconds Cumsecs #Calls msec/call Name 82.0 0.36 0.36 184 1.96 _read 9.0 0.04 0.40 184 0.22 _write 4.5 0.02 0.42 1 20. _main 4.5 0.02 0.44 1 20. _pipe Thus, we see that it's the I/O (read+write=.58) and semaphores (.26) that are eating up the time; the shared memory version looks faster (.84 versus 1.30), and IPC is definitely faster; subtracting I/O due to external sources, gives .30 versus .72, or more than a factor of two. I will try to cook up an experiment a little later to get timings which are based only on the IPC time. 6. Summary (so far) But it is helpful to observe the following facts: 1. IPC may not be the dominant cost in any case (here, it's external I/O) 2. There was no major difference in the response times of "mem" and "piper" 3. This may be a best case comparison for pipes - pipes are highly optimized on UNIX and pipe flow control is tightly integrated with scheduling and process control. In fact, I suspect that fact 3 is the most reasonable explanation for fact 2, given the profile results.
pardo@cs.washington.edu (David Keppel) (11/11/90)
jms@central.cis.upenn.edu (Jonathan M. Smith) writes: >[Programming with pipes vs. shared memory] Some other lessons from the experiment: * Programming with good, familiar primitives is always easier. * Never confuse MODEL with IMPLEMENTATION -- there are pipes that are implemented using shared memory (Mach, Topaz) and shared memory implemented using pipes (more or less, anyway; Ivy). * The results are always different on a multiprocessor: the shared-memory version has a simple implementation that (after setup) makes no kernel calls and does no copying -- at least on some platforms. * There is also a pipe model, implemented using shared memory, that runs on (some) multiprocessors with no kernel calls and no copying. * Simple, understandable examples are almost controversial :^) Thank you for this opportunity to generalize... ;-D on ( General Semaphore? This is Private Memory. Sir! ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
ajudge@maths.tcd.ie (Alan Judge) (11/13/90)
In <8754@darkstar.ucsc.edu> jms@central.cis.upenn.edu (Jonathan M. Smith) writes: >Some Notes on Experiences programming shared memory Just out of interest, I thought that I would try piper and mem on a different machine type. It should give a better idea of the effects of differing kernel implementations on the performance. Below are the results of piper and mem run on a DECsystem 5810 (MIPS based) running Ultrix 3.1C. The system has 32Mb of memory and is running from local disks. The size of the buffer cache should remove the effects of the disks after the first execution. /vmunix, the test file used, is 2.8Mb in size. parade 1% time pipe < /vmunix > /dev/null 0.0u 2.0s 0:05 39% 3+4k 1+0io 0pf+0w av:8k m:4k refpf: 0 parade 2% time pipe < /vmunix > /dev/null 0.0u 2.0s 0:02 67% 3+4k 0+0io 0pf+0w av:8k m:4k refpf: 0 parade 2% time mem < /vmunix > /dev/null 0.0u 1.1s 0:01 80% 3+2k 3+1io 0pf+0w av:6k m:5k refpf: 0 parade 3% time mem < /vmunix > /dev/null 0.0u 1.1s 0:01 91% 3+2k 0+0io 0pf+0w av:6k m:5k refpf: 0 It is interesting to note that, in this system, shared memory is almost twice as fast as pipes. -- Alan Judge ajudge@maths.tcd.ie a.k.a. amjudge@cs.tcd.ie +353-1-772941 x1782 File names are infinite in length where infinity is set to 255 characters. - Peter Collinson, "The Unix File System"
bdf@rex.cs.tulane.edu (Brett Fleisch) (11/14/90)
In article <8754@darkstar.ucsc.edu> jms@central.cis.upenn.edu (Jonathan M. Smith) writes: >Some Notes on Experiences programming shared memory > >In some initial trials, I found that the shared memory program ("mem") >performed far worse than "piper". This surprised me, so I profiled the code >to find out where the time was going, and it was being eaten up by the >calls to "semop()", which are used to implement P() and V(). But the cost >of the calls was not the problem; it was the *number* of calls! For a file >(my "mbox") with 184 8192-byte blocks, about 30-40 K calls to semop were >made! This didn't sem possible from examining the code; and I must admit to >a lot of hacking trying to "fix" the problem, which I was sure was a BUG. > >The problem was that pipes are well-integrated with process synchronization >and scheduling. While executing a "V()" operation freed the semaphore, it >did not give up the processor, so that the process which had just executed >the "V" could now execute a "P" again immediately. After I saw this, I quickly >rewrote the program with two semaphores so that better control over process >control (such as in "sleep" and "wakeup") and response to events could be >achieved. > I observed a similar problem in Mirage. Why not read over the article "Mirage: A Coherent Distributed Shared Memory Design" in SOSP12 noting the yield() instruction which was added to better control UNIX synchronization and scheduling interactions. If you have questions contact me: Brett D. Fleisch Assistant Professor Computer Science Department Tulane University 301 Stanley Thomas Hall New Orleans, LA 70118-5698 (504)865-5840 bdf@cs.tulane.edu -- brett@cs.tulane.edu
aglew@crhc.uiuc.edu (Andy Glew) (11/14/90)
..> Pipes versus shared memory, spinning on a non-running processor, ..> and scheduling. [jms@central.cis.upenn.edu (Jonathan M. Smith)] >I observed a similar problem in Mirage. Why not read over the article >"Mirage: A Coherent Distributed Shared Memory Design" in SOSP12 noting the >yield() instruction which was added to better control UNIX synchronization >and scheduling interactions. >[Brett D. Fleisch] The yield() operation sounds very much like the suspend()/resume() facility found in many real-time executives, which I put into Gould UTX-32/RT. As I recall it took me 45 minutes to prototype - it was essentially a variant of SIGSTOP, with much reduced overhead - and maybe a man-month for someone else to productify, after the fact design, etc. Suspend()/resume(), and the combined suspres(), were low overhead system calls to suspend the current process, and start another process running. So, in the shared memory pipe example, the producer would suspend itself and start the consumer when the buffer is full, and the consumer would vice versa when the buffer is empty. By the way, some UNIX pipe implementations provide another feature: they provide a priority boost PPIPE to a process made runnable by a pipe. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
jesup@cbmvax.cbm.commodore.com (Randell Jesup) (01/26/91)
In article <8754@darkstar.ucsc.edu> jms@central.cis.upenn.edu (Jonathan M. Smith) writes: >I, and others, have claimed that using shared memory support for >programming can reduce the copying required in IPC. I agree that it can, and have some proof (albeit from another OS). I rewrote your examples for AmigaDos (which is based on lightweight processes). I also coded up a null version (reads from input, writes to output) to subtract out the IO overhead reading and writing the input/output channels. I ran all IO to and from the ramdisk to reduce FS overhead (the ramdisk does about 3 to 5Mb/sec for read/write on this machine). Times for a 2.5Mb file were (averaged over several runs): null: 0.60 sec msg: 0.70 sec pipe: 2.92 sec So, it looks like .10 seconds for msg, versus 2.32 seconds for pipes. I think 23:1 is a pretty clear win for shared memory. Note that on a system built around shared memory, there's no need to reinvent the wheel in order to have processes share memory. I simply pass messages back and forth telling the child the buffer is available and how many bytes are in it. Messages are shared and the entire OS is built on the message primitives (GetMsg, PutMsg, ReplyMsg, WaitPort). It took no longer for me to write the shared-memory version than to write the pipe-version (AmigaDos uses named pipes). BTW, PutMsg in AmigaDos merely links a message into the list of messages waiting at a port, and does whatever notification is needed (usually a signal to a process). For those that care, this is an Amiga 3000, 25mhz '030, 8 meg of on-board 80ns SCRAM, 8 meg of expansion bus ram (Zorro-III bus), about 1/2 as fast as the on-board ram. All execution was out of the on-board ram. The version AmigaDos is 2.0. >6. Summary (so far) >But it is helpful to observe the following facts: >1. IPC may not be the dominant cost in any case (here, it's external I/O) In many cases, true. However, extensive use of shared-mem techniques can cause large improvements in OS/application performance, especially on a system where most IO is handled by FileSystem processes and device processes (i.e. not running as part of a kernel, nor special in any way). This can be quite important in interactive processes, especially graphical ones where any delay in response is noticable and annoying to the user. >2. There was no major difference in the response times of "mem" and "piper" In my case there was (including FS overhead, it's still 3:1). >3. This may be a best case comparison for pipes - pipes are highly optimized >on UNIX and pipe flow control is tightly integrated with scheduling and >process control. And of course AmigaDos is optimized for fast task-switching (where most of that .1 seconds went), and fast messaging. Pipes are just another user-mode filesystem. Most Unix systems have more state to switch (and caches to flush: with LW processes no flush is needed). Here's the source code for those that are interested: /* pipe.c */ #include <dos/dos.h> #include <dos/dostags.h> #include <clib/dos_protos.h> #include <pragmas/dos_pragmas.h> #include <stdio.h> #define PIPESIZ 8192 #define PIPENAME "Pipe:bar" extern struct DosLibrary *DOSBase; struct Process *parent; int sig; void child(); struct TagItem tags[] = {NP_Entry,child, NP_Output,NULL, NP_CloseOutput,FALSE, TAG_END,0 }; char buffer[PIPESIZ]; int main (argc,argv) { BPTR pipe,input; struct Process *p; LONG len; parent = (struct Process *) FindTask(0); input = Input(); if (!(pipe = Open(PIPENAME,MODE_NEWFILE))) { printf("Can't open pipe!\n"); return RETURN_ERROR; } if ((sig = AllocSignal(-1)) == -1) { printf("Can't get signal?\n"); Close(pipe); return RETURN_ERROR; } SetSignal(0,1<<sig); /* clear signal, paranoia */ tags[1].ti_Data = Output(); /* fill in NP_Output */ p = CreateNewProc(tags); if (!p) { printf("Can't create process!\n"); FreeSignal(sig); Close(pipe); return RETURN_ERROR; } while ((len = Read(input,buffer,PIPESIZ)) > 0) { Write(pipe,buffer,len); } if (len < 0) PrintFault(IoErr(),"Error on Read:"); Close(pipe); Wait(1<<sig); /* wait for child to exit */ FreeSignal(sig); return RETURN_OK; } char child_buffer[PIPESIZ]; void child () { BPTR pipe,output; LONG len; output = Output(); if (!(pipe = Open(PIPENAME,MODE_OLDFILE))) { Signal(parent,1<<sig); return; } while ((len = Read(pipe,child_buffer,PIPESIZ)) > 0) { Write(output,child_buffer,len); } Close(pipe); Signal(parent,1<<sig); } /* msg.c */ #include <dos/dos.h> #include <dos/dostags.h> #include <clib/dos_protos.h> #include <pragmas/dos_pragmas.h> #include <clib/exec_protos.h> #include <pragmas/exec_pragmas.h> #include <stdio.h> #define PIPESIZ 8192 #define PIPENAME "Pipe:bar" extern struct ExecBase *SysBase; extern struct DosLibrary *DOSBase; struct Process *parent; LONG sig; void child(); struct TagItem tags[] = {NP_Entry,child, NP_Output,NULL, NP_CloseOutput,FALSE, TAG_END,0 }; char buffer[PIPESIZ]; int main (argc,argv) { struct Process *p; struct Message msg; struct MsgPort *port; BPTR input; LONG len; parent = (struct Process *) FindTask(0); input = Input(); if ((sig = AllocSignal(-1)) == -1) { printf("Can't get signal?\n"); return RETURN_ERROR; } SetSignal(0,1<<sig); /* clear signal, paranoia */ tags[1].ti_Data = Output(); /* fill in NP_Output */ p = CreateNewProc(tags); if (!p) { printf("Can't create process!\n"); FreeSignal(sig); return RETURN_ERROR; } Wait(1<<sig); /* wait for child to initialize */ port = FindPort(PIPENAME); if (!port) { printf("No port!\n"); FreeSignal(sig); return RETURN_ERROR; } msg.mn_ReplyPort = &(parent->pr_MsgPort); while ((len = Read(input,buffer,PIPESIZ)) > 0) { msg.mn_Length = len; PutMsg(port,&msg); WaitPort(&(parent->pr_MsgPort)); /* wait for reply */ GetMsg(&(parent->pr_MsgPort)); } if (len < 0) PrintFault(IoErr(),"Error on Read:"); msg.mn_Length = 0; /* tell child to exit */ PutMsg(port,&msg); WaitPort(&(parent->pr_MsgPort)); GetMsg(&(parent->pr_MsgPort)); Wait(1<<sig); /* wait for child to exit */ FreeSignal(sig); return RETURN_OK; } void child () { struct MsgPort *port; struct Message *msg; BPTR output; if (!(port = CreateMsgPort())) { Signal(parent,1<<sig); return; } port->mp_Node.ln_Name = PIPENAME; AddPort(port); /* add to system list */ Signal(parent,1<<sig); /* tell parent to wak up */ output = Output(); while (1) { WaitPort(port); msg = GetMsg(port); if (msg->mn_Length) { Write(output,buffer,msg->mn_Length); ReplyMsg(msg); } else { ReplyMsg(msg); break; /* EOF */ } } RemPort(port); DeleteMsgPort(port); Signal(parent,1<<sig); } /* null.c */ #include <dos/dos.h> #include <dos/dostags.h> #include <clib/dos_protos.h> #include <pragmas/dos_pragmas.h> #include <stdio.h> #define PIPESIZ 8192 extern struct DosLibrary *DOSBase; char buffer[PIPESIZ]; int main (argc,argv) { BPTR output,input; LONG len; input = Input(); output = Output(); while ((len = Read(input,buffer,PIPESIZ)) > 0) { Write(output,buffer,len); } if (len < 0) PrintFault(IoErr(),"Error on Read:"); return RETURN_OK; }