[comp.arch] fork, spawn, vfork: an overview

chris@mimsy.umd.edu (Chris Torek) (08/06/90)

Since there seems to be a great deal of confusion and ignorance
surrounding the `vfork issue', it might help if I describe how these
things work.

The basic idea behind `fork' is quite simple:  make an exact duplicate
of the running process---copy all the user information and all the
kernel information, except of course the process ID and special
internal-only implementation data---and then return from the fork in
the old process giving the new process's ID, and return from the fork
in the new process giving a 0 (so that it can tell it is the clone
rather than the original).

This operation tends to be expensive since it involves copying data,
potentially a great deal of data.  There are a number of solutions:

 a. Avoid fork entirely (the spawn issue):  `If the copy is merely
    being made so that the clone can throw away its user data and run
    another program, why bother copying?'  This is a sensible idea, but
    every time someone tries to do it, the spawn call winds up being
    terribly complex.  Perhaps this is because the wrong people do it;
    but perhaps it is because the spawn approach is inherently flawed.
    I am not going to argue one way or the other here.

 b. Try not to copy.  Anything that can never be changed (e.g., the
    program's code or `text' section) obviously need not be physically
    copied.  (Unix fork() calls have always shared text.)  In addition,
    if the hardware supports it (e.g., via copy-on-write or copy-on-
    reference), the data and stack segments also need not be copied.
    This is not always the best policy; sometimes the expense of
    copying on write or reference outweighs the savings from never
    copying some pages.  There are some tricks that can be applied
    here as well (e.g., always copy the current stack page, leave the
    rest copy-on-write).

 c. Cheat.  This is the `vfork' approach.

Even if fork can be done via copy-on-write, on some machines it remains
expensive for large processes, because the page tables must be copied
anyway and these can also be large.  (Some of the pages of PTEs can be
shared, but this rapidly gets tricky.)  And of course, there are
machines on which copy-on-write cannot be done, either because the
hardware never supported it (e.g., where writes cannot be restarted) or
because of a bug in the microcode (e.g., the infamous VAX-11/750 bug);
here copy-on-reference can be used, but it tends to be more expensive
(most processes read many more locations than they write).  Thus, the
`virtual fork' trick.

Suppose that, instead of making a copy of the data and stack segments
*and* the kernel per-process information, the kernel were to make a
copy of just the kernel per-process information, then suspend the
parent process entirely, `handing over' the entire virtual memory image
to the clone (by giving the original's PTE pages as well as its data
and stack segments to the clone)?  The clone would then be running in
the very same virtual (and physical) memory that the original had
before it `virtually forked'.  Nothing has been copied except the
kernel data.  The main issue here is that, if the original process
were allowed to run, both the parent and the child would be competing
for the same data and stack memory---much like threads, but disastrous
since the stacks would rapidly collide---so the parent must be marked
as having no memory at all, and cannot run while the child uses the
space.  The child, too, must be marked specially, so that when it is
ready to give up its virtual space (via exec() or exit()), the kernel
knows instead to `hand back' the entire virtual memory to the parent.

This is how vfork works.  Only the kernel data (the u. area) are copied.
The old virtual memory is not so much `shared' as `handed over' to the
child: it uses the exact same resources the parent process had.  When
the child process is done with them, they are `handed back' (using the
same kernel routine, in fact; it is called `vpassvm') and the parent is
allowed to continue.  There are several problems with this scheme:

 - Since the child process (the clone) has already returned from the
   vfork() system call, the stack frame needed for this return has been
   wiped out.  This is handled by avoiding the return in the first
   place:  The vfork call is written in assembly, and pops the stack
   frame before doing the system call at all.  On return from the
   system call the assembly code jumps to the place a return would have
   gone, having pre-set the registers as appropriate.

 - Since the old VM is handed over to the child, any changes the child
   makes show up in the parent as well.  Occasionally some software
   winds up depending on this (e.g., csh).

 - Since the parent is suspended, the order of execution (child runs to
   exec() or exit()) is defined.  Occasionally some software winds up
   depending on this as well (again, csh is the prime example).

 - Although the kernel data are copied (so that changes made in the
   child to, e.g., the signal dispositions are not reflected in the
   parent), some of these describe the virtual memory resources.  When
   the child hands the virtual memory back to the parent, the parent
   process must account for any changes the child has made to the
   virtual memory itself---for instance, if the child has made the heap
   or stack larger, the parent must update its size information.  This
   is a good place for bugs to hide.

Buggy user programs aside, it is always possible for a kernel to
implement vfork as a true fork, if this is somehow cheaper.  Thus there
can never be an efficiency argument in favor of fork; vfork never costs
more.  Moreover, even given a copy-on-write fork, vfork may still be
cheaper, because vfork does not have to copy PTEs.  On a machine like
the Butterfly, a vfork helps because the parent process is suspended:
the child can run in the parent's space until it exits or, more likely,
gets another processor to do an exec.  In essence, vfork is `fork with
a promise to finish soon' or even `spawn, but let me do a few things
before I go'.

When you come right down to it, the real disadvantage to vfork is that,
conceptually, it is downright ugly.  Viewing it as `spawn, but let me
run a few instructions first' helps a lot here.  The meaning of

	pid = vfork();
	if (pid == 0) {
		(void) dup2(fd, 0);
		(void) close(fd);
		(void) execve(path, av, env);
		_exit(1);
	}
	...

is not `make a copy; if this is the child, fiddle with descriptors
and run something else' but rather `run something else, but first
fiddle with descriptors using the following code'.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)

tbray@watsol.waterloo.edu (Tim Bray) (08/06/90)

chris@mimsy.umd.edu (Chris Torek) writes an informative and very helpful
backgrounder on the subject of fork() and vfork().

But there is a piece of information missing.  How much does fork() cost?  In
an environment with a lot of fork()ing going on, (e.g. a heavily-timeshared
system with a lot of shell scripts running?), how much of the system
resources are burned up forking?  Quantitatively, how much does vfork()
save?  Is the picture significantly different on a workstation which spends
much of its time running a window system and a few technical applications?

Somebody must have measured this at some time.  Or did somebody in the bsd
gang just say "That looks inefficient. Let's fix it" and create vfork()?

I.e., can anyone prove whether the ugliness of vfork() really buys anything?

Cheers, Tim Bray, Open Text Systems

jkenton@pinocchio.encore.com (Jeff Kenton) (08/06/90)

From article <25904@mimsy.umd.edu>, by chris@mimsy.umd.edu (Chris Torek):
> Since there seems to be a great deal of confusion and ignorance
> surrounding the `vfork issue', it might help if I describe how these
> things work.
> 
>	[Good survey of the issues] 
> 
>                                                    On a machine like
> the Butterfly, a vfork helps because the parent process is suspended:
> the child can run in the parent's space until it exits or, more likely,
> gets another processor to do an exec.  In essence, vfork is `fork with
> a promise to finish soon' or even `spawn, but let me do a few things
> before I go'.

The one issue you haven't dealt with is the case where a program wants
to spawn many copies of itself but never execs after the fork.  In this
case you need a fork() which is biased toward moving to another processor
and has no way to guarantee the "child runs first" semantics (which the
man page warns against relying upon).  On the Butterfly, with >100 CPUs
and non-uniform memory, this is a reasonable scenario.  Traditional
vfork() buys very little, and it's ugly.

Beauty is only skin deep, but ugly goes right to the bone.








- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
      jeff kenton  ---	temporarily at jkenton@pinocchio.encore.com	 
		   ---  always at (617) 894-4508  ---
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

chris@mimsy.umd.edu (Chris Torek) (08/06/90)

In article <1990Aug6.010044.8638@watdragon.waterloo.edu>
tbray@watsol.waterloo.edu (Tim Bray) asks the very reasonable question:
>How much does fork() cost? ... Quantitatively, how much does vfork()
>save?  Is the picture significantly different on a workstation which spends
>much of its time running a window system and a few technical applications?
>
>Somebody must have measured this at some time.  Or did somebody in the bsd
>gang just say "That looks inefficient. Let's fix it" and create vfork()?

I have only two data points on this myself (the vfork thing was done
before my time---I only got into this Unix business in late 1981...).
I know that the Franz Lisp group at Berkeley drove some of the changes
to the VM system between various 4BSD releases (including, I suspect,
vadvise); and I know that the VM group at Sun who did the SunOS 4.x
virtual memory took vfork out (made it a simple copy-on-write fork) and
then put it back.  Whether the latter was only temporary, I cannot say,
but surely someone from Sun should remember:  Was this for performance
or just to accomodate broken programs like csh?

Anyway, old Berkeley papers should describe the vfork savings (which
for 6+ MB Lisp jobs on a VAx-11/780 without a copy-on-write fork will
no doubt be quite impressive); but keep in mind that these will compare
`copy everything' against `copy nothing' without mentioning a middle
ground (`copy PTEs and written pages').
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
	(New campus phone system, active sometime soon: +1 301 405 2750)

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (08/06/90)

In article <12382@encore.Encore.COM> jkenton@pinocchio.encore.com
	(Jeff Kenton) writes:

>Beauty is only skin deep, but ugly goes right to the bone.

True.

The problem here is that the semantics of fork-exec is really ugly,
because it is virtual memory ineffecient.

First, by simple implementation, it requires unnecessary copying
of virtual memory.

By COW workaround, copying is not necessary. But the fundamental ineffeciency
on virtual memory is still there. So, a large process can't fork.

By not-pre-allocating-swap-space workaround, a large process can now fork.
But the fundamental ineffeciency on virtual memory is only hidden from the
sight and still exists. So, now, there is possibility of deadlock.

So, the ugliness of virtual memory ineffeciency is really goes
right to the bone.

The solution is to get rid of the ugly virtual memory ineffeciency.

Thus
	vfork - spawn new process in a virtual memory efficient way

To appreciate beauty of vfork, you should first think of general
purpose segment sharing mechanism. Then, by using the mechanism,
you can construct a virtual memory effecient way to spawn a
process.

BTW, for the complete beauty, vfork should have been designed so
that changes to registers in a child context are also reflected
to the parent process. That is important for a highly optimized compiler.

						Masataka Ohta

peter@ficc.ferranti.com (Peter da Silva) (08/06/90)

I agree that vfork is a better mechanism. It also has one further advantage:

	- It can be implemented on machines with no MMU.

This is what makes this such a frustrating subject. I'm not attacking
fork() and vfork(). I think they're the best pair of ideas since sliced
bread. But I have to deal with real-time operating systems where fork()
can't be implemented at all, and vfork() can't be implemented cheaply
outside the kernel.

I also believe that the changes needed to make spawn() usable without
excessive complexity are valuable in their own right. Come now, wouldn't
you rather save your current directory so you can chdir() back instead
of having to burn cycles calling getcwd()? How about having the nicer
semantics of sigset() universally available via signal()?
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
<peter@ficc.ferranti.com>

peter@ficc.ferranti.com (Peter da Silva) (08/06/90)

In article <12382@encore.Encore.COM> jkenton@pinocchio.encore.com (Jeff Kenton) writes:
> The one issue you haven't dealt with is the case where a program wants
> to spawn many copies of itself but never execs after the fork.

In this case vfork() isn't even usable. Doesn't matter if you're on a
Butterfly or not.
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
<peter@ficc.ferranti.com>

jack@cwi.nl (Jack Jansen) (08/08/90)

A reasonably clean solution to the vfork() hack would be to split
fork in two: one call that creates a new kernel environment (u area,
etc), and another one that starts a new process in that environment.

You could then program like

	pre_fork();
	/*
	** From now on, all system calls we do refer to the kernel
	** state of the *new* process.
	*/
	close files;
	open other files;
	change signal;
	etc;
	pid = spawn("a.out", ....);
	/*
	** Here we're back in our own kernel environment again.
	*/

The only problem I see is that there are some calls for which it is
unclear what the semantics should be between the pre_fork() and the
spawn() call: sbrk() comes to mind. Also, if you crash between the
pre_fork() and the spawn it is unclear which of the two u areas you
would want to store in the coredump.
--
--
Een volk dat voor tirannen zwicht	| Oral:     Jack Jansen
zal meer dan lijf en goed verliezen	| Internet: jack@cwi.nl
dan dooft het licht			| Uucp:     hp4nl!cwi.nl!jack

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (08/08/90)

In article <1940@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>A reasonably clean solution to the vfork() 

PLEASE folks, this is not a computer architecture issue. It is a
question that has to do with the software of one very particular
operating system. Let's move it appropriate os-specific group? OK.
Most OS's don;t seem to have any such weird difficulties.

Doug MCDonald

peter@ficc.ferranti.com (Peter da Silva) (08/08/90)

In article <1940@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
> A reasonably clean solution to the vfork() hack would be to split
> fork in two: one call that creates a new kernel environment (u area,
> etc), and another one that starts a new process in that environment.

You just re-invented vfork! The only difference is that the pid is returned
in a different place.

> 	pre_fork();

This is equivalent to "if((pid=vfork()) == 0) {".

> 	pid = spawn("a.out", ....);

And this can be replaced by "exec(...); exit(ERROR_CODE); }"
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
<peter@ficc.ferranti.com>

peter@ficc.ferranti.com (Peter da Silva) (08/09/90)

In article <1990Aug8.133546.27806@ux1.cso.uiuc.edu> mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
> PLEASE folks, this is not a computer architecture issue.

Depends on how you define "architecture". Software versus hardware versus
industrial design of cabinets...
-- 
Peter da Silva.   `-_-'
+1 713 274 5180.   'U`
<peter@ficc.ferranti.com>