[comp.sys.sgi] bug in vfork semantics under IRIX 3.3.1

chk%alias@csri.toronto.edu (C. Harald Koch) (11/29/90)

I was just applying the latest patches to ELM, after upgrading to 3.3.1.
Suddenly elm was no longer able to read my mailbox! After long and detailed
debugging, I eventually found the problem:

ELM runs set group-id mail so that it can create lock files. This is a
potential security hole, so ELM uses subprocesses to verify certain file
access permissions using your real gid rather than your effective gid. This
is to prevent users from getting access to files that are readable by the
mail group (i.e. other users mailboxes).

Under 3.3.1, ELM configuration detects the existence of vfork() and uses it
instead of fork(). Then, in the child, ELM calls setgid() to set the
group-id to your real group-id, performs the test, and exits with a status.
The parent reads this status back.

On most systems with vfork(), the two processes inherit the same address
space, BUT DIFFERENT KERNEL U-AREAS. This means that the setgid() call
doesn't affect the parent.

Under IRIX, the vfork() call is actually implemented using sproc(), which is
a more primitive way to get multiple processes. It DOES NOT give you a
separate u-area. So the setgid() call affects the parent!

As a result, the parent process is no longer set group-id mail, and so it
cannot generate lock files in the mail directory!

I discovered this quite accidentally; I was using DBX to attempt some
debugging and found that vfork() confused DBX, so I recompiled elm to use
fork() instead. Suddenly, everything worked fine! So I wrote a simple test
program which runs set group-id, vforks, and does a setgid(getgid()) in the
child. Sure enough, the group-id in the parent changes!

vfork() also causes problems with Perl. I strongly suggest not using it at
all, unless you *really* need the performance improvement that it gives.

	Whee!

--
C. Harald Koch  VE3TLA                Alias Research, Inc., Toronto ON Canada
chk%alias@csri.utoronto.ca      chk@gpu.utcs.toronto.edu      chk@chk.mef.org
"Open the Zamboni! We're coming out!" - Kathrin Garland and Anson James, 2299

eggert@twinsun.com (Paul Eggert) (11/30/90)

chk%alias@csri.toronto.edu (C. Harald Koch) writes:

	Under IRIX, the vfork() call is actually implemented using sproc(),
	which is a more primitive way to get multiple processes.  It DOES NOT
	give you a separate u-area.  So the setgid() call affects the parent!
	...  I strongly suggest not using it at all....

I second this suggestion.  Under IRIX 3.3, vfork() also botches file
descriptors: e.g. if a vfork() child process closes a file, IRIX mistakenly
closes the corresponding file descriptor in the parent.  I ran into this
problem porting RCS 5.4 to SGI.

I couldn't find any SGI documentation for vfork(),
so I suspect it's both undocumented and unsupported.
Even so, surely it is unwise for SGI to supply such a nonstandard vfork(),
because too many people will run into similar problems.

jmb@patton.wpd.sgi.com (Doctor Software) (12/03/90)

In article <1990Nov30.041307.15489@twinsun.com>, eggert@twinsun.com
(Paul Eggert) writes:
> 
> chk%alias@csri.toronto.edu (C. Harald Koch) writes:
> 
> 	Under IRIX, the vfork() call is actually implemented using sproc(),
> 	which is a more primitive way to get multiple processes.  It DOES NOT
> 	give you a separate u-area.  So the setgid() call affects the parent!
> 	...  I strongly suggest not using it at all....
> 

I guess I missed the original posting on this one, but the assumption
here is wrong. sproc(2) creates a new process which shares certain
resources with the parent process. The caller of sproc() is in control
of which are shared. The list includes VM, file descriptors, user and
group IDs, and others. Thus, there >is< a separate u-area for both.
sproc() and vfork() aren't even in the same league - sproc() creates
multiple threads of execution, while vfork() was implemented solely to
speed up the fork/exec sequence.

> I second this suggestion.  Under IRIX 3.3, vfork() also botches file
> descriptors: e.g. if a vfork() child process closes a file, IRIX mistakenly
> closes the corresponding file descriptor in the parent.  I ran into this
> problem porting RCS 5.4 to SGI.
> 
> I couldn't find any SGI documentation for vfork(),
> so I suspect it's both undocumented and unsupported.

Wherever you found a vfork() routine, the underlying sproc() undoubtedly
shares file descriptors with the parent, and thus your problems. As to
support, indeed, vfork() is unsupported by SGI. Please note that vfork()
is supposed to be semantically equivalent to fork(), unless the child
messes up the parent's address space (which the 4.3 manual page warns about).

Vfork() is actually itself a kludge, because it is a response to the
poor performance of fork() on Bezerkley based machines. Because BSD VM
copies the entire address space on fork, fork is expensive. Modern VM
systems (including IRIX) implement copy-on-write as well as text
sharing, so the cost of a fork() call is very small. The proper thing to
do is to fix fork()'s poor performance, not kludge in a system call to
get around it.

Thus, the easiest way to implement vfork() is to add this to the program:

# define vfork	fork

and leave it at that.

> Even so, surely it is unwise for SGI to supply such a nonstandard vfork(),
> because too many people will run into similar problems.

Surely you jest. The entire computer industry would grind to a halt if
every company tried to make sure it only shipped "documented" entry
points to it's libraries.

-- Jim Barton
   Silicon Graphics Computer Systems
   jmb@sgi.com

eggert@twinsun.com (Paul Eggert) (12/03/90)

jmb@patton.wpd.sgi.com (Doctor Software) writes:

	Thus, the easiest way to implement vfork() is ...
		# define vfork	fork

Several programs (e.g. perl, rn) have autoconfiguration scripts that determine
whether the host has a vfork(), and use fork() otherwise.
This strategy fails under IRIX 3.3, which has a vfork() that doesn't work.
A programmer who knows about the IRIX vfork bug can work around this by hand,
but it's unrealistic to expect this of every perl and rn maintainer.
If SGI wants to make it easy to port software to their machines,
IRIX should have either a working vfork(), or no vfork() at all.


	The entire computer industry would grind to a halt if every company
	tried to make sure it only shipped "documented" entry points to it's
	libraries.

If IRIX's vfork() is indeed just an undocumented library entry point that has
nothing to do with BSD's vfork(), then it should be given a different name.

guido@cwi.nl (Guido van Rossum) (12/03/90)

jmb@patton.wpd.sgi.com (Doctor Software) writes:

>[Reasonable article omitted]
>
>Vfork() is actually itself a kludge, because it is a response to the
>poor performance of fork() on Bezerkley based machines.
                               ^^^^^^^^^
(As an SGI employee you should know better than using such pejoratives.)

From the rest of your story it is clear that SGI is aware that some
programs need vfork().  You also claim that SGI's fork() has adequate
performance to be used instead of vfork().  Then why did someone at SGI
bother to whip up an inadequate vfork() substitute using sproc() while
it could be implemented just as well using fork() trivially, with better
preservation of the semantics?  (Note that the shared memory semantics
of vfork() are explicitly undefined, whereas the non-shared u-area is
essential.)  I'm sure this can be fixed in the next release.

--
Guido van Rossum, CWI, Amsterdam <guido@cwi.nl>
"A thing of beauty is a joy till sunrise"

jmb@patton.wpd.sgi.com (Doctor Software) (12/04/90)

In article <1990Dec3.024237.23749@twinsun.com>, eggert@twinsun.com (Paul
Eggert) writes:
> ...
> If SGI wants to make it easy to port software to their machines,
> IRIX should have either a working vfork(), or no vfork() at all.
> 
> 
> 	The entire computer industry would grind to a halt if every company
> 	tried to make sure it only shipped "documented" entry points to it's
> 	libraries.
> 
> If IRIX's vfork() is indeed just an undocumented library entry point that has
> nothing to do with BSD's vfork(), then it should be given a different name.

You're right of course on both counts. The solution has already started
to work it's way through the mill out here ...

-- Jim Barton
   Silicon Graphics Computer Systems
   jmb@sgi.com