[comp.sys.apollo] Failure of fork

hanche@imf.unit.no (Harald Hanche-Olsen) (03/07/91)
In article <5021316a.1bc5b@pisa.citi.umich.edu> rees@pisa.citi.umich.edu (Jim Rees) writes:

   In article <HANCHE.91Feb28201637@hufsa.imf.unit.no>, hanche@imf.unit.no (Harald Hanche-Olsen) writes:

     Status         03010002: process had a fatal error (process manager/process manager)
     In routine     "/sys/node_data.a0b0/systmp/global_readonly" offset 363A
     Called from    "pgm_$exec_uid_pn" line 1450
     Called from    "pgm_$exec_xoid_pn" line 1287

   This happens when an exec fails after the process is unrecoverably committed
   to running the new program.  Since it can't get back to the original program
   at this point, the process just exits.  This "shouldn't happen" (be glad
   it's not a kernel panic).

You're right it shouldn't happen, at least not when the exec'd file
really does exist and is eminently executable.  Why it does happen is
one of those questions we would maybe rather not know the answer to...

I am apparently not the only one seeing this problem.  I also had mail
from ericb@caen.engin.umich.edu (Eric Bratton) who has seen /bin/sh do
the same thing, and "can sometimes reproduce the bug in /bin/sh by
doing a very large /com/cpt command accross the network".

Now I have learned a really good way to provoke this bug, and that is
by running xdm and Xapollo from the MIT X11R4 distribution.  And it is
not only exec() that fails, fork() can also fail - though not in quite
the same destructive manner.  Here is the story:

THE FAILURE OF fork()

In a separate posting I have told about our troubles with the R4
Xapollo server and xdm, wherein Xapollo fails in the initialization,
forcing xdm to start a new one.  Well, that works fine much of the
time, but once in a while the fork() call fails with `no more
processes' which, of course, should not be so when the total number of
processes is below twenty.  It happens anyway.

Now, I did a clever (I thought) workaround:  First, I modified xdm to
just fork() again if it failed, but that never succeeded.  Sooo, I
figured, somehow that process's tables are all screwed up.  Other
processes can fork(), I mean, I can run a shell and run all kinds of
commands in it, right?  So I modifed xdm to exit with a special exit
code if it could not fork(), and wrote a tiny parent process (called
it xdmfix) which, when the child exits with that status code, just
starts up a new one.  Works fine, xdmfix itself never reported
difficulties with fork(), but it often reports having to restart xdm,
which it does just fine.  But...

THE FAILURE OF exec()

Yes, this is where I first saw exec() failing.  My clever xdmfix
process sometimes cannot exec(), and hence dies an unnatural death.
Here is the top end of a typical traceback:

Program        /bsd4.3/usr/bin/X11-R4/xdmfix
Status         03010002: process had a fatal error (process manager/process manager)
In routine     "<UID 4FD535A6.4001996A>" offset 363A
Called from    "pgm_$exec_uid_pn" line 1450
Called from    "pgm_$exec_xoid_pn" line 1287
Called from    "execve" line 224
Called from    "execv" line 146

At the same time, I found it impossible to log in on the node via
tcp/ip.  I get a connection, but nothing happens.  After rebooting the
hard way, I do a traceback and find that inetd crashed just like
xdmfix in the traceback above.  I didn't report this problem here,
however, until the same problem (with inetd) showed up on a node
running vanilla HP supplied software, showing that this is likely to
affect other users too.  Why the X11R4 server and xdm provoke this bug
(bugs?) so consistently, while most nodes are unaffected most of the
time, I don't know.  As always, I am happy for any suggestions, though
I am not too optimistic at the moment.  If nothing else, maybe here is
more fuel for the current flames directed at Domain/OS...

- Harald Hanche-Olsen <hanche@imf.unit.no>
  Division of Mathematical Sciences
  The Norwegian Institute of Technology
  N-7034 Trondheim, NORWAY