[comp.unix] Children's exit

lenb@houxs.UUCP (02/23/88)

  Okay UNIX Sys V hackers, here's a question for you.
  In the following scenario, how should a parent process
  wait for it's children to complete:

REQUIREMENT:

  I have a parent process who forks 30 identical children.
  The children conduct some measurements, and when done,
  each sends a single IPC message with results back to the
  parent and exits.

  The children are identical, so they should all have roughly
  equal life span, though that time may vary between 5 and 15 minutes.
  
  The parent needs to be woken when the first child exits --
  a straight forward wait().  The parent must also know if
  any children complete in error.

  It is preferable that the parent check the children's exit status
  for any errors, since the system may indicate strange situations in
  the exit status, and the children are already designed to use exit(code).
  
POSSIBLE SOLUTIONS:

  Here's what I've though of so far:

  There seem to be 2 types of solutions, either use wait() with or
  without SIGCLD, or use blocking message receives.
  
  I'd like to use wait(), because the children have a meaningful
  exit status.  The question is, is it possible that my program
  be woken up only 20 times, for 30 children.  Ie. could I miss
  child deaths because several occur "simultaneously".  (simultaneously
  meaning while I'm awake checking one child's return code, another
  2 children die -- the next wait() missing one or both of them.)
  
  If I *do* miss children deaths, then upon each wake up from wait,
  I could kill(pid, 0), each of the children to see if they're all dead.
  I wouldn't miss any deaths that way, but I'd still miss some exit codes.
  
  If I'm going to miss exit codes, I could use signal(SIGCLD, SIG_IGN)
  after the first child's death to wait() for the last child's death.
  Then I'd check to see if I have 30 messages waiting.  There are
  warnings about using this signal in signal(2), so this is no good.

  Another possibility is to have the children send a software signal
  to the parent just before they die.  I wouldn't miss any deaths,
  but this is no help with exit codes.
  
  Another solution is to use vanilla blocking message receives.
  I know how many children I have, and could expect that number
  of messages.  I'd have to change the children to not send a message
  if they encountered a problem -- the message in effect acting as
  a "normal" return code.  However, error codes from built in exit()s
  would be lost, unless redesigned to send the code in a message before
  exiting.  I'd also lose any system information encodes in the exit code.

Has anybody out there run in to this type of situation?
Any facts, clues or pointers appreciated.  If you reply,
please cc: email since I don't often read news.  Thanks.

Len Brown
201-949-0092
{ ihnp4 etc. }!houxs!lenb

davek@heurikon.UUCP (David Klann) (05/20/88)

In article <4626@mcdchg.UUCP> you write:
|
|  Okay UNIX Sys V hackers, here's a question for you.
|  In the following scenario, how should a parent process
|  wait for it's children to complete:
|
|  
|POSSIBLE SOLUTIONS:
|
|  Here's what I've though of so far:
|
|  There seem to be 2 types of solutions, either use wait() with or
|  without SIGCLD, or use blocking message receives.
|  
|  I'd like to use wait(), because the children have a meaningful
|  exit status.  The question is, is it possible that my program
|  be woken up only 20 times, for 30 children.  Ie. could I miss
|  child deaths because several occur "simultaneously".  (simultaneously
|  meaning while I'm awake checking one child's return code, another
|  2 children die -- the next wait() missing one or both of them.)

I've recently finished porting BSD csh(1) to System V Rel. 3.0.  I
agree that using wait(2) with SIGCLD is a possibly bad way to go
becaus of the warnings in signal(2), but I used it anyway.

The way to ensure catching all of your children is to use the Release
3 sigset(2) group of calls (I assume you're running Release 3).  Use
sigset( SIGCLD, function ) to trap SIGCLD.  Then when you wait() and
receive the first SIGCLD the system will automatically SIGHOLD all
SIGCLD signals.  Before leaving your signal catching function be sure
to call sigrelse() to release any held SIGCLD signals.

If you're not running Release 3 I'd suggest you use the message based
soltion.

Good Luck!
David Klann
{ihnp4 | uwvax}!heurikon!davek

dopey@ihlpe.ATT.COM (James Blasius) (05/20/88)

It appears from the manual (though I won't swear by it) that a
child process can't go away until it's waited for - either by
the parent, or process 1 if the parent goes away.  So it would
seem "wait" would give you thirty responses.

The manual does say that if there are no children to wait for,
it will fail - so you won't be stuck waiting forever (like you
might if you do  "alarm" wrong).

James Blasius
ihnp4!gomez!dopey

asa@unisoft.UUCP (Asa Romberger) (05/20/88)

In article <4626@mcdchg.UUCP> lenb@houxs.UUCP writes:
>
>  An extended message about child deaths

The one signal that Sys V makes sure that you do not miss
is child death. You will not miss any either with SIGCLD
or wait.

abcscnge@csuna.UUCP (Scott Neugroschl) (05/20/88)

In article <4626@mcdchg.UUCP> lenb@houxs.UUCP writes:
>                The question is, is it possible that my program
>  be woken up only 20 times, for 30 children.  Ie. could I miss
>  child deaths because several occur "simultaneously".  (simultaneously
>  

As I understand it, wait() returns the PID of one child.  Therefore,
you should not get signals from the two children.  Another possibility:
nice() the children, so that the parent has a higher priority than the
child tasks. i.e.:

	for (i = 0  ; i < NUM_PROCS ; i++)
	{
	    if ((pid = fork()) == -1) 
	    {
		    /* BAD FORK() */
	    }
	    else if (pid == 0)   /* child process */
	    {
		nice(5);		/* lower my priority */
	    }
	    else 
	    {
		/* do parent process stuff */
	    }
	}

Just some thoughts...
-- 
Scott "The Pseudo-Hacker" Neugroschl
UUCP: {litvax,humboldt,sdcrdcf,rdlvax,ttidca,}\_ csun!abcscnge
      {psivax,csustan,nsc-sca,trwspf         }/
-- "They also surf who stand on waves"

lenb@houxs.UUCP (Len Brown) (05/20/88)

Thanks for the numerous replies, here's a summary:

"What is the proper way for a parent process to wait for multiple
 child processes to exit?"

People have warned me that when many children exit simultaneously,
the parent executing wait(2) loop may miss the return status of
some of them.  The truth is that the parent will never miss child
exit status if the loop is constructed properly.

When a child exits, UNIX queues it as a zombie <defunct> until its
parent does a wait(2).  If the parent exits, init will inherit the
children and wait(2) for them.

A simple wait(2) loop would suffice, except sometimes wait(2) returns
when you don't expect it to.  The frequently cited example is when
the parent process is in a pipeline.  The sh(1) may make other pipeline
members children of the parent.  When they exit, they wake up the parent.
So all one has to do is keep a list of the known children's pid's.
When wait(2) returns, check the list to be sure it is a known child
that woke you up.

Len Brown,  AT&T Data Systems Group, Holmdel, NJ.
{ihnp4 etc}!houxs!lenb

chris@mimsy.UUCP (Chris Torek) (06/03/88)

In article <7963@mcdchg.UUCP> davek@heurikon.UUCP (David Klann) writes:
>... using wait(2) with SIGCLD is a possibly bad way to go
>becaus of the warnings in signal(2) ...

SIGCLD is specific to System V (well, there is an alias for it in
current 4BSD <signal.h> files, but ignore that).  On System V, with the
exception of the code in VR3 that was taken from Berkeley's `jobs'
library, it is true that there is no reliable way to catch most
signals.  SIGCLD, however, is special.  It is unlike every other signal
in SysV in everything except the delivery mechanism.

Because of the special way SIGCLD is implemented (read: a kludge :-) ),
if your signal catching function reads

	/* ARGSUSED */
	child(signo)
		int signo;
	{
		int status, w;

		w = wait(&status);
		... do stuff with w & status ...

		/* this should be the last line */
		(void) signal(SIGCLD, child);
	}

you will never miss a child signal, even though if two children exit
`simultaneously' they will only generate one signal.  The reason is
that the special kludge *regenerates* the signal on the way out of
child(), if there happens to be another exit to pick up.

If you put the signal() call before the wait(), you will get an
endless recursion since there will always be at least one exit status
ready.

>The way to ensure catching all of your children is to use the Release
>3 sigset(2) group of calls (I assume you're running Release 3). ...

This will work only when the kludge is present.  Even given reliable
signals, a signal catcher may still run only once for multiple
signals.  So as long as you are relying on the SIGCLD kludge, you
might as well use the simpler interface above.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

paul@morganucodon.cis.ohio-state.edu (Paul Placeway) (06/14/88)

In article <10105@mcdchg.UUCP> daveb@laidbak.UUCP (Dave Burton) writes:
< Actually, this is more effort than is required.
< wait(2) will return a -1 and set errno to ECHILD if there are no more
< children to wait() on. So a simple loop will work:
< 
< 	while (wait(&status) != -1 || errno != ECHILD)
< 		;	/* check status here if desired */

Ah yes, but what if you want to catch all of your exited children, you
might still have running children, and you _don't_ want to wait(2) for
them too?  Under BSD you could do that as:

	while (wait3(&status, WNOHANG, NULL) != -1 || errno != ECHILD) {
		/* stuff */
	}

But what about USG?

		--Paul