lenb@houxs.UUCP (02/23/88)
Okay UNIX Sys V hackers, here's a question for you. In the following scenario, how should a parent process wait for it's children to complete: REQUIREMENT: I have a parent process who forks 30 identical children. The children conduct some measurements, and when done, each sends a single IPC message with results back to the parent and exits. The children are identical, so they should all have roughly equal life span, though that time may vary between 5 and 15 minutes. The parent needs to be woken when the first child exits -- a straight forward wait(). The parent must also know if any children complete in error. It is preferable that the parent check the children's exit status for any errors, since the system may indicate strange situations in the exit status, and the children are already designed to use exit(code). POSSIBLE SOLUTIONS: Here's what I've though of so far: There seem to be 2 types of solutions, either use wait() with or without SIGCLD, or use blocking message receives. I'd like to use wait(), because the children have a meaningful exit status. The question is, is it possible that my program be woken up only 20 times, for 30 children. Ie. could I miss child deaths because several occur "simultaneously". (simultaneously meaning while I'm awake checking one child's return code, another 2 children die -- the next wait() missing one or both of them.) If I *do* miss children deaths, then upon each wake up from wait, I could kill(pid, 0), each of the children to see if they're all dead. I wouldn't miss any deaths that way, but I'd still miss some exit codes. If I'm going to miss exit codes, I could use signal(SIGCLD, SIG_IGN) after the first child's death to wait() for the last child's death. Then I'd check to see if I have 30 messages waiting. There are warnings about using this signal in signal(2), so this is no good. Another possibility is to have the children send a software signal to the parent just before they die. I wouldn't miss any deaths, but this is no help with exit codes. Another solution is to use vanilla blocking message receives. I know how many children I have, and could expect that number of messages. I'd have to change the children to not send a message if they encountered a problem -- the message in effect acting as a "normal" return code. However, error codes from built in exit()s would be lost, unless redesigned to send the code in a message before exiting. I'd also lose any system information encodes in the exit code. Has anybody out there run in to this type of situation? Any facts, clues or pointers appreciated. If you reply, please cc: email since I don't often read news. Thanks. Len Brown 201-949-0092 { ihnp4 etc. }!houxs!lenb
davek@heurikon.UUCP (David Klann) (05/20/88)
In article <4626@mcdchg.UUCP> you write: | | Okay UNIX Sys V hackers, here's a question for you. | In the following scenario, how should a parent process | wait for it's children to complete: | | |POSSIBLE SOLUTIONS: | | Here's what I've though of so far: | | There seem to be 2 types of solutions, either use wait() with or | without SIGCLD, or use blocking message receives. | | I'd like to use wait(), because the children have a meaningful | exit status. The question is, is it possible that my program | be woken up only 20 times, for 30 children. Ie. could I miss | child deaths because several occur "simultaneously". (simultaneously | meaning while I'm awake checking one child's return code, another | 2 children die -- the next wait() missing one or both of them.) I've recently finished porting BSD csh(1) to System V Rel. 3.0. I agree that using wait(2) with SIGCLD is a possibly bad way to go becaus of the warnings in signal(2), but I used it anyway. The way to ensure catching all of your children is to use the Release 3 sigset(2) group of calls (I assume you're running Release 3). Use sigset( SIGCLD, function ) to trap SIGCLD. Then when you wait() and receive the first SIGCLD the system will automatically SIGHOLD all SIGCLD signals. Before leaving your signal catching function be sure to call sigrelse() to release any held SIGCLD signals. If you're not running Release 3 I'd suggest you use the message based soltion. Good Luck! David Klann {ihnp4 | uwvax}!heurikon!davek
dopey@ihlpe.ATT.COM (James Blasius) (05/20/88)
It appears from the manual (though I won't swear by it) that a child process can't go away until it's waited for - either by the parent, or process 1 if the parent goes away. So it would seem "wait" would give you thirty responses. The manual does say that if there are no children to wait for, it will fail - so you won't be stuck waiting forever (like you might if you do "alarm" wrong). James Blasius ihnp4!gomez!dopey
asa@unisoft.UUCP (Asa Romberger) (05/20/88)
In article <4626@mcdchg.UUCP> lenb@houxs.UUCP writes: > > An extended message about child deaths The one signal that Sys V makes sure that you do not miss is child death. You will not miss any either with SIGCLD or wait.
abcscnge@csuna.UUCP (Scott Neugroschl) (05/20/88)
In article <4626@mcdchg.UUCP> lenb@houxs.UUCP writes: > The question is, is it possible that my program > be woken up only 20 times, for 30 children. Ie. could I miss > child deaths because several occur "simultaneously". (simultaneously > As I understand it, wait() returns the PID of one child. Therefore, you should not get signals from the two children. Another possibility: nice() the children, so that the parent has a higher priority than the child tasks. i.e.: for (i = 0 ; i < NUM_PROCS ; i++) { if ((pid = fork()) == -1) { /* BAD FORK() */ } else if (pid == 0) /* child process */ { nice(5); /* lower my priority */ } else { /* do parent process stuff */ } } Just some thoughts... -- Scott "The Pseudo-Hacker" Neugroschl UUCP: {litvax,humboldt,sdcrdcf,rdlvax,ttidca,}\_ csun!abcscnge {psivax,csustan,nsc-sca,trwspf }/ -- "They also surf who stand on waves"
lenb@houxs.UUCP (Len Brown) (05/20/88)
Thanks for the numerous replies, here's a summary: "What is the proper way for a parent process to wait for multiple child processes to exit?" People have warned me that when many children exit simultaneously, the parent executing wait(2) loop may miss the return status of some of them. The truth is that the parent will never miss child exit status if the loop is constructed properly. When a child exits, UNIX queues it as a zombie <defunct> until its parent does a wait(2). If the parent exits, init will inherit the children and wait(2) for them. A simple wait(2) loop would suffice, except sometimes wait(2) returns when you don't expect it to. The frequently cited example is when the parent process is in a pipeline. The sh(1) may make other pipeline members children of the parent. When they exit, they wake up the parent. So all one has to do is keep a list of the known children's pid's. When wait(2) returns, check the list to be sure it is a known child that woke you up. Len Brown, AT&T Data Systems Group, Holmdel, NJ. {ihnp4 etc}!houxs!lenb
chris@mimsy.UUCP (Chris Torek) (06/03/88)
In article <7963@mcdchg.UUCP> davek@heurikon.UUCP (David Klann) writes: >... using wait(2) with SIGCLD is a possibly bad way to go >becaus of the warnings in signal(2) ... SIGCLD is specific to System V (well, there is an alias for it in current 4BSD <signal.h> files, but ignore that). On System V, with the exception of the code in VR3 that was taken from Berkeley's `jobs' library, it is true that there is no reliable way to catch most signals. SIGCLD, however, is special. It is unlike every other signal in SysV in everything except the delivery mechanism. Because of the special way SIGCLD is implemented (read: a kludge :-) ), if your signal catching function reads /* ARGSUSED */ child(signo) int signo; { int status, w; w = wait(&status); ... do stuff with w & status ... /* this should be the last line */ (void) signal(SIGCLD, child); } you will never miss a child signal, even though if two children exit `simultaneously' they will only generate one signal. The reason is that the special kludge *regenerates* the signal on the way out of child(), if there happens to be another exit to pick up. If you put the signal() call before the wait(), you will get an endless recursion since there will always be at least one exit status ready. >The way to ensure catching all of your children is to use the Release >3 sigset(2) group of calls (I assume you're running Release 3). ... This will work only when the kludge is present. Even given reliable signals, a signal catcher may still run only once for multiple signals. So as long as you are relying on the SIGCLD kludge, you might as well use the simpler interface above. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
paul@morganucodon.cis.ohio-state.edu (Paul Placeway) (06/14/88)
In article <10105@mcdchg.UUCP> daveb@laidbak.UUCP (Dave Burton) writes: < Actually, this is more effort than is required. < wait(2) will return a -1 and set errno to ECHILD if there are no more < children to wait() on. So a simple loop will work: < < while (wait(&status) != -1 || errno != ECHILD) < ; /* check status here if desired */ Ah yes, but what if you want to catch all of your exited children, you might still have running children, and you _don't_ want to wait(2) for them too? Under BSD you could do that as: while (wait3(&status, WNOHANG, NULL) != -1 || errno != ECHILD) { /* stuff */ } But what about USG? --Paul