vrm@cathedral.cerc.wvu.wvnet.edu (Vasile R. Montan) (01/22/91)
I have a program which occasionally forks to do some processing. In order to avoid having zombie process hang around, I put the following in my main routine: void dowait() { wait(0); } main() { ... signal(SIGCHLD, dowait); ... } However, in another place in the code, I do a system call and look at the return status to see if an error has occurred. Without the signal in the main routine, the system call works fine, but with the signal, the system call always returns a -1. Is there an easy way to fix this? **************** The above opinions are mine, all mine. ***************** Vasile R. Montan Bell Atlantic Software Systems 9 South High Street vrm@cerc.wvu.wvnet.edu Morgantown, WV 26505
yang@nff.ncl.omron.co.jp (YANG Liqun) (01/23/91)
In article <15745vrm@cathedral.cerc.wvu.wvnet.edu> Vasile R. Montan writes: > ... I put the following in my main routine: > > void dowait() >{ > wait(0) It should be wait((int *)0). >main() > { > ... > signal(SIGCHLD, dowait); > ... > } When a child process stopped or exited, SIGCHLD signal will be sent to the process and wait system call itself will catch the SIGCHLD signal from a child. So you do not need to use signal(SIGCHLD, dowait); just use wait(&ret_val) in parent process. I think the problem of your code is that a SIGCHLD signal is sent to parent process when a child process dies, but the signal is caught and then invoke a wait system call which will wait for another SIGCHLD signal. Yang. ----- Li-qun Yang OMRON Computer Technology R&D lab yang@nff.ncl.omron.co.jp tel: 075-951-5111 fax: 075-956-7403 -- ; Li-qun Yang OMRON Computer Technology R&D lab ; yang@nff.ncl.omron.co.jp tel: 075-951-5111 fax: 075-956-7403
diamond@jit345.swstokyo.dec.com (Norman Diamond) (01/24/91)
In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes: >In article <15745vrm@cathedral.cerc.wvu.wvnet.edu> Vasile R. Montan writes: >> ... I put the following in my main routine: >> void dowait() >>{ >> wait(0) > >It should be wait((int *)0). It should be wait((union wait *)0) in BSD. I don't know what it should be in System V. >>main() >> { >> ... >> signal(SIGCHLD, dowait); >> ... >> } > >When a child process stopped or exited, SIGCHLD signal will be sent to the >process and wait system call itself will catch the SIGCHLD signal from a >child. wait() does not catch a signal. >So you do not need to use >signal(SIGCHLD, dowait); >just use > wait(&ret_val) >in parent process. This is true, he doesn't have to use signal(). However, he wants his main process to do other operations while the child is still running. If he calls wait() right away, then his main process is suspended. So he doesn't want to call wait() until after he receives a signal, when he knows that it will be a "short wait." >I think the problem of your code is that a SIGCHLD signal is sent to >parent process when a child process dies, but the signal is caught and >then invoke a wait system call which will wait for another SIGCHLD signal. No, this is not the problem. His method is a common one. He has a problem with the system() library call misbehaving, and neither you nor I know the answer to that one. -- Norman Diamond diamond@tkov50.enet.dec.com If this were the company's opinion, I wouldn't be allowed to post it.
gwyn@smoke.brl.mil (Doug Gwyn) (01/24/91)
In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes: >In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes: >>It should be wait((int *)0). >It should be wait((union wait *)0) in BSD. No, it's wait((int*)0) in all flavors of UNIX and POSIX. "union wait" was a bogus attempt by somebody to give names to the subfields of the status word, but it was never a correct description of how wait() actually works and has been repudiated by IEEE 1003.1.
davel@cai.uucp (David W. Lauderback) (01/24/91)
In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes: > >In article <15745vrm@cathedral.cerc.wvu.wvnet.edu> Vasile R. Montan writes: > >> ... I put the following in my main routine: >> >> void dowait() >>{ >> wait(0) > >It should be wait((int *)0). > This could be important, but probably isn't the cause of the left around processes. >>main() >> { >> ... >> signal(SIGCHLD, dowait); >> ... >> } > >When a child process stopped or exited, SIGCHLD signal will be sent to the >process and wait system call itself will catch the SIGCHLD signal from a >child. So you do not need to use >signal(SIGCHLD, dowait); >just use > wait(&ret_val) >in parent process. If he didn't wait until a signal came in, the parent process would stop until the child dies. This is probably not the desired effect. > >I think the problem of your code is that a SIGCHLD signal is sent to >parent process when a child process dies, but the signal is caught and >then invoke a wait system call which will wait for another SIGCHLD signal. > >Yang. > Calling wait returns when: a signal occurs OR when a child's status is ready OR when there is no outstand children. So the code above should work except for a timing problem if another signal come in, just as the process calls wait. However, if you are just trying to get rid of left-over child processes "zombie processes", just use: signal(SIGCHLD,SIG_IGN); instead of: signal(SIGCHLD, dowait); and you need no wait. (see signal(2) or signal(3c) in BSD) FYI: The zombie process is storing the child process' exit status, so must remain until its parent process has read this information. SIG_IGN to SIGCHLD states this process' child's return value should be discarded. -- David W. Lauderback (a.k.a. uunet!cai!davel) Century Analysis Incorporated Disclaimer: Any relationship between my opinions and my employer's opinions is purely accidental.
vrm@babcock.cerc.wvu.wvnet.edu (Vasile R. Montan) (01/25/91)
From article <1991Jan24.084230.12153@cai.uucp>, by davel@cai.uucp (David W. Lauderback): > However, if you are just trying to get rid of left-over child processes > "zombie processes", > just use: > signal(SIGCHLD,SIG_IGN); > instead of: > signal(SIGCHLD, dowait); > and you need no wait. (see signal(2) or signal(3c) in BSD) > > FYI: The zombie process is storing the child process' exit status, so must > remain until its parent process has read this information. SIG_IGN to SIGCHLD > states this process' child's return value should be discarded. I have seen this solution proposed many times, but it doesn't work for me. I am using a Sun4 SunOS4.1. I have created the following test routine. Maybe someone could tell me if I am doing something wrong or if it is the operating system. #include <signal.h> main() { int i; signal(SIGCHLD, SIG_IGN); for (i=0; i< 10; i++) { if (! fork()) exit (0); } while (1) {} } It generates 10 children, which immediately exit then it waits in an infinite loop. When I do a "ps", I see all 10 <defunct> processes hanging around.
diamond@jit345.swstokyo.dec.com (Norman Diamond) (01/25/91)
In article <14965@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes: >>In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes: >>>It should be wait((int *)0). >>It should be wait((union wait *)0) in BSD. > >No, it's wait((int*)0) in all flavors of UNIX and POSIX. No, it's wait((union wait *)0) in systems that implement the bogus attempt that we all know about. >"union wait" was a bogus attempt by somebody to give names to the >subfields of the status word, but it was never a correct description >of how wait() actually works It's rather disorganized but wait() does actually work in the same disorganized manner, on those systems. >and has been repudiated by IEEE 1003.1. I'm glad to hear that. Unfortunately, some computers aren't running IEEE 1003.1 yet. -- Norman Diamond diamond@tkov50.enet.dec.com If this were the company's opinion, I wouldn't be allowed to post it.
gwyn@smoke.brl.mil (Doug Gwyn) (01/26/91)
In article <1991Jan25.022950.10683@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes: >In article <14965@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >>In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes: >>>In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes: >>>>It should be wait((int *)0). >>>It should be wait((union wait *)0) in BSD. >>No, it's wait((int*)0) in all flavors of UNIX and POSIX. >No, it's wait((union wait *)0) in systems that implement the bogus attempt >that we all know about. No, it's wait((int*)0) in all flavors of UNIX and POSIX.
pt@geovision.uucp (Paul Tomblin) (01/29/91)
gwyn@smoke.brl.mil (Doug Gwyn) writes: >In article <1991Jan25.022950.10683@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes: >>In article <14965@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >>>In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes: >>>>In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes: >>>>>It should be wait((int *)0). >>>>It should be wait((union wait *)0) in BSD. >>>No, it's wait((int*)0) in all flavors of UNIX and POSIX. >>No, it's wait((union wait *)0) in systems that implement the bogus attempt >>that we all know about. >No, it's wait((int*)0) in all flavors of UNIX and POSIX. Sorry to add to this 'did not- did too' level of discussion, but a "man 2 wait" on several machines shows the following results: union wait *wait_id: Dec RISC/Ultrix 4.0 on a DS 3100 Sun OS 4.0.1 on a Sun 3/60 Dec VAX/Ultrix 4.0 on a VaxStation 3600 int *wait_id: Sun OS 4.1 on a Sun 4/360 AIX 3.1 on a RS/6000 and our VMS machine is temporarily down, so I can't check VMS 5.4, but it's int *wait_id in the VAX C Language Summary for VMS 4.0+ and VAX C 2.0+. So Doug: is Ultrix not a flavour of unix? Tastes pretty close to the same to me! -- Paul Tomblin, Department of Redundancy Department. ! My employer does The Romanian Orphans Support Group needs your help, ! not stand by my Ask me for details. ! opinions.... pt@geovision.gvc.com or {cognos,uunet}!geovision!pt ! Me neither.
gwyn@smoke.brl.mil (Doug Gwyn) (01/31/91)
In article <1356@geovision.UUCP> pt@geovision.gvc.com writes: >gwyn@smoke.brl.mil (Doug Gwyn) writes: >>No, it's wait((int*)0) in all flavors of UNIX and POSIX. >So Doug: is Ultrix not a flavour of unix? Well, that's a question I prefer not to answer. However, I explained the "union wait" situation previously. Here it is again: Some "helpful" soul at UCB decided that it would be "nicer" to declare a union type for the wait() status, with bit field members designating the "subfields" of the status, than to simply announce, as had been the case universally in UNIX to that point, which bits of the int-type status had which meanings. Unfortunately, because of the lack of standard bit-field allocation semantics, to accommodate all previously existing C code that had been written according to the rules to that point, porting 4.nBSD to a new platform always required that the BSD porter check the bit-field definition and if necessary adjust it to accurately reflect the REAL definition of the wait() status, which has always been in terms of the lowest 16 bits of an int representation. I just examined the 4.3BSD kernel source code and found no use of the w_* identifiers that are declared/defined in <sys/wait.h>. I did, however, find places where the kernel treated the wait() status as type int. This even more strongly indicates that the true type is int and that <sys/wait.h> is simply a bogus invention. Note that int* and union wait* need not have the same representation (although they do in many implementations including VAX 4.3BSD PCC), so it does matter what the argument type really is. It is int*.
torek@elf.ee.lbl.gov (Chris Torek) (02/14/91)
(This really belongs in a Unix newsgroup; however, I expect no further followups, i.e., I think this will be the decisive answer.) In various articles (see the references line) Doug Gwyn and Norman Diamond argue over the type of the argument to wait(2). In article <1356@geovision.UUCP> pt@geovision.gvc.com writes: >Sorry to add to this 'did not- did too' level of discussion, but a >"man 2 wait" on several machines shows [both]. Although I am a known BSDite (`BSD pervert' to some :-) ), I have to side with Doug here. The mess came about for historical reasons. In the days of Version 6 Unix, there was only one wait() system call; it took a pointer to int. V6 begat V7 and PWB; PWB grew (via a long and convoluted path) into System V while V7 grew into 32V and eventually to 4BSD. (There were various cross-fertilizations along the way, but by and large the systems split apart sometime between V6 and V7.) As Doug has already noted, certain persons who shall remain nameless--- not to protect the guilty, but rather, simply, because I am not certain who---changed both wait() and wait3() at about the same time as job control (and wait3() itself) were added to the Berkeley kernel. (Wait() and wait3() were in fact the same system call, distinguished by, of all things, the condition codes in the VAX PSL. The whole setup was a botch. Fortunately, all is now repaired.) Since wait3() could and did return more information than did wait()%, it seemed convenient to make a union describing the different return values. While all this went on, no one changed the kernel: the union was carefully tailored to match the actual kernel code, which still used `int's. ----- % Ignore that masked ptrace() behind the curtain ----- Because the kernel was unchanged, the fields in the union were byte order dependent. When 4.3BSD was ported to the Tahoe, a big-endian machine, our industrious kernel hackers added byte-order macros and made use of them in defining the wait union. This made the same names work on the two different machines. Unfortunately, the resulting union definition was still not right: the byte order of any given machine does not uniquely determine the bit order of that machine. With the advent of POSIX our industrious kernel hackers finally gave up, sighed, and replaced the union with accessor macros. Meanwhile, on all those machines that still use the old Berkeley union, it `just happens' (for the reasons given above) that `int's also work. New machines that conform to POSIX standards will use `int's. Therefore, all new software should use `int's. The new Berkeley <sys/wait.h> will still work with old software as well (there is some hackery in the accessor macros to accomplish this). The answer, then, is that to wait for a process whose id is `pid' you should use: int w, status; if (check_other_wait_results(pid, &status)) /* if necessary */ while ((w = wait(&status)) != pid) { if (w == -1 && errno == EINTR) /* ugly but sometimes... */ continue; /* ...necessary */ record_other_wait_result(w, status); /* if necessary */ } The exit status of the process, if any, is then `status >> 8' and the signal, if any, that caused the process to die is then `status & 0177'. The process left a core dump (`image' or `traceback data' to non-Unix folks) if `status & 0200' is nonzero. This *will work* on systems that currently have the union. It will draw warnings from lint, but then, lint does not know *every*thing. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
gwc@root.co.uk (Geoff Clare) (02/18/91)
In comp.lang.c<9882@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >(This really belongs in a Unix newsgroup; however, I expect no further >followups, i.e., I think this will be the decisive answer.) Sorry to disappoint Chris, but I have something to add to his "decisive answer". I have cross-posted to comp.unix.programmer and directed follow-ups there. The discussion does have some relevance to 'C' since it is about the format of the status returned by wait(), and on UNIX systems this format also applies to the return value of the system() function. >The answer, then, is that to wait for a process whose id is `pid' you >should use: > int w, status; > if (check_other_wait_results(pid, &status)) /* if necessary */ > while ((w = wait(&status)) != pid) { > if (w == -1 && errno == EINTR) /* ugly but sometimes... */ > continue; /* ...necessary */ > record_other_wait_result(w, status); /* if necessary */ > } >The exit status of the process, if any, is then `status >> 8' and the >signal, if any, that caused the process to die is then `status & 0177'. >The process left a core dump (`image' or `traceback data' to non-Unix >folks) if `status & 0200' is nonzero. POSIX does not specify the precise encoding of information in the status returned by wait(), system(), etc., so portable programs should not rely on the traditional encoding Chris describes above. Instead macros are provided in <sys/wait.h> to extract the relevant data from the status: WIFEXITED(status) is non-zero if the child exited normally, in which case WEXITSTATUS(status) gives the exit code. WIFSIGNALED(status) is non-zero if the child was terminated by a signal, and WTERMSIG(status) gives the signal number. WIFSTOPPED(status) is non-zero if the child was stopped by a signal, and WSTOPSIG(status) gives the signal number. -- Geoff Clare <gwc@root.co.uk> (Dumb American mailers: ...!uunet!root.co.uk!gwc) UniSoft Limited, London, England. Tel: +44 71 729 3773 Fax: +44 71 729 3273
gwyn@smoke.brl.mil (Doug Gwyn) (02/19/91)
In article <2608@root44.co.uk> gwc@root.co.uk (Geoff Clare) writes: >POSIX does not specify the precise encoding of information in the status >returned by wait(), system(), etc., so portable programs should not >rely on the traditional encoding Chris describes above. Instead macros >are provided in <sys/wait.h> to extract the relevant data from the status: (1) PORTABLE programs MUST follow Chris's recommendation; not all existing UNIX environments provide the macros to which you alluded. PORTABLE != POSIX (2) Does POSIX really neglect to specify the bits? Certainly as of the trial-use 1003.1 standard the bits were specified. In any case, all UNIX systems must continue to act as Chris decided, regardless of whether POSIX requires additional facilities for this.
gwc@root.co.uk (Geoff Clare) (02/20/91)
In <15239@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes: >In article <2608@root44.co.uk> gwc@root.co.uk (Geoff Clare) writes: >>POSIX does not specify the precise encoding of information in the status >>returned by wait(), system(), etc., so portable programs should not >>rely on the traditional encoding Chris describes above. Instead macros >>are provided in <sys/wait.h> to extract the relevant data from the status: >(1) PORTABLE programs MUST follow Chris's recommendation; not all >existing UNIX environments provide the macros to which you alluded. >PORTABLE != POSIX I think Doug has misunderstood my meaning. Dan Bernstein gave a similar reaction in comp.unix.programmer. Perhaps I wasn't very clear, or maybe I used some English expression which doesn't mean quite the same to Americans. Anyway, what I was trying to say is that because the wait status encoding is not specified by POSIX, it may not be the same on future POSIX systems, so programs should not rely on it. For maximum portability programs should use the POSIX macros *if they are defined*. If they are not defined, programs which are required to be portable to non-POSIX systems should of course revert to the traditional encoding. See my reply to Dan's follow-up in comp.unix.programmer for more details. >(2) Does POSIX really neglect to specify the bits? Yes. By the way, Doug, why did you ignore the "Followup-To:" in my article? Are you using a broken newsreader? Follow-ups to this article are directed to comp.unix.programmer (again :-) -- Geoff Clare <gwc@root.co.uk> (Dumb American mailers: ...!uunet!root.co.uk!gwc) UniSoft Limited, London, England. Tel: +44 71 729 3773 Fax: +44 71 729 3273