[comp.lang.c] Catching termination of child process and system

vrm@cathedral.cerc.wvu.wvnet.edu (Vasile R. Montan) (01/22/91)

   I have a program which occasionally forks to do some processing.
In order to avoid having zombie process hang around, I put the
following in my main routine:

void dowait()
{
  wait(0);
}

main()
{
   ...
   signal(SIGCHLD, dowait);
   ...
}

   However, in another place in the code, I do a system call and look
at the return status to see if an error has occurred.  Without the
signal in the main routine, the system call works fine, but with the
signal, the system call always returns a -1.  Is there an easy way
to fix this?

**************** The above opinions are mine, all mine. *****************
Vasile R. Montan                           Bell Atlantic Software Systems 
                                           9 South High Street             
vrm@cerc.wvu.wvnet.edu                     Morgantown, WV 26505

yang@nff.ncl.omron.co.jp (YANG Liqun) (01/23/91)

In article <15745vrm@cathedral.cerc.wvu.wvnet.edu> Vasile R. Montan writes:

> ... I put the following in my main routine:
> 
> void dowait()
>{
> wait(0)

It should be wait((int *)0).

>main()
> {
>   ...
>  signal(SIGCHLD, dowait);
> ...
> }

When a child process stopped or exited, SIGCHLD signal will be sent to the
process and wait system call itself will catch the SIGCHLD signal from a
child. So you do not need to use 
signal(SIGCHLD, dowait);
just use
   wait(&ret_val)
in parent process.

I think the problem of your code is that a SIGCHLD signal is sent to
parent process when a child process dies, but the signal is caught and
then invoke a wait system call which will wait for another SIGCHLD signal.

Yang.

-----
Li-qun Yang                  OMRON Computer Technology R&D lab
yang@nff.ncl.omron.co.jp     tel: 075-951-5111  fax: 075-956-7403

--
;  Li-qun Yang			OMRON Computer Technology R&D lab
;  yang@nff.ncl.omron.co.jp	tel: 075-951-5111  fax: 075-956-7403

diamond@jit345.swstokyo.dec.com (Norman Diamond) (01/24/91)

In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes:
>In article <15745vrm@cathedral.cerc.wvu.wvnet.edu> Vasile R. Montan writes:
>> ... I put the following in my main routine:
>> void dowait()
>>{
>> wait(0)
>
>It should be wait((int *)0).

It should be wait((union wait *)0) in BSD.  I don't know what it should
be in System V.

>>main()
>> {
>>   ...
>>  signal(SIGCHLD, dowait);
>> ...
>> }
>
>When a child process stopped or exited, SIGCHLD signal will be sent to the
>process and wait system call itself will catch the SIGCHLD signal from a
>child.

wait() does not catch a signal.

>So you do not need to use 
>signal(SIGCHLD, dowait);
>just use
>   wait(&ret_val)
>in parent process.

This is true, he doesn't have to use signal().  However, he wants his main
process to do other operations while the child is still running.  If he
calls wait() right away, then his main process is suspended.  So he doesn't
want to call wait() until after he receives a signal, when he knows that
it will be a "short wait."

>I think the problem of your code is that a SIGCHLD signal is sent to
>parent process when a child process dies, but the signal is caught and
>then invoke a wait system call which will wait for another SIGCHLD signal.

No, this is not the problem.  His method is a common one.  He has a problem
with the system() library call misbehaving, and neither you nor I know the
answer to that one.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

gwyn@smoke.brl.mil (Doug Gwyn) (01/24/91)

In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes:
>>It should be wait((int *)0).
>It should be wait((union wait *)0) in BSD.

No, it's wait((int*)0) in all flavors of UNIX and POSIX.

"union wait" was a bogus attempt by somebody to give names to the
subfields of the status word, but it was never a correct description
of how wait() actually works and has been repudiated by IEEE 1003.1.

davel@cai.uucp (David W. Lauderback) (01/24/91)

In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes:
>
>In article <15745vrm@cathedral.cerc.wvu.wvnet.edu> Vasile R. Montan writes:
>
>> ... I put the following in my main routine:
>> 
>> void dowait()
>>{
>> wait(0)
>
>It should be wait((int *)0).
>
This could be important, but probably isn't the cause of the left around
processes.
>>main()
>> {
>>   ...
>>  signal(SIGCHLD, dowait);
>> ...
>> }
>
>When a child process stopped or exited, SIGCHLD signal will be sent to the
>process and wait system call itself will catch the SIGCHLD signal from a
>child. So you do not need to use 
>signal(SIGCHLD, dowait);
>just use
>   wait(&ret_val)
>in parent process.
If he didn't wait until a signal came in, the parent process would stop until
the child dies.  This is probably not the desired effect.
>
>I think the problem of your code is that a SIGCHLD signal is sent to
>parent process when a child process dies, but the signal is caught and
>then invoke a wait system call which will wait for another SIGCHLD signal.
>
>Yang.
>
Calling wait returns when: a signal occurs OR when a child's status is ready OR
when there is no outstand children.  So the code above should work except for
a timing problem if another signal come in, just as the process calls wait.

However, if you are just trying to get rid of left-over child processes
"zombie processes",
just use:
	signal(SIGCHLD,SIG_IGN);
instead of:
	signal(SIGCHLD, dowait);
and you need no wait. (see signal(2) or signal(3c) in BSD)

FYI: The zombie process is storing the child process' exit status, so must
remain until its parent process has read this information.  SIG_IGN to SIGCHLD
states this process' child's return value should be discarded.
-- 
David W. Lauderback (a.k.a. uunet!cai!davel)
Century Analysis Incorporated
Disclaimer: Any relationship between my opinions and my employer's
	    opinions is purely accidental.

vrm@babcock.cerc.wvu.wvnet.edu (Vasile R. Montan) (01/25/91)

From article <1991Jan24.084230.12153@cai.uucp>, by davel@cai.uucp (David W. Lauderback):
> However, if you are just trying to get rid of left-over child processes
> "zombie processes",
> just use:
> 	signal(SIGCHLD,SIG_IGN);
> instead of:
> 	signal(SIGCHLD, dowait);
> and you need no wait. (see signal(2) or signal(3c) in BSD)
> 
> FYI: The zombie process is storing the child process' exit status, so must
> remain until its parent process has read this information.  SIG_IGN to SIGCHLD
> states this process' child's return value should be discarded.

   I have seen this solution proposed many times, but it doesn't work
for me. I am using a Sun4 SunOS4.1.  I have created the following test
routine.  Maybe someone could tell me if I am doing something wrong or
if it is the operating system.

#include <signal.h>

main()
{
  int i;
  signal(SIGCHLD, SIG_IGN);

  for (i=0; i< 10; i++) {
    if (! fork()) exit (0);
  }
  while (1) {}
}

It generates 10 children, which immediately exit then it waits in an
infinite loop.  When I do a "ps", I see all 10 <defunct> processes
hanging around.

diamond@jit345.swstokyo.dec.com (Norman Diamond) (01/25/91)

In article <14965@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>>In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes:
>>>It should be wait((int *)0).
>>It should be wait((union wait *)0) in BSD.
>
>No, it's wait((int*)0) in all flavors of UNIX and POSIX.

No, it's wait((union wait *)0) in systems that implement the bogus attempt
that we all know about.

>"union wait" was a bogus attempt by somebody to give names to the
>subfields of the status word, but it was never a correct description
>of how wait() actually works

It's rather disorganized but wait() does actually work in the same
disorganized manner, on those systems.

>and has been repudiated by IEEE 1003.1.

I'm glad to hear that.  Unfortunately, some computers aren't running
IEEE 1003.1 yet.
--
Norman Diamond       diamond@tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.

gwyn@smoke.brl.mil (Doug Gwyn) (01/26/91)

In article <1991Jan25.022950.10683@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>In article <14965@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>>In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>>>In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes:
>>>>It should be wait((int *)0).
>>>It should be wait((union wait *)0) in BSD.
>>No, it's wait((int*)0) in all flavors of UNIX and POSIX.
>No, it's wait((union wait *)0) in systems that implement the bogus attempt
>that we all know about.

No, it's wait((int*)0) in all flavors of UNIX and POSIX.

pt@geovision.uucp (Paul Tomblin) (01/29/91)

gwyn@smoke.brl.mil (Doug Gwyn) writes:
>In article <1991Jan25.022950.10683@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>>In article <14965@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:
>>>In article <1991Jan24.023750.19569@tkou02.enet.dec.com> diamond@jit345.enet@tkou02.enet.dec.com (Norman Diamond) writes:
>>>>In article <YANG.91Jan23133130@newyork.nff.ncl.omron.co.jp> yang@nff.ncl.omron.co.jp (YANG Liqun) writes:
>>>>>It should be wait((int *)0).
>>>>It should be wait((union wait *)0) in BSD.
>>>No, it's wait((int*)0) in all flavors of UNIX and POSIX.
>>No, it's wait((union wait *)0) in systems that implement the bogus attempt
>>that we all know about.
>No, it's wait((int*)0) in all flavors of UNIX and POSIX.

Sorry to add to this 'did not- did too' level of discussion, but a 
"man 2 wait" on several machines shows the following results:


union wait *wait_id:

Dec RISC/Ultrix 4.0 on a DS 3100
Sun OS 4.0.1 on a Sun 3/60
Dec VAX/Ultrix 4.0 on a VaxStation 3600

int *wait_id:

Sun OS 4.1 on a Sun 4/360
AIX 3.1 on a RS/6000

and our VMS machine is temporarily down, so I can't check VMS 5.4, but it's
int *wait_id in the VAX C Language Summary for VMS 4.0+ and VAX C 2.0+.


So Doug: is Ultrix not a flavour of unix?  Tastes pretty close to the same to
me!
-- 
Paul Tomblin, Department of Redundancy Department.       ! My employer does 
The Romanian Orphans Support Group needs your help,      ! not stand by my
Ask me for details.                                      ! opinions.... 
pt@geovision.gvc.com or {cognos,uunet}!geovision!pt      ! Me neither.

gwyn@smoke.brl.mil (Doug Gwyn) (01/31/91)

In article <1356@geovision.UUCP> pt@geovision.gvc.com writes:
>gwyn@smoke.brl.mil (Doug Gwyn) writes:
>>No, it's wait((int*)0) in all flavors of UNIX and POSIX.
>So Doug: is Ultrix not a flavour of unix?

Well, that's a question I prefer not to answer.
However, I explained the "union wait" situation previously.
Here it is again:

Some "helpful" soul at UCB decided that it would be "nicer" to declare
a union type for the wait() status, with bit field members designating
the "subfields" of the status, than to simply announce, as had been
the case universally in UNIX to that point, which bits of the int-type
status had which meanings.  Unfortunately, because of the lack of
standard bit-field allocation semantics, to accommodate all previously
existing C code that had been written according to the rules to that
point, porting 4.nBSD to a new platform always required that the BSD
porter check the bit-field definition and if necessary adjust it to
accurately reflect the REAL definition of the wait() status, which has
always been in terms of the lowest 16 bits of an int representation.

I just examined the 4.3BSD kernel source code and found no use of the
w_* identifiers that are declared/defined in <sys/wait.h>.  I did,
however, find places where the kernel treated the wait() status as
type int.  This even more strongly indicates that the true type is int
and that <sys/wait.h> is simply a bogus invention.  Note that int* and
union wait* need not have the same representation (although they do in
many implementations including VAX 4.3BSD PCC), so it does matter what
the argument type really is.  It is int*.

torek@elf.ee.lbl.gov (Chris Torek) (02/14/91)

(This really belongs in a Unix newsgroup; however, I expect no further
followups, i.e., I think this will be the decisive answer.)

In various articles (see the references line) Doug Gwyn and Norman Diamond
argue over the type of the argument to wait(2).

In article <1356@geovision.UUCP> pt@geovision.gvc.com writes:
>Sorry to add to this 'did not- did too' level of discussion, but a 
>"man 2 wait" on several machines shows [both].

Although I am a known BSDite (`BSD pervert' to some :-) ), I have to side
with Doug here.

The mess came about for historical reasons.  In the days of Version 6
Unix, there was only one wait() system call; it took a pointer to int.
V6 begat V7 and PWB; PWB grew (via a long and convoluted path) into
System V while V7 grew into 32V and eventually to 4BSD.  (There were
various cross-fertilizations along the way, but by and large the systems
split apart sometime between V6 and V7.)

As Doug has already noted, certain persons who shall remain nameless---
not to protect the guilty, but rather, simply, because I am not certain
who---changed both wait() and wait3() at about the same time as job
control (and wait3() itself) were added to the Berkeley kernel.
(Wait() and wait3() were in fact the same system call, distinguished
by, of all things, the condition codes in the VAX PSL.  The whole setup
was a botch.  Fortunately, all is now repaired.)  Since wait3() could
and did return more information than did wait()%, it seemed convenient
to make a union describing the different return values.  While all this
went on, no one changed the kernel: the union was carefully tailored
to match the actual kernel code, which still used `int's.
-----
% Ignore that masked ptrace() behind the curtain
-----

Because the kernel was unchanged, the fields in the union were byte
order dependent.  When 4.3BSD was ported to the Tahoe, a big-endian
machine, our industrious kernel hackers added byte-order macros and
made use of them in defining the wait union.  This made the same names
work on the two different machines.  Unfortunately, the resulting union
definition was still not right: the byte order of any given machine
does not uniquely determine the bit order of that machine.  With the
advent of POSIX our industrious kernel hackers finally gave up, sighed,
and replaced the union with accessor macros.

Meanwhile, on all those machines that still use the old Berkeley union,
it `just happens' (for the reasons given above) that `int's also work.
New machines that conform to POSIX standards will use `int's.  Therefore,
all new software should use `int's.  The new Berkeley <sys/wait.h> will
still work with old software as well (there is some hackery in the
accessor macros to accomplish this).

The answer, then, is that to wait for a process whose id is `pid' you
should use:

	int w, status;

	if (check_other_wait_results(pid, &status))	/* if necessary */
	while ((w = wait(&status)) != pid) {
		if (w == -1 && errno == EINTR)	/* ugly but sometimes... */
			continue;		/* ...necessary */
		record_other_wait_result(w, status);	/* if necessary */
	}

The exit status of the process, if any, is then `status >> 8' and the
signal, if any, that caused the process to die is then `status & 0177'.
The process left a core dump (`image' or `traceback data' to non-Unix
folks) if `status & 0200' is nonzero.  This *will work* on systems
that currently have the union.  It will draw warnings from lint, but
then, lint does not know *every*thing.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

gwc@root.co.uk (Geoff Clare) (02/18/91)

In comp.lang.c<9882@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:

>(This really belongs in a Unix newsgroup; however, I expect no further
>followups, i.e., I think this will be the decisive answer.)

Sorry to disappoint Chris, but I have something to add to his "decisive
answer".  I have cross-posted to comp.unix.programmer and directed
follow-ups there.  The discussion does have some relevance to 'C' since
it is about the format of the status returned by wait(), and on UNIX
systems this format also applies to the return value of the system()
function.

>The answer, then, is that to wait for a process whose id is `pid' you
>should use:

>	int w, status;

>	if (check_other_wait_results(pid, &status))	/* if necessary */
>	while ((w = wait(&status)) != pid) {
>		if (w == -1 && errno == EINTR)	/* ugly but sometimes... */
>			continue;		/* ...necessary */
>		record_other_wait_result(w, status);	/* if necessary */
>	}

>The exit status of the process, if any, is then `status >> 8' and the
>signal, if any, that caused the process to die is then `status & 0177'.
>The process left a core dump (`image' or `traceback data' to non-Unix
>folks) if `status & 0200' is nonzero.

POSIX does not specify the precise encoding of information in the status
returned by wait(), system(), etc., so portable programs should not
rely on the traditional encoding Chris describes above.  Instead macros
are provided in <sys/wait.h> to extract the relevant data from the status:

     WIFEXITED(status) is non-zero if the child exited normally, in which
case WEXITSTATUS(status) gives the exit code.

     WIFSIGNALED(status) is non-zero if the child was terminated by a signal,
and  WTERMSIG(status) gives the signal number.

     WIFSTOPPED(status) is non-zero if the child was stopped by a signal,
and  WSTOPSIG(status) gives the signal number.
-- 
Geoff Clare <gwc@root.co.uk>  (Dumb American mailers: ...!uunet!root.co.uk!gwc)
UniSoft Limited, London, England.   Tel: +44 71 729 3773   Fax: +44 71 729 3273

gwyn@smoke.brl.mil (Doug Gwyn) (02/19/91)

In article <2608@root44.co.uk> gwc@root.co.uk (Geoff Clare) writes:
>POSIX does not specify the precise encoding of information in the status
>returned by wait(), system(), etc., so portable programs should not
>rely on the traditional encoding Chris describes above.  Instead macros
>are provided in <sys/wait.h> to extract the relevant data from the status:

(1)  PORTABLE programs MUST follow Chris's recommendation; not all
existing UNIX environments provide the macros to which you alluded.
PORTABLE != POSIX

(2)  Does POSIX really neglect to specify the bits?  Certainly as of
the trial-use 1003.1 standard the bits were specified.  In any case,
all UNIX systems must continue to act as Chris decided, regardless of
whether POSIX requires additional facilities for this.

gwc@root.co.uk (Geoff Clare) (02/20/91)

In <15239@smoke.brl.mil> gwyn@smoke.brl.mil (Doug Gwyn) writes:

>In article <2608@root44.co.uk> gwc@root.co.uk (Geoff Clare) writes:
>>POSIX does not specify the precise encoding of information in the status
>>returned by wait(), system(), etc., so portable programs should not
>>rely on the traditional encoding Chris describes above.  Instead macros
>>are provided in <sys/wait.h> to extract the relevant data from the status:

>(1)  PORTABLE programs MUST follow Chris's recommendation; not all
>existing UNIX environments provide the macros to which you alluded.
>PORTABLE != POSIX

I think Doug has misunderstood my meaning.  Dan Bernstein gave a similar
reaction in comp.unix.programmer.  Perhaps I wasn't very clear, or maybe
I used some English expression which doesn't mean quite the same to
Americans.  Anyway, what I was trying to say is that because the wait
status encoding is not specified by POSIX, it may not be the same on 
future POSIX systems, so programs should not rely on it.  For maximum
portability programs should use the POSIX macros *if they are defined*.
If they are not defined, programs which are required to be portable
to non-POSIX systems should of course revert to the traditional encoding.
See my reply to Dan's follow-up in comp.unix.programmer for more details.

>(2)  Does POSIX really neglect to specify the bits?

Yes.

By the way, Doug, why did you ignore the "Followup-To:" in my article?
Are you using a broken newsreader?

Follow-ups to this article are directed to comp.unix.programmer (again :-)
-- 
Geoff Clare <gwc@root.co.uk>  (Dumb American mailers: ...!uunet!root.co.uk!gwc)
UniSoft Limited, London, England.   Tel: +44 71 729 3773   Fax: +44 71 729 3273