[comp.unix.wizards] bug in pclose

jmm@ecijmm.UUCP (12/20/88)

I just tripped over a bug in the pclose library routine.  If you do a
pclose on a file descriptor (that was opened with popen of course) in
some circumstances it will hang forever.  For example:

    pin = popen( "comm1", "r" );
    pout1 = popen( "comm2", "w" );
    pout2 = popen( "comm3", "w" );

    while( <read pin until eof> )
	<write portions to pout1 and pout2>

    pclose( pout1 );
    pclose( pin );		/* this one hangs */
    pclose( pout2 );

If the pclose( pin ) is switched with either of the other two
pcloses, then there is no hang - if it is first, then all works
perfectly, if it is last, then it fails with the error "no child".

It is clear what is happening - the pclose( pout1 ) closes its
file descriptor and then does a wait to get the status of the
pipe child; but it first gets (and discards) the status of the
pin pipe child.  When the pclose( pin ) is done, it closes the
file descriptor, and then it tries to wait for the child.  But
the child's status has already been discarded, and a different
child still exists, so the system can't return an error, so the
wait just hangs - since the pout2 child won't terminate until it
get eof, and that won't happen until this process closes the file
descriptor.

There are many other variations possible.  (I first encountered
this in a perl program with only two pipes, but there were system(3)
calls done in the body of the loop, which presumably were doing
their own wait.)  I ended up with the above version when I tried
to reduce the problem to determine whether it was Perl's fault.

I have isolated this behaviour on two different systems - Interactive
Systems System 5/386 (I should know the release number, ask me tomorrow)
and Xenix System III for 68000.

(Finally the question...)

Is this behaviour common to all System 5 variations?  To BSD
derivatives?  SunOS?  AIX?  Your favourite here?

Is there even a good general solution?  I can see only one good way
to handle all of the variations of some routine wanting to wait for
a specific child and getting the termination info for a different child
instead (which will eventually be waited for - perhaps by a totally
different routine).  That would be to provide some new library routines:
waitfor( child, &status ) and postwait( child, status ).  Waitfor would
wait for a specified child (and save information internally on any other
children that terminate in the meantime).  Postwait would allow a routine
that had done a wait call and gotten the termination for a child that
it didn't know to pass that info into the mechanism for saving used by
waitfor.  These routines could be used internally by pclose, system, and
any other library routines that have waiting for a specific child as a
part of their semantics, as well as being provided to the user as a new
pair of library routines for building additional capabilities that include
forked children as a part of their implementation.
-- 
--
John Macdonald

mlandau@bbn.com (Matt Landau) (12/21/88)

[Earlier article describes some race conditions that can lead to 
 program hanging if they popen() multiple things and then close them 
 in a certain order. . .]

I've found the same sort of behavior in BSD-derived systems, with both
popen and system.  The basic problem is that in both cases, the library
routines will wait for the process they're interested in to terminate,
but will catch and discard termination information for other processes
silently.

The only completely reliable solution for our application turned out
to be the rule "don't use popen/pclose or system for general process
control functions."  Instead, we wrote a set of library routines to 
manage multiple child processes cleanly - they know how to start, send
input to, get input from, and kill any of the set of running children
without affecting the others.

chris@mimsy.UUCP (Chris Torek) (12/21/88)

In article <261@ecijmm.UUCP> jmm@ecijmm.UUCP writes:
>in some circumstances [pclose] will hang forever. ...
>Is this behaviour common to all System 5 variations?  To BSD
>derivatives?  SunOS?  AIX?  Your favourite here?

It is partially fixed in 4.3BSD (and, one hopes, derivatives), where
pclose() will return -1 if the process has already been reaped.

>Is there even a good general solution?

4.4BSD will have a `wait4' or `waitfor' syscall, to wait for a
particular process (or for any process, if the given pid is 0).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

malcolm@otc.oz (Malcolm Purvis ) (12/22/88)

In article <261@ecijmm.UUCP> jmm@ecijmm.UUCP (John Macdonald) writes:

	[Stuff about pclose(3) hanging when there is more than one child...]

>Is this behaviour common to all System 5 variations?  To BSD
>derivatives?  SunOS?  AIX?  Your favourite here?

	It's all of them as for as I know. See below.
>
>Is there even a good general solution?  I can see only one good way
>to handle all of the variations of some routine wanting to wait for
>a specific child and getting the termination info for a different child
>instead (which will eventually be waited for - perhaps by a totally
>different routine).  That would be to provide some new library routines:
>waitfor( child, &status ) and postwait( child, status ).  Waitfor would
>wait for a specified child (and save information internally on any other
>children that terminate in the meantime).  Postwait would allow a routine
>that had done a wait call and gotten the termination for a child that
>it didn't know to pass that info into the mechanism for saving used by
>waitfor.  These routines could be used internally by pclose, system, and
>any other library routines that have waiting for a specific child as a
>part of their semantics, as well as being provided to the user as a new
>pair of library routines for building additional capabilities that include
>forked children as a part of their implementation.
>-- 
>--
>John Macdonald

	I had to solve this problem recently for a C++ subprocess class and
as John said it stems from the inability to wait for a specific child to
die; you can only wait for the next one.  The solution I came up with under
BSD4.3 was to put a structure associated with each child in a linked list
and use a SIGCHLD handler to do the actual waiting.  When the signal
arrives, the wait(2) is done in the handler and the list is searched for the
pid of the dead child and, when it is found, it is marked as dead and taken
off the list.  The inside of the waitfor() call would then look something
like:

		while (!child.dead)
			sigpause(0); /* wait for this process to die. */

		*status = child.return_value;

	This works wonderfully in C++ because as the child structure goes
out of scope the child process is automatically waited for, so stopping
children not having a data structure associated with it, and also each caller
gets the right return value. I don't know, however, how you could do this in
C.  The only problem with all this is that is falls over if the child dies
before it gets inserted into the list (eg: The exec() fails because the
program isn't there) so care must be taken over the order of list insertion
and forking.

	I hope this helps.

			    Malcolm Purvis (vacation student)
			    |||| OTC ||

			ACSnet:  malcolm@otc.oz
			  UUCP:  {uunet,mcvax}!otc.oz!malcolm
			 Snail:  GPO Box 7000, Sydney 2001, Australia

allbery@ncoast.UUCP (Brandon S. Allbery) (12/27/88)

As quoted from <261@ecijmm.UUCP> by jmm@ecijmm.UUCP:
+---------------
| I just tripped over a bug in the pclose library routine.  If you do a
| pclose on a file descriptor (that was opened with popen of course) in
| some circumstances it will hang forever.  For example:
+---------------

From the manual page for popen(3) under System III:

	BUGS
		Only one stream opened by popen can be in use at once.

This is less a bug than a misfeature; popen() is intended for casual use, not
for complex tricks.  If BSD Unix has a way to wait for a specific process ID,
that could be used by pclose() (but I don't remember wait3() supporting
that); the alternative is to have popen()/pclose() store the PID in a linked
list, but then system() must also use that linked list -- and either new
wrappers for exec?() must be made available or the use of system()/popen()
and exec?() in the same program must be firmly discouraged.  (And then
someone will do it without reading the manual and complain about the "bug".)

Moral:  if you're going to play with multiple concurrent -- or even
potentially concurrent -- subprocesses, do all the work manually.  It takes
more work to write the program, but the result is far more robust in a multi-
process situation.

++Brandon
-- 
Brandon S. Allbery, comp.sources.misc moderator and one admin of ncoast PA UN*X
uunet!hal.cwru.edu!ncoast!allbery		    ncoast!allbery@hal.cwru.edu
comp.sources.misc is moving off ncoast -- please do NOT send submissions direct
      Send comp.sources.misc submissions to comp-sources-misc@<backbone>.

decot@hpisod2.HP.COM (Dave Decot) (01/04/89)

The IEEE Standard 1003.1 (POSIX) requires conforming implementations to have
a new function called waitpid() that does effectively what the previous
respondant described as 4.4BSD's waitfor() or wait4() function.  I hope BSD
changes the name before it's too late.

Dave

rml@hpfcdc.HP.COM (Bob Lenk) (01/05/89)

> The IEEE Standard 1003.1 (POSIX) requires conforming implementations to have
> a new function called waitpid() that does effectively what the previous
> respondant described as 4.4BSD's waitfor() or wait4() function.  I hope BSD
> changes the name before it's too late.

Functionally (not necessarily syntactically):

wait4 == the union of wait3 and waitpid
      == wait3 + pid
      == waitpid + rusage

I'm confidant that waitpid() will be a library using wait4 in 4.4BSD.

		Bob Lenk
		hplabs!hpfcla!rml
		rml%hpfcla@hplabs.hp.com

guy@auspex.UUCP (Guy Harris) (01/05/89)

 >The IEEE Standard 1003.1 (POSIX) requires conforming implementations to have
 >a new function called waitpid() that does effectively what the previous
 >respondant described as 4.4BSD's waitfor() or wait4() function.  I hope BSD
 >changes the name before it's too late.

Since "waitfor()"/"wait4()" does things that "waitpid()" doesn't
(namely, optionally fill in a "struct rusage"), I see no reason why the
people at Berkeley would want to rename "wait{4|for}" to "waitpid". 
They may want to provide "waitpid" as a procedure that calls
"wait{4|for}", but that's a different matter. 

egisin@mks.UUCP (Eric Gisin) (01/07/89)

In article <803@auspex.UUCP>, guy@auspex.UUCP (Guy Harris) writes:
> Since "waitfor()"/"wait4()" does things that "waitpid()" doesn't
> (namely, optionally fill in a "struct rusage"), I see no reason why the
> people at Berkeley would want to rename "wait{4|for}" to "waitpid". 
> They may want to provide "waitpid" as a procedure that calls
> "wait{4|for}", but that's a different matter. 

Declare waitpid as
	pid_t waitpid (pid_t pid, int *statp, int options, ...);
	#define	WRUSAGE	0x10
The option WRUSAGE can indicate that a fourth parameter
pointing to a struct rusage exists.

There's no need for half-a-dozen wait-like functions in 4.4 BSD.

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/08/89)

In article <633@mks.UUCP> egisin@mks.UUCP (Eric Gisin) writes:
>Declare waitpid as
>	pid_t waitpid (pid_t pid, int *statp, int options, ...);

I don't have IEEE Std 1003.1 at hand at the moment,
but I'm fairly sure this declaration is incompatible with
the one in the Standard.  If you want a different interface,
use a new name.  Don't people EVER learn from experience?

guy@auspex.UUCP (Guy Harris) (01/10/89)

>Declare waitpid as
>	pid_t waitpid (pid_t pid, int *statp, int options, ...);

It's already been defined (by POSIX), and probably not as a varargs
function.  Yeah, maybe it'll work on most of the machines you and I can
think of (just as calling "open" with 2 arguments will), but why take
chances?

ado@elsie.UUCP (Arthur David Olson) (01/17/89)

> I don't have IEEE Std 1003.1 at hand at the moment,
> but I'm fairly sure this declaration is incompatible with
> the one in the Standard.  If you want a different interface,
> use a new name.  Don't people EVER learn from experience?

No they don't, as a look at the changes made to function library interfaces
by the dpANS shows.
-- 
	Arthur David Olson    ado@ncifcrf.gov    ADO is a trademark of Ampex.

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/18/89)

In article <8608@elsie.UUCP> ado@elsie.UUCP (Arthur David Olson) writes:
>>Don't people EVER learn from experience?
>No they don't, as a look at the changes made to function library interfaces
>by the dpANS shows.

What changes?  Having just checked over my System V "lint" library for
pANS conformance, I didn't notice any actual changes.

ado@elsie.UUCP (Arthur David Olson) (01/19/89)

> > >Don't people EVER learn from experience?
> >No they don't, as a look at the changes made to function library interfaces
> >by the dpANS shows.
> 
> What changes?  Having just checked over my System V "lint" library for
> pANS conformance, I didn't notice any actual changes.

Check out the memory allocation functions (with their changed treatment of
NULL arguments) and fopen (where "rb+" is given new meaning) for starters--the
ground is covered in X3J11 correspondence that should be part of its records.
-- 
	Arthur David Olson    ado@ncifcrf.gov    ADO is a trademark of Ampex.

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/19/89)

In article <9003@elsie.UUCP> ado@elsie.UUCP (Arthur David Olson) writes:
>> > >Don't people EVER learn from experience?
>> >No they don't, as a look at the changes made to function library interfaces
>> >by the dpANS shows.
>> What changes?  Having just checked over my System V "lint" library for
>> pANS conformance, I didn't notice any actual changes.
>Check out the memory allocation functions (with their changed treatment of
>NULL arguments) and fopen (where "rb+" is given new meaning) for starters

The only changes regarding memory allocation and NULL arguments are that
in an ANSI C Standard conforming hosted implementation, realloc() acts
like malloc() when fed a null pointer, and free() of a null pointer is
benign.  Both those are upward compatible with previously existing
practice; correct code using these functions will not be broken by the
extensions.  Similarly for the fopen() example.  "rb+" simply HAD no
correct meaning in old UNIX implementations; using it there would have
been a (possibly undetected) bug.  Again this is an extension.  UNIX
(or POSIX) implementations need not use "rb+" at all; "r+" is equivalent
in those environments; thus no existing correct code has been invalidated
by this (necessary!) extension.

The suggestion that prompted this discussion was INCOMPATIBLE with the
already-established standard usage for the function.  Therefore a new
name should have been used, not the standardized name.

ado@elsie.UUCP (Arthur David Olson) (01/22/89)

Thanks to Doug Gwyn for confirming that there are changes to the C language
function interface in X3J11's work, and for clarifying that it it's only
non-upwardly-compatible (rather than incompatible) changes that are seen as
a problem.
-- 
	Arthur David Olson    ado@ncifcrf.gov    ADO is a trademark of Ampex.