[comp.sys.hp] hp-ux 7.0/800 select

MAH@awiwuw11.wu-wien.ac.at (Michael Haberler) (08/30/90)

I have encountered a strange behaviour of several programs which use
select(2) on hp-ux 7.0 on the Series 800. All of these programs are
'ported' BSD code, so I have the suspicion there's something in common:

It seems that programs which have select(2) in their inner loop sometimes
start using enormous amounts of system cpu time, just as if the select()
call would return immediately as if it were polling. Among those programs
are Xemacs 18.55, Greg Minshall's tn3270, and named4.8.3.

Xemacs tends to do this especially if the X server terminates before emacs.
I did'nt find a explanation for named behaviour. With tn3270, it looks like
a modem disconnect and thus eof on the tty would cause tn3270 looping.

tn3270 seems to spend it's time in select() itself, while named returns to
user mode and immediately calls select again. One can see this when attaching
the process in question to the debugger (xdp -P <pid> <program>).

Since several programs show this behaviour, I suspect it  has to do with the
way select() is implemented in hpux 7.0.


Has anybody else encountered this behaviour? Is this a bug? If so, is there
a workaround?

- michael

staffan@isy.liu.se (Staffan Bergstrom) (09/03/90)

MAH@awiwuw11.wu-wien.ac.at (Michael Haberler) writes:

>I have encountered a strange behaviour of several programs which use
>select(2) on hp-ux 7.0 on the Series 800. All of these programs are
>'ported' BSD code, so I have the suspicion there's something in common:

>It seems that programs which have select(2) in their inner loop sometimes
>start using enormous amounts of system cpu time, just as if the select()
>call would return immediately as if it were polling. Among those programs
>are Xemacs 18.55, Greg Minshall's tn3270, and named4.8.3.

-
-
-

>Has anybody else encountered this behaviour? Is this a bug? If so, is there
>a workaround?

>- michael

I have had similar problems but it turned out to be the macros
FD_SET, FD_CLR etc, that was the cause of the problem. 
FD_SET is defined as follows.
#define FD_SET(n, p)    ((p)->fds_bits[(n)/NFDBITS] |= (1 << ((n) % NFDBITS)))
One has to be carful when using them on closed files otherwise 
it could cause an atempt to do a negative shift.
I had a program that worked fine on sun3, sun4 and hp300 (hp-ux 7.0),
but did not work at all on when I tried to port it to hp 800, because of this.

/Staffan

cph@zurich.ai.mit.edu (Chris Hanson) (09/06/90)

   From: MAH@awiwuw11.wu-wien.ac.at (Michael Haberler)
   Date: 30 Aug 90 15:16:05 GMT

   I have encountered a strange behaviour of several programs which use
   select(2) on hp-ux 7.0 on the Series 800. All of these programs are
   'ported' BSD code, so I have the suspicion there's something in common:

   It seems that programs which have select(2) in their inner loop sometimes
   start using enormous amounts of system cpu time, just as if the select()
   call would return immediately as if it were polling. Among those programs
   are Xemacs 18.55, Greg Minshall's tn3270, and named4.8.3.

   Xemacs tends to do this especially if the X server terminates before emacs.
   I did'nt find a explanation for named behaviour. With tn3270, it looks like
   a modem disconnect and thus eof on the tty would cause tn3270 looping.

I managed to get emacs into that state last night, and debugged it.
What happened was as follows.

I normally run several subprocesses under emacs.  At the time that the
problem occurred, there were two active subprocesses, and two exited
subprocesses.  Emacs still had all four subprocesses in its tables.
Emacs's command reader checks all of the subprocesses periodically for
input, using the `select' call on the input file descriptors of the
processes, and due to some peculiarities of its design, it was
checking all four of the subprocesses, even though two of them no
longer existed.

This `select' call was returning with a single bit set, which
indicated that the input file descriptor from one of the dead
subprocesses had some input that could be read.  Emacs then dutifully
went into a `read' call on that descriptor, which fortunately was set
to non-blocking mode, and the `read' call returned saying that of
course there was no data.

In summary: we have two processes and a pipe from one to the other.
The read side of the pipe has been set to non-blocking mode by the use
of O_NONBLOCK.  The process on the write side of the pipe finishes by
calling `exit'.  The process on the read side receives SIGCHLD and
uses `waitpid' to extract the exit status of the now-dead subprocess.
It then does a `select' on the read side of the pipe, which returns
indicating that the pipe has some data to be read.  The process calls
`read' on the pipe, which returns zero indicating no data is
available.  Etc.

Now I'm no expert, but it's my belief that `select' shouldn't indicate
that the pipe has input in this situation.

For information: this behavior has been observed (by others) when the
subprocess is using a PTY to communicate with emacs, although it has
not been debugged and thoroughly examined in such a case.

PS: Emacs is being changed so that it does not attempt to use `select'
on connections to dead processes.  Version 18.56 will not have this
problem.  If anyone is interested in a patch for 18.55, they should
contact me directly by e-mail.

eliot@dg-rtp.dg.com (Topher Eliot) (09/07/90)

|>    It seems that programs which have select(2) in their inner loop sometimes
|>    start using enormous amounts of system cpu time, just as if the select()
|>    call would return immediately as if it were polling. Among those programs
|>    are Xemacs 18.55, Greg Minshall's tn3270, and named4.8.3.
|> 
|> This `select' call was returning with a single bit set, which
|> indicated that the input file descriptor from one of the dead
|> subprocesses had some input that could be read.  Emacs then dutifully
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> went into a `read' call on that descriptor, which fortunately was set
|> to non-blocking mode, and the `read' call returned saying that of
|> course there was no data.
|> 
|> In summary: we have two processes and a pipe from one to the other.
|> The read side of the pipe has been set to non-blocking mode by the use
|> of O_NONBLOCK.  The process on the write side of the pipe finishes by
      ^^^^^^^^^^
|> calling `exit'.  The process on the read side receives SIGCHLD and
|> uses `waitpid' to extract the exit status of the now-dead subprocess.
|> It then does a `select' on the read side of the pipe, which returns
|> indicating that the pipe has some data to be read.  The process calls
|> `read' on the pipe, which returns zero indicating no data is
|> available.  Etc.
|> 
|> Now I'm no expert, but it's my belief that `select' shouldn't indicate
|> that the pipe has input in this situation.

Well, in fact, it isn't.

I've bumped into this problem before, in a different context.  I can't
remember what any of the applicable documentation said, but the bottom line
was that the semantics of select are that it will return with a particular
bit set if a read on the corresponding file descriptor WILL NOT BLOCK.  It
is NOT saying that there is data to be read there.  In my opinion, in such
cases the correct way to handle this is that all reads should be prepared
to detect that the descriptor from which they are reading has been closed,
or reached end of file, or whatever, and handle that appropriately.
Apparantly emacs does not do so in this case.

--
Topher Eliot
Data General Corporation                eliot@dg-rtp.dg.com
62 T. W. Alexander Drive                {backbone}!mcnc!rti!dg-rtp!eliot
Research Triangle Park, NC 27709        (919) 248-6371
Obviously, I speak for myself, not for DG.

cph@zurich.ai.mit.edu (Chris Hanson) (09/09/90)

   From: eliot@dg-rtp.dg.com (Topher Eliot)
   Date: 6 Sep 90 17:21:44 GMT

   |> This `select' call was returning with a single bit set, which
   |> indicated that the input file descriptor from one of the dead
   |> subprocesses had some input that could be read.  Emacs then dutifully
		   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |> went into a `read' call on that descriptor, which fortunately was set
   |> to non-blocking mode, and the `read' call returned saying that of
   |> course there was no data.
   |> 
   |> In summary: we have two processes and a pipe from one to the other.
   |> The read side of the pipe has been set to non-blocking mode by the use
   |> of O_NONBLOCK.  The process on the write side of the pipe finishes by
	 ^^^^^^^^^^
   |> calling `exit'.  The process on the read side receives SIGCHLD and
   |> uses `waitpid' to extract the exit status of the now-dead subprocess.
   |> It then does a `select' on the read side of the pipe, which returns
   |> indicating that the pipe has some data to be read.  The process calls
   |> `read' on the pipe, which returns zero indicating no data is
   |> available.  Etc.
   |> 
   |> Now I'm no expert, but it's my belief that `select' shouldn't indicate
   |> that the pipe has input in this situation.

   Well, in fact, it isn't.

   I've bumped into this problem before, in a different context.  I can't
   remember what any of the applicable documentation said, but the bottom line
   was that the semantics of select are that it will return with a particular
   bit set if a read on the corresponding file descriptor WILL NOT BLOCK.  It
   is NOT saying that there is data to be read there.  In my opinion, in such
   cases the correct way to handle this is that all reads should be prepared
   to detect that the descriptor from which they are reading has been closed,
   or reached end of file, or whatever, and handle that appropriately.
   Apparantly emacs does not do so in this case.

Since I posted the original message, I've changed my mind.  The only
thing I now believe about `select' is that it should return a
"readable" bit when there is data to be read from that channel.  I
have no opinion about what it should do in any other case.

The real problem here is that the documentation for `select' doesn't
define what it does.  The documentation defines the results as "ready
for reading, writing, or has an exceptional condition", but fails to
say what that means.  For example, I spoke to an HP engineer recently
who had no idea what an "exceptional condition" is in this context.
And I have no idea what it means either -- despite the fact that I'm
quite knowledgeable about unix.

Please, HP documenters, rewrite this man page so that it is possible
for us to know what it means!  It's no excuse that every other unix
says the same thing.

An aside: if it were the case that "ready for reading" meant "a read
on this channel will not block", then `select' would always say
"readable" for every non-blocking channel.  But emacs did a `select'
on four channels, all non-blocking, and it indicated only one of them
was "readable".  So I don't believe this definition is correct.

To paraphrase what I said above, I dunno what to believe.

In any case, emacs is now fixed so that it doesn't care what `select'
says in this case.

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (09/09/90)

In article <CPH.90Sep9022935@kleph.ai.mit.edu> cph@zurich.ai.mit.edu (Chris Hanson) writes:
> The real problem here is that the documentation for `select' doesn't
> define what it does.  The documentation defines the results as "ready
> for reading, writing, or has an exceptional condition", but fails to
> say what that means.

That a read or write wouldn't block if the descriptor were blocking.
Exceptional conditions are defined by the device.

Since passing nonblocking descriptors to application programs is a
serious violation of convention, you shouldn't run into problems unless
you're creating them.

---Dan

chris@mimsy.umd.edu (Chris Torek) (09/09/90)

In article <CPH.90Sep9022935@kleph.ai.mit.edu> cph@zurich.ai.mit.edu
(Chris Hanson) writes:
>The real problem here is that the documentation for `select' doesn't
>define what it does. ...

Well, the most likely reason is that select does not *do* ANYthing,
except time out.  The `selecting' is all done in lower layers, much
like ioctl.  There is no way that ioctl(2) can list what ioctl()
does, because it really does not do *any*thing.

>"... reading, writing, or has an exceptional condition"

Of course, the lower layers try to do something sensible.  For `read',
select is supposed to return true whenever a read() system call would
not block, regardless of any `non-blocking' mode on the file
descriptor.  That is, it should return true when there are data, and it
should return true when there is a `boundary condition' like an `EOF'
on a tty or a socket.  For `write', it is supposed to return true when
the lower layer can accept (some) more data without blocking.
`Exceptions' are left entirely up to the lower layers, and you have to
look at those (or read their manuals, provided that someone bothered to
document them properly) to find out which descriptor entities (sockets,
ptys, ttys, disks, tapes, ...) actually do something, and what, with
`exceptions'.

>An aside: if it were the case that "ready for reading" meant "a read
>on this channel will not block", then `select' would always say
>"readable" for every non-blocking channel.  But emacs did a `select'
>on four channels, all non-blocking, and it indicated only one of them
>was "readable".  So I don't believe this definition is correct.

No, but it is close---closer than `there are data', at any rate.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

cph@zurich.ai.mit.edu (Chris Hanson) (09/10/90)

In article <26445@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:

   From: chris@mimsy.umd.edu (Chris Torek)
   Date: 9 Sep 90 12:53:02 GMT

   In article <CPH.90Sep9022935@kleph.ai.mit.edu> cph@zurich.ai.mit.edu
   (Chris Hanson) writes:
   >The real problem here is that the documentation for `select' doesn't
   >define what it does. ...

   Well, the most likely reason is that select does not *do* ANYthing,
   except time out.  The `selecting' is all done in lower layers, much
   like ioctl.  There is no way that ioctl(2) can list what ioctl()
   does, because it really does not do *any*thing.

   >"... reading, writing, or has an exceptional condition"

   Of course, the lower layers try to do something sensible.  For `read',
   select is supposed to return true whenever a read() system call would
   not block, regardless of any `non-blocking' mode on the file
   descriptor.  That is, it should return true when there are data, and it
   should return true when there is a `boundary condition' like an `EOF'
   on a tty or a socket.  For `write', it is supposed to return true when
   the lower layer can accept (some) more data without blocking.
   `Exceptions' are left entirely up to the lower layers, and you have to
   look at those (or read their manuals, provided that someone bothered to
   document them properly) to find out which descriptor entities (sockets,
   ptys, ttys, disks, tapes, ...) actually do something, and what, with
   `exceptions'.

OK, now I think I understand -- thanks.  I guess my complaint is that
the man page for `select' could have contained something of what you
said in these two paragraphs.

Here is a suggestion: why not define the meaning of "readable" and
"writable" in the `select' man page, since it seems that most devices
will satisfy this definition in the same way.  Also have a sentence
that says "exceptional conditions" are device-specific, because the
current man page doesn't say anything of the kind.  Then have specific
devices document (in section 7) how their "readable" and "writable"
differ from the standard (if at all), and what their "exceptional
conditions" are.  This would be a great improvement over the current
situation, because then it would be possible to understand how this
works.

eliot@chutney.rtp.dg.com (Topher Eliot) (09/11/90)

In article <CPH.90Sep9022935@kleph.ai.mit.edu>, cph@zurich.ai.mit.edu (Chris Hanson) writes:
|>    From: me
|> 
|>    I've bumped into this problem before, in a different context.  I can't
|>    remember what any of the applicable documentation said, but the bottom line
|>    was that the semantics of select are that it will return with a particular
|>    bit set if a read on the corresponding file descriptor WILL NOT BLOCK.  It
|>    is NOT saying that there is data to be read there.  In my opinion, in such
|>    cases the correct way to handle this is that all reads should be prepared
|>    to detect that the descriptor from which they are reading has been closed,
|>    or reached end of file, or whatever, and handle that appropriately.
|>    Apparantly emacs does not do so in this case.
|> 
|> An aside: if it were the case that "ready for reading" meant "a read
|> on this channel will not block", then `select' would always say
|> "readable" for every non-blocking channel.  But emacs did a `select'
|> on four channels, all non-blocking, and it indicated only one of them
|> was "readable".  So I don't believe this definition is correct.

Well, this shows that your kernel is different from the one I dealt with,
because with ours, select did indeed say that the descriptor was "readable"
all the time.

-- 
Topher Eliot
Data General Corporation                eliot@dg-rtp.dg.com
62 T. W. Alexander Drive                {backbone}!mcnc!rti!dg-rtp!eliot
Research Triangle Park, NC 27709        (919) 248-6371
Obviously, I speak for myself, not for DG.