[comp.windows.x] "fatal IO error 32" problems in Sun Server

stevel@rtech.rtech.com (Steve Langley) (11/01/89)

We have a problem with the default Sun server I hope someone at MIT can
comment on.

From time to time some X applications we are building die with the
following error message:

XIO:  fatal IO error 32 (Broken pipe) on X server "unix:0.0"
      after 8499 requests (8181 known processed) with 44 events remaining.
      The connection was probably broken by a server shutdown or KillClient.

The only thing the failures have in common is they seem to happen when
the event queue on the server side is filling up with unprocessed events.

For example, after a button is pressed, a callback routine might go into
a frenzy of activity by creating, destroying, moving, resizing, mapping,
and unmapping widgets. This generates a large number of requests to the
server and events for the client to read. But since the callback is doing
this without ever calling XtMainLoop, the events just queue up until
we return from the callback. So far no problem.

But every once in a while the above error occurs. I was able to track it
down to the server/os/4.2bsd/io.c routine. (We are running Sun OS4.0 on
Sun 3/60's, X11R3 with (I think) fixes 1 through 8 installed.) As far as I
can tell, Dispatch calls FlushAllOutput which calls

	FlushClient(client,oc, (char*)NULL,0);

FlushClient calls the writev routine, and most of time everything works.
But sometimes there is no I/O to be written, and so iovCnt==0. The
writev routine doesn't like this, and fails with an error (EINVAL) because
of invalid arguments. Because errno is not equal to either EWOULDBLOCK
or EBADF FlushClient assumes the write has failed and the client has died,
leading it to close the connection. If 'notdef' had been defined in io.c
you would see the message:

	Closing connection xx because write failed

Ultimately this results in the client seeing the 'fatal IO error' above.

Now, is this a bug or a feature? Is there a known problem like this in the
Sun server and I just haven't picked up the fix? Or am I inadvertently
doing something in my application that makes this happen? It is a Bad Thing
to generate a lot of events, and if so how many are a "lot"?

I put a line of code in FlushClient which just returns without doing any
I/O is iovCnt == 0; this seems to cure the problem. (I added an ErrorF
message to tell me when this happens; every now and then the message
appears and the client keeps on running, apparently with no problems.)

So, any comments? If the answer is "fixed in R4" that's okay (since I seem
to have a workaround) but I'd appreciate some more information on what's
happening here.


+--------------------------------------------------------------------------+
| Steve Langley                        | Phone: (415)748-3658              |
| Relational Technology, Inc.          | Internet: stevel@ws58s.rtech.com  |
| P.O. Box 4008                        |                                   |
| 1080 Marina Village Parkway          |                                   |
| Alameda, California 94501            |                                   |
+--------------------------------------------------------------------------+

kucharsk@uts.amdahl.com (William Kucharski) (11/01/89)

This type of problem is seen quite often among a number of different servers.
When unprocessed events pile up without being processed, broken pipe errors
occur.

However, WHY this is occurring on your Sun server is a mystery to me.  We've
been running the sample Xsun server since we got the X11r3 tape with few
problems, and we haven't noticed this one.
-- 
===============================================================================
| ARPA:	kucharsk@uts.amdahl.com			    |	William Kucharski     |
| UUCP:	...!{ames,apple,sun,uunet}!amdahl!kucharsk  |	Amdahl Corporation    |
===============================================================================
| Saying: "It's a window system named 'X,' NOT a system named 'X Windows'"    |
===============================================================================
| Disclaimer:  "The opinions expressed above may not agree with mine at any   |
|              other moment in time, so they certainly can't be those of my   |
|              employer."						      |
===============================================================================

rws@EXPO.LCS.MIT.EDU (Bob Scheifler) (11/01/89)

    But sometimes there is no I/O to be written, and so iovCnt==0. The
    writev routine doesn't like this, and fails with an error (EINVAL) because
    of invalid arguments.

Yup, there was such a bug in R3, and it is fixed in R4.  Your workaround
should do the job.