[comp.windows.x] XIO errors again

jlo@elan.elan.com (Jeff Lo) (07/20/89)

This subject has appeared before, but I never heard any real definitive
answers or solutions to the problem.  The problem is that sometimes an
X client seems to fall behind the server, or a very large amount of data
is being sent between the client and the server, and the server appears
to send a KillClient, and consequently the client dies.  I have heard
some say that there is a bug in writev and it returns an incorrect
error code.  Others have said that it is caused by buggy unix domain
sockets (we've gotten the error when client and server were on the same
machine and when they were not).  In any case, it is causing us a lot
of grief, so I was wondering if anyone has found a fix, a good explanation,
or even a "Fixed in R4" comment.  Thanks!
-- 
Jeff Lo, Elan Computer Group, Inc.
jlo@elan.com, ..!{ames,uunet}!elan!jlo
888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200

rws@EXPO.LCS.MIT.EDU (07/20/89)

    I was wondering if anyone has found a fix, a good explanation,
    or even a "Fixed in R4" comment.

I suspect various bugs fall under "XIO error".  As I recall, one was the OS
writev returning an error code the server didn't expect, in another it may have
been buggy socket code, but there were also a couple of bugs relating to
zero-length reads and writes in the R3 server code.  I believe the R4 server
will not have these problems, but we're not in a position to get out patches
at this point, too much has changed since R3.

rbj@DSYS.NCSL.NIST.GOV (Root Boy Jim) (07/20/89)

? From: elan!jlo@AMES.ARC.NASA.GOV  (Jeff Lo)

? This subject has appeared before, but I never heard any real definitive
? answers or solutions to the problem.  The problem is that sometimes an
? X client seems to fall behind the server, or a very large amount of data
? is being sent between the client and the server, and the server appears
? to send a KillClient, and consequently the client dies.  I have heard
? some say that there is a bug in writev and it returns an incorrect
? error code.  Others have said that it is caused by buggy unix domain
? sockets (we've gotten the error when client and server were on the same
? machine and when they were not).  In any case, it is causing us a lot
? of grief, so I was wondering if anyone has found a fix, a good explanation,
? or even a "Fixed in R4" comment.  Thanks!

You asked for it. I will repeat my posting. Gurus should save it and
repost it whenever this topic reappears.

? From: sumax!amc-gw!brian@beaver.cs.washington.edu  (Brian Crowley)

? I am trying to build and run the R3 distribution on a 3/60CG4 using
? gcc-1.35 and the PURDUE speedups.

? After successfully compiling the World, I try to execute the server
? with the command:

? 	xinit xterm -- -dev /dev/cgfour0

? The grey stipple pattern comes up along with the cursor, then after 3-4
? seconds, I get the error message:

? 	XIO:  fatal IO error 32 (Broken pipe) on X server "unix:0.0"
?               after 38 requests (28 known processed) with 0 events remaining.
?               The connection was probably broken by a server shutdown or
? 	      KillClient.

? and the whole thing dies (laeving the keyboard in a funny state that I can
? only fix with the kbd_mode -a command).

? My questions:

? 	1.  Can anybody tell me what this error message *really* means?

? 	2.  Has anybody else successfully compiled using gcc-1.35?
? 	    Should I be using gcc-1.34?

? 	3.  If I should be using gcc-1.34, anyone tell me where I can
? 	    snarf a copy (UUCP or ftp).

? 	4.  Observations?  Suggestions?

? Of course, thanks in advance.

**** WARNING **** WARNING **** WARNING **** WARNING **** WARNING ****
**** WARNING **** WARNING **** WARNING **** WARNING **** WARNING ****
**** WARNING **** WARNING **** WARNING **** WARNING **** WARNING ****

Do *NOT*, ever, never use `unix:0' as your display. Unless it works.

There are bugs in certains vendors' operating systems regarding them.
Use `localhost:0' or your explicit hostname instead.

I have been thru this before. I am running SunOS 3.5 on a 3/180,
with the 4.0 nameserver kit installed. It redefines netdb.h and
some routines in libc.a, notably gethostbyname.

I see the same behavior with gcc and regular cc.

However, I got a little further than you did.

Symptoms:

I do xinit with no args. An xterm comes up. I can type small amounts
of output with no problems. However, a `cat /etc/termcap' types a few
pages, and then the window dies. I can get about 4K. Sound familiar?
That is the pipe limit. Pipes are unix domain socketpairs. It makes no
difference whether I run uwm or not. I have applied patches 1-9.

I am not sure whether this is your problem, and which vendors and
operating systems versions my notes apply to. Perhaps using the
new networking stuff is a problem, I once had unix:0 working.

The point is that unix domain sockets have always been buggy, and
there is no reason to expect them to work now. Avoid them.

Gurus, please take note. When someone poses you a stumper question
that has this particular XIO error in it, and you have no answer,
please utter this warning with the caveats that it is a last resort,
that it should work, but doesn't always in practice.

In fact, I was led to my conclusions by something RWS said about
the way certain OS's handle writev's in server/4.2bsd/io.c:FlushClient.
He mentioned that EINVAL could be returned under certain circumstances.
Perhaps the specific UNIX error should be reported as well. Perhaps
an attempt should be made to handle EINTR and retry the operation.
I don't see how EINTR could happen, but you never know.

We now return you to your regularly scheduled program.

**** WARNING **** WARNING **** WARNING **** WARNING **** WARNING ****
**** WARNING **** WARNING **** WARNING **** WARNING **** WARNING ****
**** WARNING **** WARNING **** WARNING **** WARNING **** WARNING ****

? Jeff Lo, Elan Computer Group, Inc.
? jlo@elan.com, ..!{ames,uunet}!elan!jlo
? 888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200

	Root Boy Jim
	Have GNU, Will Travel.

lmjm@doc.imperial.ac.UK (07/21/89)

I too was plauged by "XIO error" problems.  Here are the pair of
reports I sent out ages out containing patches that seem to fix most
of these problems.  One fixes some weird case where writev is asked
to write 0 bytes the other makes sure that when the server uses writev
to send a client its gives writev manageable chunks.

Hope these are useful.
	Lee

-------------------------------------------------------------------------
			  X Window System Bug Report
			    xbugs@expo.lcs.mit.edu




VERSION:
    R3

CLIENT MACHINE and OPERATING SYSTEM:
    HLH Clipper Orion running 4.2 BSD

DISPLAY:
    HLH StarPoint

WINDOW MANAGER:
    awm

AREA:
    Xlib (xterm)

SYNOPSIS:
    xterm using a Unix domain socket will quit unexpectedly when
    listing long files.

DESCRIPTION:
    Xlib in XlibInt.c in the routine _XSend() somehow ends up passing a
    0 as the third arg to WritevToServer().  This causes the writev() then
    to fail with an EINVAL error.  After detailed tracing of the code
    I have no idea why this occurs.  It only happens with Unix domain
    sockets - not with TCP sockets.

REPEAT BY:
    setenv DISPLAY unix:0
    xterm
      in the xterm window turn on jump scroll
      then do "cat /etc/termcap"

    after 3 pages or so xterm will quit with an error message of:
	xterm: invalid arg

SAMPLE FIX:
    Since it can never reach the WritevToServer() without having
    something to write and that would have to be in iov I just ensure
    i is at least one.

*** XlibInt.c.orig	Thu Jan 12 23:19:54 1989
--- XlibInt.c	Fri Jan 13 22:39:40 1989
***************
*** 495,500 ****
--- 495,505 ----
  	    InsertIOV(pad, padlength[size & 3])
      
  	    errno = 0;
+ 
+ 	    /* Always using at least iov[ 0 ] */
+ 	    if (i == 0)
+ 		    i = 1;
+ 
  	    if ((len = WritevToServer(dpy->fd, iov, i)) >= 0) {
  		skip += len;
  		total -= len;

-------------------------------------------------------------------------

VERSION:
    R3

CLIENT MACHINE and OPERATING SYSTEM:
    HLH Clipper Orion running 4.2 BSD

DISPLAY:
    HLH StarPoint

WINDOW MANAGER:
    awm

AREA:
    X server 

SYNOPSIS:
    The server fails to write any message greater than 8K back to the client.

DESCRIPTION:
    Under 4.2 BSD the max size of a message you can send on a pipe is
    8K (at least on the few 4.2 BSD's I could find).  In
    server/os/4.2bsd/io.c FlushClient() allows blocks of any size to
    be written.

REPEAT BY:
    With all the core fonts available try:
	xlsfonts
    This fails with
Connection # 3 to server broken.
XIO: Broken pipe

SAMPLE FIX:
    I've made the code #ifdef'd hlh - the machine I wrote it for.  It should
    really be for any machine with a socket message size limit but I couldn't
    find any suitable #define and didn't feel up to adding one - its been
    a long day.

*** io.c.old	Mon Jan 16 18:19:18 1989
--- io.c	Mon Jan 16 20:52:39 1989
***************
*** 314,353 ****
      int connection = oc->fd,
      	total, n, i, notWritten, written,
  	iovCnt = 0;
      struct iovec iov[3];
      char padBuffer[3];
  
      total = 0;
      if (oc->count)
      {
! 	total += iov[iovCnt].iov_len = oc->count;
! 	iov[iovCnt++].iov_base = (caddr_t)oc->buf;
          /* Notice that padding isn't needed for oc->buf since
             it is alreay padded by WriteToClient */
      }
      if (extraCount)
      {
! 	total += iov[iovCnt].iov_len = extraCount;
! 	iov[iovCnt++].iov_base = extraBuf;
  	if (extraCount & 3)
  	{
! 	    total += iov[iovCnt].iov_len = padlength[extraCount & 3];
! 	    iov[iovCnt++].iov_base = padBuffer;
  	}
      }
  
      notWritten = total;
      while ((n = writev (connection, iov, iovCnt)) != notWritten)
      {
  #ifdef hpux
  	if (n == -1 && errno == EMSGSIZE)
  	    n = swWritev (connection, iov, 2);
  #endif
          if (n > 0) 
          {
  	    notWritten -= n;
  	    for (i = 0; i < iovCnt; i++)
              {
  		if (n > iov[i].iov_len)
  		{
  		    n -= iov[i].iov_len;
--- 314,406 ----
      int connection = oc->fd,
      	total, n, i, notWritten, written,
  	iovCnt = 0;
+ #define _AddToIov( bytes, len ) \
+     total += iov[iovCnt].iov_len = (len); \
+     iov[iovCnt++].iov_base = (caddr_t)(bytes);
+ #ifndef hlh
      struct iovec iov[3];
+ #define AddToIov(bytes, len) _AddToIov(bytes, len)
+ #else
+     int iovs;
+     struct iovec iov[100]; /* Enough to avoid the need for dynamic allocation */
+ #define MAX_MSG 8192 /* Max size of a single iov to writev */
+ #define AddToIov( bytes, len ) \
+     { \
+ 	    char *buf = bytes; \
+ 	    int towrite = len; \
+ 	    while( towrite > MAX_MSG ){ \
+ 		    _AddToIov( buf, MAX_MSG ); \
+ 		    towrite -= MAX_MSG; \
+ 		    buf += MAX_MSG; \
+ 	    } \
+ 	    _AddToIov( buf, towrite ); \
+     }
+ #endif
      char padBuffer[3];
  
      total = 0;
      if (oc->count)
      {
! 	AddToIov( oc->buf, oc->count );
          /* Notice that padding isn't needed for oc->buf since
             it is alreay padded by WriteToClient */
      }
      if (extraCount)
      {
! 	AddToIov( extraBuf, extraCount );
  	if (extraCount & 3)
  	{
! 	    AddToIov( padBuffer, padlength[extraCount & 3] );
  	}
      }
  
      notWritten = total;
+ #ifndef hlh
      while ((n = writev (connection, iov, iovCnt)) != notWritten)
+ #else
+     iovs = iovCnt;
+     while ((n = writev (connection, iov, iovs)) != notWritten)
+ #endif
      {
  #ifdef hpux
  	if (n == -1 && errno == EMSGSIZE)
  	    n = swWritev (connection, iov, 2);
  #endif
+ #ifdef hlh
+ 	if (n == -1 && errno == EMSGSIZE){
+ 		/* Too large a lump to write.
+ 		 * try with a fewer iov's.
+ 		 */
+ 		int siz = 0;
+ 		struct iovec *ip;
+ 
+ 		/* How many iov's are less than the max? */
+ 		iovs = 0;
+ 		ip = &iov[0];
+ 		while( (siz + ip->iov_len) <= MAX_MSG ){
+ 			siz += ip->iov_len;
+ 			ip++;
+ 			iovs++;
+ 		}
+ 		continue;
+ 	}
+ 	else {
+ 		/* Once its succeeded then try to write the rest - the
+ 		 * code in the if statement below should prevent iov's from
+ 		 * being resent */
+ 		iovs = iovCnt;
+ 	}
+ #endif
          if (n > 0) 
          {
  	    notWritten -= n;
  	    for (i = 0; i < iovCnt; i++)
              {
+ #ifdef hlh
+ 		/* ignore buffers that have been written already */
+ 		if (iov[i].iov_len == 0 )
+ 			continue;
+ #endif
  		if (n > iov[i].iov_len)
  		{
  		    n -= iov[i].iov_len;

-------------------------------------------------------------------------

--
Lee McLoughlin		phone: 01 589 5111 X 5028  	fax: 01 581 8024
Department of Computing,Imperial College,180 Queens Gate,London SW7 2BZ, UK
Janet: lmjm@uk.ac.ic.doc	Uucp:  lmjm@icdoc.UUCP (or ..!ukc!icdoc!lmjm)
DARPA: lmjm@doc.ic.ac.uk (or lmjm%uk.ac.ic.doc@nsfnet-relay.ac.uk)

stevep@stellar.stellar.COM (Steve Pitschke) (07/24/89)

>> This subject has appeared before, but I never heard any real definitive
>> answers or solutions to the problem.  The problem is that sometimes an
>> X client seems to fall behind the server, or a very large amount of data
>> is being sent between the client and the server, and the server appears
>> to send a KillClient, and consequently the client dies.  I have heard
>> some say that there is a bug in writev and it returns an incorrect
>> error code.  Others have said that it is caused by buggy unix domain
>> sockets (we've gotten the error when client and server were on the same
>> machine and when they were not).  In any case, it is causing us a lot
>> of grief, so I was wondering if anyone has found a fix, a good explanation,
>> or even a "Fixed in R4" comment.  Thanks!
>> -- 
>> Jeff Lo, Elan Computer Group, Inc.
>> jlo@elan.com, ..!{ames,uunet}!elan!jlo
>> 888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200

I spent a fair amount of time tracking down cases of this for our implementation
and thus have some info for you.  The general rule for the sample implementation
server socket calls (in libos) is to perform the system call, if it returns
an error to silently do a close() on the socket and thus leave the user
in the dark.

(What we do here is to send any error messages out thru the sys log daemon :=)

Two things that can cause the error, which we have actually observed are:

	1) Under heavy load the system (if it is Unix (tm) derivative) either
	   ENOBUFS or ENOMEM when the X server tries to write into the socket.

	2) During the X connection handshake, the server saves the time at
	   which the connection handshake started, and if the handshake does
	   not complete before a time out period (default 60 sec.), again
	   silently close()s the connection.

The two cases can be differentiated via the XIO message.  In the latter case,
0 requests will have been processed.  (As a heuristic, using time out values
in non-real time O.S.'s often works, but can infrequently fail. :=)

I believe the thing which needs to be done is to have the server implementor
write meaningful error messages to a message log when either of these cases
occur.  You then may be able to reconfigure your O.S or use of X to avoid
situation of heavy load which cause the underlying problem.  Having an error
message is a necessary precursor, in order to recognize what the problem was.