jlo@elan.elan.com (Jeff Lo) (07/20/89)
This subject has appeared before, but I never heard any real definitive answers or solutions to the problem. The problem is that sometimes an X client seems to fall behind the server, or a very large amount of data is being sent between the client and the server, and the server appears to send a KillClient, and consequently the client dies. I have heard some say that there is a bug in writev and it returns an incorrect error code. Others have said that it is caused by buggy unix domain sockets (we've gotten the error when client and server were on the same machine and when they were not). In any case, it is causing us a lot of grief, so I was wondering if anyone has found a fix, a good explanation, or even a "Fixed in R4" comment. Thanks! -- Jeff Lo, Elan Computer Group, Inc. jlo@elan.com, ..!{ames,uunet}!elan!jlo 888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200
rws@EXPO.LCS.MIT.EDU (07/20/89)
I was wondering if anyone has found a fix, a good explanation, or even a "Fixed in R4" comment. I suspect various bugs fall under "XIO error". As I recall, one was the OS writev returning an error code the server didn't expect, in another it may have been buggy socket code, but there were also a couple of bugs relating to zero-length reads and writes in the R3 server code. I believe the R4 server will not have these problems, but we're not in a position to get out patches at this point, too much has changed since R3.
rbj@DSYS.NCSL.NIST.GOV (Root Boy Jim) (07/20/89)
? From: elan!jlo@AMES.ARC.NASA.GOV (Jeff Lo) ? This subject has appeared before, but I never heard any real definitive ? answers or solutions to the problem. The problem is that sometimes an ? X client seems to fall behind the server, or a very large amount of data ? is being sent between the client and the server, and the server appears ? to send a KillClient, and consequently the client dies. I have heard ? some say that there is a bug in writev and it returns an incorrect ? error code. Others have said that it is caused by buggy unix domain ? sockets (we've gotten the error when client and server were on the same ? machine and when they were not). In any case, it is causing us a lot ? of grief, so I was wondering if anyone has found a fix, a good explanation, ? or even a "Fixed in R4" comment. Thanks! You asked for it. I will repeat my posting. Gurus should save it and repost it whenever this topic reappears. ? From: sumax!amc-gw!brian@beaver.cs.washington.edu (Brian Crowley) ? I am trying to build and run the R3 distribution on a 3/60CG4 using ? gcc-1.35 and the PURDUE speedups. ? After successfully compiling the World, I try to execute the server ? with the command: ? xinit xterm -- -dev /dev/cgfour0 ? The grey stipple pattern comes up along with the cursor, then after 3-4 ? seconds, I get the error message: ? XIO: fatal IO error 32 (Broken pipe) on X server "unix:0.0" ? after 38 requests (28 known processed) with 0 events remaining. ? The connection was probably broken by a server shutdown or ? KillClient. ? and the whole thing dies (laeving the keyboard in a funny state that I can ? only fix with the kbd_mode -a command). ? My questions: ? 1. Can anybody tell me what this error message *really* means? ? 2. Has anybody else successfully compiled using gcc-1.35? ? Should I be using gcc-1.34? ? 3. If I should be using gcc-1.34, anyone tell me where I can ? snarf a copy (UUCP or ftp). ? 4. Observations? Suggestions? ? Of course, thanks in advance. **** WARNING **** WARNING **** WARNING **** WARNING **** WARNING **** **** WARNING **** WARNING **** WARNING **** WARNING **** WARNING **** **** WARNING **** WARNING **** WARNING **** WARNING **** WARNING **** Do *NOT*, ever, never use `unix:0' as your display. Unless it works. There are bugs in certains vendors' operating systems regarding them. Use `localhost:0' or your explicit hostname instead. I have been thru this before. I am running SunOS 3.5 on a 3/180, with the 4.0 nameserver kit installed. It redefines netdb.h and some routines in libc.a, notably gethostbyname. I see the same behavior with gcc and regular cc. However, I got a little further than you did. Symptoms: I do xinit with no args. An xterm comes up. I can type small amounts of output with no problems. However, a `cat /etc/termcap' types a few pages, and then the window dies. I can get about 4K. Sound familiar? That is the pipe limit. Pipes are unix domain socketpairs. It makes no difference whether I run uwm or not. I have applied patches 1-9. I am not sure whether this is your problem, and which vendors and operating systems versions my notes apply to. Perhaps using the new networking stuff is a problem, I once had unix:0 working. The point is that unix domain sockets have always been buggy, and there is no reason to expect them to work now. Avoid them. Gurus, please take note. When someone poses you a stumper question that has this particular XIO error in it, and you have no answer, please utter this warning with the caveats that it is a last resort, that it should work, but doesn't always in practice. In fact, I was led to my conclusions by something RWS said about the way certain OS's handle writev's in server/4.2bsd/io.c:FlushClient. He mentioned that EINVAL could be returned under certain circumstances. Perhaps the specific UNIX error should be reported as well. Perhaps an attempt should be made to handle EINTR and retry the operation. I don't see how EINTR could happen, but you never know. We now return you to your regularly scheduled program. **** WARNING **** WARNING **** WARNING **** WARNING **** WARNING **** **** WARNING **** WARNING **** WARNING **** WARNING **** WARNING **** **** WARNING **** WARNING **** WARNING **** WARNING **** WARNING **** ? Jeff Lo, Elan Computer Group, Inc. ? jlo@elan.com, ..!{ames,uunet}!elan!jlo ? 888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200 Root Boy Jim Have GNU, Will Travel.
lmjm@doc.imperial.ac.UK (07/21/89)
I too was plauged by "XIO error" problems. Here are the pair of reports I sent out ages out containing patches that seem to fix most of these problems. One fixes some weird case where writev is asked to write 0 bytes the other makes sure that when the server uses writev to send a client its gives writev manageable chunks. Hope these are useful. Lee ------------------------------------------------------------------------- X Window System Bug Report xbugs@expo.lcs.mit.edu VERSION: R3 CLIENT MACHINE and OPERATING SYSTEM: HLH Clipper Orion running 4.2 BSD DISPLAY: HLH StarPoint WINDOW MANAGER: awm AREA: Xlib (xterm) SYNOPSIS: xterm using a Unix domain socket will quit unexpectedly when listing long files. DESCRIPTION: Xlib in XlibInt.c in the routine _XSend() somehow ends up passing a 0 as the third arg to WritevToServer(). This causes the writev() then to fail with an EINVAL error. After detailed tracing of the code I have no idea why this occurs. It only happens with Unix domain sockets - not with TCP sockets. REPEAT BY: setenv DISPLAY unix:0 xterm in the xterm window turn on jump scroll then do "cat /etc/termcap" after 3 pages or so xterm will quit with an error message of: xterm: invalid arg SAMPLE FIX: Since it can never reach the WritevToServer() without having something to write and that would have to be in iov I just ensure i is at least one. *** XlibInt.c.orig Thu Jan 12 23:19:54 1989 --- XlibInt.c Fri Jan 13 22:39:40 1989 *************** *** 495,500 **** --- 495,505 ---- InsertIOV(pad, padlength[size & 3]) errno = 0; + + /* Always using at least iov[ 0 ] */ + if (i == 0) + i = 1; + if ((len = WritevToServer(dpy->fd, iov, i)) >= 0) { skip += len; total -= len; ------------------------------------------------------------------------- VERSION: R3 CLIENT MACHINE and OPERATING SYSTEM: HLH Clipper Orion running 4.2 BSD DISPLAY: HLH StarPoint WINDOW MANAGER: awm AREA: X server SYNOPSIS: The server fails to write any message greater than 8K back to the client. DESCRIPTION: Under 4.2 BSD the max size of a message you can send on a pipe is 8K (at least on the few 4.2 BSD's I could find). In server/os/4.2bsd/io.c FlushClient() allows blocks of any size to be written. REPEAT BY: With all the core fonts available try: xlsfonts This fails with Connection # 3 to server broken. XIO: Broken pipe SAMPLE FIX: I've made the code #ifdef'd hlh - the machine I wrote it for. It should really be for any machine with a socket message size limit but I couldn't find any suitable #define and didn't feel up to adding one - its been a long day. *** io.c.old Mon Jan 16 18:19:18 1989 --- io.c Mon Jan 16 20:52:39 1989 *************** *** 314,353 **** int connection = oc->fd, total, n, i, notWritten, written, iovCnt = 0; struct iovec iov[3]; char padBuffer[3]; total = 0; if (oc->count) { ! total += iov[iovCnt].iov_len = oc->count; ! iov[iovCnt++].iov_base = (caddr_t)oc->buf; /* Notice that padding isn't needed for oc->buf since it is alreay padded by WriteToClient */ } if (extraCount) { ! total += iov[iovCnt].iov_len = extraCount; ! iov[iovCnt++].iov_base = extraBuf; if (extraCount & 3) { ! total += iov[iovCnt].iov_len = padlength[extraCount & 3]; ! iov[iovCnt++].iov_base = padBuffer; } } notWritten = total; while ((n = writev (connection, iov, iovCnt)) != notWritten) { #ifdef hpux if (n == -1 && errno == EMSGSIZE) n = swWritev (connection, iov, 2); #endif if (n > 0) { notWritten -= n; for (i = 0; i < iovCnt; i++) { if (n > iov[i].iov_len) { n -= iov[i].iov_len; --- 314,406 ---- int connection = oc->fd, total, n, i, notWritten, written, iovCnt = 0; + #define _AddToIov( bytes, len ) \ + total += iov[iovCnt].iov_len = (len); \ + iov[iovCnt++].iov_base = (caddr_t)(bytes); + #ifndef hlh struct iovec iov[3]; + #define AddToIov(bytes, len) _AddToIov(bytes, len) + #else + int iovs; + struct iovec iov[100]; /* Enough to avoid the need for dynamic allocation */ + #define MAX_MSG 8192 /* Max size of a single iov to writev */ + #define AddToIov( bytes, len ) \ + { \ + char *buf = bytes; \ + int towrite = len; \ + while( towrite > MAX_MSG ){ \ + _AddToIov( buf, MAX_MSG ); \ + towrite -= MAX_MSG; \ + buf += MAX_MSG; \ + } \ + _AddToIov( buf, towrite ); \ + } + #endif char padBuffer[3]; total = 0; if (oc->count) { ! AddToIov( oc->buf, oc->count ); /* Notice that padding isn't needed for oc->buf since it is alreay padded by WriteToClient */ } if (extraCount) { ! AddToIov( extraBuf, extraCount ); if (extraCount & 3) { ! AddToIov( padBuffer, padlength[extraCount & 3] ); } } notWritten = total; + #ifndef hlh while ((n = writev (connection, iov, iovCnt)) != notWritten) + #else + iovs = iovCnt; + while ((n = writev (connection, iov, iovs)) != notWritten) + #endif { #ifdef hpux if (n == -1 && errno == EMSGSIZE) n = swWritev (connection, iov, 2); #endif + #ifdef hlh + if (n == -1 && errno == EMSGSIZE){ + /* Too large a lump to write. + * try with a fewer iov's. + */ + int siz = 0; + struct iovec *ip; + + /* How many iov's are less than the max? */ + iovs = 0; + ip = &iov[0]; + while( (siz + ip->iov_len) <= MAX_MSG ){ + siz += ip->iov_len; + ip++; + iovs++; + } + continue; + } + else { + /* Once its succeeded then try to write the rest - the + * code in the if statement below should prevent iov's from + * being resent */ + iovs = iovCnt; + } + #endif if (n > 0) { notWritten -= n; for (i = 0; i < iovCnt; i++) { + #ifdef hlh + /* ignore buffers that have been written already */ + if (iov[i].iov_len == 0 ) + continue; + #endif if (n > iov[i].iov_len) { n -= iov[i].iov_len; ------------------------------------------------------------------------- -- Lee McLoughlin phone: 01 589 5111 X 5028 fax: 01 581 8024 Department of Computing,Imperial College,180 Queens Gate,London SW7 2BZ, UK Janet: lmjm@uk.ac.ic.doc Uucp: lmjm@icdoc.UUCP (or ..!ukc!icdoc!lmjm) DARPA: lmjm@doc.ic.ac.uk (or lmjm%uk.ac.ic.doc@nsfnet-relay.ac.uk)
stevep@stellar.stellar.COM (Steve Pitschke) (07/24/89)
>> This subject has appeared before, but I never heard any real definitive >> answers or solutions to the problem. The problem is that sometimes an >> X client seems to fall behind the server, or a very large amount of data >> is being sent between the client and the server, and the server appears >> to send a KillClient, and consequently the client dies. I have heard >> some say that there is a bug in writev and it returns an incorrect >> error code. Others have said that it is caused by buggy unix domain >> sockets (we've gotten the error when client and server were on the same >> machine and when they were not). In any case, it is causing us a lot >> of grief, so I was wondering if anyone has found a fix, a good explanation, >> or even a "Fixed in R4" comment. Thanks! >> -- >> Jeff Lo, Elan Computer Group, Inc. >> jlo@elan.com, ..!{ames,uunet}!elan!jlo >> 888 Villa Street, Third Floor, Mountain View, CA 94041, 415-964-2200 I spent a fair amount of time tracking down cases of this for our implementation and thus have some info for you. The general rule for the sample implementation server socket calls (in libos) is to perform the system call, if it returns an error to silently do a close() on the socket and thus leave the user in the dark. (What we do here is to send any error messages out thru the sys log daemon :=) Two things that can cause the error, which we have actually observed are: 1) Under heavy load the system (if it is Unix (tm) derivative) either ENOBUFS or ENOMEM when the X server tries to write into the socket. 2) During the X connection handshake, the server saves the time at which the connection handshake started, and if the handshake does not complete before a time out period (default 60 sec.), again silently close()s the connection. The two cases can be differentiated via the XIO message. In the latter case, 0 requests will have been processed. (As a heuristic, using time out values in non-real time O.S.'s often works, but can infrequently fail. :=) I believe the thing which needs to be done is to have the server implementor write meaningful error messages to a message log when either of these cases occur. You then may be able to reconfigure your O.S or use of X to avoid situation of heavy load which cause the underlying problem. Having an error message is a necessary precursor, in order to recognize what the problem was.