dlc@c3.lanl.gov (Dale Carstensen) (11/17/90)
My experience with X11R3 and X11R4, which has been mostly with Suns, has been that if the server dies (just the "X" process dies, or the whole host reboots), the remote clients that had been connected to it continue to run, usually. I have to go on a search and destroy mission to clean them up, sometimes finding clients patiently waiting on the expired server for a month or more. On the other hand, unreliability in the network connection between client and server can terminate clients while the server is still running. I would think that the clients could tell that the server has been gone for some sufficiently long time and terminate themselves, or, better yet, reconnect to the same "DISPLAY" (and the same user, I would hope) once the user has brought a new server up. If I actively terminate the server by finishing the last remaining active program from .xinitrc, the clients do get cleaned up, but when something more drastic kills off the server, a mess is left. I use xinit exclusively, so maybe with xdm or other ways to start the server, this problem doesn't happen or is somewhat different. I need the option to run other window systems, so replacing getty with xdm is not an option.
meo@rsiatl.UUCP (Miles ONeal) (11/19/90)
I agree. In fact, ideally, I want both (well, of *course* you do - here, have a bridge, too). When timeout is selected (known by server, who tells client at connect, if they are R5 or later, anyway - also settable via client command line?), each client polls the server (or expects a watchdog signal from the server - I don't care) every n minutes (def = 10? - also user selectable). If the test fails x times in a row, the clients give up and grace- fully go away. Alternatively, when the server comes up, it announces who it is (say, fred:0) on the usual well known socket, and anybody who thought they were already talking to fred:0 reattaches, and the server sends them an expose event. Could be annoying with bogus unix:0 announcements, but so is having to go find and kill 20 processes when someone is debugging their login on a new x terminal that refuses to work like the old one did. User selectable as to which behavior is desired, of course. -Miles Miles O'Neal {uunet | emory}!rsiatl!meo (home) meo@sware.com (work) {uunet | emory}!sware!meo (work)
donn@MILTON.U.WASHINGTON.EDU (Donn Cave) (11/20/90)
excerpts from <6220@lanl.gov> (Dale Carstensen): > ... with X11R3 and X11R4 ... if the server dies ... remote clients that > had been connected to it continue to run, usually. Even if you wanted to use xdm, it isn't a complete cure for this one, since it only gets the clients directly descended from it - not clients started from the shell command line, not clients running on other hosts. We were able to install a fairly trivial patch to Xlib, so that the socket is created with a keep-alive option. Clients that would normally accept prolonged inactivity on the connection, now fail at some point when the keep-alive comes up. Since the twm client seems to be particularly hardy in disconnected state I'm now running a test twm client linked with the revised Xlib routine, and at least with some TCPs it seems to fix Problem 1. Unfortunately, it seems to slightly aggravate Problem 2: > On the other hand, unreliability in the network connection between client > and server can terminate clients while the server is still running. We get a lot of "Network down" (E_NETDOWN) errors, particularly on one host whose Ethernet hardware is not the industry's best. These errors tend to afflict only one client at a time, rather than shutting down everyone at once, and they don't seem bother ftp or telnet in the least. I'm told that ftp and telnet re-try when they encounter these conditions - would such re-trying be another trivial modification to Xlib? Are there fundamental reasons why X should give up immediately when it encounters this network error? Donn Cave University Computing Services, University of Washington donn@cac.washington.edu
milton@en.ecn.purdue.edu (Milton D Miller) (11/20/90)
In article <9011191810.AA01979@milton.u.washington.edu>, donn@MILTON.U.WASHINGTON.EDU (Donn Cave) writes: >excerpts from <6220@lanl.gov> (Dale Carstensen): > >> ... with X11R3 and X11R4 ... if the server dies ... remote clients that >> had been connected to it continue to run, usually. > >Even if you wanted to use xdm, it isn't a complete cure for this one, since >it only gets the clients directly descended from it - not clients started from >the shell command line, not clients running on other hosts. > >We were able to install a fairly trivial patch to Xlib, so that the socket >is created with a keep-alive option. ... >Unfortunately, it seems to slightly aggravate Problem 2: > >> On the other hand, unreliability in the network connection between client >> and server can terminate clients while the server is still running. > >We get a lot of "Network down" (E_NETDOWN) errors, particularly on one host >whose Ethernet hardware is not the industry's best. .... There is currently a discussion in comp.protocols.tcp-ip about (not using) keepalives. As was pointed out there today: > From: barmar@think.com (Barry Margolin) > Subject: Re: Warning: Keep-Alive considered harmful [excerpt follows:] ... The connection shouldn't be killed as a result of keep-alive timeouts. Instead, the purpose of keep-alives should be to elicit RSTs from the other host. Timeouts can be due to any number of reasons, but a RST indicates unambiguously that the connection is unusable, because the other end rebooted or closed the connection itself (perhaps network problems prevented the FIN from getting through). If a host crashes, the keepalive won't actually notice this until it comes back up, which is probably good enough. [end of excerpt] I agree :-) Also notice, in the case of X terminals, there is usually a "close all connections" option, which is essentially a reboot of the tcp. The other end probably is not given FIN or RST, and the condition won't show up until the other end is poked. (For xterms, invoking "write" to youself will usually push the connection into destruction, and may return a "Not logged on there" to your write command). >I'm told that >ftp and telnet re-try when they encounter these conditions - would such >re-trying be another trivial modification to Xlib? It is the responsibility of TCP to do the retrying. It *Should* be up to the application when to give up (See also Host Requirements RFC), but that is not generally available :-(. >Are there fundamental >reasons why X should give up immediately when it encounters this network error? > None that I can think of; the servers already buffer for each client, I don't see why the clients shouldn't buffer for the server when not explicitly requesting a sync (do they do this already?) They (clients) may need to give up of the buffering is taking too much space; and response time may suffer with no apperent reason (to the other servers/users) if multiple servers are open. milton