[comp.windows.x] R5 wish list -- client recovery

dlc@c3.lanl.gov (Dale Carstensen) (11/17/90)

My experience with X11R3 and X11R4, which has been mostly with Suns, has been
that if the server dies (just the "X" process dies, or the whole host reboots), the
remote clients that had been connected to it continue to run, usually.  I have to go
on a search and destroy mission to clean them up, sometimes finding clients
patiently waiting on the expired server for a month or more.  On the other hand,
unreliability in the network connection between client and server can terminate
clients while the server is still running.  I would think that the clients could
tell that the server has been gone for some sufficiently long time and terminate
themselves, or, better yet, reconnect to the same "DISPLAY" (and the same user, I
would hope) once the user has brought a new server up.

If I actively terminate the server by finishing the last remaining active program
from .xinitrc, the clients do get cleaned up, but when something more drastic kills
off the server, a mess is left.  I use xinit exclusively, so maybe with xdm or other
ways to start the server, this problem doesn't happen or is somewhat different.  I
need the option to run other window systems, so replacing getty with xdm is not an
option.

meo@rsiatl.UUCP (Miles ONeal) (11/19/90)

I agree. In fact, ideally, I want both (well, of *course*
you do - here, have a bridge, too).

When timeout is selected (known by server, who tells client
at connect, if they are R5 or later, anyway - also settable
via client command line?), each client polls the server (or
expects a watchdog signal from the server - I don't care)
every n minutes (def = 10? - also user selectable). If the
test fails x times in a row, the clients give up and grace-
fully go away.

Alternatively, when the server comes up, it announces who it
is (say, fred:0) on the usual well known socket, and anybody
who thought they were already talking to fred:0 reattaches,
and the server sends them an expose event. Could be annoying
with bogus unix:0 announcements, but so is having to go find
and kill 20 processes when someone is debugging their login
on a new x terminal that refuses to work like the old one did.

User selectable as to which behavior is desired, of course.

-Miles

Miles O'Neal
{uunet | emory}!rsiatl!meo (home)
meo@sware.com              (work)
{uunet | emory}!sware!meo  (work)

donn@MILTON.U.WASHINGTON.EDU (Donn Cave) (11/20/90)

excerpts from <6220@lanl.gov> (Dale Carstensen):

>  ... with X11R3 and X11R4 ... if the server dies ... remote clients that
> had been connected to it continue to run, usually.

Even if you wanted to use xdm, it isn't a complete cure for this one, since
it only gets the clients directly descended from it - not clients started from
the shell command line, not clients running on other hosts.

We were able to install a fairly trivial patch to Xlib, so that the socket
is created with a keep-alive option.  Clients that would normally accept
prolonged inactivity on the connection, now fail at some point when the
keep-alive comes up.  Since the twm client seems to be particularly hardy
in disconnected state I'm now running a test twm client linked with the
revised Xlib routine, and at least with some TCPs it seems to fix Problem 1.
Unfortunately, it seems to slightly aggravate Problem 2:

> On the other hand, unreliability in the network connection between client
> and server can terminate clients while the server is still running.

We get a lot of "Network down" (E_NETDOWN) errors, particularly on one host
whose Ethernet hardware is not the industry's best.  These errors tend to
afflict only one client at a time, rather than shutting down everyone at
once, and they don't seem bother ftp or telnet in the least.  I'm told that
ftp and telnet re-try when they encounter these conditions - would such
re-trying be another trivial modification to Xlib?  Are there fundamental
reasons why X should give up immediately when it encounters this network error?

		Donn Cave
		University Computing Services, University of Washington
		donn@cac.washington.edu

milton@en.ecn.purdue.edu (Milton D Miller) (11/20/90)

In article <9011191810.AA01979@milton.u.washington.edu>,
    donn@MILTON.U.WASHINGTON.EDU (Donn Cave) writes:
>excerpts from <6220@lanl.gov> (Dale Carstensen):
>
>>  ... with X11R3 and X11R4 ... if the server dies ... remote clients that
>> had been connected to it continue to run, usually.
>
>Even if you wanted to use xdm, it isn't a complete cure for this one, since
>it only gets the clients directly descended from it - not clients started from
>the shell command line, not clients running on other hosts.
>
>We were able to install a fairly trivial patch to Xlib, so that the socket
>is created with a keep-alive option. 
...
>Unfortunately, it seems to slightly aggravate Problem 2:
>
>> On the other hand, unreliability in the network connection between client
>> and server can terminate clients while the server is still running.
>
>We get a lot of "Network down" (E_NETDOWN) errors, particularly on one host
>whose Ethernet hardware is not the industry's best.  
....

There is currently a discussion in comp.protocols.tcp-ip about (not using)
keepalives.  As was pointed out there today:

>	From: barmar@think.com (Barry Margolin)
>	Subject: Re: Warning: Keep-Alive considered harmful

[excerpt follows:]
... The connection shouldn't be killed as a result of keep-alive timeouts.
Instead, the purpose of keep-alives should be to elicit RSTs from the other
host.  Timeouts can be due to any number of reasons, but a RST indicates
unambiguously that the connection is unusable, because the other end
rebooted or closed the connection itself (perhaps network problems
prevented the FIN from getting through).  If a host crashes, the keepalive
won't actually notice this until it comes back up, which is probably good
enough.
[end of excerpt]

I agree :-)  Also notice, in the case of X terminals, there is usually
a "close all connections" option, which is essentially a reboot of the
tcp.  The other end probably is not given FIN or RST, and the condition
won't show up until the other end is poked.  (For xterms, invoking "write"
to youself will usually push the connection into destruction, and may 
return a "Not logged on there" to your write command).  

>I'm told that
>ftp and telnet re-try when they encounter these conditions - would such
>re-trying be another trivial modification to Xlib?

It is the responsibility of TCP to do the retrying.   It *Should* be up
to the application when to give up (See also Host Requirements RFC), but 
that is not generally available :-(.  

>Are there fundamental
>reasons why X should give up immediately when it encounters this network error?
>

None that I can think of; the servers already buffer for each client,
I don't see why the clients shouldn't buffer for the server when not
explicitly requesting a sync (do they do this already?)

They (clients) may need to give up of the buffering is taking too much
space; and response time may suffer with no apperent reason (to the
other servers/users) if multiple servers are open.  

milton