[comp.sys.isis] detecting client failures

ken@gvax.cs.cornell.edu (Ken Birman) (03/22/90)

As part of the isis_remote() mechanism I am implementing a facility
by which ISIS will periodically poke a client program to see if it
is alive.

Here's the proposed interface; I am interested in comments or feedback:

1) Initializing the mechanism.
	application calls	isis_probe(freq, timeout)
        where
				freq = frequency to probe, in seconds
				timeout = speed of expected reply, in sec
2) Defaults:
	disabled for "local" clients of ISIS
	freq = 60, timeout = 30 for "remote" clients
3) Implementation:
        client starts sending a HELLO message every freq. seconds
        ISIS has a timer; if it doesn't get a HELLO within freq+timeout secs
		it "kills off" the client
        killed client that was really alive calls isis_failed, then panics
		with message "killed by ISIS" unless isis_failed traps the
		failure (i.e. by reconnecting).

In addition, a remote client monitors its mother machine and calls
isis_failed if the mother machine dies; the isis_probe() constants
are used for this purpose, too.

Robert Cooper has an idea for a very fancy mechanism that would let
a client (any client) drop "offline" for a while and then come back.
For example, say that you are doing a stock trading system and the
broker presses "analyze".  You might want to stop monitoring stocks,
crunch hard for a few seconds and do some fancy display, then show
all the updates that arrived while you were crunching as a batch and
go back to monitoring passively.

We are thinking about how this could be added to ISIS.  It wouldn't be
trivial, but could probably be done.  Also being considered are ways to
connect remote non-UNIX clients (PC's, etc).  I doubt that either of these
features will ever be in the public version of ISIS, but my guess is that
IDS will have products extending ISIS in these ways late this year...

jhanley@oracle.com (John Hanley) (03/23/90)

In article <38955@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:
>Robert Cooper has an idea for a very fancy mechanism that would let
>a client (any client) drop "offline" for a while and then come back.
>...
>We are thinking about how this could be added to ISIS.  It wouldn't be
>trivial, but could probably be done.  Also being considered are ways to
>connect remote non-UNIX clients (PC's, etc)....

I hope that the "are you alive?" probe can be done by either the server
or client, but that the result "machine1 poked machine2" is available
to both server and client, with a timestamp.  If it would mean less work
for the Isis server I suppose the client should take primary resposibility
for doing the pinging, though the overhead on the server is unclear to me.
For sanity's sake I _think_ it would make sense for one or the other of
the machines to "usually" do the vast majority of the pings, with the other
machine making a query and a log entry if no word has been received from
its partner in a while.

In the event that a fileserver in the machine room loses touch with 50 PC's
because a thin-ether terminator came loose, it might become aware of this
failure sooner, with less scheduling overhead, if it was expecting pings
from its clients, randomly distributed over a 60-second interval or
whatever.  It would be interesting to include a "network location" attribute
in the sites file and to test the hypothesis "ether cable such-and-such
failed" when a machine on that cable is observed to timeout (by pinging its
neighbors).  This would be a biggie for detecting network partitions, say
from an important router being disconnected.  Loss of power to an entire
building is also an interesting group-failure case.

One implication of this "usually pinging from one side" strategy is that
although the interval and timeout parameters might very well be different
for the two sides, at the time they are negotiated some sanity check and
readjustment would be done to ensure that 99% of all pings will come from
the same side.  Deciding whether you prefer having the server or client do
it is your option, though I suspect that having a single-tasked MeSsy-DOS
client speak up rather than listen attentively is an easier proposition.

As long as you're going to the trouble of pinging, it might be worthwhile to
communicate some load metric to the client, such as load avg, or # of active
users, or # of pages on the free list.  If the metric was scaled to a
uniform range of 1 to 100, applications that distribute calculations across
a lot of workstations could choose to not bog down busy hosts based on this,
in a portable way.  The advertised metric might be artificially increased by
a workstation user, or by an agent on his behalf, to control how willing or
unwilling the user is at the moment to donate resources to the Isis
community.  Isis clients might test-ping alternate servers and change the
host they are using as an Isis server based on excessive load (good-neighbor
load levelling policy) or high observed latency on responses (selfish
gimme-performance policy).  The interval and timeout parameters might
normally be very small, to rapidly detect failure of the partner or the
network, with adaptive increases if the server is becoming very busy.  The
log noting loss of contact with the server might also note the load average
ramping up on responses just before the timeout.  A lack of correlation
between the server's load metric and observed latencies may indicate that
the server is quite healthy but that the network is melting down.  In any
event, the metric returned by server would have to be very cheap to compute,
and could probably be cached by the server's Isis so that the host operating
system wouldn't be interrogated for the metric more often than every 5
seconds or so.

If you choose to have the server generate most of the pings, substantial
efficiencies could be obtained by taking advantage of broadcast media.
Another argument for having the server generate pings is that they could be
generated by simply sending messages to a standard Isis group which all
clients are a member of.

Also, I suspect that the "drop offline for a little while" functionality
isn't too hard to do by
   A) demanding an estimate of how long you plan to crunch offline for
   B) renegotiating the interval and timeout parameters to much larger values,
      and then shrinking them back to normal when rejoining the Isis community
In trading off the lag in detecting a failure against the resources needed
for pinging, widely varying client needs would argue for choosing to make
(at least some) clients generate lots of server pings.

Finally, I would like to observe that the single most important server
resource in our environment is physical memory.  The greatest objection that
people have to running a protos process on their workstation all the time is
that it always has a resident set of several hundred K, regardless of
whether is doing any real work or not.  Optimizing the set of pages touched
whilst assuring others that "I'm up" would be a great boon.

And thanks for the good work, Ken.
						--JH

ken@gvax.cs.cornell.edu (Ken Birman) (03/23/90)

In article <1990Mar23.093050.23923@oracle.com> jhanley@oracle.com (John Hanley) writes:
>I hope that the "are you alive?" probe can be done by either the server
>or client....

The implementation is pretty simple.  Actually, nobody does any probing at
all.  The client starts sending "I'm alive" messages to the server once
every <frequency> seconds.  The server expects these and if one is late
by <timeout> seconds it kills off the client.  I'll probably tune things
so that I'm alive messages are only sent if there has been no other traffic
to the server, but this isn't likely to be a major issue.

The client expects acks from the server, so it will notice within about
30 seconds if the server isn't acknowledging a "I'm alive" transmission.
This is also how it finds out that it has been killed off.

In this case (client was up but got killed off, or server died but client
survived) the client code calls isis_failed(), which has the option of
reconnecting to ISIS or aborting.  

>As long as you're going to the trouble of pinging, it might be worthwhile to
>communicate some load metric to the client, such as load avg, or # of active
>users, or # of pages on the free list...

We are planning to solve this class of problem through Meta, which has
"sensors" that include the sorts of per-process load factors you cite.
Meta is a major ISIS application developed by Keith Marzullo and Mark
Wood.  It provides a network-wide user-extensible database of sensors
(i.e. things like load, but also potentially things like the amount of
space left on a disk or the length of a job-queue or even the temperature
in the machine room).  It also has actuators (trigger actions).  Meta
has several ways to query the database of sensors/actuators and supports
a "when" clause (will support, at any rate) that watches in a fault-tolerant
way for some event and triggers a specified action.  Oh, and you can also
combine sensors to build composite ones, i.e. average load or something...

We are mailing out a few technical reports in the next week or so, and
one covers Meta (another is on the bypass code, and another is just a
progress/status report for the project).

>Also, I suspect that the "drop offline for a little while" functionality
>isn't too hard to do by
>   A) demanding an estimate of how long you plan to crunch offline for
>   B) renegotiating the interval and timeout parameters to much larger values,
>      and then shrinking them back to normal when rejoining the Isis community

Well, this would force ISIS to buffer potentially large amounts of data.
With lots of such data ISIS would congest and kill the client to shed load...
What Robert had in mind was more along the lines of a way to have the remote
client tell its services (gracefully) that it wants to drop offline,
a period during which it would be offline and the services would buffer
or archive data for it, and a graceful re-join mechanism that would
bring them up to date again.  Obviously, you could implement this now,
but the question is whether we couldn't come up with a "generic" tool
for this purpose.

>Finally, I would like to observe that the single most important server
>resource in our environment is physical memory....  Optimizing the set of
>pages touched whilst assuring others that "I'm up" would be a great boon.

I'm confused.  Doesn't the isis_remote() stuff address this?  With the
remote client code, protos won't be on the workstations at all, only on
the servers that run things like NFS or the main database system.  The
client code is down to something a bit more modest by now, 168k of ISIS
related library text total.  So, the typical user of a workstation dedicated
to some application and running only as a "remote" client is 168k/process,
which could be further reduced using shared libraries (one of those things
we ought to get around to...)

Ken

ken@gvax.cs.cornell.edu (Ken Birman) (03/24/90)

> From: rich@sendai.ann-arbor.mi.us (K. Richard Magill) (via email)
>   From: ken@gvax.cs.cornell.edu (Ken Birman)
>   Newsgroups: comp.sys.isis
>   Date: 22 Mar 90 15:50:46 GMT
>   Reply-To: ken@gvax.cs.cornell.edu (Ken Birman)

>	   application calls	isis_probe(freq, timeout)

>Could be tricky to set these initially.  Maybe something based on
>typical round trip message time would be easier to deal with than hard
>seconds.

I've played with round-trip numbers and they really don't work for
crashes.  Crashes tend to be unusual events -- so are long delays --
and just because you were running fast a few seconds ago I can't
safely assume that you won't be updating a display for 10 seconds
sometime soon...  ISIS is full of hard-coded constants at this level,
as is any protocol implmentation.

>   3) Implementation:
>	   client starts sending a HELLO message every freq. seconds
>	   ISIS has a timer; if it doesn't get a HELLO within freq+timeout secs
>		   it "kills off" the client

>I'd prefer that isis poll the client before killing.  I can imagine a
>situation, perhaps macOs, where the tasking and time division is such
>that periodic hello's might be difficult, but where polls could be
>answered immediately.

Good idea.  I'll add this feature.

>	   killed client that was really alive calls isis_failed, then panics
>		   with message "killed by ISIS" unless isis_failed traps the
>		   failure (i.e. by reconnecting).

>If the client's system has a clock, which it must in order to know
>frequency and timeout, then isis on the client can recognize when it
>has missed a poll.  This implies that some kind of dynamic backoff
>might be in order before isis outright "disowns" the client.

Unclear on what you mean by this...  The clib clock is by SIGALRM
interrupts currently (once per second) but the sending of the I am alive
message is done only when ISIS gets scheduled.  So, what I am really
defining as liveness is that the ISIS scheduler gets scheduled at least
once every "frequency" seconds...  

>   Robert Cooper has an idea for a very fancy mechanism that would let
>   a client (any client) drop "offline" for a while and then come back...

>I know precisely the problem here and I'd be sorely dissappointed to
>see you spend the time to solve it within isis rather than with isis.

>My recommendation would be for users with this need to do one of the
>following. Use a cross between isis bcast from the server to the client
>and an rpc-ish from the client to the server.  Or, withdraw from the group
>that receives the data then rejoin.  Personally, I think I'd stay in the
>group, but register and unregister for the bcasts from server.  That
>is, message flow control, and persistence becomes the server's
>problem, not isis's.

I don't know; seems like it would be more in the spirit of a toolkit to
provide the servers with a cleaner warning that the client wants to do
this and way to spool data conveniently.  But, you might be right.  After
all, your company would be a typical user of this sort of facility,
so if you don't see it as a necessary add-on...

rich@sendai.sendai.ann-arbor.mi.us (K. Richard Magill) (03/24/90)

In article <38995@cornell.UUCP> ken@gvax.cs.cornell.edu (Ken Birman) writes:

   > From: rich@sendai.ann-arbor.mi.us (K. Richard Magill) (via email)
   >   From: ken@gvax.cs.cornell.edu (Ken Birman)
   >   Newsgroups: comp.sys.isis
   >   Date: 22 Mar 90 15:50:46 GMT
   >   Reply-To: ken@gvax.cs.cornell.edu (Ken Birman)

   >   Robert Cooper has an idea for a very fancy mechanism that would let
   >   a client (any client) drop "offline" for a while and then come back...

   >I know precisely the problem here and I'd be sorely dissappointed to
   >see you spend the time to solve it within isis rather than with isis.

   I don't know; seems like it would be more in the spirit of a
   toolkit to provide the servers with a cleaner warning that the
   client wants to do this and way to spool data conveniently.  But,
   you might be right.  After all, your company would be a typical
   user of this sort of facility, so if you don't see it as a
   necessary add-on...

Hmm..  I suspect that no matter how you implement such a facility, it
would only be useful to me for a short while.  In any case, I might be
tempted to use it.  What I meant was that this is functionality that
can be provided at the user/above-isis level.  I'd rather see the isis
project apply their resources to the items already on their wishlist.
:-).  For instance, I think the scaling issue would be of use to more
potential users.