[comp.sys.isis] Recommended values for restart timeout parameters?

tim@capmkt.COM (Tim Edwards) (11/11/89)

how tight can you reasonably set the -A param to isis and the -f param for
protos and still have everything work as expected?  i have experimented
with several values for both parameters some of which partion the network
when a site crashes and restarts ( the crashed node comes up in its own
partion, and dosent know about the other nodes ).  the behaviour also
dosent seem to be consistant; sometimes the node will get joined back
up with the rest of the pack and sometimes it won't.  so the question is,
how tight can i set these parameters and still get reliable system 
behaviour.  the goal here is to detect a failure as soon as possible while
minimizing the "waiting X minutes to restart isis.." time.  i am currently
using -A1 and -f15 which dosent seem to work very well.  we are running
~10 nodes on our net.  any suggestions appreciated.  thanks.

tim

tim edwards  capmkt!tim@uunet.uu.net
capital market technology, inc.
berkely, ca  (415)540-6400

tim@capmkt.COM (Tim Edwards) (11/11/89)

how tight can you reasonably set the -A param to isis and the -f param for
protos and wtill have everything work as expected?  i have experimented
with several values for both parameters some of which partion the network
when a site crashes and restarts ( the crashed node comes up in its own
partion, and dosent know about the other nodes ).$ the behaviour alsodosent seem$to be consistant; sometimes the$node will get joined back
up wi|h the rest of the pack and sometimes it won't.  so the question is,
how tight can i set these parameters and still get reliable system 
behaviour.  the goal heve is to detect a failure as soon as possible while
minimizing the "waiting X minutes to restart isis.." time.  m am currently
using -A1 and$-f15 which dosent seem to work very well.  we are running
~10 nodes on our net.  an} suggestions appreciated.  thanks.

tim

tim edwards  capmkt!tim@uunet.uu.net
capital market technology, inc.
berkely, ca  (415)540-6400

rcbc@honir.cs.cornell.edu (Robert Cooper) (11/13/89)

This concerns two parameters that you can change when setting up ISIS.
The first is the "auto restart interval" specified by the -A option to
the "isis" command. This specifies how long ISIS must remain down
at a site before restarting. This must be long enough that other
ISIS sites have time to notice the failure before the site restarts.

Which brings us to the second parameter, the "failure detector timeout",
which is settable by the -f option to the "protos" utility. This 
controls how long a site must be inoperative before another site
will consider it to have failed.

Clearly these two parameters must have mutually consistent values, and
also take into account how "slow" and "lossy" normal message traffic
is in your environment.

We "recommend" the default values supplied with ISIS, namely an 
auto restart interval of 5 minutes, and a failure timeout of 60 seconds.
We don't recommend changing these unless you have very specific requirements.
Here at Cornell we often set the failure timeout to much less than
60 seconds to speed up testing of failure modes, but we leave the
restart interval at five minutes normally. Basically restarts shouldn't
happen at all really (it means that ISIS has crashed even though the 
machine hasn't). If it does crash often you should get back to us.

                                -- Robert Cooper

ken@gvax.cs.cornell.edu (Ken Birman) (11/13/89)

I guess this -f option has people a bit confused.  Robert's comments are
correct, of course, but I wonder if it wouldn't also help to explain the
sequencing of events controlled by these flags.

Say that your value for the -f parameter is FTIMEOUT seconds and for -A
is ATIMEOUT.  Also, assume that sites A, B and C are running during this
dialog:

1.  A, B, C exchange messages per your code.  If your code doesn't send
    any messages at all, A sends a message to B, B to C and C to A every
    FTIMEOUT seconds.  In particular, this means that B will send to C
    every 10 seconds.  The sites are organized as a ring and each site
    hears from the site on its left and sends to the site on its left
    with this frequency.  It follows that B can monitor the status of A
    and C can monitor the status of B.

    If X gets a message from Y and there isn't any message sent from Y to X
    within a short time, X sends an ACK-only message to Y.  ACK-only messages
    are not themselves acknowledged.

    ISIS initially retransmits after 4 seconds, but it varies this to adapt
    to "average" delays before an ack is received.  Say that the average
    measured delay before packets are acked is <davg> seconds.  ISIS keeps
    a running average of <davg+1sec> and retransmits after this amount of
    time, but never after waiting less than 2 seconds.

2.  Now, say that B becomes unresponsive or crashes.  

2.1 Soon after, say at time t0, A or C will try to send a message to B.
2.2 Not getting an ACK, A or C will retransmit this message at time, say,
    t0+4secs, t0+8secs, etc.   Say that A sent the message.
2.3a After retrying MAX_RET times, currently hardwired to 3, A logs the
    message:  "Transmitted same packet %d times, giving up (len %d)\n"
    and declares B to have failed.  For the default case, this means that
    a site is declared to be down if it doesn't respond within 12 seconds
    after you send it a packet, but the value could drop as low as 6 seconds
    or rise much higher, depending on how sluggish the destination has been.
2.3b Alternatively, after not hearing from B for a period of 
	 max(30,RTDELAY*FTIMEOUT/2) seconds, A
    C declares B to have failed.  Note that in step 1, C was expecting to
    hear from B periodically.  E.g., if the current average RTDELAY is
    to retransmit after 4 seconds and you specified -f10, this rule
    kicks in after max(30,4*10/2) = 30 seconds.  For -f60, the default,
    this timer kicks in after max(30,4*60/2) = 120 seconds.
    This time you get the logged message:
        "Timeout: site %d/%d unresponsive for %d secs\n"
3.  OK, now C thinks C is down.  Problem is, A doesn't know.  So... we
    run the "failure detection protocol", which does a sort of 2-phase
    commit.  If no other site is down, this protocol runs essentially
    instantly.  However, if other sites are also down, we might not notice
    the problem until now, forcing a further delay.

    E.g, in a larger system, B and C could both fail, and although some
    site D would notice that C was down, since C was supposed to watch B,
    we wouldn't find out that B was down until step 2.3a for the failure
    agreement protocol itself, slowing things down even more!
4.  In the case where B was just running very slowly, it now finds out
    that someone decided it was down and prints a message:
    "fd_iamdead: %d/%d told me to die", then shuts down.
5.  If you specified the -A parameter, after ATIMEOUT minutes
    (default = 5) the isis monitor program restarts the site
    and it should come up on its own.

So, how long should you expect all this to take?  

... Depends on whether someone was sending to the site when it crashed.

If so, a  good guess is that the failure detector will run after about
12 seconds and the site will be declared down more or less immediately.
for a value like -f10, a single failure will be detected after about

If not, we'll notice the failure within about 30 seconds when the
expected "keepalive" message is not received.

For the default setting of FTIMEOUT = 60, the delay would again be
something between 12 seconds and 2 minutes, and more for multiple
failures.

We recommend that most sites use -f60 (the default) when using ISIS for
actual monitoring or control of an application.  The problem is that
slow machines, NFS servers that hang while printing messages, YP servers
that hang briefly, and even clock resets can all trigger cascades of
failures if the parameters are set to detect failures quickly.

For networks of very uniform machines that never hang due to NFS or YP
problems, I guess one could run with a value like -f15 or -f10, but
obviously this won't be true for a network of overloaded SUN 2 workstations...

Now, regarding one of the other issues that was raised, let me comment
on network partitions.

A network partition occurs if for some reason sites can't talk to each
other.  Say that B was temporarily unplugged from the net and that this
is what triggered all the problems.

If a large number of sites were up (> 3) ISIS kills off sites on the minority
side of the partition, so B would crash in this case with the message;
fd_localcommit: Possible partition, with this site being in minority partition
Otherwise, B might not be killed off and would just keep running as
a partition with a single site.

Even if B is killed off, after a while (after ATIMEOUT parameter expires)
B will restart.  At this point, it won't find the other ISIS sites on the
network and will restart itself.  It will then be running as a partition
with one site in it.  

This last situation will be corrected in the version of ISIS that IDS will
release sometime in 1990.  Meanwhile, you just need to be aware that it
represents a problem...

Please let us know if you observe event sequences that don't match this
algorithm.  My guess is that it "explains" all behaviors people have managed
to get out of ISIS.

Ken

ken@gvax.cs.cornell.edu (Ken Birman) (11/17/89)

In article <372@catmkt.COM> tim@capmkt.COM (tim edwards- writes:
>how tight can you reasonably set the -A param to isis and the -f param for
>protos and wtill have everything work as expected?  .... (etc) ...

I actually don't know.  I guess we need a simulator with which we could
experiment a little.

However, -f15 is probably much too small for what your group is doing.
To fill others in, Capital Market Technologies is using ISIS in a financial
trading setting (or they will if our faster broadcast is fast enough; I
guess the one in ISIS V1.3 is a bit sluggish for this setting).  They run
on a mix of SUN 2, SUN 3 and Apple Mac-2 systems under AUX and last I
heard they were planning a port of ISIS to the latter.

On such a mix you see very long timeouts from the SUN 2 systems, which
are old technology and short of memory.  -f15 is very fast for such a setting,
and in fact a previous person at CMT urged me strongly to support -f120!
I guess -f15 might work for closely matched machines with lots of memory,
though.

As for the -A value, I think -A1 or -A2 should be fine.  In ISIS, if a
site recovers unexpectedly fast, we just run the failure and recovery
protocols both at once.

Now, this leads to the strange part.  Tim mentions that he gets a lot
of "partitioned" executions (e.g. site 3 gets killed off and then restarts
quickly and comes up all by itself, not talking to sites 1, 2 and 4 that
stayed operational the whole time).

This is unusual and suggests that the root problem at CMT is actually
that the network itself is flakey.  We are working on ISIS and hope that
by mid 1990 we will be able to release a version that runs right through
partitions and heals itself automatically when the network recovers.  This
version of ISIS (V1.3 and also V2.0, when we release it) won't do that,
and hence doesn't "tolerate" partition failures.

Neither does anything else you can buy or run...

My suggestion is that you start by uncovering the cause of this frequent
communication problems: gateway that crashes, someone kicks the wire
on his/her SUN, or whatever.  Maybe there is a hardware problem here that
can be corrected?  Certainly sounds like an abnormal situation.

If not, perhaps you can run two versions of ISIS, with different
sites files, one on each "side" of the flakey line.  You would need
to partition your application itself, but the new long_haul facility
(see man spool(3)) includes a number of facilities for this, and is
being extended even as I type (Messac is adding a very fast uucp style
file copy and implementing long-haul cbcast and abcast protocols and
building several demo applications with them).  For example, one of our
users is running ISIS on LAN's in Norway, Sweden, DC, San Diego, and
elsewhere, and is using this approach to interconnect the LAN's.  But,
the application is physically partitioned as well; no process groups
try to transparently span the long-haul lines or anything.

Far in the future, ISIS might make all this transparent, but for 
short and mid-term planning you need to design with partitions in mind.

Ken

PS: We have our V2.0 broadcast protocols running solidly now at Cornell,
and the RPC timing looks quite good.  We are still tuning, but should be
able to post something on this shortly.  The reason I mention this is
to emphasize that our stress right now is on speed, with partitioning
to be addressed in 1990 after V2.0 is out.  This probably is the right
order of priorities for CMT, since V1.3 is really too slow to use in a
broadcast intensive setting like a trading room.