Mills@UDEL.EDU.UUCP (03/28/87)
Folks, Yesterday I happened to watch the EGP tables in one gateway just after another gateway crashed. The intent was to see how long old information about nets serviced by that gateway persisted and whether naughty things might be happening as it dissipated in the various caches. The old information persisted well over an hour. Shocked out of my socks, I started digging further. The gateway that crashed represented the only path to two innocent networks; however, reachability was being advertised via EGP to the core and also along a different path to the NSFNET swamps, a situation typical of many universities these days. Therefore, I looked for route loops, cache relatches and other phenomena characteristic of the algorithms used on these wetlands. I can't speak for all implementations and in fact can speak authoritatively only for the ones used by the fuzzballs, which are scattered all over the swamps. The routing algorithms used by the fuzzballs (sometimes called "hello") are very careful about dissipating old information and avoiding loops, at least between neighbors. A two-minute hold-down interval is enforced once a previously reachable path goes down, during which reachability claims are ignored. This mechanism, used at various times in other algorthms, including ARPANET, is designed to avoid re-introduction of old, now bogus information. The interval is selected to be at leat as long as the maximum time to spread routing information throughout the system, which can be a pretty long time. Obviously, in the present case we have to look beyond the fuzzballs to find where the old information is being stashed. Note that this can happen anywhere in the world and it only takes one rogue who innocently may have a funny idea how long this information should live in its cache and then re-introduce it into the system, finding its way back to the fuzzballs, core gateways or whatever, and start the whole process all over again. Why should we care about the problem? It is typical of such phenomena that route loops form; thus, traffic to the now-unreachable destinations must circulate somewhere until the TTL fields expire. The key indicator of that is the ICMP Time Exceeded message. If these are popping up at your host, the problem is also yours. My observations might suggest a hold- down should be in the order of an hour, but I don't think this is the case. More likely one or more implementations have no hold-down at all. Lessons learned: Operations people (this includes both the INOC, NSF and related monitoring centers) must do more than react to reported problems and go out looking for them. This comment is not meant in any way to detract from their excellent service and prompt response given for a long time, but might suggest additional, specialized staff might be necessary. The lookers might start by periodically rummaging over the data bases at strategic spots looking for excessive "churn" (entries changing very often) and inconsistencies. Occasional tests should be done involving a net being turned off, watching the systemic response and so forth. I've been doing thee things informally for some years now and occasionally reporting the results to this list. Now the system has grown to big for me to do that. The important implementor's lesson is that a coordinated hold-down (or equivalent) is absolutely necessary. My guess is that two minutes is not enough and maybe twice that is more appropriate. We also need to examine those cases where, for various reasons, information must be cached for much longer than this interval, such as when public networks are involved. One rule might be that, if you have to cache something for longer than the hold-down interval, you can't ever tell anybody else about it. And so forth. As hinted recently, it might be time to re-examine the "wiretapping" issue, where a gateway observing an ICMP error message wandering by to a host on one of its connected networks is examined for possible hints that might be useful in its forwarding and routing functions. A sufficient number of these for the same destination within some time should be grounds to declare the destination unreachable, thus avoiding needless congestion, looping and other antisocial behavior. Yes, I know the layer violations implicit in the above may drive many up the wall. Please show this message to them the next time their TCP/TELNET connection times out. At least two readers of this list will notice this could be the first toenail in the "fair-queueing" closet. Dave