mills@HUEY.UDEL.EDU.UUCP (11/24/86)
Folks, I spent long hours this weekend trying to find and fix problems which were destabilizing NSFnet access via DCN-GATEWAY. I found that relatively obscure problems in FORDnet and UMDnet were causing earthquakes all over the system. Since these are examples of how relatively innocent misbehavior can have profound implications on Internet service everywhere, I am distributing the following saga to this list, in spite of the rather intricate and specialized technical details involved. I spent most of Saturday digging into the U Maryland local net UMDnet via the UMD1 fuzzball and trying to explain why I couldn't complete a largish FTP transfer. I found that MIMSY, the "official" UMDnet gateway, was having trouble keeping its MILnet peering partner(s) up and cycling up/down every twenty minutes or so. This behavior was similar to that of our (U Delaware) gateway before we jacked up maxacquire from one to three in the Unix EGP gateway daemon and upped our peering partners accordingly. We found that up/down cycles create nasty routing loops in the core gateway system, with many ICMP time-exceeded messages flying about, as well as hijacking system bandwidth for spasms of core-gateway update messages. EGP reachability problems are immediately evident by watching the hop counts (which I happened to do with a fuzzball) for some time and observing which ones are cycling. I found several nets that were doing that, which suggests the MIMSY problem may be happening at the gateway(s) servicing these nets. It isn't clear why the problems are occuring at all, even with only one peering partner. Fuzzball gateway DCN-GATEWAY seems to have no trouble sustaining EGP reachability, which might indicate something bust in the Unix EGP code itself, or possibly an incompatibility between it and the core gateway code. Another problem was found on the FORD1 host on FORDnet, which is also connected via DCN-GATEWAY. Apparently, cables between its serial-line interfaces and modems were switched (for unknown cause), which caused a routing loop on the access line from DCnet. The result was massive congestion on DCnet, through which traffic for a large portion of NSFnet and its clients pass, and service was badly degraded. The reason the loop occured in the first place was that some FORDnet hosts have no routing algorithm and so require a handcrafted FORD1 routing table and dedicated interface. The problem was exacerbated, both because the line speed is relatively low (9600 bps), the TTL fields used by many hosts were large (255) and many hosts retransmitted before the TTLs had expired. This lesson again points up the need for all hosts in the Internet to use realistic TTLs (values between 30 and 60 have been suggested), use conservative values for inital RTT estimates (values at least 5 seocnds have been suggested) and back off upon retransmission. It also points up the need for comprehensive self-configuration mechanisms, either in the form of a reachability protocol, routing algorithm or some other mechanism with sufficient functionality to deal with broken configurations. Finally, it suggests we should be exploring the fairness principles suggested by John Nagle and others, especially the Nagle Conjecture (a derivative of Murphy's Law): "If it can break, don't bet on it." Finally, I blame myself for a bizarre behavior of some hosts speaking the Network Time Protocol (NTP). I found a wee beastie crawling deep inside the fuzzball NTP code which caused frequent clock discontinuities (time warps), rather than continuous slewing, in some neighbors. These neighbors, having reset their clocks, also reset their link-delay calculations, which are necessary to drive the routing algorithm. Eventually, the routing algorithm starved for lack of delay updates and declared the neighbor down. Previously, this problem has occured with the fuzzball routing protocol due to broken hardware and/or software, but only in local nets all speaking the same protocol. In the NTP case, further analysis disclosed the ominous fact that large chunks of Internet real estate can be chipped away by destabilizing local clocks (using NTP, UDP or whatever clock-synchronization mechanism is handy). The fuzzball and Unix (Mike Petry) NTP implementations use recursive median filters to deglitch the synchronization mechanism (which is where the fuzzy bug was). These filters can be spoofed, intentially or otherwise, to cause glitches to happen anyway. Goodness, gracious, but our Internet is getting sophisticatedly sneaky. Dave -------