[mod.protocols.tcp-ip] Tsunami in the swamps

mills@HUEY.UDEL.EDU.UUCP (11/24/86)

Folks,

I spent long hours this weekend trying to find and fix problems which were
destabilizing NSFnet access via DCN-GATEWAY. I found that relatively obscure
problems in FORDnet and UMDnet were causing earthquakes all over the system.
Since these are examples of how relatively innocent misbehavior can have
profound implications on Internet service everywhere, I am distributing the
following saga to this list, in spite of the rather intricate and specialized
technical details involved.

I spent most of Saturday digging into the U Maryland local net UMDnet via the
UMD1 fuzzball and trying to explain why I couldn't complete a largish FTP
transfer. I found that MIMSY, the "official" UMDnet gateway, was having
trouble keeping its MILnet peering partner(s) up and cycling up/down every
twenty minutes or so. This behavior was similar to that of our (U Delaware)
gateway before we jacked up maxacquire from one to three in the Unix EGP
gateway daemon and upped our peering partners accordingly. We found that
up/down cycles create nasty routing loops in the core gateway system, with
many ICMP time-exceeded messages flying about, as well as hijacking system
bandwidth for spasms of core-gateway update messages.

EGP reachability problems are immediately evident by watching the hop counts
(which I happened to do with a fuzzball) for some time and observing which
ones are cycling. I found several nets that were doing that, which suggests
the MIMSY problem may be happening at the gateway(s) servicing these nets. It
isn't clear why the problems are occuring at all, even with only one peering
partner. Fuzzball gateway DCN-GATEWAY seems to have no trouble sustaining EGP
reachability, which might indicate something bust in the Unix EGP code itself,
or possibly an incompatibility between it and the core gateway code.

Another problem was found on the FORD1 host on FORDnet, which is also
connected via DCN-GATEWAY. Apparently, cables between its serial-line
interfaces and modems were switched (for unknown cause), which caused a
routing loop on the access line from DCnet. The result was massive congestion
on DCnet, through which traffic for a large portion of NSFnet and its clients
pass, and service was badly degraded. The reason the loop occured in the first
place was that some FORDnet hosts have no routing algorithm and so require a
handcrafted FORD1 routing table and dedicated interface. The problem was
exacerbated, both because the line speed is relatively low (9600 bps), the TTL
fields used by many hosts were large (255) and many hosts retransmitted before
the TTLs had expired.

This lesson again points up the need for all hosts in the Internet to use
realistic TTLs (values between 30 and 60 have been suggested), use
conservative values for inital RTT estimates (values at least 5 seocnds have
been suggested) and back off upon retransmission. It also points up the need
for comprehensive self-configuration mechanisms, either in the form of a
reachability protocol, routing algorithm or some other mechanism with
sufficient functionality to deal with broken configurations. Finally, it
suggests we should be exploring the fairness principles suggested by John
Nagle and others, especially the Nagle Conjecture (a derivative of Murphy's
Law): "If it can break, don't bet on it."

Finally, I blame myself for a bizarre behavior of some hosts speaking the
Network Time Protocol (NTP). I found a wee beastie crawling deep inside the
fuzzball NTP code which caused frequent clock discontinuities (time warps),
rather than continuous slewing, in some neighbors. These neighbors, having
reset their clocks, also reset their link-delay calculations, which are
necessary to drive the routing algorithm. Eventually, the routing algorithm
starved for lack of delay updates and declared the neighbor down. Previously,
this problem has occured with the fuzzball routing protocol due to broken
hardware and/or software, but only in local nets all speaking the same
protocol.

In the NTP case, further analysis disclosed the ominous fact that large chunks
of Internet real estate can be chipped away by destabilizing local clocks
(using NTP, UDP or whatever clock-synchronization mechanism is handy). The
fuzzball and Unix (Mike Petry) NTP implementations use recursive median
filters to deglitch the synchronization mechanism (which is where the fuzzy
bug was). These filters can be spoofed, intentially or otherwise, to cause
glitches to happen anyway. Goodness, gracious, but our Internet is getting
sophisticatedly sneaky.

Dave
-------