Mills@UDEL.EDU (06/30/87)
Folks, Alerted by an increasing incidence of timewarps (unexpectedly large excursions in measured time discrepancies between Internet time servers) I found that the core gateway system seems to have lost its marbles. Faithful old fuzzsoldier macom1 (10.0.0.111) squawks EGP with purdue, bbn2 and isi gateways, so presumably those gateways should have a reasonably consistent and stable routing data base. Following is a sample of the routing tables in these three gateways: purdue UP: macom1 10.0.0.111 (Arpanet) (EGP or indirect via EGP) upenn 128.91 4 hops (ext) via macom1 10.0.0.111 (Arpanet) Washington 192.5.8 1 hop (ext) via macom1 10.0.0.111 (Arpanet) cpw-psc 192.5.146 4 hops (ext) via macom1 10.0.0.111 (Arpanet) bbn2 UP: macom1 10.0.0.111 (Arpanet) (EGP or indirect via EGP) upenn 128.91 4 hops (ext) via macom1 10.0.0.111 (Arpanet) cpw-psc 192.5.146 4 hops (ext) via macom1 10.0.0.111 (Arpanet) isi UP: macom1 10.0.0.111 (Arpanet) (EGP or indirect via EGP) Washington 192.5.8 1 hop (ext) via macom1 10.0.0.111 (Arpanet) cpw-psc 192.5.146 4 hops (ext) via macom1 10.0.0.111 (Arpanet) Network Washington (since retired, now macomnet) is directly connected to macom1, so should indicate 1 hop, while networks upenn and cpw-psc are indirectly connected via NSFNET and should indicate 4 hops. All three of these networks should be reachable by macom1; however, the above data indicate some of the gateways are reachable only by other gateways. Clearly the core gateways do not have consistent routing tables. This would not be so bad if the tables were stable and did not oscillate with time. Unfortunately, the tables are bobbing like corks, as shown by the following sample made a few minutes after the previous: purdue UP: macom1 10.0.0.111 (Arpanet) (EGP or indirect via EGP) upenn 128.91 4 hops (ext) via macom1 10.0.0.111 (Arpanet) cpw-psc 192.5.146 4 hops (ext) via macom1 10.0.0.111 (Arpanet) bbn2 UP: macom1 10.0.0.111 (Arpanet) (EGP or indirect via EGP) upenn 128.91 4 hops (ext) via macom1 10.0.0.111 (Arpanet) Washington 192.5.8 1 hop (ext) via macom1 10.0.0.111 (Arpanet) cpw-psc 192.5.146 4 hops (ext) via macom1 10.0.0.111 (Arpanet) isi UP: macom1 10.0.0.111 (Arpanet) (EGP or indirect via EGP) cpw-psc 192.5.146 4 hops (ext) via macom1 10.0.0.111 (Arpanet) Gateway macom1 happens to be the primary entry point for macomnet (192.5.8), so one might ask how the core system thinks it might be reached. Alas, the three gateways have curious, inconsistent ways of reaching it, some quite bizarre: purdue 192.5.8 1 hop (ext) via macom1 10.0.0.111 (Arpanet) bbn2 192.5.8 4 hops (ext) via psc.psc.edu 10.4.0.14 (Arpanet) isi 192.5.8 1 hop (ext) via macom1 10.0.0.111 (Arpanet) Well, even that might not break the bank, except for the fact the tables are oscillating like crazy. Here is a sample recorded a few minutes after the above: purdue 192.5.8 2 hops (ext) via ISI 10.3.0.27 (Arpanet) bbn2 192.5.8 1 hop (ext) via macom1 10.0.0.111 (Arpanet) isi 192.5.8 2 hops (ext) via BBN2 10.7.0.63 (Arpanet) The last time I saw this behavior bugs were found in the core gateway code and subsequently fixed. The present behavior is at least as bad as I have ever seen and suggests very serious instabilities have recurred. In fact, the logs on macom1 and other fuzzballs I can reach out and touch indicate intermittent loss of connectivity and general flakiness consistent with the above observations. To add to the fun above, I discovered that gateway ngp.utexas.edu (10.0.0.62) is mercilessly beating on macom1 as the gateway to host brl-aos (192.5.22.82), probably the domain nameserver there. Poor macom1 has been returning several thousand ICMP Redirect messages to what it thinks is a broken host (how would it know differently?). Gateway ngp.utexas.edu presumably discards redirects as messages from the Devil. The routing data base of all three core EGP speakers, not to mention the poor fuzzballs, clearly states network 192.5.22 is reachable only by the ARPANET/MILNET gateways, not macom1. I can't raise the ngp keepers on hf, vhf or space relay to correct this nonsense. Can somebody drop something heavy on their heads? Finally, it would be fun to find out why so much traffic is sloshing by macom1 for the NSFNET swamps, in spite of the fact that very little traffic is destined for the nets it advertises. I suspect some j-random ARPANET hosts have a very sticky gateway table that locks on to a gateway that happens to volunteer reachability when times are bad and doesn't have sense enough to backtrack to the good guys when times are good. This isn't necessarily a fault on the part of the hosts; however, an EGP gateway may not know that, although it can reach a network, other EGP gateways are a better choice. More thought needed on that. Comments from the gateway crew at BBN would be highly prized. Dave
peter%gr@CS.UTAH.EDU (Peter S. Ford) (07/01/87)
This may or may not be related, but on 24 June 1800 MDT we started to see packets routed through utah-arpa-gw.arpa (CISCO -- 10.3.0.4) with src IP addr: 128.103.1.54 (husc4.harvard.edu) dst IP addr: 128.84.252.18 (cornelld.tn.cornell.edu) To our knowledge the CISCO only advertises reachablity to nets 128.110, and 192.12.56. This traffic eventually died off. Must be some swamp out there. Peter Ford U of Utah CS department. peter@cs.utah.edu
Mills@UDEL.EDU (07/03/87)
Peter, Welcome to the swamps. If your gateway is peeping EGP with the core, you will see a lot of this. The stuff I see sloshing by linkabit-gw is truly awsome and includes many instances like yours. Even after I disabled all reachability to all nets from that gateway for several hours various hosts and gateways continued to route traffic for j-random nets via it. Not knowing whether the senders were hosts or gateways, linkabit-gw spat back redirects, but few seemed to believe them (probably gateways masked in host clothing). In spite of the fact the core gateways (for the moment) correctly reveal the routes for that gateway, about two packets per second averaged throughout the day are being routed incorrectly to linkabit-gw. See what you might be in for? Dave
brescia@CCV.BBN.COM (Mike Brescia) (07/03/87)
Beginning sometime on Tuesday, apparent (very real, in fact) meltdown of routing seems to have been caused by: 1. frequent making and breaking of egp neighbor connections, due to 2. frangible egp neighbor-alive procedures or parameters, shattered by 3. long queues (or short queues and many dropped packets) on interfaces connected to the arpanet, because of 4. changes in arpanet topology and delay characteristics. Each change in neighborliness leads to routing changes in EGP which propogate at one hop per 2(?) minutes, and in GGP (the core routing) cause burgeoning of traffic as the protocol scurries to get the routing settled again. The core gateways were being rejected (EGP cease message sent) by many of the neighbors on the arpanet, and then later acquired again. This may have been caused by long queues on the sending end toward the core gateway, or by the long queues frequently observed in the core gateways. Thursday afternoon, the arpanet problem was cleared up. I think the routing explosion has settled to a dull roar after that. I expect that the arpanet analysts will have a description of the problem and its solution after the holiday. Mike Brescia (Gateway group)
Mills@UDEL.EDU (07/04/87)
Mike, I'm not sure I agree with your comment that the non-core EGP gateways are terminating associations with the core gateways; in fact, I believe it's the other way around. Following is a summary of data collected over a 36-hour interval from old fuzzball EGP slugger dcn-gw, which ordinarily sustains associations with all three core EGPspeakers ISI [10.3.0.27], BBN2 [10.7.0.63] and PURDUE [10.2.0.37]. As specified in the initial dialog, the hello interval is 60 seconds, while the poll interval is 180 seconds. The table below shows the events and actions of the state machine for each peer as specified in RFC-904. A Cease event represents a termination on the part of the core gateway, while a Cease action represents a termination on the part of dcn-gw, almost always as the result of an extended period when no messages whatsoever have been received. As can be seen, the core gateway terminates the association between two and four times as often as dcn-gw. Hello interval 60 Poll interval 180 Neighbor -> [10.3.0.27] [10.7.0.63] [10.2.0.37] Tally Event Action Event Action Event Action -------------------------------------------------------------- Request 0 65 0 8 0 36 Confirm 29 0 5 0 19 0 Refuse 2 0 0 0 0 0 Cease 26 9 3 1 16 4 Cease-ack 4 26 1 3 2 16 Hello 0 2172 0 2284 0 2232 I-H-U 1568 0 1904 0 1735 0 Poll 699 609 829 741 750 667 Update 532 669 656 824 604 736 Down 221 52 123 Bad sequence 151 129 148 Note the rather high incidence of Down events. Using the j,k parameters suggested in RFC-904 and the 60-second hello interval, a Down event would occur if three out of four reachability indications during the last four minutes were lost. This sounds rather extreme. Note also the rather high incidence of Bad sequence events, which occur when a reply to a hello message has incorrect sequence number and is discarded. There is a strong argument for ignoring the sequence-number check, since the order of reachability events is seldom meaningful. It may be useful in the present regime of positive network void coefficents to do that to avoid further meltdown. Dave