[comp.protocols.tcp-ip] EGP and ROUTING

brescia@PARK-STREET.BBN.COM (Mike Brescia) (12/21/88)
People,

Various different sounding routing complaints have been coming in via the
egp-people mailing list, the tcp-ip mailing list, the gated-people mailing
list, private mail, and telephone.  Some messages are extracted at the end.

Problems reported:

1. My host X cannot get to host Y.

2. My gateway X has no route for net Y.

3. My gateway X cannot run its egp with core server S (or T, or U).

4. My gateway X runs egp, and gets no routing info (NR messages) from core.

5. My gateway X runs egp, and gets partially garbled routes from core.

Some explanations:

1. is the simplest if the person at host X can report the results of
   "netstat -r" and point out the default or other gateways used to get to the
   net where host Y sits; conversely, I hope that X could call Y and ask the
   same sort of questions for the return path.  If X and Y cannot communicate,
   then we often need to figure out whether the problem is on the path from X
   to Y or the reverse path back from Y to X.

   Given the hosts are O.K., the question recurses to one of the gateways
   involved in problems 2-5.

2. Have to break this down, see by running some EGP trace logs on your gateway
   X whether it is problem 3, 4, or 5.

3. Growth!  Some of the core gateways were oversubscribed, and the total
   neighbor spaces (peer slots?) available, especially on milnet, was too
   much.

   The fix here was to, yet again, squeeze more net and neighbor space in to
   the LSI11 core gateways.  Steve Atlas has been working hard to maintain
   these bears.  The 3 egp core servers on the Arpanet, and 2 out of the 3 on
   the Milnet, have been upgraded today.

4. Growth!  Some versions of EGP have suffered from the growing number of nets
   reported in the NR messages, when the reassembled packet size crept over
   2K bytes.  Some were able to recompile with the EGPPACKETSIZE constant set
   larger, like 4K.  Some noted that not all the modules needed were
   recompiled by the normal 'make' rules, and recommended recompiling the
   whole EGPUP program.

   Some sites run 4.3bsd unix, and were able to incorporate fixes mentioned on
   the egp-people list advising how to use the 'setsockopt' system call to
   assign more buffering to the egp connection, so that fragments of large
   packets could be reassembled and delivered to the egp process.

   Some sites run the 'gated' version of EGP, and get some great support from
   the people at Cornell (gated-people-request@devvax.tn.cornell.edu).

5. Growth! and a bug in the LSI11 egp code.  Bug was introduced in the version
   that began handling more than 256 nets.  Caused the info in the NR message
   to be sent with the distances no longer in ascending order.  Caused there
   to be more than 255 distances reported for a single neighbor, trying to
   stuff that number (e.g. 264) into an 8 bit field.

   This afternoon, Tuesday 12/20, the fix for this has been put in the 5 egp
   servers that have been reloaded so far (mentioned in 3 above).

Keep those packet dumps coming.

Mike Brescia
BBNCC Gateway Development Group
800-492-4992 (or 617-873-3662)

------------------- some forwarded messages, excerpted ---------------

Date:     Sat, 17 Dec 88 1:04:11 EST
From: Tim Smith (USNA|tcs) <tsmith@BRL.MIL>
To: control@bbn.com, tcp-ip@sri-nic.ARPA, gated-people@devvax.TN.CORNELL.EDU
Subject:  core routing capacity exceeded?
Message-Id:  <8812170104.aa18644@SEM.BRL.MIL>

Morning all,

I have been experiencing a bit of trouble acquiring routing
information from the core gateways over the last few days. We use
gated (version 1.3.1.36) to speak EGP to the core gateways and have
noticed that gated has not been providing nearly as good routing
information as it usually does- we have been losing contact with the
core gateways and gated has been mysteriously dying. 

I turned on tracing and came across the following:
[...]
EGP RECV 26.1.0.65 -> 26.7.0.102 Sat Dec 17 00:10:21 1988
vers 2, type ACQUIRE(3), code REFUSE(2), status INSUFFICIENT RESOURCES(3), 
AS# 1, id 1

EGP RECV 26.1.0.40 -> 26.7.0.102 Sat Dec 17 00:10:21 1988
vers 2, type ACQUIRE(3), code REFUSE(2), status INSUFFICIENT RESOURCES(3), 
AS# 2049, id 1

EGP RECV 26.3.0.75 -> 26.7.0.102 Sat Dec 17 00:10:21 1988
vers 2, type ACQUIRE(3), code REFUSE(2), status INSUFFICIENT RESOURCES(3), 
AS# 1, id 1

Is it possible that the routing tables have grown too large and
exceeded the core's capacity? What other reasons are there for the
insufficient resources message?
[...]
What does everyone else think?

	Tim Smith -[hp]ostmaster and general network person

------------------- some forwarded messages, excerpted ---------------

Return-Path: <cal@okc-unix.ARPA>
Message-Id: <8812192039.AA15146@okc-unix.ARPA>
Date: Mon Dec 19 14:39:15 1988
From: cal@okc-unix.ARPA (Charles Leach)
Subject: EGP Sick?
To: egp-people@bbn.com

For the past week or see, EGP has been very intermittent in acquiring
routes. Is there any reason for this behavior or is it virus/worm
fallout that we can come to expect?

charles..

------------------- some forwarded messages, excerpted ---------------

To: cal@okc-unix.ARPA (Charles Leach)
cc: egp-people
Subject: Re: EGP Sick? 
In-reply-to: Your message of Mon, 19 Dec 00 19:88:15 +0000.
             <8812192039.AA15146@okc-unix.ARPA> 
Date: Mon, 19 Dec 88 16:50:56 -0500
From: Mike Brescia <brescia@park-street>

     For the past week or see, EGP has been very intermittent in acquiring
     routes. Is there any reason for this behavior ...

Two factors here.

1. the size of the Net Reachability message sent by the core is growing.  If
your host kernel cannot reassemble and deliver, or your EGP cannot receive
packets much larger than 2K (recommend 4K), you will probably see EGP
apparently stop receiving any net reachability information at all.  The
Acquire cycle works, the Hello cycle works, but when you send a Poll, you will
receive 2 or 3 fragments, totalling more than 2,000 bytes.

2. A bug has just been exhibited by Walter Prue at ISI, where the LSI11 code
sending an EGP message sends the NR information with the distances out of
order, creating the apparent need for stuffing more than 255 distance reports
through a single 'neighbor'.  The result is that your EGP will receive some NR
message, but only a few nets show up in in your routing table, because
the NR message is badly formed.

In the first case, your EGP trace will probably show no NR message at all; in
the second case, a trace should show some NR message received, but with some
error condition.

We are dragging out the big guns to fix this second problem ASAP.

[BANG..:-]

------------------- some forwarded messages, excerpted ---------------

Date: 8 Dec 88 04:11:11 GMT
From: haven!aplcen!aplcomm!trn%aplcomm.jhuapl.edu@mimsy.umd.edu  (Tony Nardo)
Organization: Johns Hopkins University/APL (Baltimore, Md.)
Subject: Is someone playing games with the MILNET/ARPANET interface?
Message-Id: <2648@aplcomm.jhuapl.edu>
Sender: tcp-ip-relay@sri-nic.arpa
To: tcp-ip@sri-nic.arpa

I am on a MILNET site.  I have noticed three times in the past week (and
twice in the past two days) that, while I can not reach a site directly, I
*can* reach it thru BRL.ARPA.  For example,

	finger @maryland.arpa

will come back with a "Network is unreachable" response, but

	finger @maryland.arpa@brl.arpa

gives the desired "finger" output.  Likewise, while I can't send mail directly
to a site without it languishing in a mail queue (the name server can't connect
to resolve the address), I *CAN* send the mail thru BRL.ARPA.

This situation did not arise until CNNC's decision to yank the MILNET/
ARPANET link for "technical difficulties".  The first two times, the problem
eventually "cleared itself".  This is the third time the problem has arisen.


Is someone still playing games with the MILNET/ARPANET interface?  From my
rather untutored perspective, it looks as if "routed" is dying or being
deliberately killed somewhere.

Anyone have any insights?

[...]

[ check routing ? - m ]