[comp.protocols.tcp-ip] *.JHUAPL.EDU -- SERIOUS GATEWAY THRASHING

trn@aplcen.apl.jhu.edu (Tony Nardo) (02/27/90)

For the past few weeks, links between the *.jhuapl.edu nodes and the
non-MILNET community have been somewhat unstable.  Today, however, is
the first time that I've seen an extreme case of gateway thrashing:

warper.110% traceroute uunet.uu.net
traceroute to uunet.uu.net (192.48.96.2), 30 hops max, 40 byte packets
 1  apl-b3-gw (128.244.3.1)  0 ms  10 ms  0 ms
 2  apl-gw (128.244.1.1)  0 ms  10 ms  0 ms
 3  RESTON-DCEC-MB.DDN.MIL (26.21.0.104)  290 ms MARINA-DEL-REY-MB.DDN.MIL (26.6.0.103)  320 ms  330 ms
 4  MARINA-DEL-REY-MB.DDN.MIL (26.6.0.103)  690 ms CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  430 ms  430 ms
 5  CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  740 ms * MARINA-DEL-REY-MB.DDN.MIL (26.6.0.103)  1340 ms
 6  MARINA-DEL-REY-MB.DDN.MIL (26.6.0.103)  1210 ms CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  2060 ms  2080 ms
 7  CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  990 ms MARINA-DEL-REY-MB.DDN.MIL (26.6.0.103)  1290 ms  1710 ms 
 8  * CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  1640 ms * 
 9  * MARINA-DEL-REY-MB.DDN.MIL (26.6.0.103)  3010 ms  2490 ms 
10  MARINA-DEL-REY-MB.DDN.MIL (26.6.0.103)  2240 ms * * 
11  CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  2220 ms * * 
12  * CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  3920 ms * 
13  * * * 
14  * * * 
15  * * *

etc.

Does anyone have any insights as to how this thrashing starts?  How it
may be stopped?
--
Tony Nardo,		   INET: trn@warper.jhuapl.edu, trn@aplcen.apl.jhu.edu
 Johns Hopkins Univ./APL   UUCP: {backbone!}mimsy!aplcen!trn
		    Quote(s) relocated to my finger .plans

curt@dtix.dt.navy.mil (Welch) (02/27/90)

In article <4790@aplcen.apl.jhu.edu> trn@aplcen.apl.jhu.edu (Tony Nardo) writes:
>For the past few weeks, links between the *.jhuapl.edu nodes and the
>non-MILNET community have been somewhat unstable.  Today, however, is
>the first time that I've seen an extreme case of gateway thrashing:
>
>warper.110% traceroute uunet.uu.net
>traceroute to uunet.uu.net (192.48.96.2), 30 hops max, 40 byte packets
> 1  apl-b3-gw (128.244.3.1)  0 ms  10 ms  0 ms
> 2  apl-gw (128.244.1.1)  0 ms  10 ms  0 ms
> 3  RESTON-DCEC-MB.DDN.MIL (26.21.0.104)  290 ms MARINA-DEL-REY-MB.DDN.MIL (26.
>6.0.103)  320 ms  330 ms

  [multiple MB hops deleted]

>12  * CAMBRIDGE-MB.DDN.MIL (10.3.0.5)  3920 ms *
>etc.
>
>Does anyone have any insights as to how this thrashing starts?  How it
>may be stopped?

We have been seeing this same problem for weeks.  One minute,
traceroute shows a normal route off of the MILNET through one of the
mail-bridges, and the next minute, we see traceroute output like the
example above.  Our packets are being passed around between the mail
bridges, but they never leave the MILNET/ARPANET.  Whenever this
gateway thrashing starts, it lasts long enough to break TCP
connections.

It has gotten so bad in the last week that it almost stopped our news
feed.  The nntp connections, when they could get started, would only
last for about 3 to 5 minutes before being disconnected.

For weeks, ftp and telnet connections to anywhere off of the MILNET
have been terrible.  They would only last a few minutes before
disconnecting, and even when they were connected they were really to
slow to use.

For the past few months, ftp connections to non MILNET sites have been
getting worse and worse.  I installed traceroute a month ago in an
effort to get a handle on these network problems.

When I first saw this problem, I assumed that some of the mail bridges
must be going down.  Now, I would guess that this problem is being
caused by too much traffic through the mail bridges.

Who runs the mail bridges and who can tell me what's going on?

What has been changing in the past few months that has caused this?

Has the traffic really been increasing or has the number of gateways
been decreasing?  Or is something much more complex causing this problem?

Why do the mail bridges bounce packets around like that?  Do they really
think that the best route is through the other bridge or do they use
a packet routing algorithm that gives the packets to another bridge
when the queue for the outbound link is full?. 

Who do I need to contact to get this problem resolved?

Do we have to get a connection to NSFnet to get away from this problem?

Is there anything we could be doing wrong to cause this?

Is there anything we can change to get around this problem?

Thanks in advance for any help anyone can give us.
(while I can still talk to you...)

Curt Welch
curt@dtix.dt.navy.mil

P.S. Our gateway to the MILNET is through dtrc-b1-gw.dt.navy.mil,
     a cisco router, MILNET address 26.22.0.81.

ron@MANTA.NOSC.MIL (Ron Broersma) (03/02/90)

I'm wondering if some of this gateway thrashing isn't related to the
fact that EGP packets from the MILNET core started exceeding 4096 bytes
a month or two ago.  At that time, I was tracing some thrashing problems
and I noticed the following symptoms.  Over the course of an hour, the
packets would gradually increase in size.  Just as they got within 10
to 20 bytes of 4096, many of the EGP implementations would suddenly
start getting checksum errors or buffer overflows because they had
4K buffers.  The ones that got a few packets with bad checksums would
suddenly stop peering with that core gateway.  Then, all of a sudden the
EGP packets out of the core would be smaller by a few hundred bytes because
of many fewer peers.  As the EGP players all tried to acquire a different
gateway, they would not get the checksum errors for an hour or so until
the packets approached 4096 bytes and they would again perform this
dance-of-the-gateways.

The message here is to make sure your EGP implementation can handle
packets larger than 4K bytes.  The most recent gated supports 8K packets
as I recall.

Something I had considered was to make a list of all the networks that
disappeared from the EGP packets right after the "dance".   Then if one
could determine who announced those nets to the core you could get a
handle on where the broken EGP implementations were located.

There's some other strangeness going on too.  We had a case this week
where one site running EGP was announcing its network to the core but
the core wasn't telling anybody else about it.  By peering with a
different mailbridge, it started working.  Strange.

And to top it off, the ground started shaking yesterday.  But we think
that is an unrelated (hardware) problem.

--Ron

mcdaniel%hqeis.decnet@HQAFSC-VAX.AF.MIL ("HQEIS::MCDANIEL") (03/03/90)

Andrews AFB

                  I N T E R O F F I C E   M E M O R A N D U M

                                        Date:     02-Mar-1990 11:46am EST
                                        From:     Mr Rodney McDaniel
                                                  MCDANIEL
                                        Dept:     HQ AFSC/SCXP
                                        Tel No:   AV 858-7909 COMM 981-7909
                                        Owner:    Mr Rodney McDaniel

TO:  _MAILER!                             ( _DDN[TCP-IP@NIC.DDN.MIL] )
TO:  _MAILER!                             ( _DDN[NIC@NIC.DDN.MIL] )

CC:  _MAILER!                             ( _DDN[RON@MANTA.NOSC.MIL] )
CC:  _MAILER!                             ( _DDN[CURT@DTIX.DT.NAVY.MIL] )

Subject: RE: *.JHUAPL.EDU -- SERIOUS GATEWAY THRASHING

HAS ANYONE THOUGHT ABOUT CONTACTING THE FOLLOWING OFFICES RELATING 
TO DDN MILNET PROBLEMS:

CONUS MILNET MONITORING CENTER
AUTOVON 222-2268/5726
COMM: 202-692-2268/5726
EMAIL ADDRESS:  DCA-MMC.DCA-EMS.DCA.MIL

CONUS TROUBLE DESK (MILNET & DSNET)
1-800-451-7413
AUTOVON 231-1787
COMM: 202-486-1982

NAVY POINT OF CONTACT:
THIS FOLLOW-UP PROBLEM WAS FURTHER IDENTIFIED BY A TWO NAVY.MIL
SYSTEMS.

NAVAL TELECOMMUNICATIONS COMMAND     AUTOVON 292-0381
ATTN: N521                           COMM: 202-282-0381
4401 MASSACHUSETTS AVENUE NW         EMAIL: NAVTELCOM@DDN2.DCA.MIL
WASHINGTON, DC 20390-5290

DCA/DDOM (B651)
WASHINGTON, DC 20305-2000
MAJOR CORDER - MILNET MANAGER
AUTOVON 222-7580
COMM: 202-692-7580
EMAIL ADDRESS: MILNETMGR@DDN3.DCA.MIL

THIS INFORMATION IS AVAILABLE IN DDN NEWSLETTER #56, STORED ON THE 
NIC.DDN.MIL, <TACNEWS> MENU ITEM 6. OPTION, VIA TELENET, BY 
CALLING 1-800-235-3155 OR SENDING A REQUEST TO: NIC@NIC.DDN.MIL 
(USER ASSISTANCE) THIS NEWSLETTER PROVIDES THE POINTS OF CONTACT 
FOR DDN PROBLEMS.  PLEASE NOTE: A NEW DDN NEWSLETTER #57 IS 
FORTHCOMING AND A DRAFT CAN BE OBTAINED SAME AS ABOVE OR USING FTP
AND REQUESTING NETINFO:WHO-DDN.TXT AND PROVIDES ALL THE DCA DDN 
PROGRAM OFFICE FUNCTIONS AND PERSONNEL.
HOWEVER, STILL AWAITING THE DDN NIC TO POST AN UPDATED VERSION OF 
DDN NEWSLETTER #56, DATED 8 JUN 88.   HOPE THIS HELPS DIRECTING 
THE PROBLEM INTO THE PROPER CHANNELS.   PLUS, DCA IS RESPONSIBLE 
FOR THE MAILBRIDGES BETWEEN MILNET & INTERNET SO SUGGEST THIS BE 
DIRECTED TO THE DDN MILNET POC'S LISTED ABOVE FOR WORKING A 
POSSIBLE EGP PROBLEM.  WOULD LIKE TO SEE A SUMMARY RESPONSE ON HOW 
THE PROBLEM WAS CORRECTED ON THE TCP-IP MAILER.

RODNEY A. MCDANIEL, DAFC
AIR FORCE SYSTEMS COMMAND
DDN PROGRAM MANAGER
EMAIL: MCDANIEL@HQAFSC-VAX.AF.MIL
ANDREWS AFB MD - AUTOVON 858-7909 - COMM: 301-981-7909