[comp.protocols.tcp-ip] routing changes

heker@JVNCA.CSC.ORG.UUCP (11/05/87)

We are experiencing a large number of routing changes in the kernel of one
of our VAX8600 "gateways".  The number of changes has increased dramatically
due to some route instabilities (that are not the topic of this message).

The question is how the number of changes can affect the performance of 
our system?.

We see about 1000 route changes in the kernel in periods of 10 minutes.  This
is as you can see *extremely* high.  But does this degrade the system 
performance at all?.

I also want to point out that this route changes are then propagated to 
other systems (VAX750s).  And all dance at the same rithm.

Any comments about this will be greately appreciated.

						-- Sergio

-----------------------------------------------------------------------------
Sergio Heker				tel:	(609) 520-2000
Internet: "heker@jvnca.csc.org"		Bitnet:	"heker@jvnc"
JOHN VON NEUMANN NATIONAL SUPERCOMPUTER CENTER, JVNCnet Network Manager
-----------------------------------------------------------------------------

Mills@UDEL.EDU (11/05/87)

Sergio,

I'm not sure what you mean by "routing changes." There certainly are vast
quantities of changes involving relatively small changes in delay and
even uncomfortable quantites involving significant (factor of two) changes.
Not many of these involve changes in route, however. While the situation
is serious and must be fixed, I don't think the routing overhead itself
is a significant factor in performance. Hellograms are rate-limited to
no more than one every 400 ms in even the worst case.

Let's hear it for all those gated's honking strange distances to the
fuzzies. Can someone answer the questions I put out about their behavior?

Dave

Mills@UDEL.EDU (11/06/87)

Sergio,

Ah yes, the infamous 192.31.x nets. These dudes have been bouncing all over t
the map for some time now. The distance values for these nets are not provided
by the fuzzballs, but by gated at some site or other. WHen they count to
infinity they have in fact become unreachable. This is a classic example
of what unstable metrics can do to a distributed Bellman-Ford algorithm. I
have been working feversihly to harden the algorithm so that even these wild
swings won't destabilize the algorithm, but when distances change from one
sample to the next by over fifty percent, what can any algorithm do? I repeat
my statement made at least a dozen times: where is the source of those violent
delay excursions and what gated is generating them?

Having said that, note that even these severe transients should not adversely
affect the system throughput, at least for the nets not rocking to and fro,
since the hello messages are rate-limited. On the other hand, traffic for
nets counting to infinity can clearly gobble up dangerous levels of traffic.
That's why I have been spending so much time trying to avoid the counting
problem. THe only way to do that is to latch sudden increases in delay
and prevent further decreases until the hold-down timer expires, which is
what the present system does. I have had to experiment somewhat in order to
gauge the sensitivity of the latch, which is presently set at a factor of
two. The latch regularily snares at least some of the surges, but not all, as
you can see from your data. I can't make the latch more sensitive without
snaring a lot of benign wobbles, such as occasional retransmissions on UIUC -
NCAR lines, for example. Nevertheless, I have tuned the algorithm a lot in the
past month and, at least in the testing swamps, it seems to be working well.

It has been suggested that JVNC has more trouble than most because that is
the only spot running gated on two machines on the same Ether. I thought
Maryland was doing that as well. While they seem to be having trouble of
their own, destabilized routes do not seem to be a serious problem there.
There are two things I would recommend (again): first, identify all those
gated configurations where only a single path is available to the networks
being squawked and set the squawked delay to zero, just plain zero. Second,
where multiple paths to a net exist, pray to the metric-translation god and
really, truly and verily conform to the rules I suggested in my earlier memo.
In any case, the clock-offset fields associated with each net in the hello
message should be set to zero and the date in the header should be marked
invalid. This seems like a pretty simple thing to check.

Dave

Mills@UDEL.EDU (11/06/87)

Folks,

My apologies to the tcp-ip list for my recent reply to Sergio's message,
which must have seemed rather esoteric to most of you. I overlooked the
"tcp-ip" addressee in the return address list of the message. On the other
hand, if someone wants to start that game, I would be happy to play.

Dave

fedor@NIC.NYSER.NET (Mark Fedor) (11/06/87)

>Date:     Thu, 5 Nov 87 14:22:12 EST
>From: Mills@udel.edu
>Subject:  Re:  routing changes
>

	[ DELETED TEXT - MF ]

>Let's hear it for all those gated's honking strange distances to the
>fuzzies. Can someone answer the questions I put out about their behavior?
>
>Dave
>

	Dave,

	I must admit due to some traveling and moving, I have not
	read my mail too carefully.  As soon as I catch up and
	find the questions you put out, I will be glad to answer
	them.  Or you can send me a summary of your questions and
	I'll see what I can do.....

	Mark
	NYSERNet Inc.  (this is the last time I specify this!  Y'all
			should know I work there by now.)    :^)

	P.S.  can you elaborate "strange distances"?

Mills@UDEL.EDU (11/06/87)

Mark,

Yeah, I know who you work for now, but if I admitted that you might have
an excuse to wiggle off the hook. The hook seems to have already impaled
me, as you may have noticed.

Strange distances mean anything from 100 ms to somewhere in the middle of
Channel 4. Sergio's is a typical example. As for rounding up all the messages
I sent on the topic, gimme a break. There must be a hundred of them last
month alone. From reports by returning scouts to the INENG meeting, the
likely cause may be (a) incompatible gated versions and/or configurations,
(b) unstable ripspeakers behind gated or (c) metric conversion violations
when more than a single access path is available.

Dave