[comp.sys.proteon] routing problem

dlw@VIOLET.BERKELEY.EDU (David Wasley) (08/03/88)

Having puzzled about a routing problem here for some time, without resolution,
I've just heard that another site is experiencing the same problem. So let me
pose it for this group in case anyone else is seeing it too, or can shed light.

Below is a schematic similar to our situation:


      net-A         net-B            net-C         net-D      ethernets
        |             |                |             |
    +-------+     +-------+        +-------+     +-------+
    | p4200 |     | p4200 |        | p4200 |     | p4200 |
    |  GW1  |     |  GW2  |        |  GW3  |     |  GW4  |
    +-------+     +-------+        +-------+     +-------+
        \             /                \             /
         \           /                  \           /
     pp-EA\         /pp-EB          pp-EC\         /pp-ED     pt-to-pt circuits
           \       /                      \       /
            \     /                        \     /
           +-------+                      +-------+
           | p4200 |                      | p4200 |
           |  GW5  |                      |  GW6  |
           +-------+                      +-------+
               |                              |           ethernet
   net-E ------o------------------------------o-----------------------

All GW's are running release 7.4b and all use RIP.


I have a process running on a machine within net-E that sends a RIP "query"
to GW5 every 30 seconds, and notes any change in advertised metrics. (No, we
don't have the ability to do it with SNMP (yet) nor is the monitor machine
on the same ethernet. We're working on that.) A similar process monitors GW6.

The problem: as many as 15 times a day, the metrics for pp-EC, net-C, net-D,
and pp-ED **as seen by GW5** go to infinity, and then come back anywhere from
30 seconds (the next sample) to 30 minutes later. However, during those same
days, GW6 **never** loses routes to those nets, but it does lose routes
to the things beyond GW5. In other words, the problem shows symmetry. (The
exact times and frequencies vary between the 2 GW's, but the symptoms are
symmetrical.)

Below are extracts from the actual log file for GW5. (Only the net names have
been changed, to correspond to the picture.)

Has anyone else observed this behavior? Can anyone think of a plausible
explanation?

Thanks,
	David Wasley
	U C Berkeley



---- Scenario 1: lose all routes synchronously, very common ---

Jul 28 13:31:32 pp-EC       2 -> 16
Jul 28 13:31:32 pp-ED       2 -> 16
Jul 28 13:31:32 net-D       3 -> 16
Jul 28 13:31:32 net-C       3 -> 16

Jul 28 13:32:04 pp-EC      16 ->  2
Jul 28 13:32:04 pp-ED      16 ->  2
Jul 28 13:32:04 net-D      16 ->  3
Jul 28 13:32:04 net-C      16 ->  3

Jul 28 15:58:35 pp-EC       2 -> 16
Jul 28 15:58:35 pp-ED       2 -> 16
Jul 28 15:58:35 net-D       3 -> 16
Jul 28 15:58:35 net-C       3 -> 16

Jul 28 15:59:08 pp-EC      16 ->  2
Jul 28 15:59:08 pp-ED      16 ->  2
Jul 28 15:59:08 net-D      16 ->  3
Jul 28 15:59:08 net-C      16 ->  3

---- Scenario 2: lose nets, then p-p links, then regain in reverse order ----

Jul 29 13:48:22 net-D       3 -> 16
Jul 29 13:48:22 net-C       3 -> 16

Jul 29 13:49:28 pp-EC       2 -> 16
Jul 29 13:49:28 pp-ED       2 -> 16

Jul 29 13:50:33 pp-EC      16 ->  2
Jul 29 13:50:33 pp-ED      16 ->  2

Jul 29 13:54:53 net-D      16 ->  3
Jul 29 13:54:53 net-C      16 ->  3

---- Scenario 3: lose nets and p-p links asynchronously!?! ----

Jul 30 04:05:59 net-D       3 -> 16
Jul 30 04:05:59 net-C       3 -> 16

Jul 30 04:07:04 net-D      16 ->  3
Jul 30 04:07:04 net-C      16 ->  3

Jul 30 07:50:03 pp-ED       2 -> 16
Jul 30 07:50:03 pp-EC       2 -> 16

Jul 30 07:57:37 pp-EC      16 ->  2
Jul 30 07:57:37 pp-ED      16 ->  2

Jul 30 08:00:52 net-D       3 -> 16
Jul 30 08:00:52 net-C       3 -> 16

Jul 30 08:12:08 net-D      16 ->  3
Jul 30 08:12:08 net-C      16 ->  3

GR.PJL@ISUMVS.BITNET ("Paul Lustgraaf") (08/03/88)

I believe we are experiencing almost the same thing on MIDNET,
which is organized as a ring of 12 p4200s.  If you telnet to one
of the p4200s and watch the error log (t 2), you can see RIP
routes time out and then, a few seconds later, an update comes
along for the same route.  I believe it is some sort of timing
problem, that is, RIP updates are not happening often enough.
I must confess, though, that I have never really mastered RIP,
so if I'm all wet, please send flames to /dev/null.


Paul Lustgraaf              GR.PJL@ISUMVS.BITNET
Network Specialist          GRPJL@VAXD.IASTATE.EDU
Computation Center
Iowa State University
Ames, IA  50011
515-294-1556 or 294-0324

jch@SONNE.TN.CORNELL.EDU (Jeffrey C Honig) (08/03/88)

Some points to ponder:

I) What gateway addresses are you quering? If you do a RIP query to the
net-E address of GW5 it should respond with a metric of 16 for all of
GW6's nets, that's the way Split Horizon/Poisoned Reverse works.  Are
you consistently querying the same IP address for gateways 5 and 6? 
What does a dump of the routing table from the console show?

II) Are you having Ethernet load problems?  Are you using p4213/4214
boards and DECnet?  If you turn on more tracing do you see RIP packets
at GW5 from GW6?

Jeff

swb@DAINICHI.TN.CORNELL.EDU (Scott Brim) (08/05/88)

Dave, is there anything else involved in the routing?  Backdoor
connections not shown on your map?  Something translating between
protocols?  EGP peers which are not really a homogeneous group?

Do the metrics instantly pop to infinity or do they count to it?  What
filtering do you have on your interfaces (any?) to avoid routing "echoes"?

dlw@VIOLET.BERKELEY.EDU (David Wasley) (08/05/88)

Re:
	From swb@dainichi.tn.cornell.edu Thu Aug  4 14:22:57 1988
	To: dlw@violet.berkeley.edu (David Wasley)
	Cc: p4200@devvax.TN.CORNELL.EDU, cliff@cmsa.berkeley.edu,
	        vaf@score.stanford.edu, swb@dainichi.tn.cornell.edu
	Subject: Re: routing problem 
	Date: Thu, 04 Aug 88 17:21:26 -0400
	From: Scott Brim <swb@dainichi.tn.cornell.edu>
	
	Dave, is there anything else involved in the routing?  Backdoor
	connections not shown on your map?  Something translating between
	protocols?  EGP peers which are not really a homogeneous group?
	
	Do the metrics instantly pop to infinity or do they count to it?
	What filtering do you have on your interfaces (any?) to avoid
	routing "echoes"?

There are no back doors between the remote nets. The only routing protocol
used is RIP. The metrics (seem to) go instantly to infinity, as shown in
the log file output I sent. (The numbers are the actual metrics seen.)
I think this is reasonable assuming poisoned reverse.

There is one thing I didn't mention because I didn't think it relevant :-)
We're using 2 IP addresses on the same ethernet network controller on several
p4200's.  This is to implement different controls on routing information
interchange while avoiding an extra hop. Below is the more complete picture:

   (To NSFNET)
      net-A         net-B            net-C         net-D      ethernets
        |             |                |             |
    +-------+     +-------+        +-------+     +-------+
    | p4200 |     | p4200 |        | p4200 |     | p4200 |
    |  GW1  |     |  GW2  |        |  GW3  |     |  GW4  |
    +-------+     +-------+        +-------+     +-------+
        \             /                \             /
         \           /                  \           /
     pp-EA\         /pp-EB          pp-EC\         /pp-ED     pt-to-pt circuits
           \       /                      \       /
            \     /                        \     /
           +-------+                      +-------+
           | p4200 |                      | p4200 |
           |  GW5  |                      |  GW6  |
           +-------+                      +-------+
               |\                             |\
   net-E ------o-\----------------------------o-\---------------------
                  \                              \    one ethernet, 2 IP nets
   net-F ----------o--------o---------------o-----o-------------------
                            |               |
                        +-------+       +-------+
                        | p4200 |       | p4200 |
                        |  GW7  |       |  GW8  |
                        +-------+       +-------+
                          |   |           |   |
                       (To other subnets of net-F)

I was monitoring the net-F interface on box GW5, and seeing the 4 nets
beyond GW6 disappear, and then come back. There are no other nets beyond GW6.

One theory is that the routes were flopping between the net-E interface
and the net-F interface. This shouldn't happen because the metrics would
be identical, but it may be. (We ran an ethernet level trace this morning
but haven't analyzed it yet.) The non-16 metrics as seen by a query on
the net-F interface would indicate GW5 saw a route via its net-E interface;
metric 16's would indicate that the route was (then) via the net-F interface.

If it is flopping, I would have expected to see a more even distribution
of 16 & non-16 metrics.  On the other hand, maybe this is yet another example
of Van Jacobson's lock-step phenomenon.  But in that case, I would expect
them *always* to change together, which they don't.

To test this by inference, I change the configurations to "send nets" on
only the net-E interface (and allowed GW7 & 8 to listen only on that net).
It has been much more stable since then. I'll believe it after a day or so.

Maybe I'm just tilting at windmills...
	David