dlw@VIOLET.BERKELEY.EDU (David Wasley) (08/03/88)
Having puzzled about a routing problem here for some time, without resolution, I've just heard that another site is experiencing the same problem. So let me pose it for this group in case anyone else is seeing it too, or can shed light. Below is a schematic similar to our situation: net-A net-B net-C net-D ethernets | | | | +-------+ +-------+ +-------+ +-------+ | p4200 | | p4200 | | p4200 | | p4200 | | GW1 | | GW2 | | GW3 | | GW4 | +-------+ +-------+ +-------+ +-------+ \ / \ / \ / \ / pp-EA\ /pp-EB pp-EC\ /pp-ED pt-to-pt circuits \ / \ / \ / \ / +-------+ +-------+ | p4200 | | p4200 | | GW5 | | GW6 | +-------+ +-------+ | | ethernet net-E ------o------------------------------o----------------------- All GW's are running release 7.4b and all use RIP. I have a process running on a machine within net-E that sends a RIP "query" to GW5 every 30 seconds, and notes any change in advertised metrics. (No, we don't have the ability to do it with SNMP (yet) nor is the monitor machine on the same ethernet. We're working on that.) A similar process monitors GW6. The problem: as many as 15 times a day, the metrics for pp-EC, net-C, net-D, and pp-ED **as seen by GW5** go to infinity, and then come back anywhere from 30 seconds (the next sample) to 30 minutes later. However, during those same days, GW6 **never** loses routes to those nets, but it does lose routes to the things beyond GW5. In other words, the problem shows symmetry. (The exact times and frequencies vary between the 2 GW's, but the symptoms are symmetrical.) Below are extracts from the actual log file for GW5. (Only the net names have been changed, to correspond to the picture.) Has anyone else observed this behavior? Can anyone think of a plausible explanation? Thanks, David Wasley U C Berkeley ---- Scenario 1: lose all routes synchronously, very common --- Jul 28 13:31:32 pp-EC 2 -> 16 Jul 28 13:31:32 pp-ED 2 -> 16 Jul 28 13:31:32 net-D 3 -> 16 Jul 28 13:31:32 net-C 3 -> 16 Jul 28 13:32:04 pp-EC 16 -> 2 Jul 28 13:32:04 pp-ED 16 -> 2 Jul 28 13:32:04 net-D 16 -> 3 Jul 28 13:32:04 net-C 16 -> 3 Jul 28 15:58:35 pp-EC 2 -> 16 Jul 28 15:58:35 pp-ED 2 -> 16 Jul 28 15:58:35 net-D 3 -> 16 Jul 28 15:58:35 net-C 3 -> 16 Jul 28 15:59:08 pp-EC 16 -> 2 Jul 28 15:59:08 pp-ED 16 -> 2 Jul 28 15:59:08 net-D 16 -> 3 Jul 28 15:59:08 net-C 16 -> 3 ---- Scenario 2: lose nets, then p-p links, then regain in reverse order ---- Jul 29 13:48:22 net-D 3 -> 16 Jul 29 13:48:22 net-C 3 -> 16 Jul 29 13:49:28 pp-EC 2 -> 16 Jul 29 13:49:28 pp-ED 2 -> 16 Jul 29 13:50:33 pp-EC 16 -> 2 Jul 29 13:50:33 pp-ED 16 -> 2 Jul 29 13:54:53 net-D 16 -> 3 Jul 29 13:54:53 net-C 16 -> 3 ---- Scenario 3: lose nets and p-p links asynchronously!?! ---- Jul 30 04:05:59 net-D 3 -> 16 Jul 30 04:05:59 net-C 3 -> 16 Jul 30 04:07:04 net-D 16 -> 3 Jul 30 04:07:04 net-C 16 -> 3 Jul 30 07:50:03 pp-ED 2 -> 16 Jul 30 07:50:03 pp-EC 2 -> 16 Jul 30 07:57:37 pp-EC 16 -> 2 Jul 30 07:57:37 pp-ED 16 -> 2 Jul 30 08:00:52 net-D 3 -> 16 Jul 30 08:00:52 net-C 3 -> 16 Jul 30 08:12:08 net-D 16 -> 3 Jul 30 08:12:08 net-C 16 -> 3
GR.PJL@ISUMVS.BITNET ("Paul Lustgraaf") (08/03/88)
I believe we are experiencing almost the same thing on MIDNET, which is organized as a ring of 12 p4200s. If you telnet to one of the p4200s and watch the error log (t 2), you can see RIP routes time out and then, a few seconds later, an update comes along for the same route. I believe it is some sort of timing problem, that is, RIP updates are not happening often enough. I must confess, though, that I have never really mastered RIP, so if I'm all wet, please send flames to /dev/null. Paul Lustgraaf GR.PJL@ISUMVS.BITNET Network Specialist GRPJL@VAXD.IASTATE.EDU Computation Center Iowa State University Ames, IA 50011 515-294-1556 or 294-0324
jch@SONNE.TN.CORNELL.EDU (Jeffrey C Honig) (08/03/88)
Some points to ponder: I) What gateway addresses are you quering? If you do a RIP query to the net-E address of GW5 it should respond with a metric of 16 for all of GW6's nets, that's the way Split Horizon/Poisoned Reverse works. Are you consistently querying the same IP address for gateways 5 and 6? What does a dump of the routing table from the console show? II) Are you having Ethernet load problems? Are you using p4213/4214 boards and DECnet? If you turn on more tracing do you see RIP packets at GW5 from GW6? Jeff
swb@DAINICHI.TN.CORNELL.EDU (Scott Brim) (08/05/88)
Dave, is there anything else involved in the routing? Backdoor connections not shown on your map? Something translating between protocols? EGP peers which are not really a homogeneous group? Do the metrics instantly pop to infinity or do they count to it? What filtering do you have on your interfaces (any?) to avoid routing "echoes"?
dlw@VIOLET.BERKELEY.EDU (David Wasley) (08/05/88)
Re: From swb@dainichi.tn.cornell.edu Thu Aug 4 14:22:57 1988 To: dlw@violet.berkeley.edu (David Wasley) Cc: p4200@devvax.TN.CORNELL.EDU, cliff@cmsa.berkeley.edu, vaf@score.stanford.edu, swb@dainichi.tn.cornell.edu Subject: Re: routing problem Date: Thu, 04 Aug 88 17:21:26 -0400 From: Scott Brim <swb@dainichi.tn.cornell.edu> Dave, is there anything else involved in the routing? Backdoor connections not shown on your map? Something translating between protocols? EGP peers which are not really a homogeneous group? Do the metrics instantly pop to infinity or do they count to it? What filtering do you have on your interfaces (any?) to avoid routing "echoes"? There are no back doors between the remote nets. The only routing protocol used is RIP. The metrics (seem to) go instantly to infinity, as shown in the log file output I sent. (The numbers are the actual metrics seen.) I think this is reasonable assuming poisoned reverse. There is one thing I didn't mention because I didn't think it relevant :-) We're using 2 IP addresses on the same ethernet network controller on several p4200's. This is to implement different controls on routing information interchange while avoiding an extra hop. Below is the more complete picture: (To NSFNET) net-A net-B net-C net-D ethernets | | | | +-------+ +-------+ +-------+ +-------+ | p4200 | | p4200 | | p4200 | | p4200 | | GW1 | | GW2 | | GW3 | | GW4 | +-------+ +-------+ +-------+ +-------+ \ / \ / \ / \ / pp-EA\ /pp-EB pp-EC\ /pp-ED pt-to-pt circuits \ / \ / \ / \ / +-------+ +-------+ | p4200 | | p4200 | | GW5 | | GW6 | +-------+ +-------+ |\ |\ net-E ------o-\----------------------------o-\--------------------- \ \ one ethernet, 2 IP nets net-F ----------o--------o---------------o-----o------------------- | | +-------+ +-------+ | p4200 | | p4200 | | GW7 | | GW8 | +-------+ +-------+ | | | | (To other subnets of net-F) I was monitoring the net-F interface on box GW5, and seeing the 4 nets beyond GW6 disappear, and then come back. There are no other nets beyond GW6. One theory is that the routes were flopping between the net-E interface and the net-F interface. This shouldn't happen because the metrics would be identical, but it may be. (We ran an ethernet level trace this morning but haven't analyzed it yet.) The non-16 metrics as seen by a query on the net-F interface would indicate GW5 saw a route via its net-E interface; metric 16's would indicate that the route was (then) via the net-F interface. If it is flopping, I would have expected to see a more even distribution of 16 & non-16 metrics. On the other hand, maybe this is yet another example of Van Jacobson's lock-step phenomenon. But in that case, I would expect them *always* to change together, which they don't. To test this by inference, I change the configurations to "send nets" on only the net-E interface (and allowed GW7 & 8 to listen only on that net). It has been much more stable since then. I'll believe it after a day or so. Maybe I'm just tilting at windmills... David