[comp.sys.proteon] Problems with P4200s not updating routing tables.

JMWOBUS@SUVM.BITNET ("John M. Wobus") (06/05/89)

>We've had trouble with our P4200s dropping routes, presumably because they
>are discarding incoming RIP information.  Proteon suggested we spend more
>money replacing what they just sold us with their newer equipment, which
>we have done to some extent.  We have also eliminated all RIP information
>about networks other than or own subnets from our network (using static
>default routes instead).
>
>Has anyone else had this sort of problem with P4200s?  If so, how did you
>solve it?  It has bothered us to no end that we bought into this stuff,
>then had to spend additional money buying new versions before we got a
>working network (by "working network", I mean a network which would not
>spontaneously disconnect telnet users several times a day).
>
>Also, it strikes me that software can be written to do other things
>besides drop routes when things get busy--we've never had such a problem
>with our non-Proteon routers.  It seems to me that each of our remaining
>P4200-10s is a time-bomb ready to start killing routes again when things
>get busy, and that our P4200-31s would do the same at some busier level,
>given the priorities of the software.

Thanks for all the response and interest in the problem I outlined in the
above quote.  Here are answers to several people's queries and
suggestions:

(1) It happened under 8.0.  Upgrading to 8.1 did NOT solve the problem:
    no noticeable change.
(2) Re serial links: we have only 1 serial link (T1) and though it may have
    suffered from the problem, the problem was hurting us most on other
    gateways with no serial links.
(3) Someone asked about my comment on Proteon's recommendation.  Proteon
    wanted to see our network configuration & map to check them.  They
    recommended upgrading to 8.1.  When that didn't help, all we got
    was suggestions: reduce the load on the gateways (which we did, but
    you can only do so much of this without moving computers from building
    to building, etc); reduce the amount of RIP information by using static
    defaults (which we did); upgrade to faster P4200s (which we did).
(4) Each of our gateways had (and has) a default network route.  Now, we
    depend upon them completely, whereas during the problem, we were also
    trying to use the RIP information which NYSERNet provides us.  Subnets
    were and are handled by RIP.  Subnet routes were clearly getting dropped:
    user complaints were about reaching our own computers; network routes
    were probably getting dropped too, but they are used less and their users
    are more used to occasional failures.   We run NO Gated.  The P4200s
    that lost routes were on an Ethernet with several other (non-Proteon)
    gateways which always had all the routes.  Thus
    I assume the RIP was on the Ethernet, but the P4200s didn't
    always manage to get it in their tables.  We went through the ringer
    and are very likely to have eliminated all simple problems like
    failing to "enable sending subnet routes".
(5) Yes, we run DECnet.  In order to reduce the chances of this problem, we
    have run gateways in parallel, one DECnet only, and one IP only--this
    also strikes me as throwing good money after bad.  We deal with all the
    routes that NYSERNet gives us (which I believe includes all the NSFnet
    regionals, but never included other networks hanging off of ARPAnet)
    which I recall being in excess of 200 networks.  We contribute 30-40 subnet
    routes of our own.  Milo S. Mendin says "As for not deleting routes
    upon the timeout of that route, the RIP spec says you're supposed to
    do that."  In fact, I believe our problem was that RIP-info on the network
    never reached the routing table: this has nothing to do with timeouts
    except that later, it is the timeout that actually erases the route.
(6) I don't know if OSPFIGP would help, but it does us little good
    until our other gateways' manufacturers also support it.  Anyhow,
    I would judge it a RIP implementation problem: who could say whether
    Proteon will do something similar when they implement OSPFIGP?

We started using P4200s largely because we liked the idea of using a
counter-rotating fiber ring to interconnect the buildings: it has no
single point of failure, and all that.  We decided to live with a
proprietary ring architecture while we wait for FDDI.  However, we have
found that the large majority of our problems are gateway problems rather
than network problems: we could have installed an Ethernet backbone using
fiber repeaters, avoided being locked into a single vendor and the users
would hardly notice any difference in performance or reliability.
However, more reliable gateways would easily be noticed.

John Wobus
Syracuse University

CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (06/06/89)

John,
You may be right that this is a Proteon flaw, but I'm not sure based
on the evidence you give.  I suspect you would see the same thing if you
had the same number of routers on an ethernet, all shoving these hundreds
of routes at eachother at exactly the same time.

Think about it.  Say you have 9 p4200s on your p80.  Every 30 seconds
each and every one of them broadcasts their entire routing table.  At
the precise same moment all the ethernet routers (p4200 and others)
broadcast all their routes.  I'd be willing to bet that your non-Proteon
hosts don't have anything like this amount of simultaneous packet
traffic to deal with.  This is why they don't lose routes, it isn't
because they aren't made by Proteon.

In my experience with this (with 350-400 routes in our tables), the
routes coming from ethernets would occasionally lose.  Since we've
reduced our routing tables to less than 100 routes we haven't seen
any of these problems at all.

I think there are two problems here, neither of which are unique to
Proteon:

1)  The RIP spec.  It says you *have* to poison all your routes.  I can
    think of no reason you should have to do this.  If you learn a
    route to net X on interface A, why bother poisoning route X to net A
    ALL the time?  Why not simply poison it for a few minutes after
    you time it out?  This would cut down *enormously* on the number of
    simultaneous packets routers have to process.

    Proteon has no choice here--they have to follow the spec.

2)  The fact that things "synch up".  As Van Jacobson has demonstrated,
    RIP processes will end up in sync, even if they
    start out at random times.  This is why all these packets flood
    the routers all at once.

    Proteon could do something about this, there have been several
    suggested techniques (although I don't know of any proven in practice).

Actually, OSPFIGP may help with your problems.  You don't have to run it
everywhere, just run it on your p80 backbone.  It *should* (if all the
hype comes true) generate much less traffic on the ring and therefore
not clog up your routers.

        Cliff Frost                   (415) 642-5360
        Central Computing Services    <cliff@berkeley.edu>
        University of California      CLIFF AT UCBCMSA
        Berkeley, CA 94720

hedrick@geneva.rutgers.edu (Charles Hedrick) (06/06/89)

1)  The RIP spec.  It says you *have* to poison all your routes.  I can
    ... Proteon has no choice here--they have to follow the spec.

I wrote the RFC for RIP.  It is not quite so cut and dried.  The RFC
points out both the advantages and disadvantages of poison reverse.
It requires that you implement poison reverse, but it explicitly
allows you to provide an option to disable it, and it also allows a
hybrid scheme, such as using poison reverse for a certain period of
time after a route disappears and then dropping the route.

2)  The fact that things "synch up".  As Van Jacobson has demonstrated,
    RIP processes will end up in sync, even if they
    start out at random times.  This is why all these packets flood

What Van demonstrated is that DECnet syncs up, in certain
circumstances.  I had speculated that this might be a problem, and in
fact a paper I wrote with Len Bosack a year or two ago mentions it.
In preparing that paper I looked for evidence that it was happening on
our network, and wasn't able to find any, either for DECnet or IGRP
(both using cisco routers).  Van's results were for DECnet, apparently
using VAXes as routers.  The RIP RFC was written after Van's
presentation.  It mentions the problem of self-synchronization
explicitly, and requires that the implementation take precautions that
are sufficient to prevent it.  I know that the folks at Proteon are
aware of this problem.  I would expect them to have taken the
necessary precautions.

I do agree that sending the entire Internet routing table via RIP
could easily overload an interface.  Ballpark calculations suggest
that you could be sending on the order of 20 packets.  How dangerous
this is depends upon how close together they are.  If the packets are
sent off as they are filled, there would be a bit of space between
them.  If they are queued at once, our experience with NFS suggests
that 20 back to back packets is going to cause trouble to many
different kinds of interface.

CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (06/06/89)

Sigh.  It turns out I was commenting on RIP from an out-of-date draft
of the spec.  My apologies to Charles Hedrick (and everyone else...)

I believe that my analysis of what is likely happening is still
reasonable, since Proteon's RIP does split-horizon-with-poisoned-reverse
in the manner I described.  I was wrong to say that the RIP spec
*requires* this behaviour.

      Cliff

ps  I remember that Van Jacobson's original observations had to do
    with DEC stuff.  For a long time analogous behaviour by some RIP
    implementations has been observed by many people.  As I said before,
    it is not inherent to RIP but merely an artifact of implementations.