mills@UDEL.EDU.UUCP (10/03/87)
Folks, Things have been very bad around the NSFNET since last Thursday. After several 16-hour days and much experimentation, I think I understand at least some of the reasons. If I am correct, you are not going to like the consequences. Last Thursday the primary NSFNET gateway psc-gw became increasingly flaky, eventually to the point where it and its seventy-odd nets disappeared from EGP updates. Backup gateways linkabit-gw and cu-arpa picked up the slack, but not without considerable losses and delays due to congestion. When the new ARPANET code was installed over the weekend, psc-gw and its PSN (14) both completely expired, reportedly due to "resource shortage," the usual BBN euphemism for insufficient storage or table overflow, especially for connection blocks which manage ARPANET virtual circuits. Apparently, BBN backed out of the new code, so the PSN is unchanged from Thursday. Meanwhile, Maryland gateway terp, also connected to a PSN (20) running the new ARPAware, began behaving badly, so much so that terp was simply turned off, leaving another Maryland gateway to hump the load. At this time (Thursday evening) the gateway is still off. Since both psc-gw and terp have similar configurations, connectivity and PSN (X.25) interfaces, one would assume the same varmit bit both of them. Meanwhile, I was sitting off PSN 96 trying to figure out what was going on and noticed linkabit-gw 10.0.0.111 and dcn-gw 10.2.0.96 could not reach psc-gw at its ARPANET address 10.4.0.14. However, both of these buzzards could reach other hosts with no problem. Furthermore, EGP updates received from the usual corespeakers revealed psc-gw was working just fine. I concluded something wierd was spooking the ARPANET; however, I found that cu-arpa 10.3.0.96 and louie 10.0.0.96 could work psc-gw at its ARPANET address. I thought maybe X.25 was the key, since all of the other PSN 96 machines use 1822, and cranked up swamp-gw 10.9.0.96 using X.25, but found no joy with psc-gw either. When Dave O'Leary of PSC called to tell me their ACC 5250 X.25 driver for the MicroVAX was spewing out error comments to the effect that insufficient virtual circuits were available, all the cards fell into place. The 5250 supports a maximum of 64 virtual circuits. Apparently the number of ARPANET gateways and other (host) clients has escalated to the point that the 64-maximum was exceeded. Probably the PSN was groaning even before that, which might have led to the earlier problems over the weekend. The reason some gateways could work psc-gw anyway was that they had captured the virtual circuits due to significant traffic loads and frequent connection attempts. My tests were from lightly loaded host ports which couldn't break into the mayhem which must be going on in the psc-gw 5250 board. I have looked at the 5250 driver code, which is pretty simplistic on how it manages the virtual-circuit inventory. It appears now of the highest priority that a more mature approach be implemented in the driver, so that virtual-circuit resources can be reclaimed on the basis of use, age, etc. In principle, this is not very hard, but would have to be done quickly. Meanwhile, I suspect a lot of X.25 client gateways (not just NSFNET) are or soon will be very sick indeed. Note that reclamation requires that open circuits to one destination may have to be closed abruptly, which can result in loss of data, then reopened to another destination. Under thrashing conditions where the load is spread over lots of other gateways and virtual circuits are flapping like crazy, the cherished ARPANET reputation for reliable transport may be considerably tarnished. Those of us who have pondered the wisdom of underlaying X.25 virtual circuits beneath a connectionless service have repeatedly said that this kind of problem was certain to occur sooner or later. There are now about 200 gateways and 300 networks out there. As the ARPANET evolves toward a gateway-gateway (many-to-many) service, rather than a host-gateway (few-to-many) service, the problem can only get much worse. I personally believe the ARPANET architects and engineers, as well as the host and gateway vendors, must quickly come to solid grips on this issue. Our most precious resource may not be packet buffers, but connection blocks. Dave -------
pogran@CCQ.BBN.COM (Ken Pogran) (10/04/87)
Dave, The message you sent to the tcp-ip list the other day regarding the NSFNET woes you observed caused us here at BBN to put on our thinking caps. We worked to understand how what you saw relates to what we know about what's happening in the ARPANET these days. I think we already understood what was behind a good bit of what you observed, and your message gave us the impetus to investigate a few more things as well. This message describes the situation as we understand it. There are four separate underlying issues: 1. The number of "reachable networks" in the Internet has just nudged upwards of 300 for the first time. (The Internet used to be growing at a rate of about 10 networks/month; that rate has accelerated over the past few months.) 2. For the week ending Thursday, 1 October, the ARPANET handled a record 202 million packets. (Traffic over the past few months has been in the 180s -- itself a record over last spring.) 3. We've begun the "beta test" on the ARPANET of the new PSN software release, PSN 7.0, and -- sure enough -- there have been a few problems. And, finally, 4. The limit, that you described in your message, of 64 virtual circuits in the ACC 5250 X.25 driver that is used by several X.25-connected gateways on the ARPARNET The first two issues just demonstrate that things continue to get busier and busier in the ARPANET and in the Internet. We've put out a new version of LSI-11 "core gateway" software that allows for 400, rather than 300, reachable gateways to give the core some breathing room again. And I shudder to think what ARPANET (and, hence, Internet) performance would be like if we tried to handle over 200 million packets per week without the so-called "Routing Patch" that was installed late in the summer that considerably improved the performance of the ARPANET routing algorithm. I think the third issue, the beginning of the PSN 7.0 beta test on the ARPANET, contributed to some of what you saw and helped to obscure some of the other causes of what you observed. As you know, last weekend, we put PSN 7 into a portion of the ARPANET. CMU was one of the nodes that got PSN 7. PSN 7 contains a new "End-to-End" protocol for management of the flow of data between source PSNs and destination PSNs. It's the first re-do of the End-to-End protocol in the ARPANET EVER. We're expecting a lot of improvement in efficiency within the PSN and, hence, some network performance improvement. To make a graceful, phased cutover to the New End-to-End feasible, PSN 7.0 contains code for both the new and the old End-to-End protocols. So as we've introduced PSN 7.0, it's been with the OLD end-to-end protocol. Now unfortunately, having code for two End-to-End protocols coresident takes up memory space that would normally go to buffers, etc. for handling traffic. So, yes -- during the 3-4 week phased cutover, the ARPANET PSN's will be a little short on buffer space; there's not much that can be done about that. But once ALL nodes are cut over to the New End-to-End protocol, we will install PSN 7.1, which will remove the old End-to-End, reclaim that memory space, and -- in the case of the ARPANET nodes in which C/300 processors have replaced the C/30s -- be able to use DOUBLE the main memory. Back to the problem at hand: You mentioned the report of "resource shortage"s in the PSNs. This happened with the CMU PSN for reasons we still don't understand. However, this WASN'T "the usual BBN euphemism for ... connection blocks which manage ARPANET virtual circuits" that you suggested in your message -- we've usually got plenty of those these days. The resource shortage the CMU PSN reported to the NOC had to do with the PSN's X.25 interface. Since several higher-priority problems showed up with PSN 7, we decided the best thing to do was to return the CMU node to PSN 6 and work on this one later. We have some preliminary ideas of what might have happened, and we'll be investigating this week. As for delays in the ARPANET: It turns out that the version of PSN 7.0 that was deployed last weekend contained a bug in the "Routing Patch" that worsened, instead of improved, the performance of the routing algorithm. We are frankly embarassed about that. This problem was fixed Thursday night, 1 October -- about the time you sent your message. We'd be very interested in hearing from you how things looked from the NSFNet side THIS weekend. From your description it certainly sounds like the 64 VC limit in the ACC 5250 is the proximate cause of the problem at CMU last weekend. We now count 83 gateways attached to the ARPANET. A gateway on the ARPANET that's handling a lot of diverse traffic to other gateways as well as to other ARPANET hosts is very likely to need more than 64 VCs. We think we can provide a work-around for this problem over the short term. The PSN has a "idle timer" for each VC, and can initiate a Close of the VC if it hasn't been used for awhile. We can configure that timer to be pretty short and thus recyle the gateway's VCs. Of course, some overhead will be incurred to re-establish a VC to send the next IP datagram to that destination, but that's probably preferable to having things plug up for lack of VCs. Note that by having the PSN reclaim idle VCs, we shouldn't see much "loss of data" that you alluded to in your message. We would be happy to work with administrators at sites that have gateways with ACC 5250s who would like to try this out. In closing, let me say that we at BBN share your concerns about the issues to be faced as the ARPANET evolves toward a gateway-to-gateway service from its traditional host-to-host or host-to-gateway service. The way gateways are attached to the network is one of a number of urgent architectural and engineering issues that must be addressed. Regards, Ken Pogran Manager, System Architecture BBN Communications Corporation P.S. TO THE COMMUNITY: As the PSN 7.0 upgrade proceeds in the ARPANET, we'll probably encounter a few more problems. As described in the DDN Management Bulletin distributed earlier, please send reports of problems to ARPAUPGRADE@BBN.COM. BBN will respond.
Mills@UDEL.EDU (10/09/87)
Ken, Thanks very much for your thoughtful and informative response. Like you, I do believe the proximate cause of the psc-gw problems is running short of virtual-circuit resources in the X25 interface; however, I am a little worried about the workaround you suggest - shortening the idle timer in the PSN itself. I have verified that packets do get lost if traffic is flowing at the time of the VC clear due to the interface itself, even in loopback. I think the eventual resolution must be rebuilding the driver to reclaim VCs on the basis of time and use, much the same way the PSNs must handle that for themselremacirI di
pogran@CCQ.BBN.COM (Ken Pogran) (10/13/87)
Dave, Please note that I suggested using the PSN idle timer to recycle VCs as a WORKAROUND; it certainly isn't the proper long-term RESOLUTION. That is more likely to be the rebuilding of the driver that you suggest. You mention that packets "get lost if traffic is flowing at the time of the VC CLEAR." If we use an IDLE timer to generate the clear, wouldn't we mostly avoid the problem -- because we clear when it's idle -- i.e., no traffic flowing? Regards, Ken
Mills@UDEL.EDU (10/13/87)
Ken, My experience is that it is not unlikely that a packet happens to be in transit from one end when the other end decides to close. While this would not occur too often, it would occur often enough to dominate losses due other causes. Dave