louie@sayshell.umd.edu (Louis A. Mamakos) (03/19/91)
I have a very strange tale to tell of a very weird Novell IPX problem provoked by cisco routers. [Quick summary for those that don't want to read the long, sad story: Novell XMSNET3 client software will emit watchdog packets with incorrect source IPX network numbers. cisco not to blame.] First, let me describe the environment and the situation. We have on our campus a number of "old" Proteon P4200 routers, with the Novell IPX forwarder. Things were working Just Fine, except we are phasing out the Proteon routers and moving these subnets over to cisco routers. After I moved two buildings off of a pair of Proteon routers to two ports on a cisco router, my users started to yell and scream that the connection to the Novell file server was being broken. I ran some tests as best I could, since (back then) I knew next to nothing about Novell IPX and really wanted to stay in that state of ignorance; but that was not to be. After a week or two, we finally discovered that the failure mode was caused by Novell's watchdog packets failing to operate properly. You see, if a Novell client make no requests of the server for 5 minutes, the server pokes at it with a watchdog packet; the client is supposed to send a reply and the server is then reassured that the client is still around and thus keeps its "connection" to the server intact. If the client doesn't respond to the watchdog packet, the server then begins to sick watchdogs at the client every minute (actually, according to my Excelan LANalyzer, more like 58 seconds) desperately hoping the client will respond. It keeps this up for 10 minutes, and the will declare the client dead if no replies are heard. Now the weird thing is, this only happens when the cisco routers are being used; if I switch back to the Proteon routers, things get better. Also, it affects most of the clients on the subnet, not just and isolated one or two. The other wierd thing is that not *all* of the cisco attached Novell IPX networks exhibited this problem, just a few. This made it very difficult to try to monitor the problem, as it meant treks across campus to the few sites that did have the problem. Note that the Novell IPX network number was unchanged between the Proteon router and the cisco router; the only visible differance would have been the MAC level address, and perhaps timing related stuff. I had cisco Systems involved in trying to resolve the problem, which I felt certain *had* to be a cisco problem; after all it worked with the Proteon router, right? They were *very* responsive and cooperative in trying to help me track down and resolve the problem. For instance, one [grasping at straws] hypothosis was the the Ethernet frame padding was somehow significant (since the watchdog packets were smaller than the 64 byte minimum frame size). Cisco built a version of the router code that preserved the ethernet frame padding as the frame arrived on one ethernet, traversed an FDDI ring and was deposited on the destination ethernet. That didn't actually help. At some point after gathering a pile 'o traces we noticed that the Source Novell IPX network number in the watchdog reply packet was zero, rather than the "correct" IPX network number. Now, I didn't know if this was a problem or not since these silly protocols are documented anywhere that mere mortals can get at. We set up simultanous packet captures using a pair of Excelan LANalyzers, and noted again that the cisco router passed the packets unmolested, except of the "tranport control" filed (hop count), and the ethernet frame padding. We also noticed that when using the Proteon routers on the same network that the watchdog replies had the *correct* Novell network number in the packet. Hmmmm... Cisco built me a software load that checks the source Novell IPX network number for 0, and replaced it with the "correct" Novell network number for that Ethernet. Magically, everything began to work! Now this was very weird; why would the network number be 0 in the watchdog packet replies with the cisco, but the correct value when using the Proteon routers on the exact same network? Hmmm.. We also noticed that this only occured when using the XMSNET3 or XMSNET4 shell on the PC client; the problem was *never* observed using the non-extended memory versions, NET3 and NET4. We're also using the BYU packet driver versions of these programs. So at this point I thought our problems were pretty much solved in the short term; I decided that cisco was off the hook since the PC client was clearly emitting a garbaged packet for whatever reason. I had a workaround from cisco to repair the damaged packet (but only if the IPX fast switching code was turned off). I was all set to go, and I got my users off my back and they were no longer trying to kill me. Then it happened again. The Novell code must have some AI, because it new we were on to it. I began to observe broken behavior on the clients. $@#&*() said I, thinking I got my life back to do other things for a change. I dragged my LANalyzer over to the site yet again (thinking what a good idea it was to have it in a portable PC) and captured some more traces. I was seeing the same sort of thing; the watchdog packet replies were not being returned. But this time, rather than the source Novell IPX network number being 0, it was effectively trash. For example, rather than being 8008A600 it was 02A14ECA. This was a software bug that fought back. I called the cisco engineer that I was working with (yes, I even had a direct phone number for an engineer!) and described the problem. An hour later I had a new software load the checked the transport control (hop count) field of the IPX packets being received; if it was 0 (originated on THIS ethernet), then it made sure that the source IPX network number was correct, no matter what it was before. My problems are solved (or at least worked-around) and have been for a span of time measured in weeks. I can sleep again. So: I hope that this discussion might save someone else's hair from being pulled out. Does anyone know anything about Novell XMSNET3 programs that trash themselves like this? I figure it is only a matter of time before a random store (if that's what it is) hoses something else in XMSNET3 instead of the source IPX network number for watchdog packets. My users really want to continue to use the extened memory (XMS...) versions since it gives them more memory to run their applications in. My suggestions that they get a real computer and run a real operating system with real networking were not well received :-) Given that we have to continue to live with Novell IPX, is anyone aware of a version of XMSNET3 that doesn't have this problem? We tried a variety of different versions of XMSNET3, each of which seemed to have the problem. Note that the BYU packet driver stuff is in yet another program, called IPX (or NEWIPX or IPXPKT), and NOT in the XMSNET3 program. That tends to make me not blame the BYU packet driver stuff. But then again, this is crufty PC software, so who knows WHAT is going on in there. Also, I'm told that the current release of cisco gateway server software incorporates this fix; I don't know for certain. Once again, I have to thank cisco for the time and effort they put in to help resolve this problem, even AFTER it was clear to me that there was no fault in the cisco product. I wish my relationship with other vendors was as good. Louis Mamakos Assistant Manager, Network Infrastructure University of Maryland, College Park