[comp.sys.novell] Ultimate weird Novell IPX bug and cisco routers: save your sanity!

louie@sayshell.umd.edu (Louis A. Mamakos) (03/19/91)
I have a very strange tale to tell of a very weird Novell IPX problem
provoked by cisco routers.  [Quick summary for those that don't want
to read the long, sad story: Novell XMSNET3 client software will emit
watchdog packets with incorrect source IPX network numbers.  cisco not
to blame.]

First, let me describe the environment and the situation.  We have on
our campus a number of "old" Proteon P4200 routers, with the Novell
IPX forwarder.  Things were working Just Fine, except we are phasing out
the Proteon routers and moving these subnets over to cisco routers.

After I moved two buildings off of a pair of Proteon routers to two
ports on a cisco router, my users started to yell and scream that the
connection to the Novell file server was being broken.  I ran some
tests as best I could, since (back then) I knew next to nothing about
Novell IPX and really wanted to stay in that state of ignorance; but
that was not to be. 

After a week or two, we finally discovered that the failure mode was
caused by Novell's watchdog packets failing to operate properly.  You
see, if a Novell client make no requests of the server for 5 minutes,
the server pokes at it with a watchdog packet; the client is supposed
to send a reply and the server is then reassured that the client is
still around and thus keeps its "connection" to the server intact.  If
the client doesn't respond to the watchdog packet, the server then
begins to sick watchdogs at the client every minute (actually,
according to my Excelan LANalyzer, more like 58 seconds) desperately
hoping the client will respond.  It keeps this up for 10 minutes, and
the will declare the client dead if no replies are heard.

Now the weird thing is, this only happens when the cisco routers are
being used; if I switch back to the Proteon routers, things get
better.  Also, it affects most of the clients on the subnet, not just
and isolated one or two.

The other wierd thing is that not *all* of the cisco attached Novell
IPX networks exhibited this problem, just a few.  This made it very
difficult to try to monitor the problem, as it meant treks across
campus to the few sites that did have the problem.  Note that the
Novell IPX network number was unchanged between the Proteon router and
the cisco router; the only visible differance would have been the MAC
level address, and perhaps timing related stuff.

I had cisco Systems involved in trying to resolve the problem, which I
felt certain *had* to be a cisco problem; after all it worked with the
Proteon router, right?  They were *very* responsive and cooperative in
trying to help me track down and resolve the problem.  For instance,
one [grasping at straws] hypothosis was the the Ethernet frame padding
was somehow significant (since the watchdog packets were smaller than
the 64 byte minimum frame size).  Cisco built a version of the router
code that preserved the ethernet frame padding as the frame arrived on
one ethernet, traversed an FDDI ring and was deposited on the
destination ethernet.  That didn't actually help.

At some point after gathering a pile 'o traces we noticed that the
Source Novell IPX network number in the watchdog reply packet was
zero, rather than the "correct" IPX network number.  Now, I didn't
know if this was a problem or not since these silly protocols are
documented anywhere that mere mortals can get at.  We set up
simultanous packet captures using a pair of Excelan LANalyzers, and
noted again that the cisco router passed the packets unmolested,
except of the "tranport control" filed (hop count), and the ethernet
frame padding.  

We also noticed that when using the Proteon routers on the same
network that the watchdog replies had the *correct* Novell network
number in the packet.  Hmmmm...

Cisco built me a software load that checks the source Novell IPX
network number for 0, and replaced it with the "correct" Novell
network number for that Ethernet.  Magically, everything began to
work!

Now this was very weird; why would the network number be 0 in the
watchdog packet replies with the cisco, but the correct value when
using the Proteon routers on the exact same network?  Hmmm..  We also
noticed that this only occured when using the XMSNET3 or XMSNET4 shell
on the PC client; the problem was *never* observed using the
non-extended memory versions, NET3 and NET4.  We're also using the BYU
packet driver versions of these programs.

So at this point I thought our problems were pretty much solved in the
short term; I decided that cisco was off the hook since the PC client
was clearly emitting a garbaged packet for whatever reason.  I had a
workaround from cisco to repair the damaged packet (but only if the
IPX fast switching code was turned off).  I was all set to go, and I
got my users off my back and they were no longer trying to kill me.

Then it happened again.

The Novell code must have some AI, because it new we were on to it.  I
began to observe broken behavior on the clients.  $@#&*() said I,
thinking I got my life back to do other things for a change.  I
dragged my LANalyzer over to the site yet again (thinking what a good
idea it was to have it in a portable PC) and captured some more
traces.  I was seeing the same sort of thing; the watchdog packet
replies were not being returned.  But this time, rather than the
source Novell IPX network number being 0, it was effectively
trash.  For example, rather than being 8008A600 it was 02A14ECA.

This was a software bug that fought back.

I called the cisco engineer that I was working with (yes, I even had a 
direct phone number for an engineer!) and described the problem.  An hour
later I had a new software load the checked the transport control (hop count)
field of the IPX packets being received; if it was 0 (originated on THIS
ethernet), then it made sure that the source IPX network number was correct,
no matter what it was before.

My problems are solved (or at least worked-around) and have been for a
span of time measured in weeks.  I can sleep again.

So:

I hope that this discussion might save someone else's hair from
being pulled out.  

Does anyone know anything about Novell XMSNET3 programs that trash
themselves like this?  I figure it is only a matter of time before a
random store (if that's what it is) hoses something else in XMSNET3
instead of the source IPX network number for watchdog packets.

My users really want to continue to use the extened memory (XMS...)
versions since it gives them more memory to run their applications in.
My suggestions that they get a real computer and run a real operating
system with real networking were not well received :-)  Given that we
have to continue to live with Novell IPX, is anyone aware of a version
of XMSNET3 that doesn't have this problem?  

We tried a variety of different versions of XMSNET3, each of which
seemed to have the problem.  Note that the BYU packet driver stuff is
in yet another program, called IPX (or NEWIPX or IPXPKT), and NOT in
the XMSNET3 program.  That tends to make me not blame the BYU packet
driver stuff.  But then again, this is crufty PC software, so who
knows WHAT is going on in there.

Also, I'm told that the current release of cisco gateway server
software incorporates this fix; I don't know for certain.  Once again,
I have to thank cisco for the time and effort they put in to help
resolve this problem, even AFTER it was clear to me that there was no
fault in the cisco product.  I wish my relationship with other vendors
was as good.


Louis Mamakos
Assistant Manager, Network Infrastructure
University of Maryland, College Park