[comp.sys.novell] Compaq crashes on campus backbone

ehm1@wally.cc.msstate.edu (Eddie Mikell) (04/16/91)

Fellow netters:

I need help with a problem which has plagued me for almost 2 YEARS, but
I have had no luck from vendors or netwire on a solution or a possible
course of action.

I'll try to break my setup down:

The backbone:

Our backbone consist of a fiberoptic loop connecting all the buildings
on campus.  On that loop we have a variety of systems running, Sun's, Decs,
etc..  We also have 13 Novell servers connected to the backbone.

Each server is isolated from the backbone by a Retix bridge, and each 
building is islolated from the fiber by another Retix or Cabletron bridge.

The Novell's can all communicate with each other, and we have had few problems.
4 of the servers are used in a lab environment with with boot roms and remote
reset image files.  All the machines are running Advance Netware 2.15 Version
C, except the one giving me trouble which is running Netware SFT 2.15 C.



The machine giving me the trouble:

The server is a Compaq 386 16 mhz "Deskpro".  It is using a Racal Interlan
NP600 comm card.  It is running netware 2.15 version C with SFT and the
TTS is turned on (note.. it is the only machine on campus with TTS).



The 2 year problem:

When ever the compaq is placed on the campus backbone, it will crash.  The
time between crashes is random, and it does not appear to have anything
to do with the load on the ethernet, as it will crash at 3 in the morning,
or 3 in the afternoon.

It does not give an error message, and will not take any input on the keyboard
after the crash, so it is totally locked up.

If you leave the machine off the net (by turning off its Retix bridge), the
machine will run with no problems.



What I've done:


Since I suspected hardware problems first, I ran the Compaq diagnostics.
They turned up nothing.  The machine was running with 3 meg of memory, but
since it was handling routing request for the whole net, I upgraded the 
memory to 8 megs.  Same problem.


The machine has been totally replaced with another machine of the same
make and model... only the disk drives were moved from one to the
other.  Same problem....

I took the sniffer up and let it run till the server crashed again.  Nothing
out of the ordinary was indicated.  No packet fragments, etc.  I did see some
intercampus network activity going on behind the bridge, so I swapped
that bridge with another retix, and with another cabletron, but noted
the same traffic.  I checked the other behind other bridges on
campus, and this same traffic was seen on all the bridges that
were connected to Novell servers, so it seems to be something
that is happening to all the servers on campus...
 (that's another mystery which I need to investigate.. I
thought bridges were supposed to filter out that stuff).


Call for help:

I'm now at a loss why this particular machine should crash the way it
does.  Since I've replaced everything hardware wise except the diskdirve
(which passed the diagnostics, the hot fix area hasn't been used, etc).
I'd have to rule out hardware problems.

As I`ve noted before, its the only machine on the net that's
crashing like this.  The other servers are various brands of PC, but 
mostly your 386 16mhz clones, and they do not have this
problem.

No other system on campus runs the SFT software.  It seems hard to 
believe that somehow the TTs and the campus backbone are colliding, but
there could be a very logical reason why this is so.

I'm open for any suggestions, either from people who have had this 
kind of a problem, suggestions on what to test next, what I should
replace, etc....

I've left this same message on Netwire, but no solutions have come from
that direction either.



Thanks


Eddie H. Mikell

jlamb@npd.Novell.COM (Jason "Nematode" Lamb) (04/18/91)

Have you checked FCONSOLE statistics for the server? Particularly Dynamic
Memory Pool 1 Maximum and Peak used Stats. If you haven't login to the
server and run FCONSOLE and pull up the Statistics Summary screen. Then
let the server do its thing. When it crashes again, note DMP 1 stats. If
you're really close, you will have locked up the server by maxing out
DMP 1 usage. You can then try a couple of things. 1) Turn off TTS. (The
server will then go through the right error handling for maxing out DMP 1)
If that's not acceptable try to limit DMP 1 usage. (If you don't know what
takes up DMP 1, reply..) Also I believe this problem is taken care of in
2.2.

Jason Lamb

kenh@techbook.com (Ken Haynes) (04/20/91)

Jason,
What about routing errors?  I've heard of this kind of thing happening in
other multi protocol environments on ethernet and I think I recall the problem
was in the routing of "other" packets internally.  I also heard that if the
server and workstations were e-config'ed the problem went away, because the
server now understood and could handle the "alien" packet traffic properly.
This is all third hand, but thought I'd throw my $.02 worth in.

Ken

-- 
******************************************************************************
* Ken Haynes, CNE                    | 1-900-PRO-HELP
* Technical Support Product Manager, 900 Support
* UUCP: {nosun, sequent, tessi} kenh@techbook