[comp.sys.proteon] Overview Problem

klong@umd5.umd.edu (Kim Long) (10/17/90)

I'm writing on behalf of the SURAnet NOC in regard to a problem
we are experiencing with the Overview Network Monitoring package.
Sometime last Saturday, the package began to behave in an unpredictable
manner without apparent reason.  No changes to the topography had been
made in over a week nor had any changes been made to the platform or
hard disk.  

The symptoms that occur are as follows:

	All nodes on the map turn red, links blue
	Keyboard/mouse lockup    
 	Disappearing links
	Relocation of nodes
	Reappearing links that randomly connect to nodes
        Total disappearance of the map from the screen leaving
		no datafiles in OVDATA upon reboot

Any one or all of the symptoms can occur at a given time.  Presently,
we are either performing a warm boot (or cold boot, if necessary) to
redraw the screen to get the current status.  We are currently having
to reboot about every 15 minutes, or less.

Using a sniffer we have examined the packets originating and destined
for the Overview unit.  The DLC, IP, and UDP protocol layers appear
to be intact.  However, the SNMP layer seems to become garbled
in various ways, generating different messages on the sniffer, from
"unknown syntax" to an "error" in the command area of the SNMP data
layer.  In 10 minutes time we captured over 4000 packets associated
with the Overview host. We are not seeing any collisions, crc's or 
framing errors on the ethernet.  Presently, we have 116 nodes that we  
query using Overview.  The cycle time is set to 55 and the resp time is
set to 1.5.

Would  you have any idea as to the cause of the problem?   Would a math
co-processor take some of the load off?  Thanks for your input.

Kim Long
SURAnet Operations

louie@sayshell.umd.edu ("Louis A. Mamakos") (10/17/90)

I could begin a minor flame about how the Overview product hasn't been
supported by Proteon with any maintainance releases since its
inception, but I suspect that you all know that.

The one fatal problem that we have discovered which is going to cause
us to abandon use of our Overview is that it cannot deal with routers
which have more than 16 interfaces.  I'm not sure of the specifics of
how it looses its cookies, but it fails hard when it does.  And it
does fairly quickly.  Perhaps this is your problem.  We have to use
PING to montior those routers (cisco AGS+ routers) rather than SNMP,
and that's not nearly as useful.

We are currently looking for more robust, supported, UNIX based SNMP
monitoring platforms for our NOC.  I'm glad that we don't pay for
software maintenance on the Overview product given the non-existant
level of support for it.

louie

gutierre@noc.arc.nasa.gov (Robert Michael Gutierrez) (10/19/90)

[I originally wrote this to the p-4200 distribution list, but it never made it here]

klong@umd5.umd.edu (Kim Long) writes on the P-4200 dist. list:

>I'm writing on behalf of the SURAnet NOC in regard to a problem
>we are experiencing with the Overview Network Monitoring package.

NASA Science Internet Network Operations uses the same platform
currently to monitor AMES-NET (NSIPO) from the Ames Research Center.
We are also experiencing problems with that platform.

The current problem we have had is that all of the buffers on the
device driver for PC-TCP seem to fill up, causing a transmission
lockup on the ethernet interface.  The subsequent lockup causes
Overview to status all the nodes as unreachable (red nodes and
red lines).  We can exit the program (slowly) though, but this
does not free up the buffers.  A warm reboot is needed to clear
the buffers at this point.

We exchanged the complete box out (except the EGA display card...the
replacement PC didn't seem to have a display card), but we still
have the same problem.

>Presently, we have 116 nodes that we query using Overview.
>The cycle time is set to 55 and the resp time is set to 1.5.

We have Overview Version 1.00 (dated 12/88...is this the only version
ever made!?!) and PC-TCP Version 2.03.  We currently have 54 nodes
on 1 level (we only use one 'cloud' [member] to group 4 nodes). Of
those nodes, 52 are SNMP.  Our query times are set fairly high (20
seconds on a response of 2.0...this is the only way to catch framing
and CRC hits).

>Would  you have any idea as to the cause of the problem?   Would a math
>co-processor take some of the load off?  Thanks for your input.

I thought about this angle, but it would only work if the Overview program
was compiled with the appropriate options.

Was the intention of Overview only to support small networks?  Is it
incapable of handling anything over "X" amount of nodes? (Our magic number
seemed to have been around 35 nodes).  It was a platform that performed
fine for the initial small network we previously had, but as nodes were
added, it was apparent that it was not going to be as flexible as we
were expecting.

We are trying to implement SunNet Manager, but that`s a whole 'nother
can of worms right now.



   Robert Michael Gutierrez
   NASA Science Internet - Network Operations Center.

kwe@buit13.bu.edu (Kent England) (10/19/90)

In article <9010170358.AA02833@umd5.UMD.EDU>
 klong@umd5.umd.edu (Kim Long) writes:
>I'm writing on behalf of the SURAnet NOC in regard to a problem
>we are experiencing with the Overview Network Monitoring package.
>Sometime last Saturday, the package began to behave in an unpredictable
>manner without apparent reason.  No changes to the topography had been
>made in over a week nor had any changes been made to the platform or
>hard disk.  
>

	You may be subject to a slowly filling disk.  Try to free up
some space and see if this helps.

	--Kent

oleary@noc.sura.net (dave o'leary) (10/20/90)

Our Overview problems seem to be resolved for the moment.

There were a few different things that we did - 

1/ we watched the ethernet with a sniffer and saw that there were
	no broken ethernet packets, and no broken IP packets coming
	from Overview, even the UDP checksums were okay, however the 
	SNMP part of the packets (i.e. the UDP data) was somehow 
	munged.  The gateways weren't responding to the broken queries,
	so they turned red on the monitor, even though they were up.
	Worse, some of the gateways were crashing with a NM_6B8 bughalt.
	Needless to say, this is less than ideal behavior.  We also got
	an error on the monitor process of the gateway reporting a 
	bad SNMP packet.  As a hack to get around this we started pinging
	a bunch of the gateways instead of SNMP querying them - at least
	it kept Overview and the gateways from crashing.  We also raised
	the time between queries on the various gateways.  (hint to 
	network management package implementors:  it would have made our 
	lives much easier if we could do this by changing all nodes at
	once rather than changing the hundred+ nodes  one at a time).

2/ We got new software from Proteon - I'm not sure of the details, but
	it was four executables that replaced older versions.  This 
	seemed to help significantly.

3/ We backed up the hard disk, reformatted it, and reinstalled everything.
	This was completed at about 10 last night and things seem to 
	have worked since last night.

Thanks very much for all of your suggestions in our time of need, and
for Proteon for the new software.  Kim Long spearheaded out efforts
here to get the problem resolved, so she may be able to provide more
details about the various fixes.  Thanks Kim !!

					dave o'leary
					SURAnet NOC Mgr.

tvm@proteon.com (Tom Miceli) (10/20/90)

In regard to the message below referencing the OVERview units and seagate
harddrives, there needs to be some clarification. Although we have had
problems with the seagate harddrives shipped with the OVERview units it has
been dealt with on a case-by-case basis. Emperical data tells us that when
the unit operates properly over an extended period the harddrive remains
intact. i.e. the harddrive units suffer from infant mortality. It has been
found out that the newer harrdrives that we are now shipping do not suffer
from this problem. At NIH we are going to replace 2 of the OVERview units
and see how they perform. If this proves to solve the problems we will then
replace the remaining units.

Tom Miceli
Mrg, Tech Support

=========================================================================
From: "Jay E. Vinton" <JEV@CU.NIH.GOV>
Date:     Fri, 19 Oct 90  13:29:16 EDT
Subject:  overview problem

> Date: Tue, 16 Oct 90 23:58:28 EDT
> From: klong@umd5.umd.edu (Kim Long)
> Message-Id: <9010170358.AA02833@umd5.UMD.EDU>
> To: p4200@devvax.TN.CORNELL.EDU
> Subject: Overview Problem
> Cc: ops@noc.sura.net
>
> I'm writing on behalf of the SURAnet NOC in regard to a problem
> we are experiencing with the Overview Network Monitoring package.
> Sometime last Saturday, the package began to behave in an unpredictable
> manner without apparent reason.  No changes to the topography had been
> made in over a week nor had any changes been made to the platform or
> hard disk.
>
> The symptoms that occur are as follows:
>
>        All nodes on the map turn red, links blue
>        Keyboard/mouse lockup
>        Disappearing links
>        Relocation of nodes
>        Reappearing links that randomly connect to nodes
>         Total disappearance of the map from the screen leaving
>                no datafiles in OVDATA upon reboot

I am not subscribed to p4200 but this mail was forwarded to me by
RAF@CU.NIH.GOV. We at NIH currently have 7 OVERVIEW stations.  All of
our stations have crashed/hung at one time or another with similar
symptoms.  Proteon customer service is telling us that there is a
problem with the seagate hard drives originally supplied with these
units.  They say that they are now using conner hard drives instead
for newly made units.  They say that they will replace all of our
units with the new drives, however, they say that it may be 3 weeks
until we get our first replacement.  I will post to this list the
results of our experiences with the replacement units if and when
that happens.

gutierre@noc.arc.nasa.gov (Robert Michael Gutierrez) (10/20/90)

oleary@noc.sura.net (dave o'leary) writes:
> Our Overview problems seem to be resolved for the moment.
> 
> There were a few different things that we did - 
> 
> 1/ we watched the ethernet with a sniffer and saw that there were
> 	no broken ethernet packets  ....  however the 
> 	SNMP part of the packets (i.e. the UDP data) was somehow 
> 	munged.

Bingo.  We found out the same problem last night when our Overview showed
the usual signs of crashing (this time, only 1 node went red instead of
all of them.  Watching the packets, our engineer noticed that the SNMP
data was corrupted, but after a reboot, all was fine.

> 	Worse, some of the gateways were crashing with a NM_6B8 bughalt.

We configured all our gateways as read-only (I thought this was a little
parinoid, but now, it seems to have been A Good Thing).  Were your
gateways configured as full read-write???  We've never thought of this
angle where any of our gateways crashed at the same time Overview
crashed, because we're too busy waiting for the PC to boot back up,
and trying to delete all the alerts that were accumilated.

> 	Needless to say, this is less than ideal behavior.  We also got
> 	an error on the monitor process of the gateway reporting a 
> 	bad SNMP packet.  As a hack to get around this we started pinging
> 	a bunch of the gateways instead of SNMP querying them - at least
> 	it kept Overview and the gateways from crashing.

I currently have console output from one of our routers being monitored 
& logged because it was crashing numerous times.  Now that this connection
between bad SNMP packets and crashing routers is a possibility, I'll
sort through the output for those appropriate messages.

> 2/ We got new software from Proteon - I'm not sure of the details, but
> 	it was four executables that replaced older versions.  This 
> 	seemed to help significantly.

Was this Overview software, or PC-TCP software?  Again, when our Overview
crashes, we have no buffers free anymore, hence no programs can communicate
with the PC-TCP driver anymore.

> 3/ We backed up the hard disk, reformatted it, and reinstalled everything.
> 	This was completed at about 10 last night and things seem to 
> 	have worked since last night.

That was our first step loooong time ago.  Obviously, it never worked.

   Robert Michael Gutierrez
   NASA Science Internet Office - Network Operations Center.
   Ames Research Center, Moffett Field, California.  USA.