[comp.dcom.sys.cisco] Bugs in cisco SNMP integer variable encoding

swatt@noc.net.yale.edu (Alan S. Watt) (08/21/90)
After several sets of correspondence with cisco, I have now confirmed
there is a bug in the integer encoding of SNMP variables.  This bug
affects 8.1, and presumably 8.0.  There were other bugs in prior
versions. This bug was confirmed after the 8.1(19) maintenance release,
so I assume it will be fixed for the next maint. release, or 8.2.

The symptoms are bad values on and immediately following a 32-bit
integer overflow.  The implied value on the overflow interval is
significantly larger than average, and several sampling periods later
what appears to be a totally random value appears.

Greg Satz of cisco confirms the problem occurs when a value crosses a
0xff000000 boundary.  I don't have any more details on this, so the
following might be wrong, but it appears the data can be fixed by
throwing out the values between the last one with the 0xff000000 bits
set and the apparently bogus one.  That is, assume the second
discontinuity is the real overflow and the ones in between are in
error.  You can then replace these values with ones calculated by
averaging the total increment over the interval.

The other problems with 7.1 releases show up as an entirely bad sample
(that is, all the variable appear to be wrong).  I have found if I just
throw that sample away, the next sample is a reasonable continuation of
the previous one.

Following are (1) an example of the problem, and (2) an "awk" script I
use to check for discontinuities in my snmppoll log files.   It works
with the snmppoll format of NYSER SNMP distribution 3.0; I assume it
can be adapted to work with other versions.


	- Alan S. Watt
	  High Speed Networking, Yale University
	  Computing and Information Systems
	  Box 2112 Yale Station
	  New Haven, CT  06520-2112
	  (203) 432-6600 X394
	  Watt-Alan@Yale.Edu

Disclaimer: It is a violation of federal law to use this article in
	    a manner inconsistent with this disclaimer.

======================================================================
Fri Aug 10 18:09:18 1990 ifOutOctets_2	0xff508f2a   4283469610
Fri Aug 10 18:24:23 1990 ifOutOctets_2	0xff7d315a   4286394714
Fri Aug 10 18:39:30 1990 ifOutOctets_2	0x00c060c8     12607688
Fri Aug 10 18:54:36 1990 ifOutOctets_2	0x00df8442     14648386
Fri Aug 10 19:09:42 1990 ifOutOctets_2	0x00f63868     16136296
Fri Aug 10 19:24:49 1990 ifOutOctets_2	0x00346748      3434312
Fri Aug 10 19:39:53 1990 ifOutOctets_2	0x0051fbdc      5372892
Fri Aug 10 19:55:01 1990 ifOutOctets_2	0x005cfb52      6093650

If you look at the increments at each interval, you get:

	record    2: diff =      2925104
	record    3: diff =     21180270
	record    4: diff =      2040698
	record    5: diff =      1487910
	record    6: diff =   4282265312
	record    7: diff =      1938580
	record    8: diff =       720758

The increment between samples 2 and 3 (when the overflow occurred) is
within ethernet capacity, but is so much higher than the average for
the entire day (not shown) that it is likely to be an error.  If sample
3 is a genuine overflow, the value at sample 6 is obviously wrong.

Following is the "corrected" sample, done by assuming the real overflow
occurred at record 6, and that records 3-5 should be replaced by a
calculated average increment between records 2 and 6:

Fri Aug 10 18:09:18 1990 ifOutOctets_2 0xff508f2a 4283469610
Fri Aug 10 18:24:23 1990 ifOutOctets_2 0xff7d315a 4286394714
Fri Aug 10 18:39:30 1990 ifOutOctets_2 0xffaafed5 4289396437
Fri Aug 10 18:54:36 1990 ifOutOctets_2 0xffd8cc50 4292398160
Fri Aug 10 19:09:42 1990 ifOutOctets_2 0x000699cb     432587
Fri Aug 10 19:24:49 1990 ifOutOctets_2 0x00346748    3434312
Fri Aug 10 19:39:53 1990 ifOutOctets_2 0x0051fbdc    5372892
Fri Aug 10 19:55:01 1990 ifOutOctets_2 0x005cfb52    6093650

Which gives increments of:

	record    2: diff =      2925104
	record    3: diff =      3001723
	record    4: diff =      3001723
	record    5: diff =      3001723
	record    6: diff =      3001725
	record    7: diff =      1938580
	record    8: diff =       720758

======================================================================
	#! /bin/sh

	#
	# Format of snmppoll data:
	#
	# Sun Aug 19 02:02:09 1990	_mgmt_mib_system_sysUpTime_0	\
	#					0xcb58b9c   213224348
	#
	#	$1	Day of week	("Mon", "Tue", ...)
	#	$2	Month		("Jan", "Feb", ...)
	#	$3	Day of month	(1, 2, 3, ...)
	#	$4	Time of day	(HH:MM:SS [24-hour clock])
	#	$5	Year
	#	$6	variable name
	#	$7	value in hex
	#	$8	value in decimal
	#

	awk '
	BEGIN {
		params["bogus"] = 0
	}
	{
		# reset NR when filename changes
		# also clear out stored previous values
		if (FILENAME != prevFILENAME) {
			NR = 1
			prevFILENAME = FILENAME
			for (var in params)
				params[var] = 0
		}
		# Get the variable name and value
		varName = $6
		varValue = $8
		varTime = $2 " " $3 " " $4

		if (varValue < params[varName]) {
			printf "discontinuity: '%s':%d; at %s; for '%s'\n", \
				FILENAME, NR, varTime, varName
		}
		params[varName] = varValue
	}
	' "$@"
======================================================================
This script, when run on the original sample data above, produces:

	discontinuity: sample.dat:3; at Aug 10 18:39:30; for ifOutOctets_2
	discontinuity: sample.dat:6; at Aug 10 19:24:49; for ifOutOctets_2