[comp.sys.proteon] Op err 8704 hst 0 nt 2 int Pro/0

CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (09/20/88)

Hi,
We recently had a partial ring meltdown where one symptom was that
all the ring p4200s were spitting out the following error msg:

        Op err 8704 hst 0 nt 2 int Pro/0

The external symptom was that large data transfers would work in only
one direction across the ring.  Using large ftps, we isolated the
flakey component, which turned out to be a p3280.

Now we're seeing this same error msg on one particular gateway, and
not on any others.  It doesn't appear to be serious yet, but we are
wondering:

1)  What does this error msg mean?  We can't seem to find more than
    generalities in the manuals (maybe we aren't looking in the right
    place...).

2)  What should we replace first?  (The first choice would seem to be
    the p80 interface in the gw, but any advice is welcome.)

3)  Lots of these error msgs seem to correlate well with a high ratio of
    'output refused' to 'output packets' on the p80 interface.  Just from
    simple monitoring of our ring, a ratio of less than .01 seems to be
    'normal' and it has climbed up to .06 - .07 on this one gw.  Does
    anyone have a sense of what these numbers 'should' be?

        Thanks,
                Cliff

fedor@NISC.NYSER.NET (Mark S. Fedor) (09/20/88)

	Cliff,

	I have seen this error while working at cornell.  They
	have a pronet-10.  As you mentioned, this can be caused by
	a number of problems including a broken component of the ring.

	In the gateway however, this error seemed to occur when the
	pronet-10 card was bad (usually the CTL board) or the power
	supply was freaking out.  I usually started by swapping in a new
	pronet-10 board (usually the CTL first).

	ALso check your cabling to the wire-center.  use high quality
	stuff.

	Mark

	P.S.  I found the Pronet-10 CTL boards to be quite fragile!

tvm@proteon.COM (Tom Miceli) (09/22/88)

Date: Tue, 20 Sep 88 10:23:51 EDT
To: cgw
In-Reply-To: Cliff Frost {415} 642-5360's message of Mon, 19 Sep 88  15:25 PDT <8809200013.AA20600@devvax.TN.CORNELL.EDU>
Subject: Op err 8704 hst 0 nt 2 int Pro/0

Cliff,
	Here is the response that the development people gave for the
8704 error.
					TOM

The error number is a generic MOS device error code.  They always have
the 8000 bit set, which is the error bit.  Here is the list of generic
codes:

/* Device errors */
ID_OFLN 0x0104	/* Device offline */
ID_EOM  0x0204	/* End of medium */
ID_DAT  0x0304	/* Data error */
ID_OVFL 0x0404	/* Data overflow (overrun) */
ID_HRDE 0x0504	/* Hard device error */
ID_EOF  0x0604	/* End of file mark */
ID_OTER 0x0704	/* Output error */
ID_IPPE 0x0804	/* IPP mate error (??) */
ID_RNGE 0x0904	/* Ring buffer error */
ID_NXME 0x0A04	/* Non existant memory */
ID_ATTN 0x0B04	/* Attention on */
ID_DEAD 0x0C04	/* Device address error */
ID_FLSH 0x0D04	/* Device output flushed */
ID_HDWA 0x0E04	/* Hardware ACK failed */
ID_NSD  0x0F04	/* No such destination */

/* User errors */
IU_NDEV 0x0102	/* Non existant device */
IU_UNDF 0x0202	/* Undefined device */
IU_ALER 0x0302	/* Device already allocated */
IU_FNER 0x0402	/* Illegal function error */
IU_UNPV 0x0502	/* Unpriv I/O request error */
IU_WPER 0x0602	/* Write protect error */
IU_ODAD 0x0702	/* Odd address error */
IU_DAER 0x0802	/* Device address error */
IU_DRST 0x0902	/* Device reset error */


The mappings between the actual device errors and the error codes are
device specific.  Here is a partial list for ProNET-10/80:

BIT			Code	Hex
----------------------	-------	------
input overrun		ID_OVFL	0x8404
input bad format	ID_DAT	0x8304
input parity (10 only)	ID_DAT	0x8304
output bad format	ID_OTER	0x8704
output timeout		ID_DAT	0x8304
output overrun		ID_DAT	0x8304
output refused		ID_HDWA	0x8E04

We are working on getting all of these mappings into the
documentation.

So, your error code 8704 probably indicates output bad format.  There
is also an output bad format counter, which I'm sure is counting away.


The hst part of the error message is the "address" the packet is going
to, in octal.  On a ProNET-10/80, that is the actual destination
address.  On networks with "large" addresses (Ethernet), it is a
pointer to the destination address.  (Sorry, I know that's not very
useful.)


The most important ProNET-10/80 errors are input and output bad
format.  These indicate that something on the ring is mashing the bits
on the way around.  This could include:
	- A wire center relay operating (a normal event)
	- A sick fiber unit (optical power level problems, or
	  a problem in the reclocker on 80)
	- A cable with a broken or frayed wire
	- A sick node

One important feature of ProNET-80 is that one of the errors is
isolating, that is it is per-hop.  The input parity error bit on
ProNET-80 was defined back into link parity error.  Parity (actually
code violations) are checked by every node on the ring as the repeat
packets.  If they detect a code violation, they will set the parity
error bit, and will flag the packet as bad.  Thus, the node where the
input parity error bit is counting up is the downstream node of the
problem in the ring.  (As for the damaged packet, both sender and
recipient will get bad format errors.)

So, on ProNET-80, bad format shows that something is wrong, and parity
error will tell you where it is.  (The fault domain in 802.5 parlance.)

Now, you try removing the node on either side of the fault domain, or
swapping out the fiber units involved.  Do things one at a time, until
you find the component whose removal causes the errors to stop
counting.  Replace that component.

CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (09/22/88)

Tom,
Thank you!  That is exactly the kind of information I was looking for.

We've seen clumps of these error msgs twice, and both times the symptoms
were different from what your note describes.

1)  The first time, all the gateways had very large numbers of 'output
    refused', and no significant numbers of either 'output bad format'
    or 'input parity errors'.  We isolated the problem to a particular
    p3280 by transfering large files back and forth through each gateway
    (one direction would always fail--sending along the ring in the way
    that went through the bad p3280).

2)  The second time only one gateway was putting out the 'Op err' msgs,
    and it had large numbers of 'output refused' messages.  Again, it
    was a flakey p3280.

So, in no case have we seen any 'input parity error' or 'output bad format'
counters go up along with this 8704 msg, only 'output refused'.

I guess this just shows there's more than one way to skin a network...
        Thanks again,
                        Cliff