CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (09/20/88)
Hi, We recently had a partial ring meltdown where one symptom was that all the ring p4200s were spitting out the following error msg: Op err 8704 hst 0 nt 2 int Pro/0 The external symptom was that large data transfers would work in only one direction across the ring. Using large ftps, we isolated the flakey component, which turned out to be a p3280. Now we're seeing this same error msg on one particular gateway, and not on any others. It doesn't appear to be serious yet, but we are wondering: 1) What does this error msg mean? We can't seem to find more than generalities in the manuals (maybe we aren't looking in the right place...). 2) What should we replace first? (The first choice would seem to be the p80 interface in the gw, but any advice is welcome.) 3) Lots of these error msgs seem to correlate well with a high ratio of 'output refused' to 'output packets' on the p80 interface. Just from simple monitoring of our ring, a ratio of less than .01 seems to be 'normal' and it has climbed up to .06 - .07 on this one gw. Does anyone have a sense of what these numbers 'should' be? Thanks, Cliff
fedor@NISC.NYSER.NET (Mark S. Fedor) (09/20/88)
Cliff, I have seen this error while working at cornell. They have a pronet-10. As you mentioned, this can be caused by a number of problems including a broken component of the ring. In the gateway however, this error seemed to occur when the pronet-10 card was bad (usually the CTL board) or the power supply was freaking out. I usually started by swapping in a new pronet-10 board (usually the CTL first). ALso check your cabling to the wire-center. use high quality stuff. Mark P.S. I found the Pronet-10 CTL boards to be quite fragile!
tvm@proteon.COM (Tom Miceli) (09/22/88)
Date: Tue, 20 Sep 88 10:23:51 EDT To: cgw In-Reply-To: Cliff Frost {415} 642-5360's message of Mon, 19 Sep 88 15:25 PDT <8809200013.AA20600@devvax.TN.CORNELL.EDU> Subject: Op err 8704 hst 0 nt 2 int Pro/0 Cliff, Here is the response that the development people gave for the 8704 error. TOM The error number is a generic MOS device error code. They always have the 8000 bit set, which is the error bit. Here is the list of generic codes: /* Device errors */ ID_OFLN 0x0104 /* Device offline */ ID_EOM 0x0204 /* End of medium */ ID_DAT 0x0304 /* Data error */ ID_OVFL 0x0404 /* Data overflow (overrun) */ ID_HRDE 0x0504 /* Hard device error */ ID_EOF 0x0604 /* End of file mark */ ID_OTER 0x0704 /* Output error */ ID_IPPE 0x0804 /* IPP mate error (??) */ ID_RNGE 0x0904 /* Ring buffer error */ ID_NXME 0x0A04 /* Non existant memory */ ID_ATTN 0x0B04 /* Attention on */ ID_DEAD 0x0C04 /* Device address error */ ID_FLSH 0x0D04 /* Device output flushed */ ID_HDWA 0x0E04 /* Hardware ACK failed */ ID_NSD 0x0F04 /* No such destination */ /* User errors */ IU_NDEV 0x0102 /* Non existant device */ IU_UNDF 0x0202 /* Undefined device */ IU_ALER 0x0302 /* Device already allocated */ IU_FNER 0x0402 /* Illegal function error */ IU_UNPV 0x0502 /* Unpriv I/O request error */ IU_WPER 0x0602 /* Write protect error */ IU_ODAD 0x0702 /* Odd address error */ IU_DAER 0x0802 /* Device address error */ IU_DRST 0x0902 /* Device reset error */ The mappings between the actual device errors and the error codes are device specific. Here is a partial list for ProNET-10/80: BIT Code Hex ---------------------- ------- ------ input overrun ID_OVFL 0x8404 input bad format ID_DAT 0x8304 input parity (10 only) ID_DAT 0x8304 output bad format ID_OTER 0x8704 output timeout ID_DAT 0x8304 output overrun ID_DAT 0x8304 output refused ID_HDWA 0x8E04 We are working on getting all of these mappings into the documentation. So, your error code 8704 probably indicates output bad format. There is also an output bad format counter, which I'm sure is counting away. The hst part of the error message is the "address" the packet is going to, in octal. On a ProNET-10/80, that is the actual destination address. On networks with "large" addresses (Ethernet), it is a pointer to the destination address. (Sorry, I know that's not very useful.) The most important ProNET-10/80 errors are input and output bad format. These indicate that something on the ring is mashing the bits on the way around. This could include: - A wire center relay operating (a normal event) - A sick fiber unit (optical power level problems, or a problem in the reclocker on 80) - A cable with a broken or frayed wire - A sick node One important feature of ProNET-80 is that one of the errors is isolating, that is it is per-hop. The input parity error bit on ProNET-80 was defined back into link parity error. Parity (actually code violations) are checked by every node on the ring as the repeat packets. If they detect a code violation, they will set the parity error bit, and will flag the packet as bad. Thus, the node where the input parity error bit is counting up is the downstream node of the problem in the ring. (As for the damaged packet, both sender and recipient will get bad format errors.) So, on ProNET-80, bad format shows that something is wrong, and parity error will tell you where it is. (The fault domain in 802.5 parlance.) Now, you try removing the node on either side of the fault domain, or swapping out the fiber units involved. Do things one at a time, until you find the component whose removal causes the errors to stop counting. Replace that component.
CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (09/22/88)
Tom, Thank you! That is exactly the kind of information I was looking for. We've seen clumps of these error msgs twice, and both times the symptoms were different from what your note describes. 1) The first time, all the gateways had very large numbers of 'output refused', and no significant numbers of either 'output bad format' or 'input parity errors'. We isolated the problem to a particular p3280 by transfering large files back and forth through each gateway (one direction would always fail--sending along the ring in the way that went through the bad p3280). 2) The second time only one gateway was putting out the 'Op err' msgs, and it had large numbers of 'output refused' messages. Again, it was a flakey p3280. So, in no case have we seen any 'input parity error' or 'output bad format' counters go up along with this 8704 msg, only 'output refused'. I guess this just shows there's more than one way to skin a network... Thanks again, Cliff