Daniel.Karrenberg@cwi.nl (Daniel Karrenberg) (02/27/91)
Appliques can fail ..... ... and it can be hard to detect. This is long but I wish I had read something like it before I had to deal with this one! I have come across a very hard to detect serial line problem recently. Our setup is as follows (place names included to add local flavour): KTH Stockholm Sweden CWI Amsterdam Netherlands +--------+ +--------+ +--------+ +--------+ | |V.35| | | |V.35| | | CISCO +====+ MUX +-- Various Telcos -+ MODEM +====+ CISCO | | | | | | | | | +--------+ +--------+ +--------+ +--------+ (1) (2) (3) (4) We were having some strange problems with that link. The keepalives were getting thru OK and small IP packets were doing reasonably well (1-2% loss). Large IP packets (1000B) weren't getting thru at all. So we went to tackle the problem: Step 1: Make local loop at (3). Ping yourself from 4: OK. Conclusion: Cisco (4) plus cable to (3) and digital interface of (3) are OK. Step 2: Make local loop at (2). Ping yourself from 1: OK. Conclusion: Cisco (1) plus cable to (2) and digital interface of (2) are OK. At this point both Sweden and Holland conclude it is the Telcos (we call them PTTs) again and call in the problem. Note: PTT demarc is the digital interface of the PTT owned Modem/Mux (call it CSU/DSU if you like). The PTTs make various loops, say they found something and declare the line OK. Testing revealed that the problem persisted albeit a little less severe, only 95% of 1000 byte packets between (1) and (4) were dropped :-) :-(. We repeat steps 1 and 2 above with the same results. We start suspecting a clocking problem. So Sweden gets a BERT tester (actually a Vitalink) and we go to Step 3: Connect BERT tester to (2), remote loop at (3). Run a few minutes of BERT pattern: Works fine. Conclusion: The PTTs are not really to blame for this one. Step 4: Reset remote loop at (3), shut down interface at (4). Run a few minutes of BERT pattern: Works fine. Conclusion: Cable and applique in (4) OK. Step 5: Reconnect (1), interface at (4) remains shut down. Have (1) ping itself: Works fine. Conclusion: Nothing wrong in Sweden, really. Step 6: Shut down interface in (1), enable interface in (4). Have (4) ping iself: All large packets get dropped. Conclusion: Head scratching. Must be in (4) or some weird clocking problem between (3) and (4). Step 7: Swap MCI card in (4). Have (4) ping iself: All large packets get dropped. Conclusion: More head scratching. Should it be the applique or the flat cable in the cisco ?????!?!?!?!? Step 8: Swap back MCI card and use different applique in (4). Have (4) ping iself: Works! Conclusion: Happiness and disbelief. We subsequently swapped a few apliques with the conclusion that old ones (bar code serial <10000) consistently don't work on this line. New ones do work although the link is not 100% stable yet but this might be due to other problems. I am still not 100% sure what makes the old qppliques not work since i haven't had the time to put a scope on the line. Any ideas? Lessons learned: 1) In some (rare) circumstances local loopbacks do not detect local problems. 2) A spare parts pool needs to include appliques. 3) Problems like this can only be found by both ends testing synchroneously with an open telephone connection to discuss things as they go. Daniel
MAP@lcs.mit.edu (Michael A. Patton) (02/27/91)
From: Daniel Karrenberg <Daniel.Karrenberg@cwi.nl> Date: Wed, 27 Feb 91 10:52:54 +0100 After discussing: Appliques can fail ... and it can be hard to detect. Concludes with: 3) Problems like this can only be found by both ends testing synchroneously with an open telephone connection to discuss things as they go. Dave Clark, years ago (circa 1980), made the observation that "the most powerful tool when debugging a network is another network that's still working." This comment was made to two undergrads trying to get the first IP bits between MIT-MULTICS and MIT-CSR (Gee, it must be ancient history, no domains :-). The two undergrads were occasionally hollering down the hall and Dave came up with this observation when suggesting that they use an open telephone connection. So you don't have to be in separate countries for it to be useful! I guess it's just more obvious then. __ /| /| /| \ Michael A. Patton, Network Manager / | / | /_|__/ Laboratory for Computer Science / |/ |/ |atton Massachusetts Institute of Technology Disclaimer: The opinions expressed above are a figment of the phosphor on your screen and do not represent the views of MIT, LCS, or MAP. :-) And even then, they're only my recollection of the event! :-)
bmar@cac.washington.edu (Bill Mar) (02/28/91)
We had similar symptoms while evaluating the new Codex 3500 56k DSUs on a production circuit between two cisco routers. Problem was resolved by replacing the V.35 applique with a later vrs 6. Apparently vrs 4 and 5 incorrectly inverted some of the data and/or clock lines, which was corrected in vrs 6. The symptoms do not neccessarily show up immediately, because four pairs of different vendor/model DSUs tested fine on this circuit. When the 3500s were tried, they BERT'd the circuit ok, PING'd locally ok, but could not PING across the link. Codex concludes the 3500 is designed to enforce industry standard data and clock phase relation, while the older DSUs allowed out of spec phase and the frequently less than desirable results. Bill Mar Univ of Wash Seattle, WA
fortinp@bwdls56.bnr.ca (Pierre Fortin) (03/04/91)
In article <32744@boulder.Colorado.EDU>, bmar@cac.washington.edu (Bill Mar) writes: > > We had similar symptoms while evaluating the new Codex 3500 56k DSUs on > a production circuit between two cisco routers. Problem was resolved by > replacing the V.35 applique with a later vrs 6. Apparently vrs 4 and 5 > incorrectly inverted some of the data and/or clock lines, which was > corrected in vrs 6. The symptoms do not neccessarily show up > immediately, because four pairs of different vendor/model DSUs tested > fine on this circuit. When the 3500s were tried, they BERT'd the > circuit ok, PING'd locally ok, but could not PING across the link. > Codex concludes the 3500 is designed to enforce industry standard data > and clock phase relation, while the older DSUs allowed out of spec phase > and the frequently less than desirable results. I too spent *MANY* long hours working this problem about 20 months ago; I posted a number of replies in the past... In your reply, you are quite correct in stating that the problem is fixed in the R6 appliques. The problem was with inverted clock pairs. Let's see if I can summarize quickly: R3: inverted clocks R3+: (+ means jumpers and trace cuts) some boards were improperly modified (QA problem) R4+: cisco forgot to tell the manufacturer to *stop* applying the mods... result: undid the etched fix. R4: I don't recall if this one was completely OK (all these months and a week in the Mexican sun... :^) R6: OK, although I would have made one more minor change; I agreed with cisco at that time that this last one was a cosmetic nit. The reason that some units _appear_ to work is related to either their signal rise/fall times (worse as slope gets longer), or the data/clock relationship (measured in nanoseconds). The bottom line here is that the data lines were changing at the *same* time as the data was being clocked into the modems. The problem was always on the sending end. If you are having these problems, you might try the following: - use a breakout box or - make a special cable to - invert SCT or - invert SCTE or - invert both Another problem area is the cable type you use between the applique and modem. We eventually designed our own cable since most generally available cables will not work properly (loss and crosstalk) over more than a couple of meters. We tested our design to 70feet, but order only 35- and 50-foot units. > > Bill Mar > Univ of Wash > Seattle, WA Cheers, Pierre P.S.: If anyone kept copies of my original postings, please repost them (or email to me at fortinp@bnr.ca). I suppose we should have written a book on V.35 back then... Looking back, it would have been salt in our wounds... :^) Cheers, Pierre Fortin fortinp@bnr.ca (613)763-2598
lars@spectrum.CMC.COM (Lars Poulsen) (03/09/91)
In article <32714@boulder.Colorado.EDU> Daniel.Karrenberg@cwi.nl (Daniel Karrenberg) writes a great and detailed "war story" about customer-debugging of a serial line problem involving a pair of cisco routers connected via V.35 modems. > .... >We were having some strange problems with that link. The keepalives were >getting thru OK and small IP packets were doing reasonably well (1-2% loss). >Large IP packets (1000B) weren't getting thru at all. > .... >We subsequently swapped a few apliques with the conclusion that >old ones (bar code serial <10000) consistently don't work on this line. >New ones do work although the link is not 100% stable yet but this might >be due to other problems. > >Lessons learned: > > 1) In some (rare) circumstances local loopbacks > do not detect local problems. Being originally (and now again) a software engineer, I spent a couple of years running a customer support organization for similar stuff. A possible source for the problem could be an engineering / design error in the V.35 applique[1]. I don't know if cisco had such an error, but several implementors have had the same problem. For some reason, designers of serial interfaces have a hard time keeping their plusses and minuses straight, especially on synchrounous interface clocks. Synchronous modem clocking is intended to be set up such that the data is sampled in the middle of the bit cell, where it is presumably most stable, and "ringing", "overshoot", "round shoulders" and other boundary effects at the edge of the bit cell have died down. If the clock is inverted, the data will instead be sampled near the edge of the bit cell. You would think that it would not work at all, but with some luck, it will actually work part of the time, but the link will be enormously sensitive to minor changes in cabling, grounding etc. Of course, loopbacks will work fine, since there will be symmetrical inversions on the send and receive side. Also, it will work fine in local "null modem" hookups in the lab. The V.35 interface only started to come into widespread use four years ago, and most manufacturers started to build them "from paper": Having only a spec to work from[2], and no compatible equipment to compare and test against. I know of several manufacturers that got several products out to the field with design problems, both on the DTE side and on the modem side. The embarrassment at making a design error that can be designed in terms this simple, has led to coverups that have greatly complicated the recovery process. Note, that having spare applique's would not have helped you, since they would have been of the same engineering revision as the original ones. Footnotes: [1] Why does cisco use the word "applique" instead of "adapter" ? I have seen many computer operators confused by the term. [2] The spec even has problems. There seem to be two different physical connectors allowed. I vaguely remember that they looked identical, but one used metric dimensions, the other inches ... [3] The above should not be construed as a putdown of cisco's engineering, for which I have the highest respect. [4] The most common cause of errors that get more frequent with increasing frame size, is misconfigured clocks in the telco domain (i.e. the two CSU/DSU's are not slaved to the same master clock). This can happen either by misconfiguring a modem (enabling one of them as a clock master when telco is providing clock) or by a mis-set switch in any telco MUX that the link passes through. When this happens, the clock phase is slowly drifting in and out of sync. Often, the slip will be less than one bit per million, causing you to have "a few bad minutes every two or three hours". -- / Lars Poulsen, SMTS Software Engineer CMC Rockwell lars@CMC.COM