[comp.dcom.sys.cisco] Appliques can fail .....

Daniel.Karrenberg@cwi.nl (Daniel Karrenberg) (02/27/91)

Appliques can fail .....
	... and it can be hard to detect.


This is long but I wish I had read something like it before I had to
deal with this one!

I have come across a very hard to detect serial line problem recently.
Our setup is as follows (place names included to add local flavour):


     KTH Stockholm Sweden                     CWI Amsterdam Netherlands

   +--------+    +--------+                   +--------+    +--------+
   |        |V.35|        |                   |        |V.35|        |
   | CISCO  +====+  MUX   +-- Various Telcos -+ MODEM  +====+ CISCO  |
   |        |    |        |                   |        |    |        |
   +--------+    +--------+                   +--------+    +--------+

      (1)           (2)                          (3)            (4)


We were having some strange problems with that link. The keepalives were
getting thru OK and small IP packets were doing reasonably well (1-2% loss).
Large IP packets (1000B) weren't getting thru at all.

So we went to tackle the problem:

	Step 1:   Make local loop at (3).
		  Ping yourself from 4: OK.
		  Conclusion: Cisco (4) plus cable to (3) and digital
		  interface of (3) are OK.

	Step 2:   Make local loop at (2).
		  Ping yourself from 1: OK.
		  Conclusion: Cisco (1) plus cable to (2) and digital
		  interface of (2) are OK.

At this point both Sweden and Holland conclude it is the Telcos (we call
them PTTs) again and call in the problem. Note: PTT demarc is the
digital interface of the PTT owned Modem/Mux (call it CSU/DSU if you like).
The PTTs make various loops, say they found something and declare the
line OK. Testing revealed that the problem persisted albeit a little 
less severe, only 95% of 1000 byte packets between (1) and (4) were dropped 
:-) :-(.

We repeat steps 1 and 2 above with the same results. We start suspecting
a clocking problem. So Sweden gets a BERT tester (actually a Vitalink)
and we go to 

	Step 3: Connect BERT tester to (2), remote loop at (3).
		Run a few minutes of BERT pattern: Works fine.
		Conclusion: The PTTs are not really to blame for
		this one.

	Step 4: Reset remote loop at (3), shut down interface
		at (4). 
		Run a few minutes of BERT pattern: Works fine.
		Conclusion: Cable and applique in (4) OK.

	Step 5: Reconnect (1), interface at (4) remains shut down.
		Have (1) ping itself: Works fine.
		Conclusion: Nothing wrong in Sweden, really.

	Step 6: Shut down interface in (1), enable interface in (4).
		Have (4) ping iself: All large packets get dropped.
		Conclusion: Head scratching. Must be in (4) or
		some weird clocking problem between (3) and (4).

	Step 7: Swap MCI card in (4).
		Have (4) ping iself: All large packets get dropped.
		Conclusion: More head scratching. Should it be the
		applique or the flat cable in the cisco ?????!?!?!?!?

	Step 8: Swap back MCI card and use different applique in (4).
		Have (4) ping iself: Works!
		Conclusion: Happiness and disbelief.

We subsequently swapped a few apliques with the conclusion that
old ones (bar code serial <10000) consistently don't work on this line.
New ones do work although the link is not 100% stable yet but this might
be due to other problems.

I am still not 100% sure what makes the old qppliques not work since 
i haven't had the time to put a scope on the line. Any ideas?

Lessons learned:

	1) In some (rare) circumstances local loopbacks
	   do not detect local problems.

	2) A spare parts pool needs to include appliques.

	3) Problems like this can only be found by both
	   ends testing synchroneously with an open telephone
	   connection to discuss things as they go.

Daniel

MAP@lcs.mit.edu (Michael A. Patton) (02/27/91)

   From: Daniel Karrenberg <Daniel.Karrenberg@cwi.nl>
   Date: Wed, 27 Feb 91 10:52:54 +0100

After discussing:
   Appliques can fail ... and it can be hard to detect.

Concludes with:

	   3) Problems like this can only be found by both
	      ends testing synchroneously with an open telephone
	      connection to discuss things as they go.

Dave Clark, years ago (circa 1980), made the observation that "the
most powerful tool when debugging a network is another network that's
still working."  This comment was made to two undergrads trying to get
the first IP bits between MIT-MULTICS and MIT-CSR (Gee, it must be
ancient history, no domains :-).  The two undergrads were occasionally
hollering down the hall and Dave came up with this observation when
suggesting that they use an open telephone connection.  So you don't
have to be in separate countries for it to be useful!  I guess it's
just more obvious then.

            __
  /|  /|  /|  \         Michael A. Patton, Network Manager
 / | / | /_|__/         Laboratory for Computer Science
/  |/  |/  |atton       Massachusetts Institute of Technology

Disclaimer: The opinions expressed above are a figment of the phosphor
on your screen and do not represent the views of MIT, LCS, or MAP. :-)
And even then, they're only my recollection of the event! :-)

bmar@cac.washington.edu (Bill Mar) (02/28/91)

We had similar symptoms while evaluating the new Codex 3500 56k DSUs on
a production circuit between two cisco routers.  Problem was resolved by
replacing the V.35 applique with a later vrs 6.  Apparently vrs 4 and 5
incorrectly inverted some of the data and/or clock lines, which was
corrected in vrs 6.  The symptoms do not neccessarily show up
immediately, because four pairs of different vendor/model DSUs tested
fine on this circuit.  When the 3500s were tried, they BERT'd the
circuit ok, PING'd locally ok, but could not PING across the link. 
Codex concludes the 3500 is designed to enforce industry standard data
and clock phase relation, while the older DSUs allowed out of spec phase
and the frequently less than desirable results. 

Bill Mar
Univ of Wash
Seattle, WA

fortinp@bwdls56.bnr.ca (Pierre Fortin) (03/04/91)

In article <32744@boulder.Colorado.EDU>, bmar@cac.washington.edu (Bill Mar) writes:
>
> We had similar symptoms while evaluating the new Codex 3500 56k DSUs on
> a production circuit between two cisco routers.  Problem was resolved by
> replacing the V.35 applique with a later vrs 6.  Apparently vrs 4 and 5
> incorrectly inverted some of the data and/or clock lines, which was
> corrected in vrs 6.  The symptoms do not neccessarily show up
> immediately, because four pairs of different vendor/model DSUs tested
> fine on this circuit.  When the 3500s were tried, they BERT'd the
> circuit ok, PING'd locally ok, but could not PING across the link. 
> Codex concludes the 3500 is designed to enforce industry standard data
> and clock phase relation, while the older DSUs allowed out of spec phase
> and the frequently less than desirable results. 

I too spent *MANY* long hours working this problem about 20 months ago; I
posted a number of replies in the past...

In your reply, you are quite correct in stating that the problem is fixed in 
the R6 appliques.  The problem was with inverted clock pairs.  Let's see if
I can summarize quickly:

    R3:  inverted clocks
    R3+: (+ means jumpers and trace cuts) some boards were improperly
         modified (QA problem)
    R4+: cisco forgot to tell the manufacturer to *stop* applying the mods...
         result: undid the etched fix.
    R4:  I don't recall if this one was completely OK (all these months and
         a week in the Mexican sun... :^)
    R6:  OK, although I would have made one more minor change; I agreed with
         cisco at that time that this last one was a cosmetic nit.

The reason that some units _appear_ to work is related to either their
signal rise/fall times (worse as slope gets longer), or the data/clock
relationship (measured in nanoseconds).  The bottom line here is that the
data lines were changing at the *same* time as the data was being clocked
into the modems.  The problem was always on the sending end.

If you are having these problems, you might try the following:
    - use a breakout box or
    - make a special cable to

      - invert SCT or
      - invert SCTE or
      - invert both

Another problem area is the cable type you use between the applique and 
modem.  We eventually designed our own cable since most generally available 
cables will not work properly (loss and crosstalk) over more than a couple
of meters.  We tested our design to 70feet, but order only 35- and 50-foot
units.

> 
> Bill Mar
> Univ of Wash
> Seattle, WA

Cheers,
Pierre

P.S.:  If anyone kept copies of my original postings, please repost them
       (or email to me at fortinp@bnr.ca).  I suppose we should have written
       a book on V.35 back then...  Looking back, it would have been salt in
       our wounds...  :^)

Cheers,                      
Pierre Fortin       fortinp@bnr.ca         (613)763-2598

lars@spectrum.CMC.COM (Lars Poulsen) (03/09/91)

In article <32714@boulder.Colorado.EDU>
   Daniel.Karrenberg@cwi.nl (Daniel Karrenberg) writes a great and
   detailed "war story" about customer-debugging of a serial line
   problem involving a pair of cisco routers connected via V.35 modems.
> ....
>We were having some strange problems with that link. The keepalives were
>getting thru OK and small IP packets were doing reasonably well (1-2% loss).
>Large IP packets (1000B) weren't getting thru at all.
> ....
>We subsequently swapped a few apliques with the conclusion that
>old ones (bar code serial <10000) consistently don't work on this line.
>New ones do work although the link is not 100% stable yet but this might
>be due to other problems.
>
>Lessons learned:
>
>	1) In some (rare) circumstances local loopbacks
>	   do not detect local problems.

Being originally (and now again) a software engineer, I spent a couple
of years running a customer support organization for similar stuff.
A possible source for the problem could be an engineering / design
error in the V.35 applique[1]. I don't know if cisco had such an error,
but several implementors have had the same problem.

For some reason, designers of serial interfaces have a hard time keeping
their plusses and minuses straight, especially on synchrounous interface
clocks. Synchronous modem clocking is intended to be set up such that
the data is sampled in the middle of the bit cell, where it is
presumably most stable, and "ringing", "overshoot", "round shoulders"
and other boundary effects at the edge of the bit cell have died down.
If the clock is inverted, the data will instead be sampled near the edge
of the bit cell. You would think that it would not work at all, but with
some luck, it will actually work part of the time, but the link will be
enormously sensitive to minor changes in cabling, grounding etc. Of
course, loopbacks will work fine, since there will be symmetrical
inversions on the send and receive side. Also, it will work fine in
local "null modem" hookups in the lab.

The V.35 interface only started to come into widespread use four years
ago, and most manufacturers started to build them "from paper": Having
only a spec to work from[2], and no compatible equipment to compare and
test against. I know of several manufacturers that got several products
out to the field with design problems, both on the DTE side and on the
modem side. The embarrassment at making a design error that can be
designed in terms this simple, has led to coverups that have greatly
complicated the recovery process.

Note, that having spare applique's would not have helped you, since they
would have been of the same engineering revision as the original ones.

Footnotes:
[1] Why does cisco use the word "applique" instead of "adapter" ?
    I have seen many computer operators confused by the term.
[2] The spec even has problems. There seem to be two different physical
    connectors allowed. I vaguely remember that they looked identical, but
    one used metric dimensions, the other inches ...
[3] The above should not be construed as a putdown of cisco's
    engineering, for which I have the highest respect.
[4] The most common cause of errors that get more frequent with
    increasing frame size, is misconfigured clocks in the telco domain
    (i.e. the two CSU/DSU's are not slaved to the same master clock).
    This can happen either by misconfiguring a modem (enabling one of
    them as a clock master when telco is providing clock) or by a
    mis-set switch in any telco MUX that the link passes through.
    When this happens, the clock phase is slowly drifting in and out of
    sync. Often, the slip will be less than one bit per million, causing
    you to have "a few bad minutes every two or three hours". 
-- 
/ Lars Poulsen, SMTS Software Engineer
  CMC Rockwell  lars@CMC.COM