[comp.sys.proteon] 4-into-6 coding, and the "clasic" pronet-80 problem

enger@SEKA.SCC.COM (Robert M. Enger) (03/19/90)

Folks:

As many of you unfortunately know, Pronet-80 boards can exhibit
a sensitivity to the contents of the frames they are asked
to convey.   33hex is always specified as one of the data
types that annoys the Pronet-80 boards.  

Does anyone know what 33Hex maps into under the 
4-into-6 coding used by Proteon?  Is 33hex the
worst case test for the Pronet-80 given the
4-into-6 mapping they use?  If not, is there
a more sensitive test that we can use to detect when our
boards are on their way to the great repair-depot in the sky?

Can anyone offer a technical explanation of the situation?
Is it that the "rf" (120mbps) stages become miss-tuned,
causing waveform deformation beyond the ability of the
receivers to acurately "demodulate" (loose terminology)
the signal?  Why is their design so sensitive?  These
boards keep going bad.  Are the drivers (amplifiers)
too weak to fight the cable capacitance or something?

Has anyone found a way to "help" the situation?
Perhaps a way to appease the deficiency of the design
(or whatever the problem is) by using lower-capacitance
cabling, or cabling with a different characteristic
impedence, etc??

Thanks for any insight anyone has to offer,
Bob Enger
Contel Federal Systems
enger@seka.scc.com

sting@LAOTSE.CAM.NIST.GOV (s. ting) (03/19/90)

   Our ProNET-80 ring has experienced the problem. It has not been 
 resolved yet.  I too would like Proteon to explain the problem in
 great detail and present, if there is one, an effective solution.
 
 Questions:
   
   .  What cause the p3280 or ProNET-80 CTL-card out of alignment?
   
   .  Why they become out of alignment so easily?
   
   .  Why they have problems with those characters like x33?
   
   .  What is the quick and effective way to find which p3280 or CTL
      card among the many on the ring already out of alignment?
      
   .  Is it true that Proteon, concentrating on FDDI replacing ProNET-80, 
      does not give full effort in resolving the problem?
      
   .  Should a customer without hardware maintenance agreement for 
      those out-of-aligned units pay for the repair cost?
      
      
  Michael Ting,
  NIST

CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (03/20/90)

Hi,
We have had a fair amount of experience with this problem here at UC
Berkeley, and I think we have essentially banished it in this form.
With Proteon's help, you should be able to also.

> Does anyone know what 33Hex maps into under the 4-into-6 bit coding
> used by Proteon?  etc...

Hex 33 is useful because it maps to the ascii alpha character "3", so
you can easily fill a file with the letter "3" (but don't put too many
newlines or carriage returns in).  Hex 33 maps into: 100011100011 which
when followed by another hex 33 becomes a series of 3 zeros followed by
3 ones.  I believe it is the lack of transitions over 3 bits that is
hard for controllers that are drifting out of spec.

There are several other data patterns that are at least as bad as this,
hex: 36, 63, 66 BE, BB, EB, EE, and undoubtedly more.

> Can anyone offer a technical explanation of the situation?  Is it that
> the "rf" (120mbps) stages become miss-tuned, ...etc?

Well, I'm a software kind of guy, and our hardware techs sometimes use
the phrase "programmer with a screwdriver" in a sarcastic way, so
take what I say with some sized grain of salt.  ;-)

Each active device on the p80 ring reads the data that comes in using
its own clock to decode it.  If the data is for a node downstream, the
device regenerates it, again using its own clock.  This means that
all the devices on the ring had better have clocks that are in close
alignment with eachother.  The clocks are all supposed to be at 120Mhz
+/- a tiny fraction (10Khz?).  These clocks tick totally independently
of eachother, there is no "master" clock.

This design appears (to me) to lead to some difficult debugging
situations.  You can have a ring that is working ok but has some clocks
at the ragged edge, introduce a new node and all of a sudden your
ring is shot.  The new node may actually be OK, but you might "fix" the
problems by putting in a different controller.  Or you might "fix"
the problems by plugging the controllers in in a different order.

P3280s seem to have the worst problems.  Maybe it's because they have
two independent clocks, or maybe because they get too hot in their
little boxes or maybe their circuitry is really different (big help,
huh?).

> Has anyone found a way to "help" the situation?
>> What is the quick and effective way to find which p3280 or CTL
>> card among the many on the ring already out of alignment?

The only way I know how to deal with this requires real work, but
it is what you have to do:

1)  First you have to determine what order the nodes are in the ring.
This is crucial because of the way the data is clocked and regenerated
by each node.  In order to pinpoint a problem node you have to know
the exact path that data will take through your ring.

    To do this, you go and look at your wire center.  Data will flow
around it in a counter-clockwise direction.

IMPORTANT:  You have to realize that at the link level each packet is
going to go all the way around the ring.  Node A sends it to node B,
and if all goes well node B sends it back with the ACK bit set.  If
all doesn't go well (either the ACK bit is off or the packet is
trashed), node A will retransmit the packet (up to several times).

You need to keep this in mind.  This is the root mechanism that causes
duplicate packets to show up.  Also, if the path from B to A is bad,
A will spend a certain amount of time retransmitting unnecessarily
and this will slow down throughput from A to B--although not nearly
as much as from B to A.

2)  Next you have to have a way to test each node.  Let's say you have
p4200 routers which have a p80 interface and some others, say an ethernet.
You need access to one of the ethernets from each router.

What you do is ship data across the your ring.  From point A to point B
you ship (eg) a file with nothing but 3's in it.  Then you ship the same
size file with 1's in it (1's are inocuous).  Then do the same tests
from B to A.

-If the 3's are causing problems, you will see very different
throughput rates.
-If there is only one broken node in the ring you will see that the
throughput for the 3's file is dramatically worse in one direction
than the other.
-If there are several broken nodes in the ring you have a much more
difficult hunt, but you can USUALLY get pretty far if not all the way.
I've seen some strange things with this.  Sometimes I've had to
reorder things in the ring to find a bad component.

3)  If you note any funnyness across your p3280 links get your
p3280s upgraded to the latest revs.  We have not had this problem
with our p3280s since we did this.  (We have had a couple of total
failures, but that is at least pretty easily identifiable.)

=====
I have some tools that can help.  They are available for anonymous
ftp from jade.berkeley.edu (128.32.136.9).

1)  pub/ping.c and pub/ping.8:  This lets you specify the data fill
problem for the packets sent.  This helps you spot the problem early.
Since each ping packet goes in both directions it is no help in
pinpointing the problem.

2)  pub/netout.c:  This sends data to the TCP discard port of a remote
machine.  You can specify the data fill pattern.  This is easier to
use for pinpointing things than ftp, since you don't need an account
on the remote host.  Unfortunately, not everybody has implemented the
TCP discard port code.

=====
We can identify when we are starting to have problems in a couple
of ways.  One is from SNMP collected output errors on the p80 ring
interfaces.  Another is looking at "T 2" in the router consoles and
seeing lots of 8704 errors on the p80 interfaces.  "Lots" is
defined very fuzzily in my mind--it's based on experience...

I don't mind discussing these problems with folks.  I hope this is
helpful to someone, my hands are tired.  ;-)

        Cliff Frost                   (415) 642-5360
        Central Computing Services    <cliff@berkeley.edu>
        University of California      CLIFF AT UCBCMSA
        Berkeley, CA 94720

enger@SEKA.SCC.COM (Robert M. Enger) (03/20/90)

Cliff:

Thanks for the info.  Yes, it is helpfull.

Do you have the mapping table for the 4-> coding used by proteon?
Can I get a copy?
Do any of the allowed codes result in strings of four zeros, four ones?

Are any of the pronet-80 interface statistics of any real use in
locating the culprit?  Given the frequency and magnitude of the
problem, I would hope that some parameter they report is usefull.

I do not have any fiber equipment here to blame the problem on.
I have 8 P4200s, sitting in the same room, connected to a 
ganged wire-center.  Intuition would lead one to believe
that this would be a "piece of cake" installation.  
Since all boxes are subject to the same environmental conditions
all the clocks and chips should suffer temperature drift in the
same direction, if not the same amount.  The AC power supplied
to all the units should be closely matched.  Even the pronet-80
cables are short, so capacitive effects should be small.

I guess I should express an overdue thanks.  We have been using
the special ping program from jade for quite some time.
That is how we have been poking at the ring!

>From you description of the problem, is it correct to sum up that
the problem is not one of wave-form deformation, but rather
receiver time-base instability?  (ie, receiver can't be trusted
to sample signal near the middle of the bit-time?)

thanks,
Bob

CLIFF@UCBCMSA.BITNET (Cliff Frost {415} 642-5360) (03/20/90)

Bob,
> From you description of the problem, is it correct to sum up that
> the problem is not one of wave-form deformation, but rather
> receiver time-base instability?  (ie, receiver can't be trusted
> to sample signal near the middle of the bit-time?)

Keep in mind what I said about grains of salt.  I *think* this is
correct, but I'm not an engineer by any means.  There may be more
than one thing going on, this is the only one we've gotten a handle on.

> Do you have the mapping table for the 4-> coding used by proteon?
> Can I get a copy?

I have a reprint of an article by Howard Salwen, Alan C. Marshall, and
Nathan K. Salwen.  I have no recollection of where it came from, you should
ask Proteon for a copy.  It's called "ProNET An 80 MBIT/S Token Ring
For High-Speed LAN Applications".

> Do any of the allowed codes result in strings of four zeros, four ones?

No.  I think you can get runs of 4 and 3, but not 4 and 4.

> I do not have any fiber equipment here to blame the problem on.
> I have 8 P4200s, sitting in the same room, connected to a
> ganged wire-center.  Intuition would lead one to believe
> that this would be a "piece of cake" installation.

Are you hinting something nasty about Proteon's Quality Assurance?

Or, are you asking for advice on how to proceed?   ;-)

I would try the file transfer method.  If it works you will be left
with the question: "is it the transmitter or the receiver?".  Ie--it
won't tell you exactly which p4200 is having the problem.  What you
do then is swap the p4200's positions in the wire center.  In the
"classic" case the problem follows one or the other.  In the nasty
case the problem goes away for a while or does something else weird.

> Are any of the pronet-80 interface statistics of any real use in
> locating the culprit?  Given the frequency and magnitude of the
> problem, I would hope that some parameter they report is usefull.

The "Output bad format" and "Input parity error" counters are
useful to watch.  Proteon will tell you a trouble-shooting method
that uses the Input parity error counter.  It has sometimes been
useful here, and I should have mentioned it.

        Cliff