[comp.dcom.lans] Can ethernet TCP/IP lock up?

wrd3156@fedeva.UUCP (Bill Daniels) (05/19/88)

Some folks in my organization have been led to believe that a "screaming" 
modem/transceiver can lock up an ethernet by asserting carrier forever.
Supposedly token stuff like GM/MAP does not permit this.  I have little 
knowledge of lans and laning but I can see in the literature that etherneted
TCP/IP has about 99.999% of the non-IBM networking in the universe.  It just
doesn't seem prudent to buck such massive trends.

What do you think?
-- 
bill daniels
federal express, memphis, tn
{hplabs!csun,gatech!emcard}!fedeva!wrd3156

rpw3@amdcad.AMD.COM (Rob Warnock) (05/19/88)

In article <299@fedeva.UUCP> wrd3156@fedeva.UUCP (Bill Daniels) writes:
+---------------
| Some folks in my organization have been led to believe that a "screaming" 
| modem/transceiver can lock up an ethernet by asserting carrier forever.
| Supposedly token stuff like GM/MAP does not permit this...
| What do you think?
+---------------

Any piece of hardware can fail. Your token-ring transmitters can go into
"screaming" mode, too. But well-designed hardware tries to avoid failing in
ways that will take down the whole net. In particular, Ethernet transceivers
(see the DEC/Intel/Xerox Ethernet spec) have what is called "jabber control"
to prevent exactly this kind of "screaming" (which is usually a controller
board fault, b.t.w., not the transceiver itself). The odds that a controller
will go beserk *AND* that the jabber control will have failed at the same time
are much less than either fault alone. And each fault is itself very rare.

Ethernet (*any* net!) usually suffers much more from: (1) badly planned cable
runs [such that the cable gets continual motion, for example]; (2) poor/broken
wires [due to #1, or just rough handling]; (3) badly trained installers;
(4) accidental damage by unrelated maintenance workers [as when changing a
fluorescent light]; (5) poor/broken software on the hosts [*sigh*]; (6) the
very robustness of upper-level protocols which hide problems from you. (All
of these also affect token systems.) Still, it's a *very* reliable technology.

As an aside, there seem to be a number of token advocates whose major style
of promoting token rings is to knock Ethernet, usually with some panic stories
about Ethernet "locking up" or "overloading". (From the tone of your question,
you have one or more of these in your organization.) Some of this comes from
not understanding Ethernet (which is a "controlled CSMA/CD" system, and does
not "collapse" like uncontrolled CSMA, or even uncontrolled CSMA/CD), while
some comes from politicking a vested interest.

While it is certainly possible to have a badly overloaded Ethernet (witness
some of the "broadcast storms" some diskless workstations can get into), it
is also just as possible to have a badly overloaded token net. *ANY* shared
resource will experience a sharp increase in delay as the average load exceeds
70-85%. (See any basic book on queueing theory.) There are good and bad points
about both Ethernet and token rings. (Just ask about recovering from a lost
token... Oops! There I go, doing what I was criticizing... ;-}  ;-}  )

Any technology needs to be analyzed for its suitablility before being used.
Token rings have a place in certain constrained process-control environments.
But note that even here you can't permit "general timesharing" on the same
net as your process-control, or you'll blow your real-time constraints.
Conversely, a dedicated Ethernet can be run as a "virtual token bus", and
meet essentially the same performance constraints. All such "guarantees",
however, assume there will be *NO* data errors, as these completely upset
the real-time constraints. (Hence my comment above about lost tokens.)

The major differences in performance between Ethernet and 10 Mbit/sec token
rings occur in the very-high-average-load regime, where you never want to
design a general-purpose net to run. At reasonable loads (under 70-85% or so),
the two technologies are practically identical. Also, token rings do better
at the higher data rates (above 50 Mbit/sec) or for geographically very large
nets (diameter >2500 meters), regimes where CSMA/CD doesn't work as well
(or at all!).

Anyway, Ethernet's there, it's (relatively) cheap (finally!), and everyone
from clone-makers to Big Blue supports it. Where it fits, it fits very well
indeed...


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

phil@amdcad.AMD.COM (Phil Ngai) (05/20/88)

In article <21674@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
>In article <299@fedeva.UUCP> wrd3156@fedeva.UUCP (Bill Daniels) writes:
>| Some folks in my organization have been led to believe that a "screaming" 
>| modem/transceiver can lock up an ethernet by asserting carrier forever.
>| Supposedly token stuff like GM/MAP does not permit this...
>| What do you think?
>
>Any piece of hardware can fail. Your token-ring transmitters can go into
>"screaming" mode, too. But well-designed hardware tries to avoid failing in
>ways that will take down the whole net. In particular, Ethernet transceivers
>(see the DEC/Intel/Xerox Ethernet spec) have what is called "jabber control"
>to prevent exactly this kind of "screaming" (which is usually a controller
>board fault, b.t.w., not the transceiver itself). The odds that a controller
>will go beserk *AND* that the jabber control will have failed at the same time
>are much less than either fault alone. And each fault is itself very rare.

In addition to Rob's comments, it should be noted that the jabber
control is supposed to be implemented so as to eliminate the chance of
the network being overloaded if any one component fails. In one design
I saw, the jabber control was replicated three times. Any one of them
could shut down the transceiver if carrier were asserted too long. 

The jabber control functions independently of the controller or the
transmit or receive circuitry. It listens to the trunk cable; there is
no possible failure of the transmitter or receiver that could disable it. 

Of course, this increases the chance that one node will be cut off if
the jabber control activates when it shouldn't. But this is consistent
with the philosophy of protecting the network even at the cost of
slightly decreased availability for a particular node. 

All this not withstanding, we have hundreds of transceivers in active
use at my company and I haven't ever seen one fail. 

-- 
Make Japan the 51st state!

I speak for myself, not the company.
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or phil@amd.com

smb@ulysses.homer.nj.att.com (Steven Bellovin) (05/20/88)

This afternoon, we had some sort of network lockup that could have been
a two-point failure.  Ulysses (a Sun-3/280) suddenly started muttering
``ie1: Ethernet jammed''.  The lights on the transceiver showed
continuous receive, as if someone were indeed talking continuously.
I've only seen something like this once before, and I deliberately
applied the technique I discovered accidentally last time:  I
unterminated the coax.  That caused the jabberer to see a collision,
and hence to shut up.  The network recovered immediately, and
everything was able to talk once again.

The particular net in question is a very difficult one to debug.  It's
our backbone, and consists of a very short segment of coax with lots of
repeaters to other segments.  Some of those repeaters and transceivers
are ancient; our segment may be connected via a 5 or 6 year-old 3Com
transceiver.  I have no idea which host was misbehaving; it could even
have been Ulysses, since the repeater may have isolated such a failing
segment.

All that, right on the heels of this discussion (and while a network
equipment sales rep was in my office, trying to sell me a network
management gizmo!) got me thinking.  There is in general *no way to
know* if a jabber-detect has failed -- there is no standard diagnostic
for it!  Thus, the second failure (of a controller) can happen at any
time in the future; the two don't have to be coincident.  (As has been
noted, some jabber circuits are redundant, but not necessarily all of
them.)


			--Steve Bellovin
			ulysses!smb, smb@ulysses.att.com

kwe@bu-cs.BU.EDU (kwe@bu-it.bu.edu (Kent W. England)) (05/20/88)

In article <299@fedeva.UUCP> wrd3156@fedeva.UUCP (Bill Daniels) writes:
>Some folks in my organization have been led to believe that a "screaming" 
>modem/transceiver can lock up an ethernet by asserting carrier forever.
>Supposedly token stuff like GM/MAP does not permit this. 

	A screaming baseband transceiver can take down a single
Ethernet segment.

	A screaming broadband modem can take down a broadband CATV
network.  MAP runs on a broadband network, using token bus.

	A broadcast medium like Ethernet or broadband CATV can be
disabled by hardware failures in transmitters.  This is independent of
the medium acquisition methodology.  MAP and Ethernet/802.3 are
equivalent in this respect.

	Kent England, Boston University

kwe@bu-cs.BU.EDU (kwe@bu-it.bu.edu (Kent W. England)) (05/20/88)

In article <10303@ulysses.homer.nj.att.com> smb@ulysses.homer.nj.att.com (Steven Bellovin) writes:
>This afternoon, we had some sort of network lockup that could have been
>a two-point failure.  Ulysses (a Sun-3/280) suddenly started muttering
>``ie1: Ethernet jammed''.  
>
>The particular net in question is a very difficult one to debug.  It's
>our backbone, and consists of a very short segment of coax with lots of
>repeaters to other segments.  Some of those repeaters and transceivers
>are ancient; our segment may be connected via a 5 or 6 year-old 3Com
>transceiver.  I have no idea which host was misbehaving; it could even
>have been Ulysses, since the repeater may have isolated such a failing
>segment.
>
	Old transceivers may not implement jabber control.  Old
repeaters do not provide fault isolation.

	For new equipment, specify 802.3 compliance and test
compliance.  Transceivers should implement jabber control and new
repeaters should implement the new IEEE 802.3 repeaters specification
which provides a degree of fault isolation and should interrupt
repeating of jabbering [illegal] signals.  I believe all implementations
of the multiport repeater follow the new 802.3 repeater rules.  I
think all new implementations of Ethernet concentrators (ala the new
twisted pair concentrators) should implement the new rules.

	Kent England, Boston U

phil@amdcad.AMD.COM (Phil Ngai) (05/21/88)

In article <10303@ulysses.homer.nj.att.com> smb@ulysses.homer.nj.att.com (Steven Bellovin) writes:
>There is in general *no way to
>know* if a jabber-detect has failed -- there is no standard diagnostic
>for it!  Thus, the second failure (of a controller) can happen at any
>time in the future; the two don't have to be coincident.

This is a very important principle. I call it the "testing your
spare tire" policy. Redundancy without an alarm to notify you
when it has been invoked is very dangerous.

I tend to think of things like this as belonging under network
management.

The designers of Ethernet were concerned about this. That is why
the version 2 has the "heartbeat" or collision presence test
at the end of every packet.

Unfortunately jabber detect is not automated like this. There are
transceiver testers that can be used for this. Either Cabletron
or Titn made one that had such a test, unfortunately I looked
at this two years ago and don't remember which one. In any case,
you'd have to manually go out and hook up the transceiver tester
to check the jabber detect.
-- 
Make Japan the 51st state!

I speak for myself, not the company.
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or phil@amd.com

eshop@saturn.ucsc.edu (Jim Warner) (05/21/88)

In article <21695@amdcad.AMD.COM> phil@amdcad.UUCP (Phil Ngai) writes:
>
>This is a very important principle. I call it the "testing your
>spare tire" policy. Redundancy without an alarm to notify you
>when it has been invoked is very dangerous.
>
Not quite.  If a transceiver's jabber circuit causes it to disconnect
it will stay disconnected until either (a) the power is cycled or
(b) is is explicitly reset over control lines that are not implimented
in any products I know of.  Nothing more will be transmitted til the
fault is cleared by (a) or (b). I'd say that's pretty good notification
that the fail safe has tripped.

>The designers of Ethernet were concerned about this. That is why
>the version 2 has the "heartbeat" or collision presence test
>at the end of every packet.

I have a transceiver cable breakout box.  I used it on several
systems to disconnect the collision pair between several systems
and their transceivers.  I expected to see messages start appearing
on the console.  What I got instead was silence.  The OS was never
notified.  I did this to a 3com 3C501 and ran the self test that
came on a diskette with the interface.  It told me that my system
"passed with flying colors."  

jim warner

smb@ulysses.homer.nj.att.com (Steven Bellovin) (05/21/88)

The whole discussion about Ethernets locking up raises another issue:
what do folks use for network control, monitoring, and management?
We have a moderately complex topology:  one building wired with thinwire
Ethernet according to the DECconnect wiring scheme (which I described
recently in a long posting to comp.sys.sun), linked by LANbridges and
fiber transceivers to another building; it in turn has an IP-level
gateway to two other Ethernets, one of which is a multi-organization
backbone.  There are assorted other links as well, using varying
technologies.

The question is this:  what can we buy/build to monitor this?  I'm
especially concerned about the Ethernets; I'm interested in routine
monitoring for collision rates, plus fault isolation in case of network
meltdowns, collision storms, etc.  Presumably we need a TDR (and baseline
photographs of each segment); what else do we need?  Are etherfind(1) and
traffic(1) on our Suns sufficient?  How about an Excelan Lanalyzer, or
a Cabletron LAN MD or LAN SPECIALIST?  Does anyone have any experience
with any of those products?


		--Steve Bellovin
		ulysses!smb
		smb@ulysses.att.com

phil@amdcad.AMD.COM (Phil Ngai) (05/22/88)

In article <3366@saturn.ucsc.edu> eshop@saturn.ucsc.edu (Jim Warner) writes:
>In article <21695@amdcad.AMD.COM> phil@amdcad.UUCP (Phil Ngai) writes:
..This is a very important principle. I call it the "testing your
..spare tire" policy. Redundancy without an alarm to notify you
..when it has been invoked is very dangerous.
..
.Not quite.  If a transceiver's jabber circuit causes it to disconnect
.it will stay disconnected until either (a) the power is cycled or
.(b) is is explicitly reset over control lines that are not implimented

Sorry about that, my brain was going faster than my fingers at that
point. There are two concerns. First, you want a way to test backup
features which are normally rarely put in use. Second, you want to
be notified when a backup feature is put in use.

An example of the first is the heartbeat signal for the transceiver's
collision detect function. There is nothing similar for jabber.

An example of the second would be for the OS to complain if it wasn't
receiving heartbeat. As you have noted, jabber is very noticable when
it trips, so that is not a problem. 

.I have a transceiver cable breakout box.  I used it on several
.systems to disconnect the collision pair between several systems
.and their transceivers.  I expected to see messages start appearing
.on the console.  What I got instead was silence.  The OS was never
.notified.  I did this to a 3com 3C501 and ran the self test that
.came on a diskette with the interface.  It told me that my system
."passed with flying colors."  

I think that says something about 3Com.
-- 

I speak for myself, not the company.
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or phil@amd.com

ron@topaz.rutgers.edu (Ron Natalie) (05/25/88)

Broken devices can blow nearly all networks.  The continuously jabbering
transceiver on Ethernet is subject to the same problem as broken RF
modems on MAP broadbands.  The fact that one uses token passing rather
than carrier sense to arbitrate the bus does not help when some host
decides not to play by the rules.

This whole thing has nothing whatsoever to do with TCP/IP or the
higher level protocols, it's a media issue.

-Ron

ron@topaz.rutgers.edu (Ron Natalie) (05/25/88)

Well, transcievers do fail.  A jabbering transciever is very very hard
to find.  More frequently what fails is the vampire tap, that can be found
with TDR, but finding a malicious transciever is very hard.  BRL had one
that would intermittantly transimit for about 15 seconds straight.  This
only became really noticeable as the Braindamaged microcode in one of our
Ethernet blew up when they couldn't get on the net for seven seconds.

-Ron