[comp.dcom.lans] Ethernet "heartbeat"

hardiman@csd4.csd.uwm.edu (Paul V Hardiman) (05/16/91)

While looking through vendor literature for Ethernet repeaters and
transceivers I've noticed several references to something called a
"heartbeat".  Can anyone give me a short summary of what a "heartbeat"
is and whether or not I should choose equipment that supports it?

Thanks,
Paul Hardiman
University of Wisconsin - Milwaukee

mikel@berlioz.nsc.com (Michael G. Lohmeyer) (05/16/91)

In article <12164@uwm.edu> hardiman@csd4.csd.uwm.edu (Paul V Hardiman) writes:
>While looking through vendor literature for Ethernet repeaters and
>transceivers I've noticed several references to something called a
>"heartbeat".  Can anyone give me a short summary of what a "heartbeat"
>is and whether or not I should choose equipment that supports it?
>
>Thanks,
>Paul Hardiman
>University of Wisconsin - Milwaukee

     Heartbeat is a pulse sent from the transceiver to the transmitting 
ethernet card after a packet has been transmitted.  The pulse is given
on the collision line of the AUI cable.  When transmitting an ethernet 
packet, the data is transmitted on the TX+/- line of the AUI, and the
transceiver loops that data back on the RX+/- line so that the ethernet
board can self-monitor its own transmission and do a CRC or other check
on the packet.  This allows the transmitting node to make sure that it
is transmitting ok.  In this process, the collision lines, CD+/- are not 
actually check because not data/pulses are received on these lines during
a successful (non-collision) transmission.  The Heartbeat has this effect.
After transmitting a packets, if the transmitting node does not get a 
Heartbeat signal from the transceiver, then it knows that the CD+/- lines
were not connected correctly.  If these lines were not connected correctly
during transmission, then there may have been a collision during the 
tranmission that the ethernet board did not sense.  Hence, the Heartbeat
verifies that the AUI cable is correctly connected to the board.  

     Most equipment supports it.  It is a spec for IEEE 802.3.  Generally,
you want to make sure that the transceivers and MUXes that you get 
do support it, and also allow you to turn it off.  Sometimes you need
to be able to turn it off when connecting a MUX to a repeater for example,
depending on the equipment you buy.

Mike
-------------
Mike Lohmeyer				mikel@berlioz.nsc.com
National Semiconductor Corporation
(408) 721-8075

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (05/17/91)

In article <1991May16.004523.21301@berlioz.nsc.com>, mikel@berlioz.nsc.com (Michael G. Lohmeyer) writes:
>
>   ...[nice description deleted]...
> 
>      Most equipment supports it.  It is a spec for IEEE 802.3.  Generally,
> you want to make sure that the transceivers and MUXes that you get 
> do support it, and also allow you to turn it off.  Sometimes you need
> to be able to turn it off when connecting a MUX to a repeater for example,
> depending on the equipment you buy.
> 
> Mike Lohmeyer				mikel@berlioz.nsc.com
> National Semiconductor Corporation


From what I've seen, while most hardware more or less supports it, all
software ignores it.  If you write a BSD UNIX style driver, there's no
good place to count or report missing missing heartbeats.  You would not
want to printf on every missing heartbeat.  (I suppose you could do a
printf to the kernel or system once a day.)  If you take systems to major
trade shows, the people running the network always (in my experience)
refuse to listen to network problem reports until they've checked that the
transceivers have SQE turned off.

Perhaps this can be predicted from the fact that SQE is practically
optional, because it is not present in all of V1, V2, and 802.3, and
because all transceivers (that I've seen) allow you to turn it off.  What
is optional or commonly turned off cannot be relied upon.  Any error
reporting mechanism that cannot be relied upon, and is useful only for
detecting a relatively unlikely error is likely to atrophy.  Yes, I've
discovered and fixed loose connectors that heartbeat would have detected.
Has anyone ever had a system diagnose such a problem on its own?



Vernon Schryver,  vjs@sgi.com

oberman@ptavv.llnl.gov (05/17/91)

In article <104479@sgi.sgi.com>, vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes:
> 
> Perhaps this can be predicted from the fact that SQE is practically
> optional, because it is not present in all of V1, V2, and 802.3, and
> because all transceivers (that I've seen) allow you to turn it off.  What
> is optional or commonly turned off cannot be relied upon.  Any error
> reporting mechanism that cannot be relied upon, and is useful only for
> detecting a relatively unlikely error is likely to atrophy.  Yes, I've
> discovered and fixed loose connectors that heartbeat would have detected.
> Has anyone ever had a system diagnose such a problem on its own?

Heartbeat is and Ethernet V2 term. V1 had no heartbeat or anything like it.
802.3 has a SIMILAR thing called SQE. SQE is recommended for all devices save
repeaters where it is forbidden.

On the other hand, Heartbeat is present and non-switchable on all Ethernet V2
transceivers. Ethernet V2 repeaters do require heartbeat and are not satisfied
by SQE.

SQE is unlikely to be very helpful in finding loose connectors, but I do have
some workstations that gernerate LOTS of errors if they don't se it. If they
run NFS the error reporting can essentially kill the system.

R. Kevin Oberman			Lawrence Livermore National Laboratory
Internet: oberman@icdc.llnl.gov		(415) 422-6955

Disclaimer: Don't take this too seriously. I just like to improve my typing
and probably don't really know anything useful about anything. Especially
anything gnu.

werme@Alliant.COM (Ric Werme) (05/17/91)

In article <104479@sgi.sgi.com> vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes:
>From what I've seen, while most hardware more or less supports it, all
>software ignores it.  If you write a BSD UNIX style driver, there's no
>good place to count or report missing missing heartbeats.

In Alliant's in house net, we generally disable heartbeat on all our
transceivers so that we don't have to remember to check before we plug one
into a repeater.  Heck, if you can't use it sometimes, why use it at all?

An ioctl in our CMC-130 driver does let us get to the statistics the CMC board
keeps:

marley 28% ifstats en0
Transmit data:
frames sent without errors:                             349978
frames sent despite SQE test errors:                    316660
frames sent after deferral due to active medium:         23704
frames sent after a single collision:                    10078
frames sent after multiple collisions:                    6328
frames abandoned after 16 collisions:                        0
frames abandoned due to late collision:                      1
frames abandoned due to no carrier:                          8
frames abandoned due to length > 1518:                       0
frames abandoned due to silo underflow:                      0
frames abandoned due to board level memory error:            0

Receive data:
frames received without errors:                         385372
frames received with CRC errors:                            53
frames received with alignment errors:                      50
frames lost due to no receive buffers:                      77
frames lost due to silo overrun:                             1
frames lost due to board level memory error:                 0

What I've never bothered to figure out is why the frames with SQE errors
doesn't match the number for frames sent.  It may the that the SQE test is
skipped if the board has another frame to send after one finishes.
-- 

| A pride of lions              | Eric J Werme                   |
| A gaggle of geese             | uucp: mit-eddie!alliant!werme  |
| An odd lot of programmers     | Phone: 508-486-1214            |

hedrick@athos.rutgers.edu (Charles Hedrick) (05/19/91)

>SQE is unlikely to be very helpful in finding loose connectors, but I do have
>some workstations that gernerate LOTS of errors if they don't se it. If they
>run NFS the error reporting can essentially kill the system.

I'm not sure why you say this.  Some older cisco routers use Interlan
multibus Ethernet controllers.  With those controllers, cisco checks
heartbeat, and declares the interface down if it's missing.  This
seems to detect disconnected cables fine, except for a few cases where
it's partly in and so not all signals are interrupted.  (The test can
be disabled by a configuration command.)  Newer cisco products use a
cisco controller card, which can hear its own transmissions.  This
allows a more complete test than what you get from heartbeat. But for
the majority of controllers heartbeat is the best you can do, and does
seem worth doing.

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (05/19/91)

>>SQE is unlikely to be very helpful in finding loose connectors, but I do have
>>some workstations that gernerate LOTS of errors if they don't se it. If they
>>run NFS the error reporting can essentially kill the system.

>                                 Some older cisco routers use Interlan
> multibus Ethernet controllers.  With those controllers, cisco checks
> heartbeat, and declares the interface down if it's missing.


Ok, it seems an unnamed workstations (which brand and models?), some older
Cisco routers (what about newer ones?), and VMS-VAXen pay attention to
heartbeat.  Are there any others?  Can and are any or all three of these
commonly configured to ignore the absense of heartbeat?

Consider that Cisco routers tend to be far less common than workstations,
and are less likely to have loose or disconnected cables (tho with
unhappier consequences).  VAXen are not the most popular machine this
decade.  If these are the only machines that care, it seems to me that
heartbeat (or SQE which is not identical, but close enough) is almost
useless in real life.

I care, because I'm wondering if a bunch of BSD-style ethernet drivers I
know about should be changed to notice missing hearbeats.  They did notice
about 3 years ago, but only by mistake.  If they should care, where should
they report the problem?  Would a printf be appreciated or cursed?  Say one
printf on the first missing heartbeat?  My guess is it would be cursed,
because almost all transceivers have heartbeat turned off.


Vernon Schryver,   vjs@sgi.com

oberman@ptavv.llnl.gov (05/20/91)

In article <May.18.20.17.29.1991.13010@athos.rutgers.edu>, hedrick@athos.rutgers.edu (Charles Hedrick) writes:
>>SQE is unlikely to be very helpful in finding loose connectors, but I do have
>>some workstations that gernerate LOTS of errors if they don't se it. If they
>>run NFS the error reporting can essentially kill the system.
> 
> I'm not sure why you say this.  Some older cisco routers use Interlan
> multibus Ethernet controllers.  With those controllers, cisco checks
> heartbeat, and declares the interface down if it's missing.  This

I didn't mean to imply that SQE could not detect a disconnection, just that it
was usually obvious from the lack of received packets. I just think that there
are better reasons to use SQE (like what it was designed for). While I've
certainly had my share of failures because the cables fall off of MAUs and
workstations, these are easy to find.

I'm more concerned with detecting the flakey unit than the totally dead one.

R. Kevin Oberman			Lawrence Livermore National Laboratory
Internet: oberman@icdc.llnl.gov		(415) 422-6955

Disclaimer: Don't take this too seriously. I just like to improve my typing
and probably don't really know anything useful about anything. Especially
anything gnu.

oberman@ptavv.llnl.gov (05/20/91)

In article <105002@sgi.sgi.com>, vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes:

> Ok, it seems an unnamed workstations (which brand and models?), some older
> Cisco routers (what about newer ones?), and VMS-VAXen pay attention to
> heartbeat.  Are there any others?  Can and are any or all three of these
> commonly configured to ignore the absense of heartbeat?
> 
> Consider that Cisco routers tend to be far less common than workstations,
> and are less likely to have loose or disconnected cables (tho with
> unhappier consequences).  VAXen are not the most popular machine this
> decade.  If these are the only machines that care, it seems to me that
> heartbeat (or SQE which is not identical, but close enough) is almost
> useless in real life.
> 
> I care, because I'm wondering if a bunch of BSD-style ethernet drivers I
> know about should be changed to notice missing hearbeats.  They did notice
> about 3 years ago, but only by mistake.  If they should care, where should
> they report the problem?  Would a printf be appreciated or cursed?  Say one
> printf on the first missing heartbeat?  My guess is it would be cursed,
> because almost all transceivers have heartbeat turned off.

The bottom line is that SQE is badly understood. This thread just goes to prove
it. It was not designed to detect loose cables, although it's quite good at it.
It is there to detect the failure of MAUs and especially collision detect
circuitry. And, since MAUs tend to be very reliable, people don't see the
importance of this, so they don't bother with SQE. That's a lazy action. It
violates the standard (8802-3), and that alone is, IMHO, a reason to use it.

Trouble-shooting a LAN is non-trivial. I have spent hours of time tracking down
components which were sending out garbled packets which were breaking other
things. Any tool I can use is valuable. SQE MIGHT have saved me several hours,
IF IT WAS PROPERLY IMPLEMENTED by those who write the software.

On the other hand, I would probably curse a printf. The failure should be
counted and logged. I don't want to fill 10 KB of disk space every time there
is a failure. That's what SNMP is for. The failure information needs to be
there so I can use it to track down a problem. But since errors tend to occur
in large numbers (1 per packet), I feel a printf is over-kill.

R. Kevin Oberman			Lawrence Livermore National Laboratory
Internet: oberman@icdc.llnl.gov		(415) 422-6955

Disclaimer: Don't take this too seriously. I just like to improve my typing
and probably don't really know anything useful about anything. Especially
anything gnu.

phil@brahms.amd.com (Phil Ngai) (05/20/91)

werme@Alliant.COM (Ric Werme) writes:
>What I've never bothered to figure out is why the frames with SQE errors
>doesn't match the number for frames sent.  It may the that the SQE test is
>skipped if the board has another frame to send after one finishes.

SQE is there for the purpose of testing the collision path. If a
transmission is collided with (one or more times) and then later
succeeds, the controller does not require the "heartbeat" after a
successful transmission since the collision path is known to be working
for that packet.

I am not claiming that any particular device works this way, but that
is how it should work.

--

phil@brahms.amd.com (Phil Ngai) (05/20/91)

hedrick@athos.rutgers.edu (Charles Hedrick) writes:

>be disabled by a configuration command.)  Newer cisco products use a
>cisco controller card, which can hear its own transmissions.  This
>allows a more complete test than what you get from heartbeat. But for

Hearing your own transmission only allows you to check your transmitted
data. It does nothing to exercise the collision signal path. Remember,
the AUI cable has three signal pairs, XMT, RCV, and SQE.

Also remember the last two letters of CSMA/CD.

--

phil@brahms.amd.com (Phil Ngai) (05/20/91)

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes:
>decade.  If these are the only machines that care, it seems to me that
>heartbeat (or SQE which is not identical, but close enough) is almost
>useless in real life.

I think that people who understand what CSMA/CD is about try to run
with SQE, but of course there are lots of sites and vendors out
there who don't understand as much as they should.

>I care, because I'm wondering if a bunch of BSD-style ethernet drivers I
>know about should be changed to notice missing hearbeats.  They did notice
>about 3 years ago, but only by mistake.  If they should care, where should
>they report the problem?  Would a printf be appreciated or cursed?  Say one
>printf on the first missing heartbeat?  My guess is it would be cursed,
>because almost all transceivers have heartbeat turned off.

I would say it ought to be maintained as a statistic by the driver
and accessible via ioctl. Ideally, you'd bug the system manager about
no SQE, but realistically, you'd probably get calls from people who
don't understand 802.3 and don't want to.

As far as printf on first missing heartbeat, what does "first" mean?
What if someone unplugs the AUI cable for network maintenance and
then reconnects it an hour later?

--

phil@brahms.amd.com (Phil Ngai) (05/20/91)

oberman@ptavv.llnl.gov writes:
>I'm more concerned with detecting the flakey unit than the totally dead one.

Right on! Especially the flakey units that slow down the network for
everyone in a way that is almost impossible to pinpoint the culprit.

--

mikel@berlioz.nsc.com (Michael G. Lohmeyer) (05/21/91)

In article <1991May19.133507.1@ptavv.llnl.gov> oberman@ptavv.llnl.gov writes:
>The bottom line is that SQE is badly understood. This thread just goes to prove
>it. It was not designed to detect loose cables, although it's quite good at it.
>It is there to detect the failure of MAUs and especially collision detect
>circuitry. And, since MAUs tend to be very reliable, people don't see the
>importance of this, so they don't bother with SQE. That's a lazy action. It
>violates the standard (8802-3), and that alone is, IMHO, a reason to use it.

     When I originally said that heartbeat could be used to detect a loose
cable, I was saying that if the transmit and receive pairs were touching 
and the collision pair was not, then the lack of a heartbeat at the end of
a transmission will signify that the collision lines were not touching.  
I should have mentioned, as is said above, that the main purpose of heartbeat
is to test the MAU's collision (or SQE) circuitry (which includes the cable).
Sorry to have caused any confusion.  In any case, in my opinion, you
should not disable the SQE.  Also, it would be nice if driver software,
in general, reported heartbeat errors, but, as has been stated, this is
not the case.

     An interesting point to make about heartbeat and SQE, as many of you
probably already know, is that SQE (which was defined in IEEE 802.3) stands
for Signal Quality Error.  SQE does not specifically stand for heartbeat
(which is a term that comes from Ethernet II).  As the name implies, SQE
signifies an error in the signal transmitted or received.  So, collisions
and heartbeat are subsets of SQE.  I bring this up because most MAUs, etc.
always talk about disabling SQE.  Taking this literally means that you
are disabling the collision detection circuitry, not just heartbeat. 
Of course, the SQE disable on these MAUs means to disable heartbeat, not
the entire collision circuit.  That's my trivia for the day.

Mike
-------------
Mike Lohmeyer				mikel@berlioz.nsc.com
Local Area Networks
National Semiconductor Corporation
(408) 721-8075

lstowell@pyrnova.pyramid.com (Lon Stowell) (05/21/91)

In article <1991May20.185642.6704@berlioz.nsc.com> mikel@berlioz.nsc.com (Michael G. Lohmeyer) writes:

>Sorry to have caused any confusion.  In any case, in my opinion, you
>should not disable the SQE.  Also, it would be nice if driver software,
>in general, reported heartbeat errors, but, as has been stated, this is
>not the case.
>
  It would be nice if drivers reported physical errors period.

  Many drivers ignore SQE errors simply because it is so poorly
  understood...and mis-jumpered by installers.  

  V1 did not support it....if it is enabled, the station thinks
  it is getting a collision signal....which most drivers had
  trouble ignoring if associated with Xmit Done status....unless
  the hardware and software could note that it happened AFTER
  data was xmitted.

  V2 did support it, but not all "V2" compliance claiming
  hardware did....and few installers are/were aware of the
  subtle difference....since as noted they will interoperate as
  long as the transceiver and the station don't annoy each
  other.

  SQE is there to protect OTHER stations on the LAN from a
  station/MAU combination which is unable to detect a collision.
  Depending on your philosophy about designing products for
  cost-effective service, you would try to be a good LAN citizen
  or not.

  Absence of SQE is intended to have the offending station/MAU
  shut down after the first xmission attempt....to protect other
  stations from collisions.  It infers that your MAU is unable
  to detect a collision...or that your DTE is unable to see the
  collision signal--as the only difference between the two is
  WHEN the signal occurs.  

  There are other techniques for noticing that you have physical
  level problems (with software).   SQE status would just help
  out the poor techie,,,,and we all know how most programmers
  feel about service technicians.... >:-)

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (05/21/91)

In article <1991May20.165442.23834@amd.com>, phil@brahms.amd.com (Phil Ngai) writes:
> 
> I would say [count of missed heartbeats] ought to be maintained as a
>	statistic by the driver
> and accessible via ioctl.

If it's just available via ioctl, then what good is it?  Those of us who
might write a program to fetch the count it are as likely to have the spare
hardware to try some easter egging.  The rest would still be in the dark.

>                           Ideally, you'd bug the system manager about
> no SQE, but realistically, you'd probably get calls from people who
> don't understand 802.3 and don't want to.

This sounds like an argument against a printf.

> As far as printf on first missing heartbeat, what does "first" mean?
> What if someone unplugs the AUI cable for network maintenance and
> then reconnects it an hour later?

I meant "first" as in "first missing heartbeat after initialization or after
having seen heartbeats."  I'd suppress the printf if there were a
"simultaneous" no-carrier report, which covers most disconnected cables.


Others have written bad words about printf's.  So until and unless there is
something in a SMTP MIB, I guess the count will stay a secret here.



Vernon Schryver,   vjs@sgi.com