hardiman@csd4.csd.uwm.edu (Paul V Hardiman) (05/16/91)
While looking through vendor literature for Ethernet repeaters and transceivers I've noticed several references to something called a "heartbeat". Can anyone give me a short summary of what a "heartbeat" is and whether or not I should choose equipment that supports it? Thanks, Paul Hardiman University of Wisconsin - Milwaukee
mikel@berlioz.nsc.com (Michael G. Lohmeyer) (05/16/91)
In article <12164@uwm.edu> hardiman@csd4.csd.uwm.edu (Paul V Hardiman) writes: >While looking through vendor literature for Ethernet repeaters and >transceivers I've noticed several references to something called a >"heartbeat". Can anyone give me a short summary of what a "heartbeat" >is and whether or not I should choose equipment that supports it? > >Thanks, >Paul Hardiman >University of Wisconsin - Milwaukee Heartbeat is a pulse sent from the transceiver to the transmitting ethernet card after a packet has been transmitted. The pulse is given on the collision line of the AUI cable. When transmitting an ethernet packet, the data is transmitted on the TX+/- line of the AUI, and the transceiver loops that data back on the RX+/- line so that the ethernet board can self-monitor its own transmission and do a CRC or other check on the packet. This allows the transmitting node to make sure that it is transmitting ok. In this process, the collision lines, CD+/- are not actually check because not data/pulses are received on these lines during a successful (non-collision) transmission. The Heartbeat has this effect. After transmitting a packets, if the transmitting node does not get a Heartbeat signal from the transceiver, then it knows that the CD+/- lines were not connected correctly. If these lines were not connected correctly during transmission, then there may have been a collision during the tranmission that the ethernet board did not sense. Hence, the Heartbeat verifies that the AUI cable is correctly connected to the board. Most equipment supports it. It is a spec for IEEE 802.3. Generally, you want to make sure that the transceivers and MUXes that you get do support it, and also allow you to turn it off. Sometimes you need to be able to turn it off when connecting a MUX to a repeater for example, depending on the equipment you buy. Mike ------------- Mike Lohmeyer mikel@berlioz.nsc.com National Semiconductor Corporation (408) 721-8075
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (05/17/91)
In article <1991May16.004523.21301@berlioz.nsc.com>, mikel@berlioz.nsc.com (Michael G. Lohmeyer) writes: > > ...[nice description deleted]... > > Most equipment supports it. It is a spec for IEEE 802.3. Generally, > you want to make sure that the transceivers and MUXes that you get > do support it, and also allow you to turn it off. Sometimes you need > to be able to turn it off when connecting a MUX to a repeater for example, > depending on the equipment you buy. > > Mike Lohmeyer mikel@berlioz.nsc.com > National Semiconductor Corporation From what I've seen, while most hardware more or less supports it, all software ignores it. If you write a BSD UNIX style driver, there's no good place to count or report missing missing heartbeats. You would not want to printf on every missing heartbeat. (I suppose you could do a printf to the kernel or system once a day.) If you take systems to major trade shows, the people running the network always (in my experience) refuse to listen to network problem reports until they've checked that the transceivers have SQE turned off. Perhaps this can be predicted from the fact that SQE is practically optional, because it is not present in all of V1, V2, and 802.3, and because all transceivers (that I've seen) allow you to turn it off. What is optional or commonly turned off cannot be relied upon. Any error reporting mechanism that cannot be relied upon, and is useful only for detecting a relatively unlikely error is likely to atrophy. Yes, I've discovered and fixed loose connectors that heartbeat would have detected. Has anyone ever had a system diagnose such a problem on its own? Vernon Schryver, vjs@sgi.com
oberman@ptavv.llnl.gov (05/17/91)
In article <104479@sgi.sgi.com>, vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes: > > Perhaps this can be predicted from the fact that SQE is practically > optional, because it is not present in all of V1, V2, and 802.3, and > because all transceivers (that I've seen) allow you to turn it off. What > is optional or commonly turned off cannot be relied upon. Any error > reporting mechanism that cannot be relied upon, and is useful only for > detecting a relatively unlikely error is likely to atrophy. Yes, I've > discovered and fixed loose connectors that heartbeat would have detected. > Has anyone ever had a system diagnose such a problem on its own? Heartbeat is and Ethernet V2 term. V1 had no heartbeat or anything like it. 802.3 has a SIMILAR thing called SQE. SQE is recommended for all devices save repeaters where it is forbidden. On the other hand, Heartbeat is present and non-switchable on all Ethernet V2 transceivers. Ethernet V2 repeaters do require heartbeat and are not satisfied by SQE. SQE is unlikely to be very helpful in finding loose connectors, but I do have some workstations that gernerate LOTS of errors if they don't se it. If they run NFS the error reporting can essentially kill the system. R. Kevin Oberman Lawrence Livermore National Laboratory Internet: oberman@icdc.llnl.gov (415) 422-6955 Disclaimer: Don't take this too seriously. I just like to improve my typing and probably don't really know anything useful about anything. Especially anything gnu.
werme@Alliant.COM (Ric Werme) (05/17/91)
In article <104479@sgi.sgi.com> vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes: >From what I've seen, while most hardware more or less supports it, all >software ignores it. If you write a BSD UNIX style driver, there's no >good place to count or report missing missing heartbeats. In Alliant's in house net, we generally disable heartbeat on all our transceivers so that we don't have to remember to check before we plug one into a repeater. Heck, if you can't use it sometimes, why use it at all? An ioctl in our CMC-130 driver does let us get to the statistics the CMC board keeps: marley 28% ifstats en0 Transmit data: frames sent without errors: 349978 frames sent despite SQE test errors: 316660 frames sent after deferral due to active medium: 23704 frames sent after a single collision: 10078 frames sent after multiple collisions: 6328 frames abandoned after 16 collisions: 0 frames abandoned due to late collision: 1 frames abandoned due to no carrier: 8 frames abandoned due to length > 1518: 0 frames abandoned due to silo underflow: 0 frames abandoned due to board level memory error: 0 Receive data: frames received without errors: 385372 frames received with CRC errors: 53 frames received with alignment errors: 50 frames lost due to no receive buffers: 77 frames lost due to silo overrun: 1 frames lost due to board level memory error: 0 What I've never bothered to figure out is why the frames with SQE errors doesn't match the number for frames sent. It may the that the SQE test is skipped if the board has another frame to send after one finishes. -- | A pride of lions | Eric J Werme | | A gaggle of geese | uucp: mit-eddie!alliant!werme | | An odd lot of programmers | Phone: 508-486-1214 |
hedrick@athos.rutgers.edu (Charles Hedrick) (05/19/91)
>SQE is unlikely to be very helpful in finding loose connectors, but I do have >some workstations that gernerate LOTS of errors if they don't se it. If they >run NFS the error reporting can essentially kill the system. I'm not sure why you say this. Some older cisco routers use Interlan multibus Ethernet controllers. With those controllers, cisco checks heartbeat, and declares the interface down if it's missing. This seems to detect disconnected cables fine, except for a few cases where it's partly in and so not all signals are interrupted. (The test can be disabled by a configuration command.) Newer cisco products use a cisco controller card, which can hear its own transmissions. This allows a more complete test than what you get from heartbeat. But for the majority of controllers heartbeat is the best you can do, and does seem worth doing.
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (05/19/91)
>>SQE is unlikely to be very helpful in finding loose connectors, but I do have >>some workstations that gernerate LOTS of errors if they don't se it. If they >>run NFS the error reporting can essentially kill the system. > Some older cisco routers use Interlan > multibus Ethernet controllers. With those controllers, cisco checks > heartbeat, and declares the interface down if it's missing. Ok, it seems an unnamed workstations (which brand and models?), some older Cisco routers (what about newer ones?), and VMS-VAXen pay attention to heartbeat. Are there any others? Can and are any or all three of these commonly configured to ignore the absense of heartbeat? Consider that Cisco routers tend to be far less common than workstations, and are less likely to have loose or disconnected cables (tho with unhappier consequences). VAXen are not the most popular machine this decade. If these are the only machines that care, it seems to me that heartbeat (or SQE which is not identical, but close enough) is almost useless in real life. I care, because I'm wondering if a bunch of BSD-style ethernet drivers I know about should be changed to notice missing hearbeats. They did notice about 3 years ago, but only by mistake. If they should care, where should they report the problem? Would a printf be appreciated or cursed? Say one printf on the first missing heartbeat? My guess is it would be cursed, because almost all transceivers have heartbeat turned off. Vernon Schryver, vjs@sgi.com
oberman@ptavv.llnl.gov (05/20/91)
In article <May.18.20.17.29.1991.13010@athos.rutgers.edu>, hedrick@athos.rutgers.edu (Charles Hedrick) writes: >>SQE is unlikely to be very helpful in finding loose connectors, but I do have >>some workstations that gernerate LOTS of errors if they don't se it. If they >>run NFS the error reporting can essentially kill the system. > > I'm not sure why you say this. Some older cisco routers use Interlan > multibus Ethernet controllers. With those controllers, cisco checks > heartbeat, and declares the interface down if it's missing. This I didn't mean to imply that SQE could not detect a disconnection, just that it was usually obvious from the lack of received packets. I just think that there are better reasons to use SQE (like what it was designed for). While I've certainly had my share of failures because the cables fall off of MAUs and workstations, these are easy to find. I'm more concerned with detecting the flakey unit than the totally dead one. R. Kevin Oberman Lawrence Livermore National Laboratory Internet: oberman@icdc.llnl.gov (415) 422-6955 Disclaimer: Don't take this too seriously. I just like to improve my typing and probably don't really know anything useful about anything. Especially anything gnu.
oberman@ptavv.llnl.gov (05/20/91)
In article <105002@sgi.sgi.com>, vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes: > Ok, it seems an unnamed workstations (which brand and models?), some older > Cisco routers (what about newer ones?), and VMS-VAXen pay attention to > heartbeat. Are there any others? Can and are any or all three of these > commonly configured to ignore the absense of heartbeat? > > Consider that Cisco routers tend to be far less common than workstations, > and are less likely to have loose or disconnected cables (tho with > unhappier consequences). VAXen are not the most popular machine this > decade. If these are the only machines that care, it seems to me that > heartbeat (or SQE which is not identical, but close enough) is almost > useless in real life. > > I care, because I'm wondering if a bunch of BSD-style ethernet drivers I > know about should be changed to notice missing hearbeats. They did notice > about 3 years ago, but only by mistake. If they should care, where should > they report the problem? Would a printf be appreciated or cursed? Say one > printf on the first missing heartbeat? My guess is it would be cursed, > because almost all transceivers have heartbeat turned off. The bottom line is that SQE is badly understood. This thread just goes to prove it. It was not designed to detect loose cables, although it's quite good at it. It is there to detect the failure of MAUs and especially collision detect circuitry. And, since MAUs tend to be very reliable, people don't see the importance of this, so they don't bother with SQE. That's a lazy action. It violates the standard (8802-3), and that alone is, IMHO, a reason to use it. Trouble-shooting a LAN is non-trivial. I have spent hours of time tracking down components which were sending out garbled packets which were breaking other things. Any tool I can use is valuable. SQE MIGHT have saved me several hours, IF IT WAS PROPERLY IMPLEMENTED by those who write the software. On the other hand, I would probably curse a printf. The failure should be counted and logged. I don't want to fill 10 KB of disk space every time there is a failure. That's what SNMP is for. The failure information needs to be there so I can use it to track down a problem. But since errors tend to occur in large numbers (1 per packet), I feel a printf is over-kill. R. Kevin Oberman Lawrence Livermore National Laboratory Internet: oberman@icdc.llnl.gov (415) 422-6955 Disclaimer: Don't take this too seriously. I just like to improve my typing and probably don't really know anything useful about anything. Especially anything gnu.
phil@brahms.amd.com (Phil Ngai) (05/20/91)
werme@Alliant.COM (Ric Werme) writes: >What I've never bothered to figure out is why the frames with SQE errors >doesn't match the number for frames sent. It may the that the SQE test is >skipped if the board has another frame to send after one finishes. SQE is there for the purpose of testing the collision path. If a transmission is collided with (one or more times) and then later succeeds, the controller does not require the "heartbeat" after a successful transmission since the collision path is known to be working for that packet. I am not claiming that any particular device works this way, but that is how it should work. --
phil@brahms.amd.com (Phil Ngai) (05/20/91)
hedrick@athos.rutgers.edu (Charles Hedrick) writes: >be disabled by a configuration command.) Newer cisco products use a >cisco controller card, which can hear its own transmissions. This >allows a more complete test than what you get from heartbeat. But for Hearing your own transmission only allows you to check your transmitted data. It does nothing to exercise the collision signal path. Remember, the AUI cable has three signal pairs, XMT, RCV, and SQE. Also remember the last two letters of CSMA/CD. --
phil@brahms.amd.com (Phil Ngai) (05/20/91)
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) writes: >decade. If these are the only machines that care, it seems to me that >heartbeat (or SQE which is not identical, but close enough) is almost >useless in real life. I think that people who understand what CSMA/CD is about try to run with SQE, but of course there are lots of sites and vendors out there who don't understand as much as they should. >I care, because I'm wondering if a bunch of BSD-style ethernet drivers I >know about should be changed to notice missing hearbeats. They did notice >about 3 years ago, but only by mistake. If they should care, where should >they report the problem? Would a printf be appreciated or cursed? Say one >printf on the first missing heartbeat? My guess is it would be cursed, >because almost all transceivers have heartbeat turned off. I would say it ought to be maintained as a statistic by the driver and accessible via ioctl. Ideally, you'd bug the system manager about no SQE, but realistically, you'd probably get calls from people who don't understand 802.3 and don't want to. As far as printf on first missing heartbeat, what does "first" mean? What if someone unplugs the AUI cable for network maintenance and then reconnects it an hour later? --
phil@brahms.amd.com (Phil Ngai) (05/20/91)
oberman@ptavv.llnl.gov writes: >I'm more concerned with detecting the flakey unit than the totally dead one. Right on! Especially the flakey units that slow down the network for everyone in a way that is almost impossible to pinpoint the culprit. --
mikel@berlioz.nsc.com (Michael G. Lohmeyer) (05/21/91)
In article <1991May19.133507.1@ptavv.llnl.gov> oberman@ptavv.llnl.gov writes: >The bottom line is that SQE is badly understood. This thread just goes to prove >it. It was not designed to detect loose cables, although it's quite good at it. >It is there to detect the failure of MAUs and especially collision detect >circuitry. And, since MAUs tend to be very reliable, people don't see the >importance of this, so they don't bother with SQE. That's a lazy action. It >violates the standard (8802-3), and that alone is, IMHO, a reason to use it. When I originally said that heartbeat could be used to detect a loose cable, I was saying that if the transmit and receive pairs were touching and the collision pair was not, then the lack of a heartbeat at the end of a transmission will signify that the collision lines were not touching. I should have mentioned, as is said above, that the main purpose of heartbeat is to test the MAU's collision (or SQE) circuitry (which includes the cable). Sorry to have caused any confusion. In any case, in my opinion, you should not disable the SQE. Also, it would be nice if driver software, in general, reported heartbeat errors, but, as has been stated, this is not the case. An interesting point to make about heartbeat and SQE, as many of you probably already know, is that SQE (which was defined in IEEE 802.3) stands for Signal Quality Error. SQE does not specifically stand for heartbeat (which is a term that comes from Ethernet II). As the name implies, SQE signifies an error in the signal transmitted or received. So, collisions and heartbeat are subsets of SQE. I bring this up because most MAUs, etc. always talk about disabling SQE. Taking this literally means that you are disabling the collision detection circuitry, not just heartbeat. Of course, the SQE disable on these MAUs means to disable heartbeat, not the entire collision circuit. That's my trivia for the day. Mike ------------- Mike Lohmeyer mikel@berlioz.nsc.com Local Area Networks National Semiconductor Corporation (408) 721-8075
lstowell@pyrnova.pyramid.com (Lon Stowell) (05/21/91)
In article <1991May20.185642.6704@berlioz.nsc.com> mikel@berlioz.nsc.com (Michael G. Lohmeyer) writes: >Sorry to have caused any confusion. In any case, in my opinion, you >should not disable the SQE. Also, it would be nice if driver software, >in general, reported heartbeat errors, but, as has been stated, this is >not the case. > It would be nice if drivers reported physical errors period. Many drivers ignore SQE errors simply because it is so poorly understood...and mis-jumpered by installers. V1 did not support it....if it is enabled, the station thinks it is getting a collision signal....which most drivers had trouble ignoring if associated with Xmit Done status....unless the hardware and software could note that it happened AFTER data was xmitted. V2 did support it, but not all "V2" compliance claiming hardware did....and few installers are/were aware of the subtle difference....since as noted they will interoperate as long as the transceiver and the station don't annoy each other. SQE is there to protect OTHER stations on the LAN from a station/MAU combination which is unable to detect a collision. Depending on your philosophy about designing products for cost-effective service, you would try to be a good LAN citizen or not. Absence of SQE is intended to have the offending station/MAU shut down after the first xmission attempt....to protect other stations from collisions. It infers that your MAU is unable to detect a collision...or that your DTE is unable to see the collision signal--as the only difference between the two is WHEN the signal occurs. There are other techniques for noticing that you have physical level problems (with software). SQE status would just help out the poor techie,,,,and we all know how most programmers feel about service technicians.... >:-)
vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (05/21/91)
In article <1991May20.165442.23834@amd.com>, phil@brahms.amd.com (Phil Ngai) writes: > > I would say [count of missed heartbeats] ought to be maintained as a > statistic by the driver > and accessible via ioctl. If it's just available via ioctl, then what good is it? Those of us who might write a program to fetch the count it are as likely to have the spare hardware to try some easter egging. The rest would still be in the dark. > Ideally, you'd bug the system manager about > no SQE, but realistically, you'd probably get calls from people who > don't understand 802.3 and don't want to. This sounds like an argument against a printf. > As far as printf on first missing heartbeat, what does "first" mean? > What if someone unplugs the AUI cable for network maintenance and > then reconnects it an hour later? I meant "first" as in "first missing heartbeat after initialization or after having seen heartbeats." I'd suppress the printf if there were a "simultaneous" no-carrier report, which covers most disconnected cables. Others have written bad words about printf's. So until and unless there is something in a SMTP MIB, I guess the count will stay a secret here. Vernon Schryver, vjs@sgi.com