tcs@USNA.ARPA (Terry Slattery) (09/25/86)
The fiber ethernet problem has been solved. This is rather long (~200 lines) and includes info supplied by various people on the net. If you only are interested in the solution, skip down to "THE SOLUTION". First, a synopsis of the problem. SYNOPSIS A Gould 9050 connected via a Codenoll Ethernet over fiber optic network wouldn't talk to a Vax 780 (or Gould 6050 co-located with the 780). The Goulds run thier UTX Unix product. The fiber's optical power levels checked out to be within spec. The 9050 could hear rwho packets from the 780. The 9050 would report errors on EVERY outgoing packet. Timing estimates showed the 9050 getting receive carrier about 8us after transmit started. The controller manual noted a bit which when set would: "Transmit even if no arrrier sense signal is detected." but the driver didn't contain code to set the bit. A recent note to these lists from Cornell (I hope this is right this time) suggested that the 9050 needed to have this bit set. INPUT FROM OTHERS ON THESE LISTS From: swb@devvax.tn.cornell.edu (Scott Brim) I'm pretty sure I know what that bit is (haven't looked at the Gould ethernet manual for a year) -- and I don't think it's your problem -- but here's an explanation of the bit: The Gould board uses the 82586 controller chip. There's an option in the 82586 to look for its own signal coming back from the transceiver within a certain amount of time. It used to be that you couldn't use a "transceiver cable" more than a certain length (I remember 6 ns, but that seems awfully short). They turned this off for us -- it's just a change to an initialization bit in the PROMs on the Ethernet board. Scott [Ed note: I didn't get any info from Gould on the actual time delay over which the system would break, but our 6050 would work in place of the 780 and the time delay there was ~4us (calculated and measured).] From: <BEAME%MCMASTER.BITNET@WISCVM.WISC.EDU> Here at McMaster University we have a star coupler with 4 legs of around 1500 ft. . We were using Codelink 2020A modems, we found that we were able to transmit from leg 1 to leg (2,3,4) and leg 2 to leg (1,3,4) we were unable to transmit from leg 3 to leg 4. This problem cleared up after trying several modems on leg 4. Thus the matching of modems IS important. Next, if you are using 2020A's, the echo from the modem (back down the receive line while transmitting) is not done within the modem but at the star coupler. Thus a machine might complain about late echo. We have just installed some 3030A's which do local echo and block out the echo from the coupler and all of our problems and errors (DECNET errors on every packet with 2020's) have gone away. - Carl Beame [Ed note: The 2020 is an old product and is no longer sold. Codenoll told me that the modems (we have 3030 and 3030S) do not do local echo. The measurements with the scope showed delays identical to the calculated delays on the three modems we have.] From: leong@andrew.cmu.edu (John Leong) Looking at your topology, it does not look like you have a distance or propagation time problem. I am assuming GUS is your problem GOULD machine. Have you establish that the GUS interface, fibre transceiver and the fibre are all O.K. ? If it was us, we would have checked it out as follows : disconnected both USNA and USNA-CS from the net and put a portable PC running the Netwatch provided MIT's PCIP to do snooping at the star hub just to make sure that GUS is transmitting fine. I know of a number of problems associated with asymetrical passive star hub network where some spines are much longer than the other, although it doesn't necessary explain you problem. However, you may be interested for future reference just the same. One problem is receiver saturation. The receiver of a station near to the hub can get blasted by the transmitter of a station also near to the hub. Your 75M link to USNA-CS may qualify for such problem, but then again, it may not. Another more obscure problem has to do with collision. Most reciver has an ACG (Automatic Gain Control) which essentially try to pick out signal from background noise. When a station on a long spine transmit to a station on another long spine *at the same time as* one on the short spine started transmitting, it is a normal collision. However, because of the relative signal strength, the ACG of the nearer staion's receiver may view the remote signal as noise and not count it as a collision for retry. On the other hand relative value as seen by the receiving station may not be significant enough for the remote stations signal to be dropped off as noise. In which case, you have an undeteced collision. Ungermann Bass sells the same set up as Codenol. However, for asymetrical network, they strongly recommend the use of an active hub ... but there again, they may just like to make money since the active hub is anything by cheap. John leong leong@andrew.cmu.edu [Ed note: I also thought the problem was optical power levels. I spent a lot of time checking that aspect. Only after I checked the optical signal levels and then got out the scope, opened up the xcvr and checked the transmit and receive signals at the transceiver cable connector did I decide that the optical stuff was indeed ok.] From: "Robert J. Reschly Jr." <reschly@BRL.ARPA> I don't have any helpful answers for you, but I do have an aside. One of the nicer things that gould has provided is a program called enfunc(8). When invoked with the stats option (/etc/enfunc en0 stats) it displays a nicely formatted summary of the interface's activity. Just thought you may be interested (in case you had not noticed it yet). [Ed note: This program is crucial to the problem solution (at least for us).] From: Preston Mullen <mullen@nrl-mpm.arpa> On our 9005, Gould had to modify their Ethernet device driver software to turn on that "transmit regardless of carrier sense" bit so that their Ethernet card would work properly here with a DEC DELNI. This was supposedly because of some timing problems, for which I've never had an adequate explanation. (Broadband Ethernet, e.g. DEC DECOM, would supposedly require the same fix.) They also changed the Ethernet board itself in some way (perhaps to implement the "ignore carrier sense" bit?) The work was done in early June; I think the board dated back to November-December. I was told that the ability to set or reset this would eventually be moved into the kernel so that it could be changed dynamically. Caveat: I may have some of this wrong; unfortunately, I never got anything in writing from Gould about this problem and solution. Everything has worked fine since the change was made. Preston Mullen [Ed note: Just like Scott's fix at Cornell.] MISC INFO Lew Law at Harvard University also called me Monday (they have a rather large configuration there). He couldn't offer much in the way of concrete solutions after discussing all the testing I had already done. Bart Brooks at Gould was really prompt; he called EARLY Monday and said that the control bit in the interface was indeed the problem and that there were two solutions (see below). THE SOLUTION Bart Brooks at Gould confirmed that the "Transmit without carrier sense" bit was the problem. There were two solutions: 1. Cornell (and NRL) have a different prom set which turns the bit on at initialization; get a set of those proms for the interface. 2. The UTX 2.0 software driver (and enfunc program mentioned above) contain code to set the bit. Get a copy of this software and use it to set the bit. I made the necessary changes to the ethernet driver to set the "tnosense" bit. Running enfunc (a version that knows about setting tnosense) set the bit (as reported by the driver). However, that didn't make the thing work. The timing measurements showed that the 9050 was only sending 10us long packets - much smaller than a full ethernet packet. Called Gould to ask for help. Bart Brooks emailed me a manual page on enfunc (received this morning). One of the notes was to "re-ifconfig" the interface after running enfunc. Funny, when that procedure is used, it works! We've not seen any errors reported by the interface since this morning when it started working. For those interested, the functionality of the new driver and enfunc will be in UTX 2.0. One bad thing about this interface is that it uses a micro on board which doesn't contain code to allow the user to examine the state of the on-board control bits. The driver tells the board what to do and remembers what has been sent. The people at Codenoll were very patient with my questions during the testing and diagnosis of the problem (which turned out to not be their fault). As an aside, our Tektronix 6130 had the same symptoms when attached to the fiber transceiver. I called Excelan about their interface (which we will be using in a gateway on the fiber net later this year) and was told that there is a jumper on the card to affect the same "transmit with no carrier sense" operation. I have one remaining question: The old December 1982, IEEE 802.3 DRAFT I have (our final is still on order) says under the section on "Transmit Media Access Management": "After the last bit of the passing frame (i.e., when carrierSense changes from true to false), the CSMA/CD Media Access sublayer continues to defer for a proper interframe spacing, interFrameSpacing (see Section 4.2.3.2.2). At the end of the interframe spacing of that time, if it has a frame waiting to be transmitted, transmission is initiated independent of the value of carrierSense. When transmission has completed (or immediately, if there was nothing to transmit) the CSMA/CD Media Access sublayer resumes its original monitoring of carrierSense." This seems to imply that the interface should not monitor carrier during transmit. Could someone more familiar with the spec elaborate? Thanks to everyone for their help; especially Gould who had a bunch of people working on it. -tcs Terry Slattery U.S. Naval Academy 301-267-4413 ARPA: tcs@usna.arpa UUCP: decvax!brl-smoke!usna!tcs
Murray.pa@XEROX.COM (09/26/86)
"This seems to imply that the interface should not monitor carrier during transmit. Could someone more familiar with the spec elaborate?" The main idea is that there should be a 9.6 microsecond minimum gap between packets so that the receiver can get ready to grab the next packet. Dropping packets can easily have disasterous impact on performance. A bit of time will normaly simplify the hardware design. The fine print is trying to say (I think) that after the transmitter waits 9.6 microseconds, it shouldn't wait again/more (as if it were starting fresh and the middle of a packet was already on the wire) just because it now looks like there is a packet already on the wire. That packet started just a very short while ago, probably less that a bit time, (if everybody is following the rules). If nothing else, the fraction of a bit difference in the phase of the transmit clocks at the two stations could easily provoke this case. When the (second/interesting) transmitter does starts to transmit, it will cause a collision. That's the desired result when two stations try to transmit at the "same" time. Note that the fractional bit race condition actually happens quite often. Consider three stations on an ethernet. Call them left to right A, B, and C. Suppose A is transmitting and B and C are waiting to send. When A finishes, the end of packet will sweep down the wire. When it gets to B, B's 9.6 microsecond clock starts ticking. A while later, C's clock will start too. When B's clock expires, the wire (around B) is empty so B starts transmitting. When C's clock goes off, B's new packet is just about to arrive at C or has just arrived at C. Because the wire delays cancel out in this configuration, fractions of a bit dure to clock synchronization are important.