[mod.protocols.tcp-ip] Fiber Ethernet problem solved

tcs@USNA.ARPA (Terry Slattery) (09/25/86)

The fiber ethernet problem has been solved.  This is rather long (~200
lines) and includes info supplied by various people on the net.  If you
only are interested in the solution, skip down to "THE SOLUTION".

First, a synopsis of the problem.

SYNOPSIS

A Gould 9050 connected via a Codenoll Ethernet over fiber optic network
wouldn't talk to a Vax 780 (or Gould 6050 co-located with the 780).  The
Goulds run thier UTX Unix product.  The fiber's optical power levels
checked out to be within spec.  The 9050 could hear rwho packets from
the 780. The 9050 would report errors on EVERY outgoing packet. Timing
estimates showed the 9050 getting receive carrier about 8us after
transmit started.  The controller manual noted a bit which when set
would:

  "Transmit even if no arrrier sense signal is detected."

but the driver didn't contain code to set the bit.  A recent note to
these lists from Cornell (I hope this is right this time) suggested that
the 9050 needed to have this bit set.

INPUT FROM OTHERS ON THESE LISTS

From: swb@devvax.tn.cornell.edu (Scott Brim)
I'm pretty sure I know what that bit is (haven't looked at the Gould
ethernet manual for a year) -- and I don't think it's your problem --
but here's an explanation of the bit: The Gould board uses the 82586
controller chip.  There's an option in the 82586 to look for its own
signal coming back from the transceiver within a certain amount of
time.  It used to be that you couldn't use a "transceiver cable" more
than a certain length (I remember 6 ns, but that seems awfully short).
They turned this off for us -- it's just a change to an initialization
bit in the PROMs on the Ethernet board.
							Scott

[Ed note: I didn't get any info from Gould on the actual time delay over
which the system would break, but our 6050 would work in place of the
780 and the time delay there was ~4us (calculated and measured).]

From:        <BEAME%MCMASTER.BITNET@WISCVM.WISC.EDU>
      Here at McMaster University we have a star coupler with 4 legs of
around 1500 ft. . We were using Codelink 2020A modems, we found that
we were able to transmit from leg 1 to leg (2,3,4) and leg 2 to leg (1,3,4)
we were unable to transmit from leg 3 to leg 4. This problem cleared up
after trying several modems on leg 4. Thus the matching of modems IS
important.

     Next, if you are using 2020A's, the echo from the modem (back down the
receive line while transmitting) is not done within the modem but at
the star coupler. Thus a machine might complain about late echo.

     We have just installed some 3030A's which do local echo and block out
the echo from the coupler and all of our problems and errors (DECNET errors
on every packet with 2020's) have gone away.

         - Carl Beame

[Ed note:  The 2020 is an old product and is no longer sold.
Codenoll told me that the modems (we have 3030 and 3030S)
do not do local echo.  The measurements with the scope showed delays
identical to the calculated delays on the three modems we have.]

From: leong@andrew.cmu.edu (John Leong)
Looking at your topology, it does not look like you have a distance or
propagation time problem. 

I am assuming GUS is your problem GOULD machine. Have you establish that the
GUS interface, fibre transceiver and the fibre are all O.K. ?  If it was us,
we would have checked it out as follows : disconnected both USNA and USNA-CS
from the net and put a portable PC running the Netwatch provided MIT's PCIP
to do snooping at the star hub just to make sure that GUS is transmitting
fine.

I know of a number of problems associated with asymetrical passive star hub
network where some spines are much longer than the other, although it doesn't
necessary explain you problem. However, you may be interested for future
reference just the same. 

One problem is receiver saturation. The receiver of a station near to the hub
can get blasted by the transmitter of a station also near to the hub. Your
75M link to USNA-CS may qualify for such problem, but then again, it may not.


Another more obscure problem has to do with collision. Most reciver has an
ACG (Automatic Gain Control) which essentially try to pick out signal from
background noise.  When a station on a long spine transmit to a station on
another long spine *at the same time as* one on the short spine started
transmitting, it is a normal collision. However, because of the relative
signal strength, the ACG of the nearer staion's receiver may view the remote
signal as noise and not count it as a collision for retry. On the other hand
relative value as seen by the receiving station may not be significant enough
for the remote stations signal to be dropped off as noise. In which case, you
have an undeteced collision. Ungermann Bass sells the same set up as Codenol.
However, for asymetrical network, they strongly recommend the use of an
active hub ... but there again, they may just like to make money since the
active hub is anything by cheap.

John leong
leong@andrew.cmu.edu

[Ed note:  I also thought the problem was optical power levels.  I spent
a lot of time checking that aspect.  Only after I checked the optical
signal levels and then got out the scope, opened up the xcvr and checked
the transmit and receive signals at the transceiver cable connector did
I decide that the optical stuff was indeed ok.]

From:     "Robert J. Reschly Jr." <reschly@BRL.ARPA>
   I don't have any helpful answers for you, but I do have an aside.

   One of the nicer things that gould has provided is a program called
enfunc(8).  When invoked with the stats option (/etc/enfunc en0 stats)
it displays a nicely formatted summary of the interface's activity.

   Just thought you may be interested (in case you had not noticed it
yet).

[Ed note:  This program is crucial to the problem solution (at least
for us).]

From: Preston Mullen <mullen@nrl-mpm.arpa>
On our 9005, Gould had to modify their Ethernet device driver software
to turn on that "transmit regardless of carrier sense" bit so that their
Ethernet card would work properly here with a DEC DELNI.  This was supposedly
because of some timing problems, for which I've never had an adequate
explanation.  (Broadband Ethernet, e.g. DEC DECOM, would supposedly
require the same fix.)  They also changed the Ethernet board itself
in some way (perhaps to implement the "ignore carrier sense" bit?)
The work was done in early June; I think the board dated back to
November-December.

I was told that the ability to set or reset this would eventually
be moved into the kernel so that it could be changed dynamically.

Caveat: I may have some of this wrong; unfortunately, I never got anything
in writing from Gould about this problem and solution.  Everything has
worked fine since the change was made.

	Preston Mullen

[Ed note:  Just like Scott's fix at Cornell.]

MISC INFO

Lew Law at Harvard University also called me Monday (they have a rather
large configuration there).  He couldn't offer much in the way of
concrete solutions after discussing all the testing I had already done.

Bart Brooks at Gould was really prompt; he called EARLY Monday and
said that the control bit in the interface was indeed the problem
and that there were two solutions (see below).

THE SOLUTION

Bart Brooks at Gould confirmed that the "Transmit without carrier
sense" bit was the problem.  There were two solutions:

1. Cornell (and NRL) have a different prom set which turns the
bit on at initialization; get a set of those proms for the interface.

2. The UTX 2.0 software driver (and enfunc program mentioned above)
contain code to set the bit.  Get a copy of this software and
use it to set the bit.

I made the necessary changes to the ethernet driver to set the
"tnosense" bit.  Running enfunc (a version that knows about setting
tnosense) set the bit (as reported by the driver).  However,
that didn't make the thing work.  The timing measurements showed
that the 9050 was only sending 10us long packets - much smaller
than a full ethernet packet.  Called Gould to ask for help.

Bart Brooks emailed me a manual page on enfunc (received this morning).
One of the notes was to "re-ifconfig" the interface after running
enfunc. Funny, when that procedure is used, it works!  We've not seen
any errors reported by the interface since this morning when it started
working.  For those interested, the functionality of the new driver
and enfunc will be in UTX 2.0.

One bad thing about this interface is that it uses a micro on board
which doesn't contain code to allow the user to examine the state of the
on-board control bits.  The driver tells the board what to do and
remembers what has been sent.

The people at Codenoll were very patient with my questions during
the testing and diagnosis of the problem (which turned out to not
be their fault).

As an aside, our Tektronix 6130 had the same symptoms when attached
to the fiber transceiver.  I called Excelan about their interface
(which we will be using in a gateway on the fiber net later this year)
and was told that there is a jumper on the card to affect the
same "transmit with no carrier sense" operation.

I have one remaining question:

The old December 1982, IEEE 802.3 DRAFT I have (our final is still on
order)  says under the section on "Transmit Media Access Management":

"After the last bit of the passing frame (i.e., when carrierSense
changes from true to false), the CSMA/CD Media Access sublayer continues
to defer for a proper interframe spacing, interFrameSpacing (see Section
4.2.3.2.2).

At the end of the interframe spacing of that time, if it has a frame
waiting to be transmitted, transmission is initiated independent of the
value of carrierSense.  When transmission has completed (or immediately,
if there was nothing to transmit) the CSMA/CD Media Access sublayer
resumes its original monitoring of carrierSense."

This seems to imply that the interface should not monitor carrier
during transmit.  Could someone more familiar with the spec elaborate?

Thanks to everyone for their help; especially Gould who had a
bunch of people working on it.

	-tcs
	Terry Slattery	  U.S. Naval Academy	301-267-4413
	ARPA: tcs@usna.arpa
	UUCP: decvax!brl-smoke!usna!tcs

Murray.pa@XEROX.COM (09/26/86)

"This seems to imply that the interface should not monitor carrier
during transmit.  Could someone more familiar with the spec elaborate?"

The main idea is that there should be a 9.6 microsecond minimum gap
between packets so that the receiver can get ready to grab the next
packet. Dropping packets can easily have disasterous impact on
performance. A bit of time will normaly simplify the hardware design.

The fine print is trying to say (I think) that after the transmitter
waits 9.6 microseconds, it shouldn't wait again/more (as if it were
starting fresh and the middle of a packet was already on the wire) just
because it now looks like there is a packet already on the wire. That
packet started just a very short while ago, probably less that a bit
time, (if everybody is following the rules).

If nothing else, the fraction of a bit difference in the phase of the
transmit clocks at the two stations could easily provoke this case. When
the (second/interesting) transmitter does starts to transmit, it will
cause a collision. That's the desired result when two stations try to
transmit at the "same" time.

Note that the fractional bit race condition actually happens quite
often. Consider three stations on an ethernet. Call them left to right
A, B, and C. Suppose A is transmitting and B and C are waiting to send.
When A finishes, the end of packet will sweep down the wire. When it
gets to B, B's 9.6 microsecond clock starts ticking. A while later, C's
clock will start too. When B's clock expires, the wire (around B) is
empty so B starts transmitting. When C's clock goes off, B's new packet
is just about to arrive at C or has just arrived at C. Because the wire
delays cancel out in this configuration, fractions of a bit dure to
clock synchronization are important.