[comp.sys.sgi] Ethernet controller differences between 4D/340 and 4D/25

sdempsey@UCSD.EDU (Steve Dempsey) (01/19/91)

				Warning!
People prone to nervous breakdowns when dealing with network problems
		are advised to stop reading this now!
			     -------------

Hardware:
	4D/340VGX with IO2 i/o controller
	4D/25TG

Software:
	all machines at 3.3.1

Situation:
    When I attempt to transmit a file from our 340 to a remote site with
    'ftp' or 'rcp' the transfer is either very slow or terminates with a lost
    connection.  I then disconnect the 340 from the net by unplugging the
    drop line from the 340 I/O panel and plug the drop line into the I/O
    panel of the 4D/25TG, to insure that both machines use the same physical
    net connection.  Now I attempt the same file transfer from the 4D/25TG
    and it goes without a hitch.

What I Know So Far:
    I have monitored our subnet to watch the packets.  What I see is that
    packets from the 340 appear with varying frequency.  The time between
    delayed packets will double from 1 to 2, then 4, 8, 16, 32, and finally
    64 seconds will elapse between packets.  On occasion six consecutive
    packets will be delayed by 64 seconds, at which point the connection is
    summarily dropped.  More frequently the timeout resets back to 1 second
    and then starts doubling again.  When the 4D/25TG is used all of the
    packets are sent within a second or two without any unusual delays.

    The only clue I have from the 340's point of view is that if I run
    'netstat -p tcp' while the transfer is in progress I see these counts
    incrementing:
       6241 retransmit timeouts
	   11 connections dropped by rexmit timeout

    This problem has been reported to the HOTLINE (call # H2774) on 10Jan91
    but no answers yet.  Our FE tried replacing the IO2 board but it had no
    effect.

    A 240GTX on the same subnet also experiences these delays, and several
    other PIs also on the same subnet have no problems.

The Big Question:
    What is different about the 240/340 and 25 that would account for this
    behavior? 

---------------------------------------------------------------------------
Steve Dempsey					voice:	  (619) 534-0208
Dept. of Chemistry Computer Facility, 0314	UUCP:	  ucsd!sdempsey
University of Calif. at San Diego		BITNET:	  sdempsey@ucsd
9500 Gilman Drive				INTERNET: sdempsey@ucsd.edu
La Jolla, CA 92093-0314				fax:	  (619) 534-0058

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (01/19/91)

In article <9101182254.AA12177@chem.chem.ucsd.edu>, sdempsey@UCSD.EDU (Steve Dempsey) writes:
> ....
> The Big Question:
>     What is different about the 240/340 and 25 that would account for this
>     behavior? 

They're very similar designs based on 7990's, as you no doubt noticed if
you examined the chips on the IO2 and the 4D25.

It would pay to check for errors with `netstat -i`, and for the tired old
ethernet complaints on the consoles (e.g. "late collisions" on the machines
in question or on any other IRIS on any ethernet between the ends of the
FTP transfer).  Perhaps there is some kind of grounding or other difference
that makes one of the machines unable to hear the ACK's from the remote
machine.  (E.g. the frame grounding differences among Ethernet 1, 2, and
802.3 cables and transcievers.)  I also seem to recall some differences in
the 802.3/e1/e2 transformers between the IO2 and the 4D25.  It might pay to
switch cables and transcievers.


Vernon Schryver,   vjs@sgi.com

sdempsey@UCSD.EDU (Steve Dempsey) (01/24/91)

Let me try asking a specific question since I'm not getting anywhere
with this problem.  What causes "retransmit timeouts" on a 340 using the IO2
board as reported by 'netstat -p tcp'?:

tcp:
        2241271 packets sent
                1942725 data packets (1701239978 bytes)
                11970 data packets (14359783 bytes) retransmitted
                239995 ack-only packets (227522 delayed)
                278 URG only packets
                943 window probe packets
                41099 window update packets
                4261 control packets
        2017922 packets received
                1393024 acks (for 1701333400 bytes)
                20860 duplicate acks
                0 acks for unsent data
                887096 packets (221013020 bytes) received in-sequence
                1592 completely duplicate packets (85826 bytes)
                248 packets with some dup. data (993 bytes duped)
                7435 out-of-order packets (5046742 bytes)
                54 packets (5 bytes) of data after window
                5 window probes
                40083 window update packets
                27 packets received after close
                0 discarded for bad checksums
                0 discarded for bad header offset fields
                0 discarded because packet too short
        1553 connection requests
        1532 connection accepts
        3024 connections established (including accepts)
        3477 connections closed (including 26 drops)
        174 embryonic connections dropped
        1378640 segments updated rtt (of 1389598 attempts)
>>	7373 retransmit timeouts
>>		12 connections dropped by rexmit timeout
        354 persist timeouts
        334 keepalive timeouts
                30 keepalive probes sent
                0 connections dropped by keepalive
---------------------------------------------------------------------------
Steve Dempsey					voice:	  (619) 534-0208
Dept. of Chemistry Computer Facility, 0314	UUCP:	  ucsd!sdempsey
University of Calif. at San Diego		BITNET:	  sdempsey@ucsd
9500 Gilman Drive				INTERNET: sdempsey@ucsd.edu
La Jolla, CA 92093-0314				fax:	  (619) 534-0058

vjs@rhyolite.wpd.sgi.com (Vernon Schryver) (01/24/91)

In article <9101232123.AA17793@chem.chem.ucsd.edu>, sdempsey@UCSD.EDU (Steve Dempsey) writes:
> Let me try asking a specific question since I'm not getting anywhere
> with this problem.  What causes "retransmit timeouts" on a 340 using the IO2
> board as reported by 'netstat -p tcp'?:
> ...


Not receiving ack's from the other end.
Sorry, but that's the most that can be inferred from the symptom.

Possible reasons for not receiving stuff from the other end are many.
They include wild backhoes, broken wires, people pushing reset buttons,
improperly installed ethernets, and broken hardware or software.


Vernon Schryver,   vjs@sgi.com